MarkLogic 4.0: A Next-Generation Content System
October 7, 2008
Navigate to KDNuggets here, and you can see a line up of some of the content processing systems available on October 2, 2008. The list is useful but it is not complete. There are more than 350 vendors in my files, and each asserts that it has a must-have content system. Most of these systems suffer from one or more drawbacks; for example, scaling is problematic or just plain expensive, repurposing the information is difficult, or modifying the system requires lots of fiddling.
MarkLogic 4.0 addresses a number of these common shortcomings in text processing and XML content manipulation. The company “accelerates the creation of content applications.” With MarkLogic’s most recent release of its flagship server product, MarkLogic offers a content platform, not a content utility. Think in terms of most content processing companies as tug boats. MarkLogic 4.0 is an ocean growing vessel with speed, power, and range. When I spoke with MarkLogic’s engineers, the ideas for enhancements to MarkLogic 3.2, the previous release, originated with MarkLogic users. One engineer said, “Our licensees have discovered new uses for the system. We have integrated into the code base functions and operations that our customers told us they need to get the most value from their information. Version 4.0 is a customer driven upgrade. We just did the coding for them.”
Most text processing systems, including XML databases, are useful but limited in power and scope. The MarkLogic 4.0 system is an ocean going vessel among harbor bound craft.
You can learn quite a bit about the functionality of MarkLogic in this Dr Dobbs’s interview with Dave Kellogg, CEO of this Sequoia-backed firm. The Dr Dobbs’ interview is here.
MarkLogic is an ocean going vessel amidst smaller boats. The product is an XML server, and it offers search, analytics, and jazzy features such as geospatial querying. For example, I can ask a MarkLogic system for information about a specific topic within a 100 mile radius of a particular city. But the core of MarkLogic 4.0 is an XML database. When textual information or data are stored in MarkLogic 4.0, slicing, dicing, reporting, and reshaping information provides solutions, not results lists.
According to Andy Feit, vice president, MarkLogic is “a mix of native XML handling, full-text search engines, and state-of-the-art DBMS features like time-based queries, large-scale alerting, and large-scale clustering.” The new release adds important new functionality. New features include:
- Geospatial support for common geospatial markup standards plus an ability to display data on polygons such as state boundaries or a sales person’s region. The outputs or geospatial mash ups are hot linked to make drill down a one-mouse click operation
- Push operations such as alerts sent to a user’s mobile phone or triggers which operate when a content change occurs which, in turn, launches a separate application. The idea is to automate content and information operations in near real time, not leave it up to the system user to run a query and find the important new information.
- Embedded entity enrichment functionality including support for Chinese, Russian and other languages
- Improved support for third party enterprise entity extraction engines or specialized systems. For example, the new version ships with direct support for TEMIS’s health and medical processing, financial services, and pharmaceutical content processing system. MarkLogic calls its approach “an open framework”
- Mobile device support. A licensee can extract data from MarkLogic and the built in operations will format those data for the user’s device. Location services become more fluid and require less developer time to implement.
The new release of MarkLogic manipulates XML quickly. In addition to performance enhancements to the underlying XML data management system, MarkLogic supports the Xquery 1.0 standard. Users of earlier versions of MarkLogic server can continue to use these systems along side Version 4.0. According to Mr. Feit, “Some vendors require a lock step upgrade when a new release becomes available. At MarkLogic, we make it possible for a licensee to upgrade to a new version incrementally. No recoding is required. Version 4, for example, supports earlier versions’ query language and scripts.”
In addition to standard search and retrieval functions, MarkLogic 4.0 implements a number of new and useful text discovery functions. For example, the system features a co occurrence display. Here’s how it works. You have indexed content from a range of sources: third party commercial publishers, internal documents, Web content, and RSS feeds. You want to know what pairs of entities occur most frequently in this corpus. I entered a query for African disease and the system displayed the following output:
In the processed content, this output reveals that the word pair “chemotherapy | malaria” occur frequently. A careful reading of the documents in the result set might have revealed this subtle relationship. I would have missed the connection. The MarkLogic system makes this type of word pair discovery easy. Other applications include identifying what name occurs most frequently with another name or what location occurs most frequently with a particular event.
MarkLogic has added a twist to entity identification and tagging. Instead of building a separate list of entities, MarkLogic performs “inline entity” tagging. The entity tags are inserted into a document as well as indexed. This makes it possible to perform database operations on the entities and extract the context in which the entity exists. Combining an entity operation for a person with a geospatial feature allows the user to see information about an entity displayed on a map. A publisher, for example, can generate dynamic displays of content. When a user clicks on a map point, the underlying data about a business can be displayed.
How MarkLogic Differs from Traditional Search Systems
MarkLogic does provide a key word search system. You can search by relationships. You can see reports that tally the frequency of certain words and phrases. But search is indeed an important component of MarkLogic. The company has developed a content platform. With each release of the XML data management system, MarkLogic has created a content environment. Licensees can store, manipulate, repurpose, and answer questions with the MarkLogic system. MarkLogic has moved beyond what search can do, and it is defining a new content category. The company has deftly side stepped the nine problems that most search and content processing companies struggle to overcome. Click here for these challenges.
With this new release, MarkLogic 4.0 raises the bar for XML content processing, repurposing, and access.
Stephen Arnold, October 7, 2008