Mark Logic

An Interview with Dave Kellogg

Mark Logic is a new breed of search and content processing company. A number of companies have created niches in which their technology can solve specific problems.

Mark Logic provides information access and delivery solutions that accelerate the creation of content applications. Customers across a range of industries rely on Mark Logic to repurpose content and deliver that information through channels. Some vendors describe this suite of functions as an enterprise publishing system.

MarkLogic Server is one of the leading XML content platforms. My tests show that it is possible to store, manipulate, and deliver content.

I spoke with Mr. Kellogg in the firm's offices, located next door to Google's Postini unit and within earshot of the humming Highway 101 that connects San Francisco with San Jose, the main artery of Silicon Valley.

The full text of my conversation with Dave Kellogg appears below:

Would you take a minute and explain how you moved from Paris to San Francisco to jump into the information access business?

I have done “data” since 1983, when first used Ingres at Lawrence Berkeley Lab while working as an undergraduate at UC Berkeley.

Then I had a great experience from 1995 to 2004 at Business Objects working with business intelligence tools. But I felt that I was getting diminishing marginal returns on interesting things one could do with data using some of the more traditional tools. Quite a few companies could capture data, database-ize data, warehouse data. There were many powerful tools to analyze data, set up alerts, built executive-level dashboards, and crunch data with prepackaged analytic applications … frankly it was getting hard to imagine 20 more years of innovation at the same pace. I like challenges.

How did you make the jump from business intelligence to content?

I was interested in the data/content divide and felt as if I was at the San Diego / Tijuana border. Content wasn’t in databases, and users couldn’t run queries against it (other than simple search ones). It was taught to do analytics on most content. Building a content application on top of unstructured information was tough. I concluded that people were basically using stone tools to work with content and I knew we could do a lot better.

I saw an opportunity to do the same things for content that we’d done with data over the prior two decades and jumped into MarkLogic.

Business Objects has generated a number of high profile search actions; namely, the acquisition of Inxight Software. How did your work at Business Objects prepare you for the Mark Logic opportunity?

The key is my knowledge and experience for growing a software business. When I started at Business Objects we were around $30 million with about 200 people. When I left, we were generating more than $850 million. I think we had around 4,500 people. So I learned a lot about growing a software company.

My international experience was a plus too. Business Objects was a French company, I learned a lot about growing an international software business and about international markets. Living in France for five years helped me learn a lot about building software companies overseas and, in the end, gave me a greater appreciation for the power of the Silicon Valley industry cluster.

I did quite a bit of work with data content and integration. Every year at Business Objects we had an executive strategy offsite and every year one of the topics was “unstructured data.” And every year we decided to worry about that one “next year.” Eventually, after I left, I suppose next-year finally came because they decided to purchase Inxight Software in May 2007.

I think I have a bias toward data, but I have deep technical and product marketing experience. I worked at Ingres and at Business Objects.

I think I have a very strong handle on how data people view content: first, as “funny data” and second as “a condiment,” something you want to sprinkle a little of into your main course of structured data reports.

Mark Logic is a new breed of information access company. How would you position the firm's technology against a vendor such as Fast Search & Transfer or a similar Swiss Army knife provider?

Good question. We have put a lot of thought into that question, done a lot of diligence, looked at the research, performed various market studies, and the answer we’ve painstaking arrived at--strictly in terms of our positioning versus Fast Search & Transfer--is “better.”

Hey, I'm just kidding, Stephen. I’ve always wanted to give an answer like that.

Let me try another angle on us versus high profile search platforms.

MarkLogic is simply a different thing. Fast is, at its core, an enterprise search engine, an inverted term list engine that indexes the presence of tokens in documents, built to run one query (return list of links to documents that contain word/phrase) really well.

Yes, they built a few applications/solutions around that core, but it’s a search engine. It is a read-only system designed to index documents and allows you to run--from a database perspective at least--a few, simple parameterized queries.

MarkLogic, on the other hand, is a database management system built to natively manage XML documents and optimized for handling vast numbers of them (I mean hundreds of terabytes) with high performance. It’s a read/write system. It has a query language (XQuery). It has transactions and logging. You can use it, by itself--without the need to bolt it on to either a relational database or an application server--as the basis for content applications.

So MarkLogic is just fundamentally a different thing?

That's right. We view the ability to run full-text searches (including Boolean expressions, proximity, stemming, thesaurus, etc.) as a feature of a document-oriented database management system. But one feature does not a DBMS make.

So, to restate my tongue-in-check answer above, if I had to say it in one word, it would be “different.” If you gave me two words, they’d be "different" and "better."

What makes the MarkLogic Server tick? Is there a secret sauce in the data management system, for example?

First, let’s talk about what MarkLogic is not. It is not the pre-integration of a relational DBMS and an enterprise search engine. MarkLogic is an XML database management system built, from scratch, to be optimized for the management of large numbers of text-oriented documents.

MarkLogic does not run on top of any other database system. Nor does it use third-party indexing technology to enable document search. MarkLogic runs on top of the operating system and manages at a "bits on iron" level, its underlying databases.

What makes MarkLogic unique is that it’s a DBMS built by a mixed team of search and DBMS engineers.

One way to think about MarkLogic is in terms of a DBMS built using search-engine parts. If, for example, you looked at source code, there would be places where you’d say, "Oh, I’m in a DBMS because I see a locking system or a before-image read consistency mechanism."

But there are other places where you would say, "Oh, I’m in a search engine because I see term lists, search engine-like query processing, or a search engine-style clustering architecture that works extremely well on large numbers of commodity servers.”

You have a number of high profile publishing companies as customers. Can you give me a snapshot of the two or three functions that your system delivers to these firms?

One of our customers, Oxford University Press, which produces around 500 titles annually, uses MarkLogic for silo-breaking and new product development. Specifically, they’ve built a content delivery platform that leverages content across multiple product lines, enabling them to create multiple highly customized products to tap new audiences.

O’Reilly Media has created a Web–based custom publishing platform on MarkLogic called SafariU, which allows college professors to create and publish customized computer science and information technology textbooks and course materials. Our system has helped them expand their presence in the textbook publishing market by delivering targeted and low–priced teaching materials to faculty and students.

Also, at Congressional Quarterly, MarkLogic is used as the basis for analytical content applications such as Legislative Impact, part of their Legislative Tracking Service. This tool allows subscribers to see how pending or passed legislation will impact existing laws. By monitoring hearings and bills and pushing real-time notification on breaking news, this helps users keep close tabs on the information that is most valuable to them.

What other markets have expressed interest in the MarkLogic Server and your other products?

Good question. We have a strong US Federal government practice where we’ve not only done a lot of work with the Intelligence Community, but also within the Department of Defense and in the US Patent & Trademark Office.

We continue to focus a lot on publishing or what we call the "Information and Media" market. We're seeing some interest in OEM licensing deals and software as a service arrangements. Several companies are building and selling applications/services built on top of MarkLogic.

There's been interest from the financial services sector. We're starting to see "publishing" applications inside banks and other areas such as equity research.

We've had some inquiries from the aviation industry. Airlines want to have online and custom-published in-flight manuals and training materials. This is, in effect, a custom publishing problem in disguise. I was surprised to learn that every aircraft is unique. This means that the documentation for that aircraft must be tailored to that aircraft.

In addition, we believe that just about every business is a publisher in some way whether it’s in providing information to their technical community, engaging with their enthusiasts, or getting the message out in general. Enabling communications will be increasingly critical to organizations in the future.

When a customer licenses MarkLogic Server, what type of training and support do you provide?

Our philosophy is to sell sell solutions to problems and avoid the stereotypical "drive-by" technology sale, where companies dump the software in the parking lot and leave.

We also use the same technical people do the presale and the post sale when we can. This is better for the customer and better for us; and we have specifically tried to build a unique and flexible org structure to support this.

In some cases--where we’re subcontracting a big US government project--we will sell the software by itself, but it’s not really our preference. We like to work side-by-side with the customers in driving success.

We provide technical support, training, and professional services consulting. However, what we really aim with our customers is to make them successful within their organization--the old "give a person a fish, feed them for a day; teach a person to fish, feed them for a lifetime" analogy.

While we like to stick around after the sale, there’s a certain level of self-sufficiency that our customers are looking for when working with us. To accomplish this, our sales engineers work closely from the get-go with customers to understand the business problems they are trying to solve and how we can help them. From there, it’s a multi-tier knowledge transfer through a variety of different vehicles including training courses.

How does your system fit with other content processing tools? For example, do you mesh with taxonomy systems from SchemaLogic and Access Innovations? With search systems from Autonomy and Endeca?

Mark Logic fits in quite well to the content infrastructure of most companies. For example, our system works with popular authoring tools, including XML-specific tools like Arbortext, and going forward more general purpose authoring / creative tools like Microsoft Office 2007 and Adobe’s InCopy and InDesign.

Our system also meshes smoothly with third-party content management systems such as Documentum or Microsoft SharePoint. In addition, some vendors have built CMS systems on top of MarkLogic, such as Really Strategies RSuite/CMS and empolis.

Furthermore, MarkLogic can be indexed by federated search engines if you’d like the MarkLogic repository to be just one of several repositories included in a federated search.

We also support the inclusion of taxonomies in queries for broadening or narrowing query terms.

You are much more than a system; you have a content framework. Am I correct in characterizing your system this way?

Yes, that's correct. MarkLogic has a built-in Content Processing Framework. This allows customers to build pipelines for things like document conversion and enrichment. As part of a pipeline, you can call out to a Web service that will enrich your XML content by, for example, marking up people, places, and organizations, using an entity extraction engine from someone like Temis or Inxight.

What are some of the new features and functions that you have added in the last six to nine months?

In our current release, we had four high-level themes when talking about new functionality. Let me highlight several for you. Just stop me if you think I sound like a human brochure.

First, we have Expanded content processing capabilities; specifically, we increased the number of languages we support with advanced language processing capabilities including language identification and stemming.

Second, we have improved ability to search and analyze content; for example, we added fielded search functionality so that organizations can provide highly targeted search over selected components of content. This means a licensee can restrict a search to just images or going across tables. We also beefed up the analytic functionality and now provide frequency analysis for the different values within their content, which drives what some vendors call guided or faceted navigation.

Third, we have provided additional administration support. Let me give you an example. We created new APIs that allow MarkLogic to be tied into an enterprise infrastructure, like performance monitoring tools.

And I think the other big enhancement has been adding more support for developers. We made it easier for developers of content applications to rapidly create new ones by providing integration into IDE tools like oXygen, which includes debugging and a profiling capability.

When you look at the broader market for content processing, what are the three or four major trends that you see gaining momentum?

As part of the general trend of consumerization of information technology, we think that every content organization now wants to build Web 2.0-style content applications.

What do you mean Web 2.0?

Good question. When I refer to Web 2.0 I am thinking bout such features as voting, annotation, commenting, folksnomy as well as fine-grained search and granular access to content.

Interestingly, I’d note that we are seeing this trend not only in our Information & Media customers, but in our general commercial ones, and even in our Government customers, where there is broad acceptance of these Web 2.0 concepts and lot of visibility for projects like Intellipedia.

We see the bar rising in areas like language processing. We, for example, believe that basic entity extraction (that is, identifying and marking people, places, and things) in the future will be as ubiquitous as stemming (that is, matching run to ran) is today.

So over time, people simply expect more and more from their content processing systems and, as per the prior point, their expectations are largely being set by what they find on the Internet.

Another trend we see is the mainstreaming of content software. It used to be that you needed specialized XML authoring tools with specially-trained authors and required to sink $1.0 million into an enterprise content management system.

Now you can get a lot of the same value from those big investments from Microsoft Word and SharePoint. So we think content technology is becoming more accessible to more people and that democratization process is good for Mark Logic because all those technologies are really enablers of the kinds of applications our customers build.

Where do you see Mark Logic in the next year to two?

In a year or two, I expect to continue to be a very strong player in the Information & Media business and in Federal government. I expect to be selling to these segments in multiple countries (our UK operation recently came online and is off to a great start). I am confident that our OEM business to be going strong as well.

In terms of new markets, I think we’ll be doing well in financial services and healthcare. These are two markets where we’ve already started to get traction today and that I feel very good about going forward.

In addition, in the next two years, I think the horizontal market for dynamic enterprise publishing will become a recognized segment. In essence, as technology enables it, I think every enterprise is going to ask its internal publishing segment the question: why aren’t we using the same tools that the professionals use? In the past, the answer was “too big, too hard, too expensive.” Going forward, I see those barriers dropping.

You have a background in business intelligence (which I think of as "search on steroids"). Will you add business intelligence functions to the MarkLogic Server product line?

The BI layer is really a tools layer in my opinion. BI software is largely about providing tools and interfaces where "mere mortals" can answer questions from information in corporate databases and data warehouses.

BI leverages, and is enabled by, the database layer – the relational database management system, possibly an online analytic processing server layer, most likely an enterprise data warehouse, and quite probably a whole family of departmental data marts. Some people use “BI” to refer to that whole stack; I tend to define BI more precisely.

At Mark Logic, we build a special-purpose database management system designed to hold large amounts of XML content. So we are currently focusing our investments at a layer that’s one level deeper than what I consider the BI layer.

That said, the BI layer is in many ways only as good as the database infrastructure that underlies it, and we are actively building in primitives into the system to enable new and better analytics on content. MarkLogic-based applications today already support analytics such as citation/link analysis. We already provide faceted navigation (itself borrowed from OLAP in my estimation) and iterative query construction.

MarkLogic also helps organizations understand and analyze their content by enabling them to count distinct values to understand how frequently they occur. For example, on a corpus of journal content we can answer questions like how many distinct authors are there, how many articles each author has written, etc. This analytic capability in turn drives a range of visual exploration methods including: histograms, charts, tables and tag clouds.

Going forward, expect us to continue to increase the scope and power of analytics that can be run on MarkLogic.

Can you give me an example of some of the new directions that your company is going? I thought I heard about a newsgroup indexing system. Is that available?

Yes, you’re talking about MarkMail (http://markmail.org) and thanks for bringing it up. MarkMail is an Internet service that we’re running focused on providing a great search experience on top of email archives, specifically the archives of mailing lists associated with open source software development projects, such as Apache or Lucene.

For us, MarkMail is a bit of an experiment. The system was a skunk works project by one of our engineers who was frustrated with the inability to find content in these open source mailing list archives. So he vented his frustration by building a great MarkLogic-based application for searching email lists.

We saw it, and liked it, so we said, “Hey, why don’t you do this full-time for a while.” We continued to like the direction it was going so we kept saying, "Hey, why not add a few resources."

Today, we’re up to about five people on the project full time and we’re driving traffic of around 15,000 visitors per day. That works out to about a half-million visitors per month.

So, we are interested in MarkMail from several perspectives. First, we think there may be some potential to grow and monetize that audience. Second, we see an opportunity to take what we’ve learned serving this community and bringing it inside the enterprise. Finally, the project is a good demonstration to our government customers about what can be done with MarkLogic in what some people call a "message traffic” style applications.

But it is early days, and we’re approaching MarkMail organically and experimentally. But, thus far, it’s been all good and we’re very happy to have started it and let it grow.

Are there any hints you can give me on new functions coming in the next six to nine months?

We’ve already started the early access program for customers for the next major release of our product, MarkLogic 4.0. While I can’t tip our cards too much until we launch it later this year, I can say that a major theme is analogous to the first three rules in real estate: location, location, location.

ArnoldIT Comment

Mark Logic, like Clearwell Systems, has identified a market opportunity and developed a solution that solves a real point of pain. The company is growing rapidly. Its executives find themselves in demand. ArnoldIT thinks that Mark Logic has a strong technical foundation and the marketing savvy to make the company a winner in the enterprise content market. Kick the tires of this content processing, publishing, search, and analytics vehicle. It merits a test drive.

Stephen E. Arnold, June 13, 2008

Search AIT

Mark Logic

An Interview with Dave Kellogg