Mark Logic, Content Automation: Flexibility and Lower Costs

May 4, 2010

I just sat in a technical presentation by Mark Logic wizards (Mark Walsh and Mary Holstege). The highlight was the explanation of Mark Logic enhancements that promise more flexible content automation and lower costs to Mark Logic’s customers. Other points I noted included:

  • “Smart” interaction between XQuery and XSLT
  • More flexibility in repurposing content
  • Lower costs for certain content transformations.

The payoff is that revenue opportunities go up for Mark Logic licensees and the costs of performing certain content transformations goes down. More information is available from www.MarkLogic.com. My impression: important step forward.

Stephen E Arnold, May 4, 2010

Oracle Acceleration with Sun Methods

April 30, 2010

Companies with big investments in Oracle face the same tough choices that bedeviled me when using Sun Microsystems hardware for The Point (Top 5% of the Internet) in 1993. To make Sun stuff go fast, one needed to keep the Sun system pure; that is, only Sun approved goodies were to be used. Each goodie had to be certified, some stringently certified and some less stringently certified. The benefit of following the rules was keeping the warranty and service agreements valid. Get too frisky and one had to pay for the romp in the grass chasing technical butterflies.

Oracle follows the same path. In a world when Google talks about commodity hardware and most developers either embrace or have a pretty solid notion of open source, the Oracle approach seems somewhat out of step.

Or is it?

I know that there are quite a few companies who have been like Speed of Mind in the business of enhancing Oracle’s search performance or like CopperEye are in this business now. There are others, and you can find them by searching for hyperboles or just visiting an enterprise software trade show.

Speeding up Oracle looks like a slam dunk business. The reality is that most of Oracle’s customers want to achieve better performance by keeping within the bright white lines that Oracle puts on its information toll road.

If you have a sluggish Oracle system, you will want to run a query for Oracle accelerators on Google or just navigate to Oracle itself and run a query for the Sun Oracle database machine. The Exadata gizmo is expensive, but most serious Oracle shops will rely on this type of device to get the performance required for today’s petascale applications.

Why?

The answer is that the Oracle DBA knows that one way or another, the Oracle Sun engineers can get the system to work and deliver better performance. The boss will agree because the cost of dealing with service if the warranty or service agreement is invalidated makes the cost savings of non Oracle solutions look like buying a chopped liver sandwich at a deli.

As the hyperbole about NoSQL solutions increases, knocking out Oracle is no easy task. Some vendors have put massive hurt on Oracle. Mark Logic comes to mind as one company that has an uncanny knack of delivering a content solution that just happens to address some Oracle data issues. But other firms have yet to experience Mark Logic type of success.

In short, there are some powerful magnetic forces operating to repel non Oracle solutions. The DBA whose sole job is to baby sit Oracle is just one factor. Sure, Oracle has flaws, but logical arguments may have to get around the potential cost penalties of letting the engineers chase butterflies.

Stephen E Arnold, April 30, 2010

Unsponsored post.

Milward from Linguamatics Wins 2010 Evvie Award

April 28, 2010

The Search Engine Meeting, held this year in Boston, is one of the few events that focuses on the substance of information retrieval, not the marketing hyperbole of the sector. Entering its second decade, the conference speakers tackle challenging subjects. This year speakers addressed such topics as “Universal Composable Indexing” by Chris Biow, Mark Logic Corporation, “Innovations in Social Search” by Jeff Fried, Microsoft, and “From Structured to Unstructured and Back Again: Database Offloading”, by Gregory Grefenstette, Exalead, and a dozen other important topics.

evvie2010

From left to right: Sue Feldman, Vice President, IDC, Dr. David Milward, Liz Diamond, Stephen E. Arnold, and Eric Rogge, Exalead.

Each year, the best paper is recognized with the Evvie Award. The “Evvie” was created in honor of Ev Brenner, one of the pioneers in machine-readable content. After a distinguished career at the American Petroleum Institute, Ev served on the planning committee for the Search Engine Meeting and contributed his insights to many search and content processing companies. One of the questions I asked after each presentation was, “What did Ev think?”. I valued Ev Brenner’s viewpoint as did many others in the field.

The winner of this year’s Evvie award is David R. Milward, Linguamatics, for his paper “From Document Search to Knowledge Discovery: Changing the Paradigm.” Dr. Milward said:

Business success is often dependent on making timely decisions based on the best information available. Typically, for text information, this has meant using document search. However, the process can be accelerated by using agile text mining to provide decision-makers directly with answers rather than sets of documents. This presentation will review the challenges faced in bringing together diverse and extensive information resources to answer business-critical R&D questions in the pharmaceutical domain. In particular, it will outline how an agile NLPbased approach for discovering facts and relationships from free text can be used to leverage scientific knowledge and move beyond search to  automated profiling and hypothesis generation from millions of documents in real time.

Dr. Milward has 20 years’ experience of product development, consultancy and research in natural language processing. He is a co-founder of Linguamatics, and designed the I2E text mining system which uses a novel interactive approach to information extraction. He has been involved in applying text mining to applications in the life sciences for the last 10 years, initially as a Senior Computer Scientist at SRI International. David has a PhD from the University of Cambridge, and was a researcher and lecturer at the University of Edinburgh. He is widely published in the areas of information extraction, spoken dialogue, parsing, syntax and semantics.

Presenting this year’s award was Eric Rogge, Exalead, and Liz Diamond, niece of Ev Brenner. The award winner received a recognition award and a check for $500. A special thanks to Exalead for sponsoring this year’s Evvie.

The judges for the 2010 Evvie were Dr. David Evans (Evans Research), Sue Feldman (IDC), and Jill O’Neill, NFAIS.

Congratulations, Dr. Milward.

Stuart Schram IV, April 28, 2010

Sponsored post.

SQL at 40: Ready for Retirement?

February 26, 2010

Darned interesting write up in the Kellogg (formerly the Web log of the CEO of Mark Logic Corporation). The title caught my attention: “The Database Tea Party: The NoSQL Movement.” If you are struggling with your favorite 50-year old database technology, you will want to read Mr. Kellogg’s article. This comment sums up Kellogg’s position:

If you’re struggling with an RDBMS on a given application problem you shouldn’t say:  we need an open source, NoSQL type thing.  You should say:  we need to look at relational database alternatives.  Those alternatives include a open source database projects (e.g., MongoDB, CouchDB) and key-value stores (e.g., Hadoop), but they also include commercial software offerings such as specialized DBMSs like Streambase (for real-time streams), Aster (for analytics on big data), and MarkLogic (for semi-structured data).  Don’t throw out the commercial-software-benefits baby with the RDBMS bathwater.

I have written about the challenges SQL poses. I want to point out that even firms with non-RDBMS solutions * can use * SQL for certain tasks. I heard one Googler several years ago mention that MySQL was a useful tool. That may have changed now, but I have a couple of RDBMS files that work just fine. The “fine” is the key word because I am not pushing beyond the capabilities of the 40-year old invention of Dr. Codd.

You don’t see too many 40-year-olds athletes in the Olympics or professional sports. Why not take the same pragmatic approach to data management?

Stephen E Arnold, February 25, 2010

The addled goose has been paid by Mark Logic Corporation to give talks at the firm’s user meetings. I was not paid to write this news item, however. Next time I am in San Francisco I will try to get a taco out of this company’s engineering department.

Search Vendors! FAST-en Your Seatbelts

February 25, 2010

The Microsoft Fast ESP road show is going to come to a US city near you. The road show has entertained thousands in Frankfurt, Melbourne, London, and Paris. Next up is the US. The topics in Europe ranged from collaboration to social search to enterprise content management to SharePoint Web sites. Well, you get the idea. After flipping through the presentations available online if you register, the main idea is that Fast ESP “becomes the foundation” for “all enterprise search products”.

I love those categorical affirmatives. I also like to find black swans, even though I am an old goose. The description of the new system is chock full of superlatives such as “best”. With 300 companies offering search and content processing, I am hard pressed to identify one system as the “best”. Most vendors have some core competencies because “best” is one of the missteps that created the unhappy circumstances for Fast Search & Transfer prior to its sale to Microsoft. Anyone remember the October 2008 police action at the Fast Search offices in Oslo? No, I did not think so.

The idea is to focus on user experience and “go beyond the search box.” I quite like the “beyond” word as in “beyond search”. Here’s an example of the interface:

image

Copyright Microsoft 2010. Source is the Microsoft Web site reachable from this page http://www.fastsearch.com/l3a.aspx?m=1166&amid=15582

I don’t have many nitty-gritty technical details, but instead of burying the Fast ESP pitch in a SharePoint conference, there are these marketing-oriented traveling programs. The first US event is in Chicago on March 9; the second, in the Big Apple on March 11; and the third, San Francisco on March 16. Microsoft has invited certified partners, resellers, and those with Bill Gates tattoos to attend. What is on tap? You will learn about the new and improved Fast ESP system for SharePoint mostly. You will hear from happy, happy Fast ESP customers. You will get briefed by Microsoft’s own engineers and some invited guests.

As you know, Microsoft purchased Fast Search & Transfer in April 2008 for about $1.2 billion and change. In the 22 month interval, Fast ESP has been trimmed and slimmed to do battle with open source solutions such as Lucene, Lemur Consulting’s FLAX, and Solr. The new Microsoft Fast ESP will do battle with vendors who have moved “beyond search”, so I anticipate some references to the weaknesses of Exalead, MarkLogic, and other companies who have industrial strength solutions. There will be some happiness for Autonomy and Fabasoft Mindbreeze, two firms with solutions that carry the Microsoft seal of approval. My hunch is that those with Windows certification, a paycheck hooked to keeping Microsoft systems alive and well, and partners will be joined by the systems folks who want to get a first hand look at a $1.2 billion search system.

The addled goose is in San Francisco the week of the event, but he returns to the goose pond before the road show pitches the left coast faithful. He will have to report on the event via second hand reports. In the meantime, he will be using his bill to root for information on these topics:

  1. How has set up, optimization, and customization been simplified?
  2. How can the system keep metadata synchronized?
  3. What is the time required to update the index on a 10 minute basis in an organization with 10,000 active users out of an employee pool of 150,000 and a document flow of 1,000 new or changed documents every 24 hours (excluding emails and attachments)?
  4. What is the method for scaling Microsoft Fast?
  5. What is the method for restoring / rebuilding indexes in the event of a system fault? What is the time required in a typical organizational setting with 10,000 active users and the document flow of 1,000 new or changed documents every 24 hours (excluding email and attachments)?
  6. What is the total cost for a system for 10,000 active users?

I think I know the license fee, based on the rumors floating in the aether. I can’t reveal the deal, but the price tag will make life tough for vendors up and down the line. If Microsoft hits a home run, my question is, “What tricks does Google have ready to roll?”

You can get the full scoop at http://www.fastsearch.com/l3a.aspx?m=1166&amid=15582.

Stephen E Arnold, February 25, 2010

No one paid me to write this. I will report writing about Microsoft to the Department of Defense. Not only was I doing work for free, the DoD people understand, love, and appreciate the Microsoft technology. Free is good I think.

Metasearch Systems: Some Thoughts

February 12, 2010

I fielded a phone call yesterday from a person who wanted to know the status of metasearch systems. I ran through some basic information and mentioned the three metasearch systems that I looked at recently for an investment bank. These services are:

Dogpile.com, owned by InfoSeek. (Note: I have in my notes a comment that suggested the search plumbing was some home brew and some Fast Search & Transfer.)

Ixquick.com, owned (according to my musty files) by Surfboard Holding BV. (Note: the application was developed by a former whiz in the investment banking sector)

Search.com, owned by CBS. (Note: you can download the WebFerret desktop application from http://www.webferret.com/)

Metasearch functions are available from “enterprise search vendors”, a phrase that I really dislike, but folks use it. These vendors include:

BrightPlanet.com. See http://www.arnoldit.com/search-wizards-speak/brightplanet.html for an interview with William Bushee, lead technologist for BrightPlanet, for more information.

Deep Web Technologies. See http://www.arnoldit.com/search-wizards-speak/deep-web.html for an interview with Abe Lederman, the founder of Deep Web.

Vivisimo. See http://www.arnoldit.com/search-wizards-speak/vivisimo.html for an interview with Raul Valdes Perez and Jerome Pesenti, founders of Vivisimo.

There are other outfits with “federating” and metasearch functionality, but these two lists provide a foundation for my observations.

First, I think that as Google’s share of the search market rises, traffic to these public metasearch sites will be under pressure. You can see what’s happening with Dogpile.com, Ixquick.com, and Search.com.

compare feb 11

Each of these services is “flat”, assuming, of course, that the Compete.com data are reasonably accurate. For purposes of comparison, Compete.com reports that Google has about 145 million unique visitors, which seems low, but the relative traffic difference is the point. Metasearch is commanding a modest share of the search traffic.

I like the Ixquick.com service, and I hope that the owners can pump up the traffic. I will have to look into this situation later this year.

For the enterprise search vendors, I think the picture may be different. BrightPlanet, Deep Web, and Vivisimo can deliver injections of third party content to their enterprise customers. Will these firms be able to generate traction as Fetch and Kapow ramp up? Will these firms compete or complement the expanding offerings of i2.co.uk, particularly in the law enforcement and intelligence sector. Will these firms have a counter to slow the push of Mark Logic Corp. into these potentially lucrative enterprise customers?

I don’t have answers to these questions, but I think the metasearch sector is going to become more interesting in 2010.

Stephen E Arnold, February 11, 2010

No one paid me to write this article. I will report this poverty assurance plan to to the General Accountability Office when the snow melts in DC.

Semantic Search Explained

February 11, 2010

A happy quack to the reader who sent me “Breakthrough Analysis: Tow  + Nine Types of Semantic Search”. Martin White (Intranet Focus) and I tried to explain semantic search in the text and the glossary for our Successful Enterprise Search Management. Our approach was to point out that the word “semantic” is often used in many different ways. Our purpose was to put procurement teams on alert when buzzwords were used to explain the features of an enterprise search system. Our approach was focused on matching a specific requirement to a particular function. An example would be displaying results in categories. The search vendor had to have a system that performed this type of value-added processing. The particular adjectives and marketing nomenclature were secondary to the function. The practicality of our approach was underscored for me when I read the Intelligent Enterprise article about the nine types of semantic search.

image

Source: http://writewellcommunications.com/wp-content/uploads/2009/06/homonyms1.jpg

I don’t feel comfortable listing the Intelligent Enterprise list, but I urge you to take  a close look at the write up. Ask yourself these questions:

  1. Do you understand the difference between related searches/queries, concept search, and faceted search?
  2. When you look for information, are you mindful of “semantic/syntactic annotations” operating under the covers or in plain view?
  3. Do you type queries of about three words, or do you type queries with a dozen words or more organized in a question?

Your answer underscores one of the most fragile aspects of search and content processing. A focus on the numerical recipes that different vendors use to deliver specific functions often makes little or no sense even to engineers with deep experience in search and content processing.

A quick example.

If you run a query on the Exstream (the enterprise publishing system acquired by Hewlett Packard), you can get a list of content elements. The system is designed to allow a person in charge of placing a message in a medical invoice or an auto payment invoice and other types of content assembly operations. The system is not particularly clever, but it works reasonably well. The notion of search in this enterprise environment is in my opinion quite 1980s, despite some nice features like saved projects along the lines of Quark’s palette of frequently used objects.

Now run a query on a Mark Logic based system at a major manufacturing company. The result looks a bit like a combination of a results list and a report, but if you move to another department, the output may have a different look and feel. This is a result of the underlying plumbing of the Mark Logic system. I think that describing Mark Logic as a search system and attributing more “meaningful” functions to it is possible, but the difference is the architecture.

A person describing either the Exstream or the Mark Logic system could apply one or more of the “two + nine” terms to the system. I don’t think those terms are particularly helpful either to the users or to the engineers at Exstream or Mark Logic. Here’s why:

  • Systems have to solve a problem for a customer. Describing what the outputs look like are descriptive and may not reflect what is going on under the hood. Are the saved projects the equivalent of an stored Xquery for MarkLogic?
  • Users need to have interfaces that allow them to get their work done. Arguably both Exstream and Mark Logic deliver for their customers. The underlying numerical recipes are essentially irrelevant if these two systems deliver for their customers.
  • The terminology in use at each company comes from different domains, and it is entirely possible that professionals of Exstream and Mark Logic use exactly the same term with very different connotations.

The discourse about search, content processing, and information retrieval is fraught with words that are rarely defined across different slices of the information industry. In retrospect, Martin and I followed a useful, if pragmatic, path. We focused on requirements. Leave the crazy lingo to the marketers, the pundits, and the poobahs. I just want systems to work for their users. Words don’t do this, obviously, which makes lingo easier than implementing systems so users can locate needed information.

Stephen E Arnold, February 11, 2010

No one paid me to put in this shameless plug for Martin White’s and my monograph, Successful Enterprise Search Management. This is a marketing write up, and I have dutifully reported this fact to you.

Another Shot across the Bow of the Oracle Tanker

February 5, 2010

Oracle, in my opinion, is similar to those giant oil tankers that one can see in ports around the world. Some—like the vessels parked off the west coast of England—are just waiting for an economic uptick. Others dribble oil as they grind thousands of miles from one place to another. Every once in a while, one of these oil tankers dumps its cargo and makes headlines.

Oracle is an oil tanker in the enterprise software world. The company’s core technology is expensive to scale, mostly in my view because it, like DB2, was designed and built after the Korean War. Once the Oracle tanker leaves Sea World Parkway, it is tough to stop and almost as difficult to turn around quickly.

image

Source: http://upload.wikimedia.org/wikipedia/commons/f/fd/Oil_tanker_Omala_in_Rotterdam.jpg

There are some positives to the Oracle solution. Clueless investors like to hear “Oracle is our database engine” without understanding the implications of that phrase. I suppose the investors could ask the Salesforce.com engineers or the Amazon.com engineers who babysit the Oracle tanker at the core of their organization, but most just resonate with the brand name. And if the licensee has the human and financial resources, Oracle can scale. Presumably Oracle’s owning Sun Microsystems will help with the one-stop scaling shop, no non-Oracle hardware required going forward. And for any given problem, Oracle has a solution. Middleware. No problem. Search. No problem. XML capability like Mark Logic’s no less. No problem. Applications. No problem. ERP. No problem. Consulting. No problem. Google Search Appliance. No problem.

The downsides are easy to summarize. You need an Oracle DBA or multiple Oracle DBAs to keep the tanker shipshape. You need money. You need consulting support from Oracle. Getting help off the reservation can lead to some tense meetings with the Sea World crowd. Once in a while the “no problem” becomes a problem, and I will leave it to your own business savvy to figure out the implications of an errant Oracle service.

When I read “Netezza Teams Up with NEC to Battle Oracle”, three thoughts crossed my addled goose brain:

  1. Oracle is getting some competition from an unexpected pair, NEC and Netezza. Will HP, Dell, and Cisco find data management partners too? I think this will be fun to watch.
  2. What will IBM do? IBM’s PR department has been working overtime on the mainframe renaissance which seems to be of minor luminescence. IBM cannot sit on its hands and allow NEC and Netezza to go after Oracle and probably DB2. Six of one, half dozen of the other.
  3. Will the Google roll out its enterprise data management service, which of course does not exist, cannot possibly be a service, and has absolutely no traction within the Google management team?

Bottomline: Life is going to force Oracle to become even more aggressive. I am glad I am in Harrod’s Creek and not involved in procuring oil tanker software any longer.

Stephen E Arnold, February 5, 2010

No one paid me to write about software in sea faring terms. I will report this sad fact to MARAD.

Mark Logic Taps Amazon

January 25, 2010

Cloud Computing “Mark Logic Leverages Amazon” reported that MarkLogic Server offers a cloud option. According to the write up said:

The move will obviously let customers use its widgetry on a pay-by-the-hour basis. A native XML database that implements the W3C-standard XML Query (XQuery) language, the server includes full-text and structured (XML) search. The AWS version consists of an Amazon Machine Image (AMI) with the MarkLogic Server pre-installed.

Mark Logic’s technology has demonstrated its versatility in a number of information-centric environments. With a client’s information within the MarkLogic Server environment, repurposing is a snap. In the last year, Mark Logic has emerged as an information infrastructure company that makes big boys like Oracle quite nervous. With the move to the cloud option, Mark Logic is poised for new services. Mark Logic’s technology exerts pressures on companies in business intelligence, enterprise publishing, and information portal services, among others.

When Larry Ellison worries, I take notice. Important step from Mark Logic.

Stephen E Arnold, January 25, 2010

Yes, I was given a free admission to the Mark Logic user conference in Washington, DC. No, I was not paid to point to this write up about the Sys-Con story. Yes, I will beg Mark Logic to throw large sums of money at me and the goslings the next time I see one of the firm’s senior managers or investors. I will report this intent to the FCC via this footnote. Wow, I feel so much better explaining that I am a shameless marketer.

Location Aware Search via Lucene / Solr

January 19, 2010

I located an interesting and helpful post “Location Aware Search with Apache Lucene and Solr”  on IBM’s developer works Web site. If you are not familiar with Developer Works you can get additional information by clicking this link. This is IBM’s resource for developers and IT professionals. If you want to search for an article about “location aware Lucene”, you can get a direct link to “Location Aware Search with Apache Lucene and Solr” from the search box at www.ibm.com. That’s a definite plus because the IBM Web site can be tough to navigate.

The write up is quite useful. Like some of the other content on the Developer Works Web site, the author is not an IBM employee. The Lucene / Solr write up is by a member of the technical staff at Lucid Imagination, a company that offers open source builds of Lucene and Solr as well as professional services. (Lucid is interesting because it resells commercial content connectors developed by the Australian company ISYS Search Software.)

The write up is timely and provides quite a bit of detail in the 6,000 word write you. You get a discussion of key Lucene concepts, geospatial search concepts, information about representing spatial data, a discussion of combining spatial data with text in search, examples, sample code, a how to for indexing spatial information in Lucene, a review of how to search by location, and compilation of links to relevant information in other technical documents, interviews with experts, and code, among other pointers.

Several observations:

  1. The effort that went into this 6,000 word write up is considerable. The work is quite good, and it strikes me as cat nip for some IBM centric developers. IBM is a Lucene user, and I think that IBM and Lucid want to get as many of these developers to use Lucene / Solr as possible. This is a marketing approach comparable to Google’s push to get Android in everything from set top boxes to netbooks.
  2. The information serves as a teaser for a longer work that will be published under the title of Taming Text. That book should find a ready audience. Based on the data I have seen, many organizations—even those with modest technical resources—are looking at Lucene as a way to get a search system in place without the hassles associated with a full scale search procurement for a commercial system.
  3. The ecumenical approach taken in the write up is a plus as well. However, in the back of my mind is the constant chant, “Sell consulting, sell consulting, sell consulting”. That’s okay with me because the phrase runs through my addled goose brain every day of the week. But the write up makes clear that there is some heavy lifting required to implement a function such as location aware search using open source software.

The complexity is not unexpected. It does contrast sharply with the approach taken by MarkLogic, an information infrastructure vendor who is making location type search part of the basic framework. Google, on the other hand, takes a slightly different approach. The company allows a developer to use its APIs to perform a large number of geospatial tricks with little fancy dancing. Microsoft is on the ease of use trail as well.

Some folks who are new to Lucene may find the code a piece of cake. Others might take a look and conclude that Lucene is going to be a recipe that requires Julia Childs in the kitchen.

Stephen E Arnold, January 19, 2010

A freebie. An IBM person once gave me an hors d’oeuvre and an Lucid professional bought me a flavored tea. Other than these high value inducements, I wrote this without filthy lucre’s involvement. I will report this to the National Institutes of Health.

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta