Vertical Search: A Chill Blast from the Past

January 15, 2008

Two years ago, a prestigious New York investment banker asked me to attend a meeting without compensation. I knew my father was correct when he said, “Be a banker. That’s where the money is.” My father didn’t know Willie Sutton, but he has money insight. The day I arrived the bankers’ topic was “vertical search,” the next big money maker in search, according to the vice president who escorted me into a conference room overlooking the East River.

As I understood the notion from these financial engineers, certain parties (translation: publishers) had a goldmine of content (translation: high-value information created by staff writers and freelancers). The question asked was: “Isn’t a revenue play possible using search-and-retrieval technology and a subscription model?”

There’s only one answer that New York bankers want to hear, and that is, “I think there is an opportunity for an upside.” I repeated the catch phrase, and the five money mavens smiled. I was a good Kentucky consultant, and I had on shoes too.

My recollection is that everyone in the Park Avenue meeting room was well-groomed, scrupulously polite, and gracefully clueless about online. The folks asking me to stop by for a chat listened to me for about 60 seconds and then fired questions at me about Web 2.0 technology (which I don’t fully grasp), online stickiness (which means repeat visitors and time spent on a Web site), and online revenue growth (which I definitely understand after getting whipsawed with costs in 1993 when I was involved with The Point (Top 5% of the Internet). Note: we sold this site to Lycos in 1995, and I vowed not to catch spreadsheet fever again. Spreadsheet fever is particularly contagious in the offices of New York banks.

This morning — Tuesday, January 15, 2008 — I read a news story about Convera’s vertical search solution. The article explained that Lloyd’s List , a portal reporting the doings in the shipping industry, was going online with a “vertical search solution.”

The idea, as I understand it, is that a new online service called Maritime Answers will become available in the future. Convera Corporation, a one-time big dog in the search-and-retrieval sled races, would use its “technical expertise to provide a powerful search tool for the shipping community.” (Note: in this essay I am not discussing the sale of Convera’s search-and-retrieval business to Fast Search & Transfer or the capturing by Autonomy of some of Convera’s key sales professionals in 2007.)

Vertical Search Defined

In my first edition of The Enterprise Search Report, I included a section about vertical search. I cut out that material in 2003 because the idea seemed outside the scope of “behind the firewall” search. In the last five years, the notion of vertical search has continued to pop up as a way to serve the needs of a specific segment or constituency in a broader market.

Vertical search means limiting the content to a specific domain. Examples include information for attorneys. Companies in the vertical search business for lawyers include Lexis Nexis (a unit of Reed Elsevier) and Westlaw (a service absorbed into the the Thomson Corporation). A person with an interest in a specific topic, therefore, would turn to an online system with substantial information about a particular field. Examples range from the U.S. government’s health information available as Medline Plus to Game Trade Magazine with tens of thousands of other examples. One could make a good case that Web logs on a specific topic and a search box are vertical search systems.

The idea is appealing because if one looks for information on a narrow topic, a search system with information only on that topic, in theory, makes it easier to find the nugget or answer the user seeks — at least to someone who doesn’t know much about the vagaries of online information. I will return to this idea in a moment.

Commercial Databases: The Origin Vertical Search

Most readers of this Web log will have little experience with using commercial databases. The big online vendors have found themselves under siege by the Web and their own actions.

In the late 1960s when the commercial online business began with an injection of U.S. government funding, the only kind of database possible was one that was very narrow. The commercial online services offered specific collections of information on very narrow topics or information confined to a specific technical discipline. By 1980, there were some general business databases available, but these were narrowly constrained by editorial policies.

In order to make the early search-and-retrieval systems useful, database publishers (the name given to the people and companies who built databases) had to create fields or what today would be called “fields” or “XML document type definitions.” The database builders would pay indexers to put the name of the author, the title of the source, the key words from a controlled term list, and other data (now called metadata) into these fields.

The user would in 1980 pay a fee to get an account with an online vendor. Leaders a quarter century ago, mean very little to most online users today. The Googles and Microsofts of 1980 were Dialog Corporation, BRS, SDC, and a handful of others such as DataStar.

Every database or “file” on these systems was a vertical database. Users of these commercial systems would have to learn the editorial policy of a particular database; for example, ABI / INFORM or PROMT. When Dialog was king, the service offered more than 300 commercial databases, and most users picked a particular file and entered queries using a proprietary syntax. For example, to locate marketing information from the most recent update to the ABI / INFORM database one would enter into the Dialog command line: SS UD=9999 and CC=76?? and marketing. If a user wanted chemical information, the Chemical Abstracts service required the user to know the specific names and structures of chemicals.

Characteristics of These Original Vertical Databases

A peculiar characteristic of a collection of information on a topic or in a field is not understood by most users or investment bankers. The more narrow the content collection, the greater the need for a specialized vocabulary. Let me give an example. In the ABI / INFORM file it was pointless to search for the concept via the word “management.” The entire database was “about” management. Therefore, a careless query would, in theory, return a large number of hits. We, therefore, made “management” a stop word; that is, one that would not return results. We forced users to access the content via a controlled vocabulary, complete with Use For and See Also cross references. We created a business-centric classification coding scheme so a user could retrieve the marketing information using the command CC=76??.

Another attribute of vertical content or deep information on a narrow subject is that the terminology shifts. When a new development occurs in oil and gas, the American Petroleum Institute had to identify this term and take steps to map the new idea to content “about” that new subject. Let me give an example from a less specialized field than oil exploration. You know about an acquisition. The term means one company buys another. In business, however, the word takeover may be used to describe this action. In financial circles, there will be leveraged buyouts, a venture capital buyout, or a management buyout. In short, the words used to describe an acquisition evidence the power of English and the difficulty of creating a controlled vocabulary for certain fields. The paradox is that the deeper the content in detail and through time, the more complicated the jargon becomes. A failure to search for the appropriate terms means that information on the topic is not retrieved. In the search systems of yore, the string required to get the information from ABI / INFORM on acquisitions would require an explicit query with all of the terms present.

Vertical Search 2008

Convera is a company that has faced some interesting and challenging experiences. The company’s core technology was rooted in scanning paper documents, converting these documents to ASCII via optical character recognition, and then making the documents searchable via an interface. The company acquired for $33 million in 1995 ConQuest Software, developed by a former colleague of mine at Booz, Allen & Hamilton. Convera also acquired Semio’s Claude Vogel in 2002, a rocket scientist who has since left Convera. Convera from Allen & Co., a New York firm, and embarked on a journey to reinvent itself. This is an intriguing case example, and I may write about it in the future.

The name “Convera” was adopted in 2000 when Excalibur Technologies landed a deal with Intel. After the Intel deal went south about the same time a Convera deal with the NBA ran aground, the Convera name stuck. Convera in the last eight years has worked to reduce its debt, find new sources of revenue, and finally divested itself of its search-and-retrieval business, emerging as a provider of vertical search. I have not done justice to a particularly interesting case study in the hurdles companies face when those firms try to make money without a Google-type business model.

Now Convera is in the vertical search business. It uses its content acquisition technology or crawlers and parsers to build indexes. Convera has word lists for specific markets such as law enforcement and heath as well as technology that automatically indexes, classifies, and tags processed content. The company also has server farms that can provide hosted or managed search services to its customers.

Instead of competing with Google in the public Web indexing space, Convera’s business model, as I understand it, approaches a client who wants to build a vertical content collection. Convera then indexes the content of certain Web sites and any content the customer such as a publisher has. The customer pays Convera for its services. The customer either gives away access to the content collection or charges the customer a fee to access the content.

In short, Convera is in the vertical search business. The idea is that Convera’s stakeholders get money by selling services, not licensing a search-and-retrieval engine to an enterprise. Convera’s interesting history makes clear that enterprise software and joint ventures such as those with Intel can lose big money, more than $600 million give or take a couple hundred million. Obviously Convera’s original business model lacked the lift its management teams projected.

The Value of Vertical Search

The value of vertical search depends upon several factors that have nothing to do with technology. The first factor is the desire of a customer such as a publisher like Lloyd’s List to find a new way to generate growth and zip from a long-in-the-tooth information service. Publishers are in a tough spot. Most are not very good at technical foresight. More problematic, the online options can cannibalize their existing revenues. As a business segment, traditional publishing is a hostile place for 17th-century business models.

Another factor is the skill of the marketers and sales professionals. Never underestimate the value of a smooth talking peddler. Big deals can be done on the basis of charm and a dollop of FUD, fear-uncertainty-doubt.

A third element is the environmental pressures that come from companies and innovators largely indifferent to established businesses. One example is the Google-Microsoft-Yahoo activity. Each of these companies is offering online access to information mostly without direct fees to the user. The advertisers foot the bill. All three are digitizing books, indexing Web logs or social media, and working with certain third parties to offer certain information. Even Amazon is in the game with its Kindle device, online service, and courtesy fee for certain online Web log content. Executives at these companies know about the problems publishers face, but there’s not much executives at these companies can do to alter the tectonic shift underway in information access. I know I wouldn’t start a traditional magazine or newspaper even though for decades I was an executive in newspaper and publishing companies like the Courier Journal & Louisville Times and Ziff Communications.

Vertical Search: Google Style

You can create your own vertical search system now. You don’t have to pay Convera’s wizards for this service. In fact, you don’t have to know how to program or do much more than activate your browser. Google will allow anyone to create a custom search engine, which is that company’s buzzword for vertical search system. Navigate to Google’s CSE page and explore. If you want to see the service in action, navigate to Macworld’s beta.

We’ve come full circle in a sense. The original online market was only vertical search; that is, very specific collections of content on a particular topic or discipline. Then we shifted to indexing the world of information. Now, the Google system allows anyone to create a very narrow domain of content.

What’s this mean? First, I am not sure the Convera for-fee approach will be a financially rewarding as the company’s stakeholders expect. Free is tough to beat. For a publisher wanting to index proprietary content, Google will license a Google Search Appliance . With the OneBox API, it is possible to integrate the Google CSE with the content processed by the GSA. Few people recognize that Google’s approach allows a technically savvy person or one who is Googley to replicate most of the functionality on offer from the hundreds of companies competing in the “beyond search” markets.

Second, a narrow collection built on spidering a subset of Web sites, by definition, will face some cost hurdles. As costs rise, companies providing custom subsets by direct spidering and content processing will face rising costs. These costs will be controllable by cutting back on the volume of content spidered and processed. Alternatively, the quality of service or technical innovations will have to be scaled to match available resources. Either way, Google, Microsoft, and Yahoo may control the fate of the vertical search vendors.

Finally, the enthusiasm for vertical search may be predicated on misunderstanding available information. There is a big market for vertical search in law enforcement, intelligence, and pharmaceutical competitive intelligence. There may be a market in other sectors, but with a free service like Google’s getting better with each upgrade to the Google service array, I think secondary and tertiary markets may go with the lower-cost alternative.

Stakeholders in Convera don’t know the outcome of Convera’s vertical search play. One thing is certain. New York bankers are mercurial, and their good humor can disappear with a single disappointing earnings report. I will stick with the motto, “Surf on Google” and leave certain types of search investments to those far smarter than I.

Stephen E. Arnold
January 15, 2008, 10 am

Library Automation: SirsiDynix and Brainware

January 14, 2008

On January 9, 2008, Marshall Breeding, an industry watcher in the library automation space, posted a story called “Perceptions 2007: an International Survey of Library Automation.” I urge anyone interested in online information retrieval to pay particular attention to the data presented in Mr. Breeding’s article. One finding caught my attention. The products of SirsiDynix, Unicorn and Horizon, received low satisfaction scores from libraries responding to the survey. Unicorn, the company’s flagship ILS performed somewhat better than Horizon. 14% of libraries running Unicorn and about half of those with Horizon indicate interest in migrating to another system–not surprising considering SirsiDynix’s position not to develop that system into the future. Horizon libraries scored high interest in open source ILS alternatives. The comments provided by libraries running Horizon voiced an extremely high level of frustration with SirsiDynix as a company and its decision to discontinue Horizon. Many indicated distrust toward the company. The comments from libraries running Unicorn, the system which SirsiDynix selected as the basis for its flagship Symphony ILS, also ran strongly negative—some because of issues with the software some because of concerns with the company.

SirsiDynix recently announced that it will use an interesting search-and-retrieval system marketed by Brainware, a company located in Northern Virginia, not far from Dulles Airport.

In my forthcoming Beyond Search study, I am profiling the Brainware technology and paying particular attention to the firm’s approach to content processing. SirsiDynix conducted a thorough search for an access technology that would handle diverse content types and deliver fast throughput. The firm selected the Brainware technology to provide its customers with a more powerful information access tool.

Mr. Breeding’s report provides some evidence that SirsiDynix may want to address some customer satisfaction issues. Innovation or lack thereof, seems to be on the top of the list. SirsiDynix’s decision to partner with Brainware for search-and-retrival should go a long way in addressing their customer’s concerns in this important area. This decision is also a testament to the strength of the Brainware solution. Accordingly, Brainware warrants close consideration when intelligent content processing is required.

Most library automation vendors integrate technology in order to deliver a comprehensive solution. The vendors providing these technologies on an OEM or original equipment manufacturing basis are not able to influence in a significant way how their licensees deploy the licensed technology.

In my take on the data in Mr. Breeding’s article, the challenges SirsiDynix faces are not those of Brainware, a company enjoying 50 percent growth in 2007. In Beyond Search, I’m rating Brainware as a “Warrants a Close Look”. I respect the findings in the survey data reported by Mr. Breeding. But let me be clear: don’t mix up SirsiDynix’s business challenges with the Brainware technology. These are separate matters. SirsiDynix, like many library automation companies, face wide set of challenges, and extraordinary demands from library customers. Brainware provides advanced content processing solutions that should address some of those demands.

Stephen E. Arnold, January 15, 2009

Search Turbocharging: A Boost for Search Company Valuations?

January 13, 2008

PCWorld’s January 12, 2008, story “Micrsoft’s FAST Bid Signals a Shift in Search.” The story is important because it puts “behind the firewall” search in the high beams.

A Bit of History

Fast Search & Tansfer got out of the online Web search and advertising business in early 2003. CNet covered the story thoroughly. Shortly after the deal either John Lervik or Bjorn Laukli, both Fast Search senior executives, told me, “Fast Search will become the principal provider of enterprise search.” In 2003, there was little reason to doubt this assertion. Fast Search was making progress with lucrative U.S. government contracts via its partner AT&T. Google’s behind-the-firewall search efforts were modest. Autonomy and Endeca each had specific functionality that generlly allowed the companies to compete in a gentlemanly way, often selling the same Forture 1000 company their search systems. Autonomy was automatic and able to process large volumes of unstructured content; Endeca at that time was more adept at handling structured information and work flow applications. Fast Search was betting that it could attack the enterprise market and win big.

Now slightly more than four years later, the bold bet on the enterprise market has created an interesting story. The decision to get out of Web search and advertising may prove to be one of the most interesting decisions in search and retrieval. Most of the coverage of the Microsoft offer to buy Fast Search focuses on the here and now, not the history. Fast Search suffered some financial set backs in 2006 and 2007, but the real setback from my point of view is in the broader enterprise market.

Some Rough Numbers for Context

Specifically, after four years of playing out its enterprise strategy, Fast Search has fallen behind Autonomy. That company’s revenues are likely to be about 30 percent higher than Fast Search’s on an annualized basis, roughly $300 million to $200 million over the last 12 months. (I’m rounding gross revenues for illustrative purposes.) Endeca is likely to hit the $90 to $100 million target in 2008, so these three companies generate collectively gross revenues of about $600 million. Now here’s the kicker. Google’s often maligned Google Search Appliance has more than 8,000 licensees. I estimate that the gross revenue from the GSA is about $350 million per year. Even if I am off in my estimates (Google refuses to answer my questions or acknowledge my existence), my research suggests that as of December 31, 2007, Google was the largest vendor of “behind the firewall” search. This estimate excludes the bundled search in the 65 million SharePoint installations and the inclusion of search in other enterprise applications.

One more data point, and again I am going to round off the numbers to make a larger point. Google’s GSA revenue is a fraction of Google’s projected $14 billion gross revenue in calendar 2007. Recall that at the time Fast Search got out of Web search and advertising, Google was generating somewhere in the $50 to $100 million per year and Fast Search was reporting revenue of about $40 million. Since 2003, Google has caught up with Fast Search and bypassed it in revenue generated revenue from the enterprise search market sector.

The Fast Search bet bought the high octane performance Microsoft bid. However, revenue issues, employee rationalization, and eroding days sales outstanding figures suggest that the Fast Search vehicle has some mechanical problems. Perhaps technology is the issue? Maybe management lacked the MBA skills to keep the pit crew working at its peak? Could the market itself changed in a fundamental way, looking for a something that was simpler and required less tinkering? I certainly don’t know.

What’s Important in a Search Acquisition?

Now back to the PCWorld story by IDG’s Chris Kanaracus. We learn that Microsoft got a deal at $1.2 billion and solid technology. Furthermore, various pundits and industry executives focus on the “importance” of search. One type of “importance” is financial because $1.2 billion for a company with $200 million in revenue translates to six times annual revenue. Another type of importance is environmental because the underperforming “behind the firewall” search sector got some much-needed publicity.

What we learn from this article is that “behind the firewall” search is still a highly uncertain. There’s nothing in the Micrsoft offer that clarifies the specifics of Micrsoft’s use of the Fast Search technology. The larger market remains equally murky. Search is not one thing. Search is key word indexing, text mining, classifying, and metatagging. Each of these components is complicated and tricky to set up and maintain. Furthermore, the vendors in the “behind the firewall” space can change their positioning as easily as a n F-1 team switches the decals on its race car.

Another factor is that no one outside of Google knows what Google, arguably the largest vendor of “behind the firewall” search will or will not do. Ignoring Google in the enterprise market is easy and convenient. A large number of “behind the firewall” search systems skirt Google or dismiss the company’s technology by commenting about it in an unflattering manner.

I think it’s a mistake. Before the pundits and the vendors start calculating their big paydays from Microsoft’s interest in Fast Search & Technology, Google cannot be ignored; otherwise, the dip in Microsoft shares cited in the PCWorld article might like a flashing engine warning light. Shifting into high gear is useless if the engine blows up.
Stephen E. Arnold
January 14, 2008

Computerworld’s Take on Enterprise Search

January 12, 2008

Several years ago I received a call. I’m not at liberty to reveal the names of the two callers, but I can say that both callers were employed by the owner of Computerworld, a highly-regarded trade publication. Unlike its weaker sister, InfoWorld, Computerworld remains both a print and online publication. The subject of the call was “enterprise search” or what I now prefer to label “behind-the-firewall search.”

The callers wanted my opinion about a particular vendor of search systems. I provided a few observations and said, “This particular company’s system may not be the optimal choice for your organization.” I was told, “Thanks. Goodbye” IDG promptly licensed the system against which I cautioned. In December 2007 at the international online meeting in London, England, an aquaintance of mine who works at another IDG company complained about the IDG “enterprise search” system. When I found myself this morning (January 12, 2008) mentioned in an article authored by a professional working at an IDG unit, I invested a few moments with the article, an “FAQ” organized as questions and answers.

In general, the FAQ snugly fitted what I believe are Computerworld’s criteria for excellence. But a few of the comments in the FAQ nibbled at me. I had to work on my new study Beyond Search: What to Do When Your Search System Doesn’t Work, and I had this FAQ chewing at my attention. A Web can be a useful way to test certain ideas before “official” publication. Even more interesting is that I know that IDG’s incumbent search system, ah, disappoints some users. Now, before the playoff games begin I have an IDG professional cutting to the heart of search and content processing. The article “FAQ: Why Is Enterprise Search Harder Than Google Web Search?” references me. The author appears to be Eric Lai, and I don’t know him, nor do I have any interaction with Computerworld or its immedite parent, IDC, or the International Data Group, the conglomerate assembled by Patrick McGovern (blue suit, red tie, all the time, anywhere, regardless of the occasion).

On the article’s three Web pages (pages I want to add that are chock full of sidebars, advertisements, and complex choices such as Recommendations and White Papers) Mr. Lai’s Socratic dialog unfurls. The subtitle is good too: “Where Format Complications Meet Inflated User Expectations”. I cannot do justice to the writing of a trained, IDC-vetted journalist backed by the crack IDG editorial resources, of course. I’m a lousy writer, backed by my boxer dog Tyson and a moonshine-swilling neighbor next hollow down in Harrods Creek, Kentucky.

Let me hit the key points of the FAQ’s Socratic approach to the thorny issues of “enterprise search”, which is remember “behind-the-firewall search” or Intranet search. After thumbnailing each of Mr. Lai’s points, I will offer comments. I invite feedback from IDC. IDG, or anyone who has blundered into my Beyond Search Web log.

Point 1: Function of Enterprise Search

Mr. Lai’s view is that enterprise search makes information “stored in their [users’] corporate network available. Structured and unstructured data must be manipulated, and Mr. Lai on the authority of Dr. Yves Schabes, Harvard professor and Teragram founder, reports that a dedicated search system executes queries more rapidly “though it can’t manipulate or numerically analyze the data.”

Beyond Search wants to add that Teragram is an interesting content processing system. In Mr. Lai’s discussion of this first FAQ point, he has created a fruit salad mixed in with his ones and zeros. The phrase “enterprise search” is used as a shorthand way to refer to the information on an organization’s computers. Although a minor point, there is no “enterprise” in “enterprise search” because indexing behind-the-firewall information means deciding what not to index or at least, what content is available to whom under what circumstances. One of the gotchas in behind-the-firewall search, therefore, is making sure that the system doesn’t find and make available personal information, health and salary information, certain sensitive information such as what division is up for sale, and the like. A second comment I want to make is that Teragram is what I classify as a “content processing system provider”. Teragram’s technology, which has been used at the New York Times and America Online can be an enhancement to other vendors’ technology. Finally, the “war of words” that rages between various vendors about performance of database systems is quite interesting. My view is that behind-the-firewall search and the new systems on offer from Teragram and others in the content processing sector are responding to a larger data management problem. Content processing is a first step toward breaking free of the limitations of the Codd database. We’re at an inflection point and the swizzling of technologies presages a far larger change coming. Think dataspaces, not databases, for example. I discuss dataspaces in my new study out in April 2008, and I hope my discussion will put the mélange of ideas in Mr. Lai’s first Socratic question in a different context. The change from databases to dataspaces is more than a two consonants.

Point 2: Google as the Model for Learning Search

Mr. Lai’s view is that a user of Google won’t necessarily be able to “easily learn” [sic] “enterprise search” system.

I generally agree with the sentiment of the statement. In Beyond Search I take this idea and expand it to about 250 pages of information, including profiles of 24 companies offering a spectrum of systems, interfaces, and approaches to information access. Most of the vendors’ systems that I profile offer interfaces that allow the user to point-and-click their way to needed information. Some of the systems absolve the user of having to search for anything because work flow tools and stored queries operated in the background. Just-in-time information delivery makes the modern systems easier to use because the hapless employee doesn’t have to play the “search box guessing game.” Mr. Lai, I believe, finds query formulation undaunting. My research reveals the opposite. Formulating a query is difficult for many users of enterprise information access systems. When a deadline looms, employees are uncomfortable trying to guess the key word combination that unlocks the secret to the needed information.

Point 3: Hard Information Types

I think Mr. Lai reveals more about his understanding of search in this FAQ segment. Citing our intrepid Luxembourgian, Dr. Schabes, we learn about eDiscovery, rich media, and the challenge of duplicate documents routinely spat out by content management systems.

The problem is the large amounts of unstructured data in an organization. Let’s reign in this line of argument. There are multiple challenges in behind-the-firewall search. What makes information “hard” (I interpret the word “hard” as meaning “complex”) involves several little-understood factors colliding in interesting ways. [a] In an organization there may be many versions of documents, many copies of various versions, and different forms of those documents; for example, a sales person may have the Word version of a contract on his departmental server, but there may be an Adobe Portable Document Format version attached to the email telling the client to sign it and fax the PDF back. You may have had to sift through these variants in your own work. [b] There are files types that are in wide use. Many of these may be renegades; that is, the organization’s over-worked technical staff may be able to deal with some of them. Other file types such as iPod files, digital videos of a sales pitch captured on a PR person’s digital video recorder, or someone’s version of a document exported using Word 2007’s XML format are troublesome. Systems that process content for search and retrieval have filters to handle most common file types. The odd ducks require some special care and feeding. Translation: coding filters, manual work, and figuring out what to do with the file types for easy access. [c] Results in the form of a laundry list are useful for some types of queries but not for others. The more types of content processed by the system, the less likely a laundry list will be useful. Not urprisingly, advanced content processing systems produce reports, graphic displays, suggestions, and interactive maps. When videos and audio programs are added to the mix, the system must be able to render that information. Most organizations’ networks are not set up to shove 200 megabyte video files to and fro with abandon or alacrity. You can imagine the research, planning, and thought that must go into figuring out what to do with these types of digital content. None is “hard”. What’s difficult is the problem solving needed to make these data and information useful to an employee so work gets done quickly and in an informed manner. Not surprisingly, Mr. Lai’s Socratic approach leaves a few nuances in the tiny spaces of the recitation of what he thinks he heard Mr. Schabes suggest. Note that I know Mr. Schabes, and he’s an expert on rule-based content processing and Teragram’s original rule nesting technique, a professor at Harvard, and a respected computer scientist. So “hard” may not be Teragram’s preferred word. It’s not mine.

Point 4: Enterprise Search Is No More Difficult than Web Search

Mr. Lai’s question burrows to the root of much consternation in search and retrieval. “Enterprise search” is difficult.

My view is that any type of search ranks as one of the hardest problems in computer science. There are different types of problems with each variety of search–Web, behind-the-firewall, video, question answering, discovery, etc. The reason is that information itself is a very, very complicated aspect of human behavior. Dissatisfaction with “behind-the-firewall” search is due to many factors. Some are technical. In my work, when I see yellow sticky notes on monitors or observe piles of paper next to a desk, I know there’s an information access problem. These signs signal the system doesn’t “work”. For some employees, the system is too slow. For others, the system is too complex. A new hire may not know how to finagle the system to output what’s needed. Another employee may be too frazzled to be able to remember what to do due to a larger problem which needs immediate attention. Web content is no walk in the park either. But free Web indexing systems have a quick fix for problem content. Google, Microsoft, and Yahoo can ignore the problem content. With billions of pages in the index, missing a couple hundred million with each indexing pass is irrelevant. In an organization, nothing angers a system user quicker than knowing a document has been processed or should have been processed by the search system. When the document cannot be located, the employee either performs a manual search (expensive, slow, and stress inducing) or goes ballistic (cheap, fast, and stress releasing). In either scenario or one in the middle, resentment builds toward the information access system, the IT department, the hapless colleague at the next desk, or maybe the person’s dog at home. To reiterate an earlier point. Search, regardless of type, is extremely challenging. Within each type of search, specific combinations of complexities exist. A different mix of complexities becomes evident within each search implementation. Few have internalized these fundamental truths about finding information via software. Humans often prefer to ask another human for information. I know I do. I have more information access tools than a nerd should possess. Each has its benefits. Each has its limitations. The trick is knowing what tool is needed for a specific information job. Once that is accomplished, one must know how to deal with the security, format, freshness, and other complications of information.

Point 5: Classification and Social Functions

Mr. Lai, like most search users and observers, have noses that twitch when a “new” solution appears. Automatic classification of documents and support of social content are two of the zippiest content trends today.

Software that can suck in a Word file and automatically determine that the content is “about” the Smith contract, belongs to someone in accounting, and uses the correct flavor of warranty terminology is useful. It’s also like watching Star Trek and hoping your BlackBerry Pearl works like Captain Kirk’s communicator. Today’s systems, including Teragram’s, can index at 75 to 85 percent accuracy in most cases. This percentage can be improved with tuning. When properly set up, modern content processing systems can hit 90 percent. Human indexers, if they are really good, hit in the 85 to 95 percent range. Keep in mind that humans sometimes learn intuitively how to take short cuts. Software learns via fancy algorithms and doesn’t take short cuts. Both humans and machine processing, therefore, have their particular strengths and weaknesses. The best performing systems with which I am familiar rely on humans at certain points in system set up, configuration, and maintenance. Without the proper use of expensive and scarce human wizards, modern systems can veer into the ditch. The phrase “a manager will look at things differently than a salesperson” is spot on. The trick is to recognize this perceptual variance and accommodate it insofar as possible. A failure to deal with the intensely personal nature of some types of search issues is apparent when you visit a company where there are multiple search systems or a company where there’s one system–such as the the one in use at IDC–and discover that it does not work too well. (I am tempted to name the vendor, but my desire to avoid a phone call from hostile 20-year-olds is very intense today. I want to watch some of the playoff games on my couch potato television.)

Point 6: Fast’s Search Better than Google’s Search

Mr. Lai raises the question that is similar to America’s fascination with identifying the winner in any situation.

We’re back to a life-or-death, winner-take-all knife fight between Google and Microsoft. No search technology is necessarily better or worse than another. There are very few approaches that are radically different under the hood. Even the highly innovative approaches of companies such as Brainware and its “associative memory” approach or Exegy with its juiced up hardware and terabytes of on board RAM appliance share some fundamentals with other vendors’ systems. If you slogged through my jejune and hopelessly inadequate monographs, The Google Legacy (Infonortics, 2005) and Google Version 2.0 (Infonortics, 2007), and the three editions I wrote of The Enterprise Search Report (CMSWatch.com, 2004, 2005, 2006) you will know that subtle technical distinctions have major search system implications. Search is one of these areas with a minor tweak can yield two quite distinctive systems even though both share similar algorithms. A good example is the difference between Autonomy and Recommind. Both use Bayesian mathematics, but the differences are significant. Which is better? The answer is, “It depends.” For some situations, Autonomy is very solid. For others, Recommind is the system of choice. The same may be said of Coveo, Exalead, ISYS Search Software, Siderean, or Vivisimo, among others. Microsoft will have some work to do to understand what it has purchased. Once that learning is completed, Microsoft will have to make some decisions about how to implement those features into its various products. Google, on the other hand, has a track record of making the behind-the-firewall search in its Google Search Appliance better with each point upgrade. The company has made the GSA better and rolled out the useful OneBox API to make integration and function tweaking easier. The problem with trying to get Google and Microsoft to square off is that each company is playing its own game. Socratic Computerworld professionals want both companies to play one game, on a fight-to-the-death basis, now. My reading of the data I have is that a Thermopylae is not now or in the near future in the interests of either Google of Microsoft to clash too much. The companies have different agendas, different business models, and different top-of-mind problems to resolve. The future of search is that it will be invisible when it works. I don’t think that technology is available from either Google or Microsoft at this time.

Point 7: Consolidation

Mr. Lai wants to rev the uncertainty engine, I think. We learn from the FAQ that search is still a small, largely unknown market sector. We learn that big companies may buy smaller companies.

My view is that consolidation is a feature of our market economy. Mergers and acquisitions are part of the blood and bones of business, not a characteristic of the present search or content processing sector. The key point that is not addressed is the difficulty of generating a sustainable business selling a fuzzy solution to a tough problem. Philosophers have been trying to figure out information for a long time and have done a pretty miserable job as far as I can tell. Software that ventures into information is going to face some challenges. There’s user satisfaction, return on investment, appropriate performance, and the other factors referenced in this essay. The forces that will ripple through behind-the-firewall search are:

  • Business failure. There are too many vendors and too few buyers willing to pay enough to keep the more than 350 companies’ sustainable
  • Mergers. A company with customers and so-so technology is probably more valuable than a company with great technology and few customers. I have read that Microsoft was buying customers, not Fast Search & Transfer’s technology. Maybe? Maybe not.
  • Divestitures and spin outs. Keep in mind that Inxight Software, an early leader in content processing, was pushed out of Xerox’s Palo Alto Research Center. The fact that it was reported as an acquisition by Business Objects emphasized the end game. The start was, “Okay, it’s time to leave the nest.”

The other factor is not consolidation; it is absorption. Information is too important to leave in a stand-alone application. That’s why Microsoft’s Mr. Raikes seems eager to point out that Fast Search would become part of SharePoint.

Net-Net

The future, therefore, is that there will be less and less enthusiasm for expensive, stand-alone “behind-the-firewall” search. Information access is part of larger, higher-value information access solutions.

Stephen E. Arnold
January 13, 2008

A Turning Point in Search? Is the Microsoft-FAST Deal Significant?

January 11, 2008

The interest in the Microsoft purchase of Fast Search & Transfer is remarkable. What caught my attention was the macho nature of some of the news stories in my Yahoo! Alert this morning. I found these revelatory:

First, The January 9, 2008, story, “Microsoft Goes for Google Jugular with Search Buy“. The story is good, and it presents what I call an MBA analysis of the deal, albeit in about 900 words. From my point of view, the key point is in the last paragraph, which quotes Fast Search & Transfer’s John Lervik as saying: “We have simply not focused on (operational execution),” Lervik told ComputerWire last year. Had it focused a little more on execution, it might have gone on to become a gorilla in enterprise search. Instead, it has succumbed to acquisition by a company playing catch-up in the space.”

Second, the January 8, 2008, Information Week story by Paul McDougall. The title of this story is “Microsoft’s Fast Search Bid Puts Heat on Google, IBM”. The notion of “heat” is interesting, particularly in the behind-the-firewall market. The key point for me in this analysis is: “Microsoft plans to marry Fast’s enterprise technology with its SharePoint software — a suite of server and desktop products that give workers an interface through which they can retrieve information and collaborate with colleagues. It’s a combination that Google and IBM will have to match — and analysts say that’s sure to put Autonomy and Endeca in play.” The “heat”, I believe, refers to increased intensity among the identified vendors; that is, blue-chip companies such as Autonomy, Endeca, and IBM.

The third — and I want to keep this list manageable — is Bill Snyder’s January 8, 2008, story in InfoWorld (a publication that has been trying to make a transition from print money pit to online Web log revenue engine). This story sports the headline: “Microsoft Tries an End Run around Google“. The most interesting point in this analysis for me was this sentence: “Despite all of Microsoft’s efforts, the most recent tally of search market share by Hitwise, shows Google gaining share at Microsoft’s expense. Here are the numbers. Google’s share in December was 65.98%, up from 65.10% the previous month; while Microsoft’s share (including MSN search and Live.com) dropped to 7.04% from 7.09%.” The Web search share has zero to do with enterprise search share, but that seems to be of little significance. Indifference to the distinctions within the generalization about search is a characteristic of many discussions about information access. Search is search is search. Alas, there are many flavors of search. Fuzziness does little to advance the reader’s understanding of what’s really going on in the “”behind the firewall” search sector.

The Range of Opinion

These stories provide some texture to the $1.2 billion offer Microsoft made for Fast Search & Transfer. The “jugular” story underscores the growing awareness among some pundits and journalists that Microsoft has to “kill” Google. Acquisitions of complex technology with some management challenges added to spice up the marriage require some internal housekeeping. Once the union is complete, the couple can figure out how to cut another company’s jugular. Unfortunately, the headline tells us more about a perception that Microsoft has not been able to respond to Google. So, a big play is positioned as the single action that will change the “battle” between Google and Microsoft. I believe that there is no battle. Google is Googling along with its “controlled chaos” approach to many markets, not just enterprise search. In fact, Google seems almost indifferent to the so-called show downs in which it is engaged. Google went on holiday when the FCC bids were due, and the company seems somewhat unfocused with regard to its challenges to Microsoft if I read what its executives say correctly. The choice of the word “jugular”, therefore, suggests an attack. My view is that Microsoft wanted customers, a presence, engineers, and technology. If Microsoft goes for Google’s jugular, it will take some time before the knife hits Googzilla.

The second story reveals that consolidation is underway. I can’t disagree with this implication. Endeca’s initial public offering, once scheduled for 2007, failed to materialize. My sources tell me that Endeca continues to discuss partnerships, mergers, and other relationships to leverage the company “guided navigation” technology. You can see it in action on the U.K. newspaper The Guardian‘s Web site. My tally of companies in the search business now has more than 350 entries. I try to keep the list up to date, but with companies going out of business (Arikus, Entopia, WiseNut, and to name three I recall without consulting my online files), entering the business (ZoomItIn), repositioning (Brainware), hiding in perpetual stealth mode (PowerSet), and pretending that marketing is not necessary (Linguamatics), and other swirls and eddies. But, the metaphor “heat” is acceptable. My view is that the search sector is a very complex, blurry, and evolving Petri dish of technology in the much larger “information access space”. Investors will react to the “heat” metaphor because it connotes a big payday when a deal like Microsoft’s takes place.

The third story is interesting for two reasons. First, the author has the idea that Microsoft does not want to cut through Google. I thought of a Spartan phalanx marching relentlessly toward a group of debating Athenians. Mr. Snyder or his editor uses a football analogy. The notion of Microsoft using the tactic of skirting a battle and attacking on the flank or the rear is intriguing. The issue I have is that if Google is indeed in “controlled chaos” mode, it can field disorganized, scattered groups of war fighters. As a result, there may be no battle unit to outflank or get behind. If the Googlers are consistent “chaotic”, Google’s troopsmay be playing one game; Microsoft, another. Second, the data quoted pertain to Web search, not Intranet search. Apples and oranges I submit.

So what?

Six Key Moments in Search and Retrieval

The point of my looking at these three excellent examples of deal analysis is to lead up to the list below. These events are what I consider key turning points in “behind the firewall” search. The Microsoft – Fast deal, while significant as the first and, therefore, the biggest search deal of 2008, has yet to demonstrate that it will rank with these events:

  1. In-Q-Tel. Set up in 1998, this is the venture arm of the Central Intelligence Agency. With Federal funding and the best entrepreneurs in the U.S. government, In-Q-Tel seeded companies with promising information access and “beyond search” technology. Without In-Q-Tel and the support from other U.S. entities such as DARPA (Defense Advanced Research Projects Agency), the search market as we know would not exist today. Even Google snagged some Federal money for its PageRank project.
  2. Personal Library Software. Matt Koll doesn’t get the same attention accorded to today’s wizards. His 1993 vision and product of Intranet and desktop search were revolutionary products. Mr. Koll delivered desktop search and actively discussed extending the initial product to the “behind the firewall” repositories that are so plentiful today.
  3. The Thunderstone Appliance. EPI Thunderstone has been a developer of search tools such as its bullet-proof stemmer for more than a decade. The company introduced its search appliance in 2003. The company correctly anticipated the need some organizations had for a plug-and-play solution. Google was quick to roll out its own appliance not long after the Thunderstone product became available. The appliances available from Index Engines, Exegy, and Planet Technologies, among others owe a tip of the hat to the Thunderstone engineers.
  4. IBM STAIRS and BRS. Most readers will not be familiar with IBM STAIRS, circa 1971. I do include some history in the first three editions of The Enterprise Search Report which I wrote between 2003 and 2006, and I won’t repeat that information here. But text retrieval received the IBM Big Blue seal of approval with this product. When development flagged at IBM, both in the U.S. and Germany, the Bibliographic Retrieval Service picked up the mantle. You can buy this system from Open Text even today.
  5. In 2002, Yahoo bought Inktomi. At the time, Inktomi was a solid choice for Web search. The company was a pioneer in high-speed indexing and it was one of the first companies to find ways to make use of random access memory to speed query processing. The purchase of Inktomi, however, marked the moment when Google seized control of Web search and, shortly thereafter, Web advertising. With the cash from its dominance of the Web, Google continues its spectacular, unencumbered rise. Had Yahoo done a better job of setting priorities, Google may not have had the easy path it found. Yahoo, as you may recall, made search subordinate to what I call “the America Online” approach to information access. Others talk about the Yahoo! portal or the Yahoo! start page. This deal — not Autonomy’s buying Verity or Microsoft’s purchase of Fast Search & Retrieval — is the key buy out. Yahoo dropped Alta Vista http://www.altavista.com for Inktomi. The Inktomi deal sealed the fate of Yahoo! and the rest is Wall Street earnings history. Google has some amazing brainpower, honed at Digital – Compaq – HP. Jeffrey Dean and Sanjay Ghemawat are, I believe, graduates of the Alta Vista “Gurus’ Academy of Search.” Exalead‘s Francois Bourdoncle did a stint with Louis Monier when both were at Alta Vista. Mr. Monier is now at Google.
  6. Hewlett Packard Ignores Alta Vista. The media have defined search as Google. Wrong. Search is Google today, but it was Alta Vista. If Hewlett Packard had been able to understand what it acquired when it tallied the assets of Compaq Computer, it would have “owned” search. Mismanagement and ignorance allowed two key things to take place. First, the Alta Vista brain trust was able to migrate to other companies. For example, eBay, Exalead, and Google have been direct beneficiaries of the years and millions of Digital Equipment research dollars. HP let that asset get away. Even today, I don’t think HP knew how significant its failure to capitalize on Alta Vista was and continues to be. Second, by marginalizing Alta Vista, the promising early desk top search product pioneered a market and then abandoned it. HP is a mass market PC vendor and printer ink company. It could have been Google. In 2003, HP sold what remained of Alta Vista to Overture, later acquired by Yahoo. The irony of this decision is that Yahoo had an opportunity to challenge Google, possibly on patent infringement allegations. Yahoo! did not. Google now controls more than 60 percent of the Web search market and is “the elephant in the room” in “behind the firewall” search. Yahoo’s link up with IBM may not be enough to keep the company in the “behind the firewall” search sector.

Observations

Today’s market has not been a smooth or steady progression. Most of the systems available today use mathematics centuries old. Most systems share more similarities than differences, despite the protestations of marketing professionals at each company. Most systems disappoint users, creating the unusual situation in large organizations where employees have to learn multiple ways to locate the information needed to do work.

The acquisition of Fast Search & Transfer is a financially significant event. It has yet to stand the test of time. Specifically, these questions must be answered by the marketplace:

  • Will companies buy SharePoint with technology developed elsewhere, built from multiple acquisitions’ engineering, and open source code?
  • Can Microsoft retain the Fast Search customer base?
  • Will customers working through some of the Fast ESP (enterprise search platform) complexities have the patience to maintain their licenses in the face of uncertainty about ESP’s future?
  • Will Microsoft engineers and Fast Search engineers mesh together, successfully integrating Microsoft’s product manager approach with 10,000 sailboats generally heading in one direction with Fast Search’s blend of rigid Nordic engineering with a cadre of learning-while-you-work technologists?

My view is that I don’t know enough to answer these questions. I see an upside and a downside to the deal. I do invite readers to identify the most important turning points in search and offer their ideas about the impact of the Microsoft-Fast Search tie up.

Stephen E. Arnold
11 January 2008, 3 21 pm Eastern

Recommind: Following the Search Imperative

January 10, 2008

I opened my Yahoo alerts this morning, January 10, 2008, and read:

Recommind Predicts 2008 Enterprise Search and eDiscovery Trends: Search Becomes the Information Foundation of the … — Centre Daily Times Wed, 09 Jan 2008 5:32 AM PST

According to the enterprise search and eDiscovery technology experts at Recommind, 2008 will be the year that enterprise search and eDiscovery converge to become top areas of focus for enterprises worldwide, creating substantial growth and evolution in the management of electronic information.

The phrase “foundation of the electronic enterprise” struck me as meaningful and well-turned. Most search experts know Recommind by name only. I profiled the company in the third edition of The Enterprise Search Report, the last one that I wrote. I support the excellent fourth edition, but I did not do any of the updating for that version of the study. I’m confining my efforts to shorter, more specialized analyses.

The company once focused on the legal market. My take on the company’s technology was that it relied on Bayesian algorithms.

The Recommind product can deliver key word search. The company has a patented algorithm that implements “probabilistic latent semantic analysis.” I will discuss latent semantic indexing in “Beyond Search”. For our purpose, Recommind’s system identifies and analyzes the distribution in a document of concept-related words. The approach uses statistical methods to predict an item’s relevance. .

The Recommind implementation of these algorithms differentiate the company’s system from Autonomy’s. Autonomy, as you may know, is the high-profile proponent of “automatic” or “automated” text processing. The idea (and I am probably going to annoy the mathematicians who read this article) is that Bayesian algorithms can operate without human fiddling. The phrase “artificial intelligence” is often applied to a Bayesian system when it feeds information about processed content back into the content processing subsystem. The notion is that Bayesian systems can be implemented to adapt to the content flowing through the system. As the system processes content, the system recognizes new entities, concepts, and classifications. The phrase “set it and forget it” may be used to describe a system similar to Autonomy’s or Recommind’s. Keep in mind that each company will quickly refine my generalization. For my purposes, however, I’m not interested in the technology. I’m interested in the market orientation the news story makes clear.

Recommind is no longer a niche player in content processing. Recommind is cursoring the heartland of IBM, Microsoft, and Oracle: big business, the Fortune 1000, the companies that have money and will spend it on systems that enhance the firm’s revenue or control the firm’s costs. Recommind is an “enterprise content solutions vendor”

Some History

Lawyers are abstemious, far better at billing their clients than spending on information technology. Recommind offered a reasonably priced solution for what’s now called “eDiscovery.”

eDiscovery means collecting a set of documents, typically those obtained through the legal discovery process and processing them electronically. The processing part can have a number of steps, ranging from scanning, performing optical character recognition, and generating indexable files to performing relatively simple file transformation tasks. A simple transformation task is to take electronic mail and segment the message and save it, then save any attachment such as a PowerPoint presentation. Once a body of content obtained through the legal discovery process is available, that context is indexed.

Legal discovery means, and I am simplifying in this explanation, that each side in a legal matter must provide information to the opposing side. In complex matters, more than two law firms and usually more than two attorneys will be working on the matter. In the pre-digital age, discover involved keeping track of each discovered item manually, affixing an agreed upon identification number on the item, and making photocopies. The photocopies were — and still are in many legal proceedings — punched and placed in binders. The binders, even for relatively modest legal actions, can proliferate like gerbils. In a major legal action, the physical information can run to hundreds of thousands of documents.

eDiscovery, therefore, is the umbrella term for converting the text to electronic form, indexing it, and making that index available for those authorized to find and read those documents.

The key point about discovery is that it is not key word search. Discovery means that the system somehow finds out the important information in a document or collection of documents and makes that finding evident to a user. No key word query is needed. The user can read an email alert, click on a hot link that says, “The important information is here”, or displays a visual representation of what’s in a mass of content. Remember: discovery means no key word query, no reading of the document to find out what’s in it. Discovery is the most recent Holy Grail in information retrieval despite its long history in specialized applications like military intelligence.

Recommind found success in the eDiscovery market. The product was reasonably priced, particularly when compared to a brand name, high profile system such as those available from Autonomy, Endeca, Fast Search & Transfer, iPhrase (now a unit of IBM), and Stratify. Instead of six figures, think in terms of $30,000 to $50,000. For certain law firms, spending $50,000 to manipulate discovered materials electronically was preferable to spending $330,000.

The problem with the legal market is that litigation and legal matters come and go. For a vendor of eDiscovery tools, marketing costs chew away at margins. Only a small percentage of law firms maintain a capability to process case-related materials in a single system. The pattern is to gear up for a specific legal matter, process the content, and then turn off the system when the matter closes. Legal content related to a specific case is encumbered by necessary controls about access, handling of the information once the matter is resolved, and specific actions that must be taken with regard to the information obtained in eDiscovery; for example, expert witnesses must return or destroy certain information at the close of a matter.

The point is that eDiscovery systems are designed to make it possible for a law firm to comply with the stipulations placed on information obtained in the discovery process.

Approaches to eDiscovery

Stratify, now a unit of Iron Mountain, is one of the leaders in eDiscovery. Once called Purple Yogi and the darling of the American intelligence community, Stratify has narrowed its marketing to eDiscovery. The Stratify system performs automatic processes along with key word indexing of documents gathered via legal discovery. The system has been tuned for legal applications. Licensees receive knowledge bases with legal terms, a taxonomy, and an editorial interface so the licensing firm can add, delete, or modify the knowledge bases. Stratify is priced in a way that is similar to the approach taken by the Big Three (Autonomy, Endeca, and Fast Search & Transfer) in search; that is, fees in the hundreds of thousands of dollars are more common than $50,000 fees. Larger license fees are needed because the marketing costs are high, and the search vendors have to generate enough revenue to avoid plunging into financial shortfalls. Second, the higher fees make sense to large, cash rich organizations. Many companies want to pay more in order to get better service or the “best available” solution. Third, other factors may be operating such as the advice of a consultant or the recommendation of a law firm also working on the matter.

eDiscovery can also be performed using generalized and often lower-cost products. In the forthcoming “Beyond Search: What to Do When Your Search System Doesn’t Work”, I profile a number of companies offering software systems that can make discovered matter searchable. For most of these firms, the legal market is a sideline. Selling software to law firms requires specialized knowledge of legal proceedings, a sales person familiar with how law firms work, and marketing that reaches attorneys in a way that makes them comfortable. The legal market is a niche, and anyone can buy the names of lawyers from various sources, lawyers are not an easy market to penetrate.

Recommind, therefore, has shifted its marketing from the legal niche to the broader, more general market for Intranet search or what I call “behind the firewall” search. The term “enterprise search” is devalued, and I want to steer clear of giving you the impression that a single search systems can serve the many information access needs of a growing organization. More importantly, there’s a belief that “one size fits all” in search. That is a misconception. The reality is that an organization will have a need for many different types of information access systems. At some point in the future, there may be a single point solution, but for the foreseeable future, organizations will need separate, usually compartmentalized systems to avoid personnel, legal, and intellectual property problems. I will write more about this in “Beyond Search” and in this Web log.

Trajectory of Recommind

Recommind’s market trajectory is important. The company’s shift from a niche to a broader market segment illustrates how content processing companies must adapt to the revenue realities in selling search solutions. Recommind has moved into a market sector where a general purpose solution at a competitive price point should be easier to sell. Instead of the specialized sales person for the niche market, a sales person with more generalized experience can be hired. The small number of law firms is somewhat limited and has become saturated. The broader enterprise market consists of the Fortune 1000 and upwards of 15 million small- and mid-sized businesses. Most of these need and want a “better” search solution. Recommind’s expansion of its marketing into this broader arena makes sense, and it illustrates what many niche vendors often do to increase their revenues.

Here’s the formula and a diagram to illustrate this marketing shifting. Click on the thumbnail to view the illustration:

  • Increase the number of prospects for a search system by moving to a larger market. Example: from lawyers to general business or intelligence community in Washington, DC to business intelligence in companies; or from pharmaceutical text mining to general business text mining.
  • Simplify the installation, minimizing the need for specialized knowledge bases, tuning, and time-consuming set up. Example: offer a plug-and-play solution, emphasize speedy deployment, provide a default configuration that delivers advanced features without manual set up and time-consuming “training” of the system.
  • Maintain a competitive price point because the “vendor will make it up on volume”. With more customers and shorter buying cycles, the vendor will have increased chances to land a large account that generates substantial fees when customization or special functionality are required.
  • Boost the return on investment for research, development, sales, marketing, and customer support. The business school logic is inescapable to many search vendors. Note that these MBA (master of business administration) assumptions prove false is not my concern in this point. Search vendors can’t make their revenue goals in small niches and remain profitable, grow, and fund R&D. The search vendors have to find a way to grow and expand margins quickly. The broader business market is a solution that most content processing companies implement.

Search market shift

Implications of Market Shifts

Based on my research, several implications of moving upmarket, offering general purpose solutions, and expanding service options receive scant attention in the trade and business press. Let’s look at several. Keep in mind that my data and experience are unique. Your view may be different, and I welcome your view points. Let’s look at what I have learned:

First, smaller, specialized vendors have to move from a niche to a broader market. Examples range from the aforementioned Stratify, which moved from the U.S. intelligence niche to the broader business niche, only to narrow its focus in the broader business niche to handling special document collections. Iron Mountain saw value in this positioning and acquired Stratify. Vivisimo, which originally offered on-the-fly clustering, has repositioned itself as a vendor of “behind the firewall” search. The company’s core technology remains intact, but the firm has added functionality as it moves from a narrow “utility” vendor to a broader, “behind the firewall” vendor. Exegy, a vendor of special purpose, high-throughput processing technology, has moved from intelligence to financial services. This list can be expanded, but the point is clear. Search vendors have to move into broader markets in order to have a chance at making enough sales to generate the return investors demand. Stated another way, content processing vendors must find a way to expand their customer base or die.

Second, larger vendors — for example, the Autonomys, Endecas, and their ilk — must offer more and more services in an effort to penetrate more segments of the broader search market. Autonomy, in a sense, had to become a platform. Autonomy had to acquire Verity to get more upsell opportunities and more customers quickly. And the company had to diversify from search into other, adjacent information access and management services such as email management with its acquisition of Zantaz. The imperative to move into more markets and grow via acquisition is driving some of the industry consolidation now underway.

Third, established enterprise software vendors must move downmarket. IBM, Microsoft, and Oracle have to offer more information management, access, and processing services. A failure to take this step means that the smaller, more innovative companies moving from niches into broader business markets will challenge these firm’s grip on enterprise customers. Microsoft, therefore, had to counter the direct threat posed by Coveo, Exalead, ISYS, and Mondosoft (now SurfRay), among others.

Fourth, specialized vendors of text mining or business intelligence tools will find themselves subject to some gravitational forces. Inxight, the text analysis spin out of Xerox Palo Alto Research Center, was purchased by Business Objects. Business Objects was then acquired by SAP. After years of inattention, companies as diverse as Siderean Software (a semantic systems vendor with assisted navigation and dashboard functionality) to MarkLogic (an XML-on-steroids and data management vendor) will be sucked into new opportunities. Executives at both firms suggested to me that their products and services were of interest to superplatforms, search system vendors, and Fortune 1000 companies. I expect that both these companies will be themselves discovered as organizations look for “beyond search” solutions that work, mesh with existing systems, and eliminate if not significantly reduce the headaches associated with traditional information retrieval solutions.

I am reluctant to speculate on the competitive shifts that these market tectonics will bring in 2008. I am confident that the market for certain content processing companies is very bright indeed.

Back to Recommind

Recommind, therefore, is a good example of how a niche vendor of eDiscovery solutions can and must move into broader markets. Recommind is important, not because it offers a low-cost implementation of the Bayesian algorithms in the Autonomy system. Recommind warrants observation because it makes a useful case study of certain search sector market imperatives visible. As the diagram depicts, albeit somewhat awkwardly, is that each segment of the information retrieval market is in movement. Niche players must move upmarket and outwards. Superplatforms must move downmarket and into niches. Business intelligence system vendors must move into mainstream applications.

Exogenous Forces

The diagram omits two important exogenous forces. I will comment on these in another Web log article. For now, let me identify these two “storm systems” and offer several observations about search and content processing.

The first force is Lucene. This is the open source search solution that is poking its nose under a number of tents. IBM, for example, uses Lucene in some of its search offerings. A start up in Hungary called Tesuji offers Lucene plus engineering support services. Large information companies like Reed Elsevier continue to experiment with Lucene in an effort to shake free of burdensome licensing fees and restrictions imposed by established vendors. Lucene is not likely to go away, and with a total cost of ownership at a baseline of zero in licensing fees, some organizations will find the system warranting further investigation. More importantly, Lucene has been one of the factors turbo charging the “free search software” movement. The only way to counter certain chess moves is a symmetric action. Lucene, not Google or other vendors, is the motive force behind the proliferation of “free” search.

The second force is cloud computing. Google is often identified as the prime mover. It’s not. The notion of hosted search is an environmental factor. Granted, cloud based information retrieval solutions remain off the radar for most information technology professionals. Recall, however, that the core of hosted search is the commercial database industry. LexisNexis, Dialog, and Ebscohost are, in fact, hosted solutions for specialized content. Blossom Software, Exalead, Fast Search & Retrieval, and other content processing vendors offer off-premises or hosted solutions. The economics of information retrieval translate to steadily increasing interest in cloud based solutions. And when the time is right, Amazon, Google, Microsoft, and others will be offering hosted content processing solutions. In part it will be a response to what Dave Girouard, a Google executive calls, the “crisis in IT”. In part, it will be a response to economics. Few — very, very few — professionals understand the total cost of information retrieval. When the “number” becomes known, a market shift from on premises to cloud-based solutions will take place, probably with some velocity.

Wrap Up

Several observations are warranted:

First, Recommind is an interesting company to watch. It is, a microcosm of broader industry trends. The company’s management has understood the survival imperative and implemented a solution that becomes obvious in today’s market. Expand or stagnate.

Second, tectonic forces are at work that will reshape the information retrieval, content processing, and search market as it exists today. It’s not just consolidation; search and its cousins will become part of a larger data management fabric.

Third, there’s a great deal of money to be made as these forces grind through the more than 200 companies offering content processing solutions. Innovation, therefore, will continue to bubble up from U.S. research computing programs and outside the U.S. Tesuji is Hungary is just one example of dozens of innovative approaches to content processing.

Fourth, the larger battle is not yet underway. Many analysts see hand to hand combat between Google and Microsoft. I don’t. I think that for the next 18 to 24 months, battles will range within niches, among established search vendors, and among the established enterprise software vendors. Google is a study in “controlled chaos”. With this approach, Google is not likely to mount any single, direct attack on anything until the “controlled chaos” yields that data Google needs before deciding on a specific course of action.

Search is dead. At least the key word variety. Content processing is alive an well. The future is broader: data management and data spaces. As we rush forward, opportunities abound for licensees, programmers, entrepreneurs, and vendors. We are living in a transition from the Dark Ages of key word search to a more robust, more useful approach.

Stephen E. Arnold
10 January 2008

Little-Known Search Engines

January 9, 2008

Here’s a run down of little known engines with links to their web sites.

As I work to complete “Beyond Search: What to Do When Your Search Engine Doesn’t Work,” I reviewed my list of companies offering search technology. I could not remember much about several of them.

Here’s what triggered my checking to see what angle each of these companies takes, or in some cases, took towards search and retrieval.

  • Aftervote — A metasearch engine with a “vote up” or “vote down” button for results.
  • AskMeNow — A mobile search service that wanted my cell number. I didn’t test it. The splash page says AskMeNow.com is a “smart service”.
  • C-Search Solutions — A search system for “your IBM Domino domain.” The company offers a connector to hook the Google Search Appliance to Domino content.
  • Ceryle — A data management system that generates topics and associations.
  • Craky.com — Site has gone dark when I tested it on January 8, 2008. It was a “search engine for impatient boomers”.
  • Dumbfind — An amazing name. A social search system. Dumbfind describes itself as a “user generated content site.” A social search system, I believe.
  • Exorbyte — A German high-performance search system. Lists eBay, Yahoo, and the ailing Convera as customers.
  • Eyealike — A visual search engine. The splash page says “you can search for your dream date.” Alas, not me. Too old.
  • Ezilon — not Ezillion which is an auction site. A Web directory and search engine.
  • Idée Inc. — The company develops advanced image recognition and visual search software. Piximilar is the company’s image search system.
  • Kosmix — An “intelligent search engine”. The system appears to mimic some of the functions of Google’s universal search system.
  • Linguistic Agents — The company’s search technology bridges “language and technology”
  • Paglo Inc. — This is a “search engine for information technology on an Intranet. The system discovers “everything on your network”.
  • Q Phrase — The company offers “discovery tools”.
  • Semantra — The sysetm allow syou to have “an intelligent conversation with your enterprise databases.”
  • Sphinx — Sphinx is a full text search engine for database content.
  • Surf Canyon — In beta. The system shows related information when you hover over a hit in a results list.
  • Syngence — A content analytics company, Syngence focuses on “e-discovery”.
  • Viziant — The company is “a pioneer in delivering tools for discovery.”
  • Xerox Fact Spotter — Text mining tools developed at Xerox “surpass search”. The description of the system seems similar to the Inxight system that’s now part of Business Objects which is now owned by SAP.

Several observations are warranted. First, I am having a difficult time keeping up with many of these companies’ systems. Second, text mining and other rich text processing solutions are notable. Semantics, linguistics, and other techniques to squeeze meaning from information are hard-to-miss trends. The implication is that key word search is slipping out of the spotlight. Finally, investors are putting up cash to fund a very wide range of search-and-retrieval operations. Even though consolidation is underway in the search sector, there’s a steady flow of new and often hard-to-pronounce vendors chasing revenue.

Stephen E. Arnold
9 January 2008, 11:00am

Thoughts on Microsoft Buying Fast Search & Transfer

January 8, 2008

To start the New Year, Microsoft bought Fast Search & Transfer for about $1.2 billion, a premium over Fast’s share price before the stock was delisted from the Oslo exchange on January 7.

I’ve tracked Fast for more than seven years, including a stint performing an independent verification and validation of the firm’s technology for the U.S. Federal government. Some good background links:

Most of the coverage of the acquisition focuses on the general view that Microsoft will integrate Fast’s search technology into SharePoint. With upwards of 65 million installations of SharePoint, Microsoft’s content management and search platform, Fast’s technology looks like a slam dunk for Microsoft.

I want to look at three aspects of this deal that may be sidelights to the general news coverage. The thread running through new stories appearing early January 8, 2008, hit three points. One, Microsoft gets enterprise search technology that can add some muscle to the present search technology available in Microsoft Office SharePoint Server (MOSS). Second, shareholders get a big payday, including the institutional shareholders hit hard by Fast’s unpredictable financial results and flatlined share price. Third, synergies in research, technology, and customers make the deal a win for Microsoft and Fast.

Now, let’s look at the sidelights. I think that one or two of these issues will become more important if the deal closes in the second quarter of 2008 and the Fast technology is embraced by Microsoft’s various product groups. None of these issues is intended to be positive or negative. My goal is to discuss “behind the firewall search” or what the trade press calls “enterprise search”. This is distinct from Web search which indexes content on publicly-accessible Web servers in most cases. The “behind the firewall” type of search indexes content on a company’s own servers and its employees computers. The idea is that the “behind the firewall search” tackles the wide range of information and file types found in an organization. To illustrate: an organization must index standard file types like Word documents and Adobe Portable Document Format files. But the system must be able to handle information stored in enterprise applications built on SAP technology or with IBM’s technology. There’s another twist to “behind the firewall search”. That’s security. Certain information cannot be available to anyone but a select and carefully vetted group of users. One example is employee salary information. Another is research data for a new product. Finally, “behind the firewall search” has to be able to generate useful results when there aren’t indicators like the number of times a document is clicked on or viewed. As you may know, Google’s Web search system uses these cues to determine relevancy. In an organization, a very important piece of information may have zero or very low accesses. In a patent matter, a “behind the firewall search” system must be able to pinpoint that particular piece of information because it may be the difference between a successful legal resolution and a costly misstep.

Web Roots

Fast Search & Transfer’s technology has deep roots in Web indexing. Fast pulled out of Web indexing for the most part in 2003. In 2003, Fast sold its Web search division to Overture, subsequently acquired by Yahoo. With its focus on enterprise search, Fast’s engineers crafted enterprise functions on the high-speed, Linux-based indexing system that powers AlltheWeb.com. Fast’s Web roots have been wrapped in three types of extensions. First, Fast wrote new code to make integration with other enterprise systems easier. Second, Fast used some open source software as a way to perform certain tasks such as data management. Third, Fast acquired technology such as the 2004 acquisition of Nextpage Publishing Applications business Unit from Nextpage and a number of other properties, including the Convera RetrievalWare business. Convera was a “behind the firewall” search vendor that had fallen into the quagmire that sucks cash in an attempt to make search systems work the way licensees want. The point is that today’s Fast search system is complicated. There are quite a few subsystems “glued” to other components. It’s the nature of information to make today’s solution a smaller piece of what customers want. Over time, “behind the firewall search” systems become hugely complex. The figure below, taken from a 2005 Fast Search presentation once available via the Google cache, provides a good indication of what makes a Fast system tick. Click on the thumbnail to view it at normal size:

Fast infrastructure

Staffing

Fast Search has some outstanding engineers. Not only is John Lervik (CEO) a Google-caliber technologist, Bjorn Laukli (chief technical officer at one time) is a search wizard. Fast Search’s management team has turned to sales and marketing professionals. One of these individuals — Ali Riaz, now CEO of Attivio, Inc. — burnished the Fast Search image and fueled sales. In the wake of Mr. Riaz’s departure, Fast Search had to trim some costs. More than 140 employees were terminated and at the same time in 2006 and 2007, Fast Search expanded its technical hiring. The company handled the shift from pure technology in the pre-Riaz era to a sales-driven organization when Mr. Riaz was at the helm from 2000 to 2006, and then back to a more engineering focus in the post-Riaz era. Not surprisingly, institutional investor pressure increased. The Fast Search Board of Directors looked for ways to get the company on an equal revenue and earnings footing with arch-rival Autonomy plc. Arguably, Autonomy’s acquisitions (Verity in search and Zantaz in email mail compliance services) have been more beneficial to Autonomy’s revenue growth than Fast Search’s acquisitions such as Platefood in advertising and Agent Arts, a content recommending system. In short, there’s been some contention between sales and engineering, institutional investors and the board of directors, and the board of directors and senior management. Joseph Krivickas’ joining the firm as President and Chief Operating Officer in July 2007 marked a turning point for Fast Search, culminating in the Microsoft deal.

Customers

My Washington, DC affiliate (BurkeHarrod LLC) involved me in a study of satisfaction with “behind the firewall search” systems in the last half of 2007. The data revealed that in our sample of US scientists and engineers, 62 percent of the respondents to the statistically-valid survey were dissatisfied with their existing “behind the firewall search” systems. My examination of the publicly-available customers of Autonomy, Endeca, and Fast Search revealed an overlap of about 50 percent among Fortune 1000 firms. The significant overlap is not surprising because large organizations have units with different search requirements. Incumbent systems are not eliminated, creating a situation where the typical large organization has five or more “behind the firewall search” systems up and running. Autonomy’s acquisition of Verity and Fast Search’s acquisition of Convera was about customers. Granted each acquired company brought new technical capabilities to their respective buyers. The real asset was the customer base. I learned when researching the first three editions of The Enterprise Search Report that customers are usually in “search procurement mode”. No single system is right for the information access requirements of a large organization.

New Direction?

One final issue warrants a brief comment. In the last five years, there’s been a shift in information access methods. In the early 2000s, key word search was the basic way to find information in an organization. Today users want their information retrieval systems to suggest where to look, offer point-and-click interfaces somewhat similar to Yahoo’s so a user can see at a glance what’s available, and systems that make it easy to pinpoint certain types of information needed to perform routine work tasks. Key word search systems have to bulk up with additional technology to deliver these types of information retrieval functions. The challenge, not surprisingly, is cost. With ever cheaper processors and storage, performing additional indexing and content processing tasks seems trivial. Rich text processing or metatagging adds complexity to already sophisticated systems. The market wants features that can be expensive and problematic to implement. Perhaps this is why investors are keen to fund next-generation search systems that go beyond key word search into linguistic, semantic, and intelligent systems. For the company that can deliver the right mix of functionality at the right price a financial windfall awaits. In the meantime, there’s the general dissatisfaction and churn that is evident in the present consolidation in the search sector.

Beyond Search Net-Net

These sidelights may be outside the mainstream of those tracking the information access industry. My view may be summarized in four observations:

  • First, Microsoft SharePoint is complex. The Fast Search enterprise search platform (ESP) is complex. Integrating two complex systems will be a challenge. Microsoft’s engineers and Fast Search’s engineers are up to this task. The question will be “How long will the meshing take?” If speedy, Microsoft can expand its service offering and put another hurdle in the path of companies like Google eager to win more of the Microsoft market. If slow, the delay will allow further incursions into Microsoft territory by Google as well as IBM, Oracle, and SAP, among others.
  • Second, customers may be wary of escalating risk. Just as Autonomy had to reassure Verity search system users after that buy out, Microsoft will have to keep Fast Search’s more than 2,000 customers in the fold. The loss of some key accounts as a result of the deal will consume additional sales and marketing resources, thus adding to the cost of the acquisition. Companies like Autonomy and Endeca will be quick to make an attempt to win some of Fast Search’s more lucrative accounts such as its deal with Reed Elsevier for the SCIRUS service. Upstarts like Exalead, ISYS Search Software, Siderean, and others will also seek to provide a seamless replacement for the Fast Search solution. Other customers will be content to use an existing Fast Search system, worrying about changes when they occur. The search sector is about to get much more interesting and fast, pun intended.
  • Third, investors react to the news of $1.2 billion changing hands in predictable ways. I look for more interest in companies in the search sector. I can also envision the acquisition of Autonomy by a larger firm. In fact, looking forward 12 months, I see a series of shifts in the search landscape. There will be more search interest by the superplatforms such as Google, IBM, Oracle, and other enterprise software vendors. These large firms will want to expand their share of the Fortune 1000 market and capture an increasing share of the small- and mid-sized market. Upstarts ranging from Paris-based Exalead to the almost-unknown Tesuji in Hungary. My list of “behind the firewall search” vendors numbers more than 50 companies, excluding specialist firms that offer specialized “snap ins” for content processing.
  • Lastly, I think further consolidation in search will take place in 2008 and 2009. In the midst of these buy outs, customers will vote with their dollars to create some new winners in “behind the firewall” search. I will offer some thoughts on these in a future write up.

2008: Best or Worst of Times?

January 8, 2008

For search system vendors, it’s a Dickens’ “best of times, worst of times” business climate. Some companies like Autonomy have diversified, acquired competitors, and marketed effectively. Others have rushed from crisis to crisis, smothering bad news with excuses.

There are some up and comers in the “behind the firewall” search market. Companies to watch include the surging ISYS Search Software. It’s reliable. It’s speedy. And it sports some “must have” bells and whistles, including entity extraction and on-the-fly classification. Also worth watching is the semantic technology vendor Siderean Software. For companies wanting assisted navigation and slicing and dicing semantic metatags permit, Siderean’s system is worth a long, hard look. There are dozens of others making customers happy and reducing the hassles associated with finding information in an Intranet.

There are some companies struggling to keep their revenue in growth mode and leaping over rivulets of red ink. In 2007, Mondosoft, a Danish search system, floundered. It’s now part of the burgeoning SurfRay technology holdings. Entopia (Belmont, California) died quietly. Fast Search & Transfer survived some financial challenges and then with little warning withdrew from the Norwegian stock exchange. Is this a positive signal or a more ominous one.

The point is that many traditional search-and-retrieval vendors look one way and see the success of a Google. A look in another direction, there are warning signs that the “behind the firewall” sector is ripe for consolidation or an increasingly stringent shake out.

The best strategy for 2008 is to look for companies that can deliver a solution that works without a huge balloon payment for technical support and customization. A second tip is to look outside the US. ISYS has its technical roots in Australia. Exalead is a Paris-based company. Little-known Bitext operates from Madrid.

Procurement teams have a tendency to use what’s available. IBM, Microsoft, and Oracle offer search systems, often as a bonus when another enterprise product is licensed. Lucene beckons because some believe it’s free–as long as the licensee has open source savvy engineers on tap. Many enterprise systems such as content management systems include a search-and-retrieval system. When budgets are tight, the CFO asks, “Why pay again?”

My recommendation is to look at the up-and-comers in “behind the firewall” search. The brand names are safe, but you might be able to save money, time, and technical headaches by widening your horizons.

Google 2008 Publishing Output

January 1, 2008

If you had any doubt about Google’s publishing activities, check out “Google Blogging in 2008”. The article by Susan Straccia here provides a run down of the GOOG’s self publishing output. Google has more than 120 Web logs. The article crows about the number of unique visitors and tosses in some Googley references to Google fun. Pushing the baloney aside, the message is clear: Google has an effective, global publishing operation focused exclusively on promoting Google. Toss in the Google Channel on YouTube.com, and the GOOG has a communication, promotion, distribution mechanism that few of its rivals can match. In my opinion, not even a major TV network in the US can reach as many eyeballs as quickly and cheaply as Googzilla. Competitors have to find a way to match this promotional 30mm automatic Boeing M230 chain gun.

Stephen Arnold, January 1, 2009

« Previous Page

  • Archives

  • Recent Posts

  • Meta