Search: The Problem with Words and Their Misuse

January 30, 2008

I rely on several different types of alerts, including Yahoo’s service, to keep pace with developments in what I call “behind the firewall search”.

Today was particularly frustrating because the number of matches for the word “search” has been increasing, particularly since the Microsoft – Fast Search & Transfer acquisition and the Endeca cash injection from Intel and SAP. My alerts contain a large number of hits, and I realized that most of these are not about “behind the firewall” search, nor chock full of substantive information. Alerts are a necessary evil, but over the years, the primitive key word indexing offered by free services don’t help me.

The problem is the word search and its use or misuse. If you know of better examples to illustrate these types of search, please, post them. I’m interested in learning about sites and their search technology.

I have a so-so understanding of language drift, ambiguity, and POM (plain old marketing) work. For someone looking for information about search, the job is not getting easier. In fact, search has become such a devalued term that locating information about a particular type of search requires some effort. I’ve just finished compiling the Glossary for “Beyond Search”, due out in April 2008 from the Gilbane Group, a high-caliber outfit in the Boston, Massachusetts area. So, terminology is at the top of my mind this morning.

Let’s look at a few terms. These are not in alphabetical order. The order is by their annoyance factor. The head of the list contains the most annoying terms to me. The foot of the list are terms that are less offensive to me. You may not agree. That’s okay.

Vertical search. Number one for 2008. Last year it was in second place. This term means that a particular topic or corpus has been indexed. The user of a vertical search engine like Sidestep.com sees only hits in the travel area. As Web search engines have done a better and better job of indexing horizontal content — that is, on almost every topic — vertical search engines narrow their focus. Think deep and narrow, not wide and shallow. As I have said elsewhere, vertical search is today’s 20-somethings rediscovering how commercial databases handled information in the late 1970s with success then but considerably less success today.

Search engine marketing. This is last year’s number one. Google and other Web engines are taking steps to make it harder to get junk sites to the top of a laundry list of results. This phrase search engine marketing is the buzzword for the entire industry of getting a site on the first page of Google results. The need to “rank high” and has made some people “search gurus”. I must admit I don’t think too much of SEM, as it is called. I do a reasonable job of explaining SEM in terms of Google’s Webmaster guidelines. I believe that solid content is enough. If you match that with clean code, Web indexing bots will index the information. Today’s Web search systems do a good job of indexing, and there are value-added services such as Clusty.com that add metadata, whether the metadata exists on the indexed sites or not. When I see the term search used to mean SEM, I’m annoyed. Figuring out how to fool Google, Microsoft Live.com, or Yahoo’s indexing systems is not something that is of much interest to me. Much of the SEM experts’ guidance amounts to repeating Google’s Web master guidelines and fiddling with page elements until a site moves up in the rankings. Most sites lack substantive content and deserve to be at the bottom of the results list. Why do I want to have in my first page of results a bunch of links to sites without heft? I want links to pages significant enough to get to the top of results list because of solid information, not SEM voodoo. For basics, check out “How Stuff Works.”

Guided, faceted, assisted, and discovery search. The idea that is difficult to express in words and phrases is a system that provides point-and-click access to related information. I’ve heard a variation on these concepts expressed as drill-down search or exploratory search. These are 21st-century buzzwords for “Use For” and “See Also” references. But by the time a vendor gets done explaining taxonomies, ontologies, and controlled term lists, the notion of search is mired in confusion. Don’t get me wrong. Rich metadata and exposed links to meaningful “See Also” and “Use For” information is important. I’m just burned out with companies using these terms when their technology can’t deliver.

Enterprise search. I do not know what “enterprise search” is. I do know that there are organizations of all types. Some are government agencies. Some are non-profit organizations. Some are publicly-traded companies. Some are privately held companies. Some are professional services corporations. Some are limited liability corporations. Each has a need to locate electronic information. There is no one-size-fits-all content processing and retrieval system. I prefer the phrase “behind the firewall search.” It may not be perfect, but it makes clear that the system must function in a specific type of setting. Enterprise search has been overused, and it is now too fuzzy to be useful from my point of view. A related annoyance is the word “all”. Some vendors say they can index “all the organization’s information.” Baloney. Effective “behind the firewall” systems deliver information needed to answer questions, not run afoul of federal regulations regarding health care information, incite dissatisfaction by exposing employee salaries, or let out vital company secrets that should be kept under wraps.

Natural language search. This term means that the user can type a question into a system. A favorite query is, “What are the car dealerships in Palo Alto?” You can run this query on Google or Ask.com. The system takes this “natural language question”, coverts it to Boolean, and displays the results. Some systems don’t do anything more than display a cached answer to a frequently asked question. The fact is that most users–exceptions include lawyers and expert intelligence operatives–don’t do “natural lanaguage queries”. Most users type some words like weather 40202 and hit the Enter key. NLP sounds great and is often used in the same sentence with latent semantic indexing, semantic search, and linguistic technology. These are useful technologies, but most users type their 2.3 words and take the first hit on the results list.

Semantic search. See natural language search. Semantic technologies are important and finally practical in every day business operations. Used inside search systems, today’s fast processors and cheap storage make it possible to figure out some nuances in content and convert those nuances to metatags. It’s easy for vendors to bandy about the term semantic and Semantic Web than explain what it delivers in terms of precision and recall. There are serious semantic-centric vendors, and there are a great many who use the phrase because it helps make sales. An important vendor of semantic technology is Siderean Software. I profile others in “Beyond Search”.

Value-added search. This is a coinage that means roughly, “When our search system processes content, we find and index more stuff.” “Stuff”, obviously, is a technical word that can mean the file type or concepts and entities. A value-added search system tries to tag concepts and entities automatically. Humans used to do indexing but there is too much data and not enough skilled indexers. So, value-added search means “indexing like a human used to do.” Once a result set has been generated, value-added search systems will display related information; that is, “See Also” references. An example is Internet the Best. Judge for yourself if the technique is useful.

Side search. I like this phrase. It sounds nifty and means nothing to most people in a vendor’s marketing presentation. What I think the vendors who use this term mean is additional processes that run to generate “Use For” and “See Also” references. The implication is that the user gets a search bonus or extra sugar in their coffee. Some vendors have described a “more like this” function as a side search. The idea is that a user sees a relevant hit. By clicking the “more like this” hot link, the system uses the relevant hit as the basis of a new, presumably more precise, query. A side search to me means any automatic query launched without the user having to type in a search box. The user may have to click the mouse button, but the heavy lifting is machine-assisted. Delicious offers a side search labeled as related terms. Just choose a tag from the list of the right side of the Web page, and you see more hits like these. The idea is that you get related information without reentering a query.

Sentiment search. I have just looked at a new search system called Circos. This system lets me search in “color”. The idea is that emotions or feeling can be located. People want systems that provide a way to work emotion, judgment, and nuance into their results. Lexalytics, for examples, offers a useful, commercial system that can provide brand managers with data about whether customers are positive or negative toward the brand. Google, based on their engineering papers, appears to be nosing around in this sentiment search as well. Worth monitoring because using algorithms to figure out if users like or dislike a person, place, or thing can be quite significant to analysts.

Visual search. I don’t know what this means. I have seen the term used to describe systems that allow the user to click on pictures in order to see other pictures that share some colors or shapes of the source picture. If you haven’t seen Kartoo, it’s worth a look. Inxight Software offers a “search wall”. This is a graphic representation of the information in a results list or a collection as a three-dimensional brick wall. Each brick is a content object. I liked the idea when I first saw in five or six years ago, but I find visual search functionality clunky. Flying hyperbolic maps and other graphic renderings have sizzle, but instead of steak I get boiled tofu.

Parametric search. Structured search or SQL queries with training wheels are loose synonyms for parametric search and close enough for horse shoes. The term parametric search has value, but it is losing ground to structured search. Today, structured data are fuzzed with unstructured data by vendors who say, “Our system supports unstructured information and structured data.” Structured and unstructured data treated as twins, thus making it hard for a prospect to understand what processes are needed to achieve this delightful state. These data can then be queried by assisted, guided, or faceted search. Some of the newer search systems are, at their core, parametric systems. These systems are not positioned in this way. Marketers find that customers don’t want to be troubled by “what’s under the hood.” So, “fields” become metatags, and other smoothing takes place. It is no surprise to me that content processing procurement teams struggle to figure out what a vendor’s system actually does. Check out Thunderstone‘s offering and look for my Web log post about parametric (structured search) in a day or two. In Beyond Search, I profile two vendors’ systems each with different but interesting parametric search functionality. Either of these two vendors’ solutions can help you deal with the structured – unstructured dichotomy. You will have to wait until April 2008 when my new study comes out. I’m not letting these two rabbits out of my hat yet.

Unstructured search. This usually implies running a query against text that has been indexed for its key words because the source lacks “tags” or “field names”. Email, PDFs, and some Word documents are unstructured. A number of content processing systems can also index bound phrases like “stock market” and “white house”. Others include some obvious access points such as file types. Today, unstructured search blends into other categories. But unstructured search has less perceived value than flashier types of search or a back office ERP (enterprise resource planning) application. Navigate to ArnoldIT.com and run a query in my site’s search box. That’s an unstructured search, provided by Blossom Software, which is quite interesting to me.

Hyperbolic search. There are many variations of this approach which is called “buzzword fog”. Hyperbolic geometry and modular forms play an important role is some vendors’ systems. But these functions are locked away out of sight and fiddling by licensees. When you hear terms other than plain English, you are in the presence of “fog rolling in on little cat’s feet.” The difference is that this fog doesn’t move on. You are stuck in an almost-impenetrable mist. When you see the collision coming, it is almost always too late to avoid. I think the phrase means, “Our engineers use stuff I don’t understand, but it sure sounds good.”

Intuitive search. This is a term used to suggest that the interface is easy enough for the marketer’s mother to use without someone telling her what to do. The interface is one visible piece of the search system itself. Humans like to look at interfaces and debate which color or icon is better for their users. Don’t guess on interfaces. Test different ones and use what gets the most clicks. Interfaces that generate more usage are generally better than interfaces designed by the senior vice president’s daughter who just graduated with an MFA from the University of Iowa. Design opinion is not search; it’s technology decoration. For an example, look at this interface from Yahoo. Is it intuitive to you?

Real-time search. This term means that the content is updated frequently enough to be perceived as real time. It’s not. There is latency in search systems. The word “search,” therefore, doesn’t mean real-time by definition. Feed means “near real time”. There are a lot of tricks to create the impression of real time. These include multiple indexes, caching, content boosting, and time stamp fiddling. Check out ZapTXT. Next compare Yahoo News, AllTheWeb.com news, and Google News. Okay, which is “real time”? Answer: none.

Audio, video, image search. The idea is that a vendor indexes a particular type of non-text content. The techniques range from indexing only metadata and not the information in the binary file to converting speech to ASCII, then indexing the ASCII. In Japan, I saw a demonstration of a system that allowed a user to identify a particular image — for example, a cow. The system then showed pictures the system thought contained cows. These type of search systems address a real need today. The majority of digital content is in the form of digitized audio, video, and image files. Text is small potatoes. We don’t do a great job on text. We don’t do very well at all on content objects such as audio, video, and images. I think Blinkx does a reasonably good job, not great, reasonable.

Local search. This is a variation on vertical search. Information about a city or particular geographic area is indexed and made available. This is Yellow Pages territory. It is the domain of local newspaper advertising. A number of vendors want to dominate this sector; for example, Google, Microsoft, and Yahoo. Incumbents like telcos and commercial directory firms aren’t sure what actions to take as online sites nibble away at what was a $32 billion dollar paper directory business. Look at Ask City. Will this make sense to your children?

Intelligent search. This is the old “FOAI” or familiar old artificial intelligence. Most vendors uses artificial intelligence but call it machine learning or computational intelligence. Every major search engine uses computational intelligence. Try Microsoft’s Live.com. Now try Google’s “ig” or Individualized Google service. Which is relying more on machine learning?

Key word search. This is the ubiquitous, “naked” search box. You can use Boolean operators, or you can enter free text and perform a free text search. Free text search means no explicit Boolean operators are required of a user. Enlightened search system vendors add an AND to narrow the result set. Other system vendors, rather unhelpfully, add an OR, which increases the number of results. Take a look at the key word search from Ixquick, a New York City investment banker developed engine now owned by a European company. What’s it doing to your free text query?

Search without search. Believe me, this is where the action is. The idea is that a vendor — for example, Google — will use information about information, user behavior, system processes, and other bits and pieces of data — to run automatically and in the background, queries for a user. Then when the user glances at his / her mobile device, the system is already displaying the information most likely to be wanted at that point of time by that user. An easy way to think of this is to imagine yourself rushing to the airport. The Google approach would look at your geo spatial coordinates, check your search history, and display flight departure delays or parking lot status. I want this service because anyone who has ridden with me knows that I can’t drive, think about parking, and locate my airline reliably. I can’t read the keyboard on my mobile phone, so I want Google to convert the search result to text, call me, and speak the information as I try to make my flight. Google has a patent application with the phrase “I’m feeling doubly lucky.” Stay tuned to Google and its competitors for more information on this type of search.

This short list of different types of search helps explain why there is confusion about which systems do what. Search is no longer something performed by a person training in computer science, information science, or a similar discipline. Search is something everyone knows, right? Wrong. Search is a service that’s readily available and used by millions of people each day. Don’t confuse using an automatic teller machine with understanding finance. The same applies to search. Just because a person can locate information about a subject does not mean that person understands search.

Search is among the most complex problems in computer science, cognitive psychology, information retrieval, and many other disciplines. Search is many things, but it definitely is not easy, well understood, or widely recognized as the next application platform.

Stephen Arnold, January 30, 2008

Sentiment Analysis: Bubbling Up as the Economy Tanks

January 20, 2008

Sentiment analysis is a sub-discipline of text mining. Text mining, as most of you know, refers to processing unstructured information and text blocks in a database to wheedle useful information from sentences, paragraphs, and entire documents. Text mining looks for entities, linguistic clues, and statistically significant high points.

The processing approach varies from vendor to vendor. Some vendors use statistics; others semantic techniques. More and more, mix and match procedures to get the best of each approach. The idea is that software “reads” or “understands” text. None of the more than 100 vendors offering text mining systems and utilities does as well as a human, but the systems are improving. When properly configured, some systems out perform a human indexer. (Most people think humans are the best indexers, but for some applications, software can do a better job.) Humans are needed to resolve “exceptions” when automated systems stumble. But unlike the human indexer who often memorizes a number of terms and uses these sometimes without seeking a more appropriate term from the controlled vocabulary. Also, human indexers can get tired, and fatigue affects indexing performance. Software indexing is the only way to deal with the large volumes of information in digital form today.

Sentiment analysis “reads” and “understands” text in order to find out if the document is positive or negative. About eight years ago, my team did a sentiment analysis for a major investment fund’s start up. The start up’s engineers were heads down on another technical matter, and the sentiment analysis job came to ArnoldIT.com.

We took some short cuts because time was limited. After looking at various open source tools and the code snippets in ArnoldIT’s repository, we generated a list of words and phrases that were generally positive and generally negative. We had several collections of text, mostly from customer support projects. We used these and applied some ArnoldIT “magic”. We were able to process unstructured information and assign a positive or negative score to documents based on our ArnoldIT “magic” and the dictionary. We assigned a red icon for results that our system identified as negative. Without much originality, we used a green icon to flag positive comments. The investment bank moved on, and I don’t know what the fate of our early sentiment analysis system was. I do recall that it was useful in pinpointing negative emails about products and services.

A number of companies offer sentiment analysis as a text mining function. Vendors include, Autonomy, Corpora Software, and Fast Search & Transfer, among others. A number of companies offer sentiment analysis as a hosted service with the work more sharply focused on marketing and brands. Buzzmetrics (a unit of AC Nielsen), Summize, and Andiamo Systems compete in the consumer segment. ClearForest, before it was subsumed into Reuters (which was then bought by the Thomson Corporation) had tools that performed a range of sentiment functions.

The news that triggered my thinking about sentiment was statistics and business intelligence giant SPSS’s announcement that it had enhanced the sentiment analysis functions of its Clementine content processing system. According to ITWire, Clementine has added “automated modeiing to identify the best analytic models, as well as combining multiple predictions for the most accurate results. You can read more about SPSS’s Clementine technology here. SPSS acquired LexiQuest, an early player in rich content processing, in 2002. SPSS has integrated its own text mining technology with the LexiQuest technology. SAS followed suit but licensed Inxight Software technology and combined that with SAS’s home-grown content processing tools.

There’s growing interest in analyzing call center, customer support, and Web log content for sentiment about people, places, and things. I will be watching for more announcements from other vendors. In the behind-the-firewall search and content processing sectors, there’s a strong tendency to do “me too” announcements. The challenge is to figure out which system does what. Figuring out the differences (often very modest) between and among different vendors’ solutions is a tough job.

Will 2008 be the year for sentiment analysis? We’ll know in a few months if SPSS competitors jump on this band wagon.
Stephen E. Arnold, January 20, 2008.

Computerworld’s Take on Enterprise Search

January 12, 2008

Several years ago I received a call. I’m not at liberty to reveal the names of the two callers, but I can say that both callers were employed by the owner of Computerworld, a highly-regarded trade publication. Unlike its weaker sister, InfoWorld, Computerworld remains both a print and online publication. The subject of the call was “enterprise search” or what I now prefer to label “behind-the-firewall search.”

The callers wanted my opinion about a particular vendor of search systems. I provided a few observations and said, “This particular company’s system may not be the optimal choice for your organization.” I was told, “Thanks. Goodbye” IDG promptly licensed the system against which I cautioned. In December 2007 at the international online meeting in London, England, an aquaintance of mine who works at another IDG company complained about the IDG “enterprise search” system. When I found myself this morning (January 12, 2008) mentioned in an article authored by a professional working at an IDG unit, I invested a few moments with the article, an “FAQ” organized as questions and answers.

In general, the FAQ snugly fitted what I believe are Computerworld’s criteria for excellence. But a few of the comments in the FAQ nibbled at me. I had to work on my new study Beyond Search: What to Do When Your Search System Doesn’t Work, and I had this FAQ chewing at my attention. A Web can be a useful way to test certain ideas before “official” publication. Even more interesting is that I know that IDG’s incumbent search system, ah, disappoints some users. Now, before the playoff games begin I have an IDG professional cutting to the heart of search and content processing. The article “FAQ: Why Is Enterprise Search Harder Than Google Web Search?” references me. The author appears to be Eric Lai, and I don’t know him, nor do I have any interaction with Computerworld or its immedite parent, IDC, or the International Data Group, the conglomerate assembled by Patrick McGovern (blue suit, red tie, all the time, anywhere, regardless of the occasion).

On the article’s three Web pages (pages I want to add that are chock full of sidebars, advertisements, and complex choices such as Recommendations and White Papers) Mr. Lai’s Socratic dialog unfurls. The subtitle is good too: “Where Format Complications Meet Inflated User Expectations”. I cannot do justice to the writing of a trained, IDC-vetted journalist backed by the crack IDG editorial resources, of course. I’m a lousy writer, backed by my boxer dog Tyson and a moonshine-swilling neighbor next hollow down in Harrods Creek, Kentucky.

Let me hit the key points of the FAQ’s Socratic approach to the thorny issues of “enterprise search”, which is remember “behind-the-firewall search” or Intranet search. After thumbnailing each of Mr. Lai’s points, I will offer comments. I invite feedback from IDC. IDG, or anyone who has blundered into my Beyond Search Web log.

Point 1: Function of Enterprise Search

Mr. Lai’s view is that enterprise search makes information “stored in their [users’] corporate network available. Structured and unstructured data must be manipulated, and Mr. Lai on the authority of Dr. Yves Schabes, Harvard professor and Teragram founder, reports that a dedicated search system executes queries more rapidly “though it can’t manipulate or numerically analyze the data.”

Beyond Search wants to add that Teragram is an interesting content processing system. In Mr. Lai’s discussion of this first FAQ point, he has created a fruit salad mixed in with his ones and zeros. The phrase “enterprise search” is used as a shorthand way to refer to the information on an organization’s computers. Although a minor point, there is no “enterprise” in “enterprise search” because indexing behind-the-firewall information means deciding what not to index or at least, what content is available to whom under what circumstances. One of the gotchas in behind-the-firewall search, therefore, is making sure that the system doesn’t find and make available personal information, health and salary information, certain sensitive information such as what division is up for sale, and the like. A second comment I want to make is that Teragram is what I classify as a “content processing system provider”. Teragram’s technology, which has been used at the New York Times and America Online can be an enhancement to other vendors’ technology. Finally, the “war of words” that rages between various vendors about performance of database systems is quite interesting. My view is that behind-the-firewall search and the new systems on offer from Teragram and others in the content processing sector are responding to a larger data management problem. Content processing is a first step toward breaking free of the limitations of the Codd database. We’re at an inflection point and the swizzling of technologies presages a far larger change coming. Think dataspaces, not databases, for example. I discuss dataspaces in my new study out in April 2008, and I hope my discussion will put the mélange of ideas in Mr. Lai’s first Socratic question in a different context. The change from databases to dataspaces is more than a two consonants.

Point 2: Google as the Model for Learning Search

Mr. Lai’s view is that a user of Google won’t necessarily be able to “easily learn” [sic] “enterprise search” system.

I generally agree with the sentiment of the statement. In Beyond Search I take this idea and expand it to about 250 pages of information, including profiles of 24 companies offering a spectrum of systems, interfaces, and approaches to information access. Most of the vendors’ systems that I profile offer interfaces that allow the user to point-and-click their way to needed information. Some of the systems absolve the user of having to search for anything because work flow tools and stored queries operated in the background. Just-in-time information delivery makes the modern systems easier to use because the hapless employee doesn’t have to play the “search box guessing game.” Mr. Lai, I believe, finds query formulation undaunting. My research reveals the opposite. Formulating a query is difficult for many users of enterprise information access systems. When a deadline looms, employees are uncomfortable trying to guess the key word combination that unlocks the secret to the needed information.

Point 3: Hard Information Types

I think Mr. Lai reveals more about his understanding of search in this FAQ segment. Citing our intrepid Luxembourgian, Dr. Schabes, we learn about eDiscovery, rich media, and the challenge of duplicate documents routinely spat out by content management systems.

The problem is the large amounts of unstructured data in an organization. Let’s reign in this line of argument. There are multiple challenges in behind-the-firewall search. What makes information “hard” (I interpret the word “hard” as meaning “complex”) involves several little-understood factors colliding in interesting ways. [a] In an organization there may be many versions of documents, many copies of various versions, and different forms of those documents; for example, a sales person may have the Word version of a contract on his departmental server, but there may be an Adobe Portable Document Format version attached to the email telling the client to sign it and fax the PDF back. You may have had to sift through these variants in your own work. [b] There are files types that are in wide use. Many of these may be renegades; that is, the organization’s over-worked technical staff may be able to deal with some of them. Other file types such as iPod files, digital videos of a sales pitch captured on a PR person’s digital video recorder, or someone’s version of a document exported using Word 2007’s XML format are troublesome. Systems that process content for search and retrieval have filters to handle most common file types. The odd ducks require some special care and feeding. Translation: coding filters, manual work, and figuring out what to do with the file types for easy access. [c] Results in the form of a laundry list are useful for some types of queries but not for others. The more types of content processed by the system, the less likely a laundry list will be useful. Not urprisingly, advanced content processing systems produce reports, graphic displays, suggestions, and interactive maps. When videos and audio programs are added to the mix, the system must be able to render that information. Most organizations’ networks are not set up to shove 200 megabyte video files to and fro with abandon or alacrity. You can imagine the research, planning, and thought that must go into figuring out what to do with these types of digital content. None is “hard”. What’s difficult is the problem solving needed to make these data and information useful to an employee so work gets done quickly and in an informed manner. Not surprisingly, Mr. Lai’s Socratic approach leaves a few nuances in the tiny spaces of the recitation of what he thinks he heard Mr. Schabes suggest. Note that I know Mr. Schabes, and he’s an expert on rule-based content processing and Teragram’s original rule nesting technique, a professor at Harvard, and a respected computer scientist. So “hard” may not be Teragram’s preferred word. It’s not mine.

Point 4: Enterprise Search Is No More Difficult than Web Search

Mr. Lai’s question burrows to the root of much consternation in search and retrieval. “Enterprise search” is difficult.

My view is that any type of search ranks as one of the hardest problems in computer science. There are different types of problems with each variety of search–Web, behind-the-firewall, video, question answering, discovery, etc. The reason is that information itself is a very, very complicated aspect of human behavior. Dissatisfaction with “behind-the-firewall” search is due to many factors. Some are technical. In my work, when I see yellow sticky notes on monitors or observe piles of paper next to a desk, I know there’s an information access problem. These signs signal the system doesn’t “work”. For some employees, the system is too slow. For others, the system is too complex. A new hire may not know how to finagle the system to output what’s needed. Another employee may be too frazzled to be able to remember what to do due to a larger problem which needs immediate attention. Web content is no walk in the park either. But free Web indexing systems have a quick fix for problem content. Google, Microsoft, and Yahoo can ignore the problem content. With billions of pages in the index, missing a couple hundred million with each indexing pass is irrelevant. In an organization, nothing angers a system user quicker than knowing a document has been processed or should have been processed by the search system. When the document cannot be located, the employee either performs a manual search (expensive, slow, and stress inducing) or goes ballistic (cheap, fast, and stress releasing). In either scenario or one in the middle, resentment builds toward the information access system, the IT department, the hapless colleague at the next desk, or maybe the person’s dog at home. To reiterate an earlier point. Search, regardless of type, is extremely challenging. Within each type of search, specific combinations of complexities exist. A different mix of complexities becomes evident within each search implementation. Few have internalized these fundamental truths about finding information via software. Humans often prefer to ask another human for information. I know I do. I have more information access tools than a nerd should possess. Each has its benefits. Each has its limitations. The trick is knowing what tool is needed for a specific information job. Once that is accomplished, one must know how to deal with the security, format, freshness, and other complications of information.

Point 5: Classification and Social Functions

Mr. Lai, like most search users and observers, have noses that twitch when a “new” solution appears. Automatic classification of documents and support of social content are two of the zippiest content trends today.

Software that can suck in a Word file and automatically determine that the content is “about” the Smith contract, belongs to someone in accounting, and uses the correct flavor of warranty terminology is useful. It’s also like watching Star Trek and hoping your BlackBerry Pearl works like Captain Kirk’s communicator. Today’s systems, including Teragram’s, can index at 75 to 85 percent accuracy in most cases. This percentage can be improved with tuning. When properly set up, modern content processing systems can hit 90 percent. Human indexers, if they are really good, hit in the 85 to 95 percent range. Keep in mind that humans sometimes learn intuitively how to take short cuts. Software learns via fancy algorithms and doesn’t take short cuts. Both humans and machine processing, therefore, have their particular strengths and weaknesses. The best performing systems with which I am familiar rely on humans at certain points in system set up, configuration, and maintenance. Without the proper use of expensive and scarce human wizards, modern systems can veer into the ditch. The phrase “a manager will look at things differently than a salesperson” is spot on. The trick is to recognize this perceptual variance and accommodate it insofar as possible. A failure to deal with the intensely personal nature of some types of search issues is apparent when you visit a company where there are multiple search systems or a company where there’s one system–such as the the one in use at IDC–and discover that it does not work too well. (I am tempted to name the vendor, but my desire to avoid a phone call from hostile 20-year-olds is very intense today. I want to watch some of the playoff games on my couch potato television.)

Point 6: Fast’s Search Better than Google’s Search

Mr. Lai raises the question that is similar to America’s fascination with identifying the winner in any situation.

We’re back to a life-or-death, winner-take-all knife fight between Google and Microsoft. No search technology is necessarily better or worse than another. There are very few approaches that are radically different under the hood. Even the highly innovative approaches of companies such as Brainware and its “associative memory” approach or Exegy with its juiced up hardware and terabytes of on board RAM appliance share some fundamentals with other vendors’ systems. If you slogged through my jejune and hopelessly inadequate monographs, The Google Legacy (Infonortics, 2005) and Google Version 2.0 (Infonortics, 2007), and the three editions I wrote of The Enterprise Search Report (CMSWatch.com, 2004, 2005, 2006) you will know that subtle technical distinctions have major search system implications. Search is one of these areas with a minor tweak can yield two quite distinctive systems even though both share similar algorithms. A good example is the difference between Autonomy and Recommind. Both use Bayesian mathematics, but the differences are significant. Which is better? The answer is, “It depends.” For some situations, Autonomy is very solid. For others, Recommind is the system of choice. The same may be said of Coveo, Exalead, ISYS Search Software, Siderean, or Vivisimo, among others. Microsoft will have some work to do to understand what it has purchased. Once that learning is completed, Microsoft will have to make some decisions about how to implement those features into its various products. Google, on the other hand, has a track record of making the behind-the-firewall search in its Google Search Appliance better with each point upgrade. The company has made the GSA better and rolled out the useful OneBox API to make integration and function tweaking easier. The problem with trying to get Google and Microsoft to square off is that each company is playing its own game. Socratic Computerworld professionals want both companies to play one game, on a fight-to-the-death basis, now. My reading of the data I have is that a Thermopylae is not now or in the near future in the interests of either Google of Microsoft to clash too much. The companies have different agendas, different business models, and different top-of-mind problems to resolve. The future of search is that it will be invisible when it works. I don’t think that technology is available from either Google or Microsoft at this time.

Point 7: Consolidation

Mr. Lai wants to rev the uncertainty engine, I think. We learn from the FAQ that search is still a small, largely unknown market sector. We learn that big companies may buy smaller companies.

My view is that consolidation is a feature of our market economy. Mergers and acquisitions are part of the blood and bones of business, not a characteristic of the present search or content processing sector. The key point that is not addressed is the difficulty of generating a sustainable business selling a fuzzy solution to a tough problem. Philosophers have been trying to figure out information for a long time and have done a pretty miserable job as far as I can tell. Software that ventures into information is going to face some challenges. There’s user satisfaction, return on investment, appropriate performance, and the other factors referenced in this essay. The forces that will ripple through behind-the-firewall search are:

  • Business failure. There are too many vendors and too few buyers willing to pay enough to keep the more than 350 companies’ sustainable
  • Mergers. A company with customers and so-so technology is probably more valuable than a company with great technology and few customers. I have read that Microsoft was buying customers, not Fast Search & Transfer’s technology. Maybe? Maybe not.
  • Divestitures and spin outs. Keep in mind that Inxight Software, an early leader in content processing, was pushed out of Xerox’s Palo Alto Research Center. The fact that it was reported as an acquisition by Business Objects emphasized the end game. The start was, “Okay, it’s time to leave the nest.”

The other factor is not consolidation; it is absorption. Information is too important to leave in a stand-alone application. That’s why Microsoft’s Mr. Raikes seems eager to point out that Fast Search would become part of SharePoint.

Net-Net

The future, therefore, is that there will be less and less enthusiasm for expensive, stand-alone “behind-the-firewall” search. Information access is part of larger, higher-value information access solutions.

Stephen E. Arnold
January 13, 2008

Recommind: Following the Search Imperative

January 10, 2008

I opened my Yahoo alerts this morning, January 10, 2008, and read:

Recommind Predicts 2008 Enterprise Search and eDiscovery Trends: Search Becomes the Information Foundation of the … — Centre Daily Times Wed, 09 Jan 2008 5:32 AM PST

According to the enterprise search and eDiscovery technology experts at Recommind, 2008 will be the year that enterprise search and eDiscovery converge to become top areas of focus for enterprises worldwide, creating substantial growth and evolution in the management of electronic information.

The phrase “foundation of the electronic enterprise” struck me as meaningful and well-turned. Most search experts know Recommind by name only. I profiled the company in the third edition of The Enterprise Search Report, the last one that I wrote. I support the excellent fourth edition, but I did not do any of the updating for that version of the study. I’m confining my efforts to shorter, more specialized analyses.

The company once focused on the legal market. My take on the company’s technology was that it relied on Bayesian algorithms.

The Recommind product can deliver key word search. The company has a patented algorithm that implements “probabilistic latent semantic analysis.” I will discuss latent semantic indexing in “Beyond Search”. For our purpose, Recommind’s system identifies and analyzes the distribution in a document of concept-related words. The approach uses statistical methods to predict an item’s relevance. .

The Recommind implementation of these algorithms differentiate the company’s system from Autonomy’s. Autonomy, as you may know, is the high-profile proponent of “automatic” or “automated” text processing. The idea (and I am probably going to annoy the mathematicians who read this article) is that Bayesian algorithms can operate without human fiddling. The phrase “artificial intelligence” is often applied to a Bayesian system when it feeds information about processed content back into the content processing subsystem. The notion is that Bayesian systems can be implemented to adapt to the content flowing through the system. As the system processes content, the system recognizes new entities, concepts, and classifications. The phrase “set it and forget it” may be used to describe a system similar to Autonomy’s or Recommind’s. Keep in mind that each company will quickly refine my generalization. For my purposes, however, I’m not interested in the technology. I’m interested in the market orientation the news story makes clear.

Recommind is no longer a niche player in content processing. Recommind is cursoring the heartland of IBM, Microsoft, and Oracle: big business, the Fortune 1000, the companies that have money and will spend it on systems that enhance the firm’s revenue or control the firm’s costs. Recommind is an “enterprise content solutions vendor”

Some History

Lawyers are abstemious, far better at billing their clients than spending on information technology. Recommind offered a reasonably priced solution for what’s now called “eDiscovery.”

eDiscovery means collecting a set of documents, typically those obtained through the legal discovery process and processing them electronically. The processing part can have a number of steps, ranging from scanning, performing optical character recognition, and generating indexable files to performing relatively simple file transformation tasks. A simple transformation task is to take electronic mail and segment the message and save it, then save any attachment such as a PowerPoint presentation. Once a body of content obtained through the legal discovery process is available, that context is indexed.

Legal discovery means, and I am simplifying in this explanation, that each side in a legal matter must provide information to the opposing side. In complex matters, more than two law firms and usually more than two attorneys will be working on the matter. In the pre-digital age, discover involved keeping track of each discovered item manually, affixing an agreed upon identification number on the item, and making photocopies. The photocopies were — and still are in many legal proceedings — punched and placed in binders. The binders, even for relatively modest legal actions, can proliferate like gerbils. In a major legal action, the physical information can run to hundreds of thousands of documents.

eDiscovery, therefore, is the umbrella term for converting the text to electronic form, indexing it, and making that index available for those authorized to find and read those documents.

The key point about discovery is that it is not key word search. Discovery means that the system somehow finds out the important information in a document or collection of documents and makes that finding evident to a user. No key word query is needed. The user can read an email alert, click on a hot link that says, “The important information is here”, or displays a visual representation of what’s in a mass of content. Remember: discovery means no key word query, no reading of the document to find out what’s in it. Discovery is the most recent Holy Grail in information retrieval despite its long history in specialized applications like military intelligence.

Recommind found success in the eDiscovery market. The product was reasonably priced, particularly when compared to a brand name, high profile system such as those available from Autonomy, Endeca, Fast Search & Transfer, iPhrase (now a unit of IBM), and Stratify. Instead of six figures, think in terms of $30,000 to $50,000. For certain law firms, spending $50,000 to manipulate discovered materials electronically was preferable to spending $330,000.

The problem with the legal market is that litigation and legal matters come and go. For a vendor of eDiscovery tools, marketing costs chew away at margins. Only a small percentage of law firms maintain a capability to process case-related materials in a single system. The pattern is to gear up for a specific legal matter, process the content, and then turn off the system when the matter closes. Legal content related to a specific case is encumbered by necessary controls about access, handling of the information once the matter is resolved, and specific actions that must be taken with regard to the information obtained in eDiscovery; for example, expert witnesses must return or destroy certain information at the close of a matter.

The point is that eDiscovery systems are designed to make it possible for a law firm to comply with the stipulations placed on information obtained in the discovery process.

Approaches to eDiscovery

Stratify, now a unit of Iron Mountain, is one of the leaders in eDiscovery. Once called Purple Yogi and the darling of the American intelligence community, Stratify has narrowed its marketing to eDiscovery. The Stratify system performs automatic processes along with key word indexing of documents gathered via legal discovery. The system has been tuned for legal applications. Licensees receive knowledge bases with legal terms, a taxonomy, and an editorial interface so the licensing firm can add, delete, or modify the knowledge bases. Stratify is priced in a way that is similar to the approach taken by the Big Three (Autonomy, Endeca, and Fast Search & Transfer) in search; that is, fees in the hundreds of thousands of dollars are more common than $50,000 fees. Larger license fees are needed because the marketing costs are high, and the search vendors have to generate enough revenue to avoid plunging into financial shortfalls. Second, the higher fees make sense to large, cash rich organizations. Many companies want to pay more in order to get better service or the “best available” solution. Third, other factors may be operating such as the advice of a consultant or the recommendation of a law firm also working on the matter.

eDiscovery can also be performed using generalized and often lower-cost products. In the forthcoming “Beyond Search: What to Do When Your Search System Doesn’t Work”, I profile a number of companies offering software systems that can make discovered matter searchable. For most of these firms, the legal market is a sideline. Selling software to law firms requires specialized knowledge of legal proceedings, a sales person familiar with how law firms work, and marketing that reaches attorneys in a way that makes them comfortable. The legal market is a niche, and anyone can buy the names of lawyers from various sources, lawyers are not an easy market to penetrate.

Recommind, therefore, has shifted its marketing from the legal niche to the broader, more general market for Intranet search or what I call “behind the firewall” search. The term “enterprise search” is devalued, and I want to steer clear of giving you the impression that a single search systems can serve the many information access needs of a growing organization. More importantly, there’s a belief that “one size fits all” in search. That is a misconception. The reality is that an organization will have a need for many different types of information access systems. At some point in the future, there may be a single point solution, but for the foreseeable future, organizations will need separate, usually compartmentalized systems to avoid personnel, legal, and intellectual property problems. I will write more about this in “Beyond Search” and in this Web log.

Trajectory of Recommind

Recommind’s market trajectory is important. The company’s shift from a niche to a broader market segment illustrates how content processing companies must adapt to the revenue realities in selling search solutions. Recommind has moved into a market sector where a general purpose solution at a competitive price point should be easier to sell. Instead of the specialized sales person for the niche market, a sales person with more generalized experience can be hired. The small number of law firms is somewhat limited and has become saturated. The broader enterprise market consists of the Fortune 1000 and upwards of 15 million small- and mid-sized businesses. Most of these need and want a “better” search solution. Recommind’s expansion of its marketing into this broader arena makes sense, and it illustrates what many niche vendors often do to increase their revenues.

Here’s the formula and a diagram to illustrate this marketing shifting. Click on the thumbnail to view the illustration:

  • Increase the number of prospects for a search system by moving to a larger market. Example: from lawyers to general business or intelligence community in Washington, DC to business intelligence in companies; or from pharmaceutical text mining to general business text mining.
  • Simplify the installation, minimizing the need for specialized knowledge bases, tuning, and time-consuming set up. Example: offer a plug-and-play solution, emphasize speedy deployment, provide a default configuration that delivers advanced features without manual set up and time-consuming “training” of the system.
  • Maintain a competitive price point because the “vendor will make it up on volume”. With more customers and shorter buying cycles, the vendor will have increased chances to land a large account that generates substantial fees when customization or special functionality are required.
  • Boost the return on investment for research, development, sales, marketing, and customer support. The business school logic is inescapable to many search vendors. Note that these MBA (master of business administration) assumptions prove false is not my concern in this point. Search vendors can’t make their revenue goals in small niches and remain profitable, grow, and fund R&D. The search vendors have to find a way to grow and expand margins quickly. The broader business market is a solution that most content processing companies implement.

Search market shift

Implications of Market Shifts

Based on my research, several implications of moving upmarket, offering general purpose solutions, and expanding service options receive scant attention in the trade and business press. Let’s look at several. Keep in mind that my data and experience are unique. Your view may be different, and I welcome your view points. Let’s look at what I have learned:

First, smaller, specialized vendors have to move from a niche to a broader market. Examples range from the aforementioned Stratify, which moved from the U.S. intelligence niche to the broader business niche, only to narrow its focus in the broader business niche to handling special document collections. Iron Mountain saw value in this positioning and acquired Stratify. Vivisimo, which originally offered on-the-fly clustering, has repositioned itself as a vendor of “behind the firewall” search. The company’s core technology remains intact, but the firm has added functionality as it moves from a narrow “utility” vendor to a broader, “behind the firewall” vendor. Exegy, a vendor of special purpose, high-throughput processing technology, has moved from intelligence to financial services. This list can be expanded, but the point is clear. Search vendors have to move into broader markets in order to have a chance at making enough sales to generate the return investors demand. Stated another way, content processing vendors must find a way to expand their customer base or die.

Second, larger vendors — for example, the Autonomys, Endecas, and their ilk — must offer more and more services in an effort to penetrate more segments of the broader search market. Autonomy, in a sense, had to become a platform. Autonomy had to acquire Verity to get more upsell opportunities and more customers quickly. And the company had to diversify from search into other, adjacent information access and management services such as email management with its acquisition of Zantaz. The imperative to move into more markets and grow via acquisition is driving some of the industry consolidation now underway.

Third, established enterprise software vendors must move downmarket. IBM, Microsoft, and Oracle have to offer more information management, access, and processing services. A failure to take this step means that the smaller, more innovative companies moving from niches into broader business markets will challenge these firm’s grip on enterprise customers. Microsoft, therefore, had to counter the direct threat posed by Coveo, Exalead, ISYS, and Mondosoft (now SurfRay), among others.

Fourth, specialized vendors of text mining or business intelligence tools will find themselves subject to some gravitational forces. Inxight, the text analysis spin out of Xerox Palo Alto Research Center, was purchased by Business Objects. Business Objects was then acquired by SAP. After years of inattention, companies as diverse as Siderean Software (a semantic systems vendor with assisted navigation and dashboard functionality) to MarkLogic (an XML-on-steroids and data management vendor) will be sucked into new opportunities. Executives at both firms suggested to me that their products and services were of interest to superplatforms, search system vendors, and Fortune 1000 companies. I expect that both these companies will be themselves discovered as organizations look for “beyond search” solutions that work, mesh with existing systems, and eliminate if not significantly reduce the headaches associated with traditional information retrieval solutions.

I am reluctant to speculate on the competitive shifts that these market tectonics will bring in 2008. I am confident that the market for certain content processing companies is very bright indeed.

Back to Recommind

Recommind, therefore, is a good example of how a niche vendor of eDiscovery solutions can and must move into broader markets. Recommind is important, not because it offers a low-cost implementation of the Bayesian algorithms in the Autonomy system. Recommind warrants observation because it makes a useful case study of certain search sector market imperatives visible. As the diagram depicts, albeit somewhat awkwardly, is that each segment of the information retrieval market is in movement. Niche players must move upmarket and outwards. Superplatforms must move downmarket and into niches. Business intelligence system vendors must move into mainstream applications.

Exogenous Forces

The diagram omits two important exogenous forces. I will comment on these in another Web log article. For now, let me identify these two “storm systems” and offer several observations about search and content processing.

The first force is Lucene. This is the open source search solution that is poking its nose under a number of tents. IBM, for example, uses Lucene in some of its search offerings. A start up in Hungary called Tesuji offers Lucene plus engineering support services. Large information companies like Reed Elsevier continue to experiment with Lucene in an effort to shake free of burdensome licensing fees and restrictions imposed by established vendors. Lucene is not likely to go away, and with a total cost of ownership at a baseline of zero in licensing fees, some organizations will find the system warranting further investigation. More importantly, Lucene has been one of the factors turbo charging the “free search software” movement. The only way to counter certain chess moves is a symmetric action. Lucene, not Google or other vendors, is the motive force behind the proliferation of “free” search.

The second force is cloud computing. Google is often identified as the prime mover. It’s not. The notion of hosted search is an environmental factor. Granted, cloud based information retrieval solutions remain off the radar for most information technology professionals. Recall, however, that the core of hosted search is the commercial database industry. LexisNexis, Dialog, and Ebscohost are, in fact, hosted solutions for specialized content. Blossom Software, Exalead, Fast Search & Retrieval, and other content processing vendors offer off-premises or hosted solutions. The economics of information retrieval translate to steadily increasing interest in cloud based solutions. And when the time is right, Amazon, Google, Microsoft, and others will be offering hosted content processing solutions. In part it will be a response to what Dave Girouard, a Google executive calls, the “crisis in IT”. In part, it will be a response to economics. Few — very, very few — professionals understand the total cost of information retrieval. When the “number” becomes known, a market shift from on premises to cloud-based solutions will take place, probably with some velocity.

Wrap Up

Several observations are warranted:

First, Recommind is an interesting company to watch. It is, a microcosm of broader industry trends. The company’s management has understood the survival imperative and implemented a solution that becomes obvious in today’s market. Expand or stagnate.

Second, tectonic forces are at work that will reshape the information retrieval, content processing, and search market as it exists today. It’s not just consolidation; search and its cousins will become part of a larger data management fabric.

Third, there’s a great deal of money to be made as these forces grind through the more than 200 companies offering content processing solutions. Innovation, therefore, will continue to bubble up from U.S. research computing programs and outside the U.S. Tesuji is Hungary is just one example of dozens of innovative approaches to content processing.

Fourth, the larger battle is not yet underway. Many analysts see hand to hand combat between Google and Microsoft. I don’t. I think that for the next 18 to 24 months, battles will range within niches, among established search vendors, and among the established enterprise software vendors. Google is a study in “controlled chaos”. With this approach, Google is not likely to mount any single, direct attack on anything until the “controlled chaos” yields that data Google needs before deciding on a specific course of action.

Search is dead. At least the key word variety. Content processing is alive an well. The future is broader: data management and data spaces. As we rush forward, opportunities abound for licensees, programmers, entrepreneurs, and vendors. We are living in a transition from the Dark Ages of key word search to a more robust, more useful approach.

Stephen E. Arnold
10 January 2008

Little-Known Search Engines

January 9, 2008

Here’s a run down of little known engines with links to their web sites.

As I work to complete “Beyond Search: What to Do When Your Search Engine Doesn’t Work,” I reviewed my list of companies offering search technology. I could not remember much about several of them.

Here’s what triggered my checking to see what angle each of these companies takes, or in some cases, took towards search and retrieval.

  • Aftervote — A metasearch engine with a “vote up” or “vote down” button for results.
  • AskMeNow — A mobile search service that wanted my cell number. I didn’t test it. The splash page says AskMeNow.com is a “smart service”.
  • C-Search Solutions — A search system for “your IBM Domino domain.” The company offers a connector to hook the Google Search Appliance to Domino content.
  • Ceryle — A data management system that generates topics and associations.
  • Craky.com — Site has gone dark when I tested it on January 8, 2008. It was a “search engine for impatient boomers”.
  • Dumbfind — An amazing name. A social search system. Dumbfind describes itself as a “user generated content site.” A social search system, I believe.
  • Exorbyte — A German high-performance search system. Lists eBay, Yahoo, and the ailing Convera as customers.
  • Eyealike — A visual search engine. The splash page says “you can search for your dream date.” Alas, not me. Too old.
  • Ezilon — not Ezillion which is an auction site. A Web directory and search engine.
  • Idée Inc. — The company develops advanced image recognition and visual search software. Piximilar is the company’s image search system.
  • Kosmix — An “intelligent search engine”. The system appears to mimic some of the functions of Google’s universal search system.
  • Linguistic Agents — The company’s search technology bridges “language and technology”
  • Paglo Inc. — This is a “search engine for information technology on an Intranet. The system discovers “everything on your network”.
  • Q Phrase — The company offers “discovery tools”.
  • Semantra — The sysetm allow syou to have “an intelligent conversation with your enterprise databases.”
  • Sphinx — Sphinx is a full text search engine for database content.
  • Surf Canyon — In beta. The system shows related information when you hover over a hit in a results list.
  • Syngence — A content analytics company, Syngence focuses on “e-discovery”.
  • Viziant — The company is “a pioneer in delivering tools for discovery.”
  • Xerox Fact Spotter — Text mining tools developed at Xerox “surpass search”. The description of the system seems similar to the Inxight system that’s now part of Business Objects which is now owned by SAP.

Several observations are warranted. First, I am having a difficult time keeping up with many of these companies’ systems. Second, text mining and other rich text processing solutions are notable. Semantics, linguistics, and other techniques to squeeze meaning from information are hard-to-miss trends. The implication is that key word search is slipping out of the spotlight. Finally, investors are putting up cash to fund a very wide range of search-and-retrieval operations. Even though consolidation is underway in the search sector, there’s a steady flow of new and often hard-to-pronounce vendors chasing revenue.

Stephen E. Arnold
9 January 2008, 11:00am

« Previous Page

  • Archives

  • Recent Posts

  • Meta