Search System Bottlenecks

February 21, 2008

In a conference call yesterday (February 19, 2008), one of the well-informed participants asked, “What’s with the performance slow downs in these behind-the-firewall search systems?”

I asked, “Is it a specific vendor’s system?”

The answer, “No, it seems like a more general problem. Have you heard anything about search slow downs on Intranet systems?”

I do hear quite a bit about behind-the-firewall search systems. People find my name on the Internet and ask me questions. Others get a referral to me. I listen to their question or comment and try to pass those with legitimate issues to someone who can help out. I’m not too keen on traveling to a big city, poking into the innards of a system, and trying to figure out what went off track. That’s a job for younger, less jaded folks.

But yesterday’s question got me thinking. I dug around in my files and discovered a dated, but still useful diagram of the major components of a behind-the-firewall search system. Here’s the diagram, which I know is difficult to read, but I want to call your attention to the seven principal components of the diagram and then talk briefly about hot spots. I will address each specific hot spot in a separate Web log post to keep the length manageable.

This essay, then, takes a broad look at the places I have learned to examine first when trying to address a system slow down. I will try to keep the technical jargon and level of detail at a reasonable level. My purpose is to provide you with an orientation to hot spots before you begin your remediation effort.

The Bird’s Eye View of a Typical Search System

Keep in mind that each vendor implements the search sub systems in a way appropriate for their engineering. In general, if you segment the sub systems, you will see a horizontal area in the middle of this diagram surrounded by four key subsystems, the content, and, of course, the user. The search system exists for the user, which many vendors and procurement teams happily overlook.

birdview

This diagram has been used in my talks at public events for more than five years. You may use this for your personal use or in educational activities without restrictions. If you want to use it in a publication, please, provide contact me for permission.

Let’s run through this diagram and then identify the hot spots. You see some arrows. These are designed to show the pipeline through which content, queries, and results flow. In several places, you see arrows pointing different directions in close proximity. It is obvious that in these interfaces, a glitch of any type will create a slowdown. Now let’s identify the main features.

In the upper left hand corner is a blue sphere that represents content. For our purpose, let’s just assume that the content resides behind the firewall, and it is the general collection of Word documents, email, and PowerPoints that make up much of an organization’s information. Pundits calculate that 80 percent of an organization’s information is unstructured. My research suggests that the ratio of structured to unstructured data varies sharply by type of organization. For now, let’s just deal with generalized “content”. In the upper right hand corner, you see the user. The user, like the content, can be generalized for our purposes. We will assume that the user navigates to a Web page, sees a search box, or a list of hot links, and enters a query in some way. I don’t want to de emphasize the user’s role in this system, but I want to set aside her needs, the hassle of designing an interface, and other user-centric considerations such as personalization.

Backbone or Framework

Now, let’s look at the horizontal area in the center of the diagram show below:

framework

You can see that there are specific sub systems within this sub system labeled storage clusters. This is the first key notion to keep in mind when thinking about performance of a search system. The problem that manifests itself at an interface may be caused by a sub component in a sub system. Until there’s a problem, you may not have thought about your system as a series of nested boxes. What you want to keep in mind is that until you have a performance bottleneck is that the many complex moving parts were working pretty well. Don’t criticize your system vendor without appreciating how complicated a search system is. These puppies are far from trivial — including the free one you download to index documents on your Mac or PC.

In this rectangle are “spaces” — a single drive or clusters of servers — that hold content returned from the crawling sub system (described below), the outputs of the document processing sub system (described below), the index or indexes, the system that holds the “content” in some proprietary or other representation, and a component to house the “metrics” for the system. Please keep in mind that running analytics is a big job, and you will want to make sure that you have a way to store, process, and manipulate system logs. No system logs — well, then, you are pretty much lost in space when it comes to trouble shooting. One major Federal agency could not process its logs; therefore, usage data and system performance information did not exist. Not good. Not good at all.

contentsubsystem

The components in this sub system handle content acquisition, usually called crawling or spidering. I want to point out that the content acquisition sub system can be a separate server or cluster of servers. Also, keep in mind that keeping the content acquisition sub system on track requires that you fiddle with rules. Some systems like Google’s search appliance reduce this to a point-and-click exercise. Other systems require command line editing of configuration files. Rules may be check boxes or separate scripts / programs. Yes, you have to write these or pay someone to do the rule fiddling. When the volume of content grows, this sub system can choke. The result is not a slow down, but you may find that some users say, “I put the content in the folder for indexing, and I can’t find the document.” No, the user can’t. It may be snagged in an over burdened content acquisition sub system.

Document Processing / Document Transformation

Let me define what I mean by document processing. I am using this term to mean content normalization and transformation. In Beyond Search, I use the word transformation to stream line the text. In this sub system, I am not discussing indexing the content. I want to move a Word file from its native Word format to a form that can be easily ingested by the indexing sub system described in the next section of this essay.

transformation

This sub system pulls or accepts the information acquired by the spidering sub system. Each file is transformed into an representation that the indexed sub system (described below) can understand. Transformation is now a key part of many behind-the-firewall systems. The fruit cake of different document types are normalized; that is, made standard. If a document cannot be manipulated by the system, then that document cannot be indexed. An increasing number of document transformation sub systems store the outputs in an XML format. Some vendors include an XML data base or data management system with their search system. Others use a data base system and keep it buried in the “guts” of their system. This notion of transformation means that disc writes will occur. The use of a data base system “under the hood” may impose some performance penalties on the document processing sub system. Traditional data base management systems can be input – output bound. A bottle neck related to an “under the hood” third-party, proprietary, or open source data base can be difficult to speed up if resources like money for hardware are scarce.

Indexing

Most vendors spend significant time explaining the features and functions of their systems’ indexing. You will hear about semantic indexing, latent semantic indexing, linguistics, and statistical processes. There are very real differences between vendors’ systems. Keep in mind that any indexing sub system is a complicate beastie. Here’s a blow up from the generalized schematic above:

indexing

In this diagram, you see knowledge bases, statistical functions, “advanced processes” (linguistics / semantics), and a reference to an indexing infrastructure. Indexing performs much of the “heavy lifting” for a search system, and it is absolutely essential that the indexing sub system be properly resourced. This means bandwidth, CPU cycles, storage, and random access memory. If the indexing sub system cannot keep pace with the amount of information to be indexed and the number of queries passed against the indexes, a number of symptoms become evident to users and the system administrator. I will return to the problems of an overloaded indexing subsystem in a separate essay in a day or two. Note that I have included “manual tagging” in the list of fancy processes. The notion of a fully automatic system, in my experience, is a goal, not a reality. Most indexing systems require over sight by a subject matter expert or indexing specialist. Both statistical and linguistic systems can get “lost in space.” There are many reasons such as language drift, neologisms, and exogenous shifts. The only reliable way to get these indexing glitches resolved is to have a human make the changes to the rules, the knowledge bases, or the actual terms assigned to individual records. Few vendors like to discuss these expensive, yet essential, interventions. Little wonder that many licensees feel snookered when “surprises” related to the indexing sub system become evident and then continue to crop up like dandelion.

Query Processing

Query processing is a variant of indexing. Queries have to be passed against the indexes. In effect, a user’s query is “indexed”. The query is matched or passed against the index, and the results pulled out, formatted, and pushed to the user. I’m not going to talk about stored queries or what used to be called SDI (selective dissemination of information), saved searches, or filters. Let’s just talk about a key word query.

queryproc

The query processing sub system consists of some pre – and post – processing functions. A heavily-used system requires a robust query processing “front end.” The more users sending queries at the same time, the more important it is to be able to process those queries and get results back in an acceptable time. My tests show that a user of a behind-the-firewall system will wait as much as 15 seconds before complaining. In my tests on systems in 2007, I found an average query response time in the 20 second range, which explains in large part why employees are dissatisfied with their incumbent search system. The dissatisfaction is a result of an inadequate infrastructure for the search system itself. Dissatisfaction, in fact, does not single out a specific vendor. The vendors are equally dissatisfying. The vendors, obviously, can make their systems run faster, but the licensee has the responsibility to provide a suitable infrastructure on which to run the search system. In short, the “dissatisfaction” is a result of poor response time. Only licensees can “fix” this infrastructure problem. Blaming a search vendor for lousy performance is often a false claim. Notice that the functions performed within the query processing sub system are complex; for example, “on the fly” clustering, relevance ranking, and formatting. Some systems include work flow components that shape queries and results to meet the needs of particular employees or tasks. The work flow component then generates the display appropriate for the work task. Some systems “inject” search results into a third-party application so the employee has the needed information on a screen display related to the work task; for instance, a person’s investments or prior contact history.

Back to Hot Spots

Let me reiterate — I am using an older, generalized diagram. I want to identify the complexities within a representative behind-the-firewall search system. The purpose of this exercise is to allow me to comment on some general hot spots as a precursor to a quick look in a subsequent essay about specific bottle necks in subsystems.

The high level points about search system slow downs are:

  1. A slow down in one part of the system may be caused be a deeper issue. In many cases, the problem could be buried deep within a particular component in a sub system. Glitches in search systems can, therefore, take some time to troubleshoot. In some cases, there may be no “fix”. The engineers will have to “work around” the problem which may mean writing code. Switching to a hosted service or a search appliance may be the easiest way to avoid this problem.
  2. The slow down may be outside the vendor’s span of control. If you have an inadequate search system infrastructure, the vendor can advise you on what to change. But you will need the capital resources to make the change. Most slow downs in search systems are a result of the licensee’s errors in calculating CPU cycles, storage, bandwidth, and RAM. The cause of this problem is ignorance of the computational burden search systems place on their infrastructure. The fast CPUs are wonderful, but you may need clusters of servers, not one or two servers. The fix is to get outside verification of the infrastructure demands. If you can’t afford the plumbing, shift to a hosted solution or license an appliance.
  3. A surge in either the amount of content to index or the numbers of queries to process can individually bring a system to a half. When the two coincide, the system will choke, often failing. If you don’t have log data and you don’t review it, you will not know where to begin looking for a problem. The logs are often orphans, and their data are voluminous, hard to process, and cryptic. Get over it. Most organizations have a steady increase in content to be processed and more users sending queries to the search system despite their dissatisfaction with its performance. In this case, you will have a system that will fail and then fail again. The fix is to buckle down, manage the logs, study what’s going on in the sub systems, and act in an anticipatory way. What’s this mean? You will have to continue to build out your system when performance is acceptable. If you wait until something goes wrong, you will be in a very precarious position.,

To wrap up this discussion, you may be reeling from the ill-tasting medicine I have prescribed. Slow downs and hot spots are a fact of life with complex systems such as search. Furthermore, the complexity of the search systems in general and their sub systems in particular are essentially not fully understood by most licensees, their IT colleagues, or their management. In the first three editions of the Enterprise Search Report, I discussed this problem at length. I touch upon it briefly in Beyond Search because it is critical to the success of any search or content processing initiative. If you have different experiences from mind, please, share them via the comments function on this Web log.

I will address specific hot spots in the next day or two.

Stephen Arnold, February 21, 2008

Power Leveling

February 20, 2008

Last week I spoke with a group of young, enthusiastic programmers. In that lecture, I used the phrase power leveling. I didn’t coin this term. In my preparation for my lecture, I came across an illustration of a maze.

What made the maze interesting was a rat had broken through the maze’s dividers. From the start of the maze to the cheese at the exit, the mouse bulldozed through the barriers. Instead of running the maze, the rat went from A to B in the shortest, most direct way.

Power leveling.

When I used the term, I was talking about solving some troublesome problems in search and retrieval. What I learned in the research for Beyond Search was that many companies get trapped in a maze. Some work very hard to figure out one part of the puzzle and fail to find the exit. Other companies solve the maze, but the process is full of starts and stops.

Two Approaches Some Vendors Take

In terms of search and retrieval, many vendors develop solutions that work in a particular way on a specific part of the search and retrieval puzzle. For example, a number of companies performing intensive content processing generate additional indexes (now called metatags) for each document processed. These companies extract entities, assign geo spatial tags, classify documents and those documents components. The thorough indexing is often over kill. When these systems crunch through email, which is often cryptic, the intense indexing can go off the rails. The user can’t locate the needed email using the index terms and must fall back on searching by date, sender, or subject. This type of search system is like the rat that figures out how to solve one corner of the maze and never gets to the exit and freedom.

The other approach does not go directly to the exit. These systems iterate, crunch, generate indexes, and rerun processes repeatedly. With each epoch of the indexing processing, the metatags get more accurate. Instead of a blizzard of metatags, the vendor delivers useful metadata. The vendor achieves the goal with the computational equivalent of using a submachine gun to kill the wasp in the basement. As long as you have the firepower, you can fire away until you solve the problem. The collateral damage is the computational equivalent of shooting up your kitchen. Instead of an AK-47, these vendors require massive amounts of computing horsepower, equivalent storage, and sophisticated infrastructure.

Three Problems to Resolve

Power leveling is neither of these approaches. Here’s what I think more developers of search-and-retrieval systems should do. You may not agree. Share you views in the comments section of this Web log.

First, find a way around brute force solutions. The most successful systems often use techniques that are readily available in text books or technical journals. The trick is to find a clever way to do the maximum amount of work in fewest cycles. Just because today’s processors are pretty darn quick, you will deliver a better solution by letting software innovations do the heavy lifting. Search systems that expect me to throw iron at bottlenecks are likely to become a money pit at some point. A number of high-profile vendors are suffering from this problem. I won’t mention any names, but you can identify the brute force systems doing some Web research.

Second, how can you or a vendor get the proper perspective on the search-and-retrieval system? It is tough to get from A to B in a nice Euclidian way if you keep your nose buried in a tiny corner of the larger problem space. In the last few days, two different vendors were thunderstruck that my write ups of their system described their respective products more narrowly than the vendors’ saw the products. My perspective was broader than theirs. These two vendors struggled and are still struggling to reconcile my narrow perception of their systems with the broader and, I believe, inaccurate descriptions of these systems.

I have identified a third problem with search-and-retrieval systems. Vendors work hard to find an angle, a way to make themselves distinct. In this effort to be different, I identified vendors who have created systems that can be used when certain, highly-specific requirements call for these functions. Most organizations don’t want overly narrow solutions. The need is to have a system that allows the major search-and-retrieval functions to be performed at a reasonable cost on relatively modest hardware. As important, the customers want a system that an information technology generalist can understand, maintain, and enhance. In my experience, most organizations don’t want rocket science. Overly complex systems are fueling interest in SaaS (software as a service. Believe me, there are search-and-retrieval vendors selling systems that are so convoluted, so mind-boggling complicated that their own engineers can’t make some changes without consulting the one or two people who know the “secret trick”. Mere mortals cannot make these puppies work.

Not surprisingly, the 50 or 60 people at my lecture were surprised to hear me make suggestions that put so much emphasis on being clever, finding ways to go through certain problems, keeping the goal in sight, and keeping their egos from getting between their customers and what the customer needs to do with a system.

A Tough Year Ahead

Too many vendors find themselves in a very tough competitive situation. The prospects often have experience with search-and-retrieval systems. The reason these prospects are talking to vendors of search-and-retrieval systems is because the incumbent system doesn’t do the job.

With chief financial officers sweating bullets about costs, search-and-retrieval vendors will have to deliver systems that work, can be maintained without hefty consulting add ons, and get the customer from point A to B.

I think search-and-retrieval as a separate software category is in danger of being commoditized. Lucene, for example, is a good enough solution. The hundreds of companies chasing a relatively modest pool of potential buyers is ripe for a shake out and consolidation. Vendors may find themselves blocked by super platforms who bundle search and content processing with other, higher value enterprise applications.

Search-and-retrieval vendors may want to print out the power leveling illustration and tape it to their desk. Inspiration? Threat? You decide.

Stephen Arnold, February 20, 2008

Arnold’s KMWorld Essay Series

February 19, 2008

kmlogo the newspaper covering the knowledge management market sector, published the first of a series essays by my hand in its February 2008. Unfortunately I am not permitted to reproduce the entire essay here because the copyright has been assigned to Information Today, Inc.

In each essay, I want to look at Google’s (NASDAQ:GOOG) impact on knowledge management and closed related fields. Many people see Google as a Web indexing and advertising business that has tried to move into other businesses and failed. But Google has disrupted the telecommunications industry with its “open platform” play in the spectrum auction. Now Google is probing shopping, banking, and entertainment sectors. Make no mistake. These probes are not happenstance. Google is a new breed of enterprise, and I want to help you understand it an essay at a time.

Here’s one snippet from my February 2008 KMWorld essay:

If we dip into Google’s more than 250 patent applications and patents, we find more than two dozen inventions related to content, embedding advertising in that content, and manipulating the content to create compilations or anthologies, as well as other “interesting” services… Just as Google disrupted the global telecommunications sector with its open platform and hosted mobile services, enterprise publishing and traditional publishing are now in the path of Googzilla**.

** That’s my coinage to refer the powerful entity that Google has become. Google has the skeleton of a meat-eating dinosaur out side of its Mountain View, California offices. Don’t believe me. Click this Google dinosaur link to see for yourself.

In the February 2008 essay titled “Probing the Knowledge Market” I talk about Google’s growing capability in enterprise content management and publishing. Most traditional publishers haven’t figured out Google’s advertising business. It comes as no surprise, then, for me to assert that Google’s potential impact on traditional publishing and CMS is essentially unperceived. JotSpot? Do you know what JotSpot’s technology can do for Google users? Most don’t. That’s a gap in your knowledge you may want to fill by reading my February column.

I’ve already completed a couple of submissions for this series. You will learn about my views on the GSA (Google Search Appliance). Unlike the GSA bashers, I think GSA is a very good and quite useful search-and-retrieval system. Competitors and pundits have been quick to point out the GSA’s inability to duplicate the alleged functionality of some of the best-known search system vendors. The problem is, I explain, that GSA is one piece of a larger enterprise solution. Unlike the mind-boggling complexity of some enterprise search solutions, Google’s approach is to reduce complexity, the time required to deploy a search solution, and eliminate most of the administrative headaches that plague many “behind the firewall” search system. Flexibility comes from the OneBox API, not a menu of poorly integrated features and functions. You can make a GSA perform most content processing tricks without eroding the basic system’s simplicity and stability.

I also tackle what I call “Google Glue”. The idea of creating a “sticky” environment is emerging as a key Google strategy. Most professionals are blissfully unaware of a series of activities over the last two years that “cement” users and developers to Google. Google is not just a search system; it is an application platform. I explain the different “molecules” in Google’s bonding agent. Some of these are “off the radar” of enterprise information technology professionals. I want to get these facts “on the radar”. My mantra is “surf on Google.” After studying Google’s technology for more than five years, the Google as President Bush phrased it is a game changer.

The “hook” in my KMWorld essays will be Google and its enterprise activities. I don’t work for Google, and I don’t think the management thinks too much of my work. My The Google Legacy: How Search Become the Next Application Platform and Google Version 2.0: The Calculating Predator presented information I obtained from open source about Google’s larger technology capabilities and its disruptive probes into a half dozen markets. More info about these studies here.
What you will get in my essays is an analysis of open source information about the world’s most influential search, content processing, and knowledge management company best known for its free Web search and online advertising business.

Please, navigate to the KMWorld Web site. You can look my essays there, or you can sign up to get the hard copy of the KMWorld tabloid. Once I complete the series, I will post versions of the columns. I did this with my earlier “Technology from Harrod’s Creek” essays that ran for two years in Information World Review. But I don’t post these drafts until two or three years after an essay series closes.

Stephen Arnold, February 19, 2008

How Big is the Behind-the-Firewall Search Market?

February 18, 2008

InternetNews.com ran a story by David Needle on February 5. The title was “Enterprise Search Will Top $1 Billion by 2010.” If the story is still online (news has a tendency to disappear pretty quickly), I found it here.

In January 2003, my publisher (Harry Collier, Infonortics Ltd., Tetbury, Glou.) and I collaborated on a short white paper Search Engines: Evolution and Diffusion. That document is no longer in print. We have talked about updating it. Each month the amount of information available about search and retrieval, content processing, and text analysis grows. An update is on my to-do list. I’m not sure about Mr. Collier’s task agenda.

How We Generated Our Estimate in 2003

In that essay, we calculated — actually backed into — estimates on the size of the search-and-retrieval market. Our procedure was straight forward. We identified the companies in our list of 100 vendors that were public. If the public company focused exclusively on search, we assumed the company’s revenues came from search. Autonomy (LO:AU) and Fast Search & Transfer (NASDAQ:MSFT) are involved in a number of activities that generate revenue. For our purposes, we took the gross revenue and assumed it was from search-centric activities. For super platforms such as IBM (NYSE:IBM), Microsoft (NASDAQ:MSFT), Oracle (NASDAQ:ORCL), and SAP (NYSE:SAP), we looked at the companies’ Securities & Exchange Commission filings and found that search revenue was mashed into other lines of business, not separated as a distinct line item.

We knew that at these public companies search was not a major line of business, but search certainly contributed some revenue. I had some information about Microsoft’s search business in 2002, and I used those data to make a calculation about the contribution to revenue Web search, SQLServer search, and SharePoint search made to the company. I discounted any payments by Certified Partners with search systems for SharePoint. Microsoft was in 2002 and early 2003 actively supporting some vendors’ efforts to create “snap in” SharePoint search systems (for instance, Denmark’s Mondosoft). Google had yet to introduce its Google Search Appliance in 2003, so it was not a factor in our analysis.

I had done some work for various investment banks and venture capital firms on the average revenue generated in one year by a sample of privately-held search firms. Using these data were were able to calculate a target revenue per full time equivalent (FTE). Using the actual revenues from a dozen companies with which I was familiar, I was able to calibrate my FTE calculation and generate an estimated revenue for the privately-held firms.

After some number crunching without any spreadsheet fever goosing our model, we estimated that search-and-retrieval — excluding Web ad revenue — was in the $2.8 to $3.1 billion range for calendar 2003. However, we knew there was a phenomenon of claiming revenues before the search licensee actually transferred real money to the search vendor. A number of companies have been involved in certain questionable dealings regarding how search license fees were tallied. Some of these incidents have been investigated by various organizations or by investors. Others are subjects of active probes. I’m not at liberty to provide details, nor do I want to reveal the details of the “adjustments” we made for certain accounting procedures. The adjustment was that we decremented our gross revenue estimate by about one-third, pegging the “size” of the search market in 2003 at $1.7 to $2.2 billion.

The Gartner Estimate

If you have reviewed the data reported in InternetNews.com’s story, you can see that its $1.2 billion estimate is lower than our 2003 estimate. I’m not privy to the methodology used to generate this Gartner estimate. The author of the article (David Needle) did not perform the analysis. He is reporting data released by the Gartner Group (NYSE:IT), one of the giants in technology research business. The key bit for me in the new story is this:

Total software revenue worldwide from enterprise search will reach $989.7 million this year, up 15 percent from 2007, according to Gartner. By 2010 Gartner forecasts the market will grow to $1.2 billion. While the rate of growth will slow to low double digits over the next few years, Gartner research director Tom Eid notes enterprise search is a huge market.

Usually research company predictions err on the high side. In my files, I have notes about estimates of search and retrieval hitting the $9.0 billion mark in 2007, which I don’t think happened. If one includes Google and Yahoo, the $9.0 billion is off the mark by a generous amount. Estimates of the size of the search market are all over the map.

I assert that the Gartner estimate is low. When I reviewed the data for our 2003 calculation and made adjustments for the following factors, I came up with a different estimate. Here’s a summary of my notes to myself made when I retraced my 2003 analysis and looked at the data compiled for my new study Beyond Search:

  1. There’s been an increase in the number of vendors offering search and retrieval, content processing, and text analysis systems. In 2003, we had a list of about 110 vendors. The list I compiled for Beyond Search contains about 300 vendors. Of these 300, about 175 are “solid” names. Some of these like Delphes and Exegy are unknown to most of the pundits and gurus tracking the search sector. Others are long shots, and I don’t want to name these vendors in my Web log.
  2. A market shift has been created by Google’s market penetration. I estimate that Google (NASDAQ:GOOG) has sold about 8,500 Google Search Appliances (GSA). It has about 40 reseller / partners / integrators. Based on my research and without any help from Google, I calculated that the estimated revenue from the GSA revenue in FY2007 was in the $400 million range, making the its behind-the-firewall search business larger than the revenue of Autonomy and Fast Search & Transfer combined.
  3. Endeca’s reaching about $85 million in revenues in calendar 2007, colored by its success in obtaining an injection of financing from Intel (NASDAQ:INTC) and SAP.
  4. Strong financial growth by the search vendors in my Beyond Search market sector analysis, specifically in the category called “Up and Comers”. Several of the companies profiled in Beyond Search have revenues in the $6 to $10 million range for calendar 2007. I was then able to adjust the FTE calculation.

I made some other adjustments to my model. The bottom line is that the 2007 market size as defined in Search Engines: Evolution and Diffusion was in the $3.4 to 4.3 billion range, up from $1.7 to $2.2 billion in 2003. The growth, therefore, was solid but not spectacular. Year-on-year growth of Google, for example, makes the more narrow search-and-retrieval sector look anemic. The relative buy out of Fast Search & Transfer at $1.2 billion is, based on my analysis, generous. When compared to the Yahoo buyout of more than $40 billion, it is pretty easy to make a case that Microsoft is ponying up about 7X Fast Search’s 2007 revenue.

My thought is that the Gartner estimate should be viewed with skepticism. It’s as misleading to low ball a market’s size as it is to over state it. Investors in search and retrieval have to pump money into technology based on some factor other than stellar financial performance.

Taken as a group, most companies in the search and retrieval business have a very tough time generating really big money. Look at the effort Autonomy (LO:AU), Endeca, and Fast Search (NASDAQ:MSFT) have expended to hit their revenue in FY2007. I find it remarkable that so many companies are able to convince investors to ante up big money with relatively little hard evidence that a newcomer can make search pay. Some companies have great PR but no shipping products. Other companies have spectacular trade show exhibits and insufficient revenues to remain in business (for instance, the Entopia system profiled on the Web log).

Some Revenue Trends to Watch in the Search Sector

Let me close by identifying several revenue trends that my research has brought to light. Alas, I can’t share the fundamental data in a Web log. Here are several points to keep in mind:

  1. Search — particularly key word search — is now effectively a commodity; therefore, look to more enterprise systems with embedded search functions that can handle broader enterprise content. This is a value add for the vendor of a database management system or a content management system. This means that it will get harder, not easier, to estimate how much of a company’s revenue comes from its search and content processing technology.
  2. Specialized vendors — see the Delphes case — can build a solid business by focusing on a niche and avoiding the Madison Avenue disease. This problem puts generalized brand building before one-on-one selling. Search systems need one-on-one selling. Brand advertising is, based on my research, a waste of time and money. It’s fun. Selling and making a system work is hard. More vendors need to tackle the more difficult tasks, not the distractions of building a brand. These companies, almost by definition, may be “off the radar” of the pundits and gurus who take a generalist’s view of the search sector.
  3. There will be some consolidation, but there will be more “going dark” situations. These shutdowns occur when the investors grow impatient and stop writing checks. I have already referenced the Entopia case, and I purposely included it in my Web log to make a point that sales and revenue have to be generated. Technology alone is not enough in today’s business environment. I believe that the next nine to 18 months will be more challenging. There are too many vendors and too few buyers to absorb what’s on offer.
  4. A growing number of organizations with incumbent disappointing search systems will be looking for ways to fix what’s broken fast. A smaller percentage will look for a replacement, an expensive proposition even when the swap goes smoothly. This means that “up and comers” and some vendors with technology that can slap a patch on a brand-name search system can experience significant growth. I name the up and comers and vendors to watch in Beyond Search but not in this essay.
  5. The geyser of US government money to fund technology to “fight terrorism” is likely to slow, possibly to a mere trickle. Not only is there a financial “problem” in the government’s checking account, a new administration will fiddle with priorities. Therefore, some of the vendors who derive the bulk of their revenue from government contracts will be squeezed by a revenue shortfall. The sales cycle for a search or content processing system is, unfortunately, measured in months, even years. So, a fast ramp of revenue from commercial customers is not going to allow the companies to rise above the stream of red ink.

To close, the search market has been growing. It is larger than some believe, but it is not as large as most people wish it were. In 2008, tectonic plates are moving the business in significant ways. Maybe the Gartner prediction is predicting the post-crash search market size? I will print out Mr. Needle’s story and do some comparisons in a year, maybe two from now.

Stephen Arnold, February 19, 2008

Blossom Software’s Dr. Alan Feuer Interviewed

February 18, 2008

You can click here to read an interview with Dr. Alan Feuer. He’s the founder of Blossom Software, a search-and-retrieval system that has carved out a lucrative niche. In the interview, Dr. Feuer says:

Degree of magic” is a telling scale for classifying search engines. At one end are search engines that take queries very literally; at the other are systems that try to be your intimate personal assistant. Systems high on the magic scale make hidden assumptions that influence the search results. High magic usually implies low transparency. Blossom works very hard to get the user results without throwing too much pixie dust in anyone’s eyes.

Dr. Feuer is a former Bell Labs’s researcher, and he has been one of the leaders in providing hosted search as well as on-premises installations of the Blossom search-and-retrieval system. I used the Blossom system to index the Federal Bureau of Investigation’s public content when a much higher profile vendor’s system failed. I also used the Blossom technology for the U.S. government funded Threat Open Source Information Gateway.

The FBI content was indexed by Blossom’s hosted service and online within 12 hours. The system accommodated the FBI’s security procedures and delivered on-point results. Once the incumbent vendor’s system had been restored to service, the Blossom hosted service was retained for one year as a hot fail over. This experience made me a believer in hosted search “Blossom style”.

Click here for the full interview. For information about Blossom, click Blossom link

Stephen Arnold, February 18, 2008

Delphes: A Low-Profile Search Vendor

February 17, 2008

Now that I am in clean up mode for Beyond Search, I have been double-checking my selection of companies for the “Profiles” section of the study. In a few days, I will make public a summary of the study’s contents. The publisher — The Gilbane Group — will also post an informational page. Publication is likely to be very close to the previously announced target of April 2008.

Yesterday, I used the Entopia system as the backbone of a mini-case study. Today — Sunday, February 17, 2008 — I want to provide some information about an interesting company not included in my Beyond Search study.

The last information I received from this company arrived in 2006, but the company’s Web site explicitly copyrights its content for 2008. When I telephoned on Friday, February 15, 2008, I went to voice mail. Therefore, I believe the company is in business.

Delphes, in the tradition of search and content processing companies, is a variant of the English word Delphi. You are probably familiar with the oracle of Delphi. I think the name of the company is intended to evoke a system that speaks with authority. As far as I know, Delphes is a private concern and concentrates its sales and marketing efforts in Canada, Francophone nations, and Spain. When I mention the name Delphes to Americans, I’m usually met with a question, “What did you say?” Delphes has a very low profile in the United States. I don’t recall seeing the company on the program of the search-and-retrieval conferences I attended in 2006 or 2007, but I go to a small number of shows. I may have overlooked the company’s sessions.

The Company’s Approach

The “guts” of the Delphes’ search-and-retrieval system is based on natural language processing embedded in a platform. The firm’s product is marketed as Diogene, another Greek variant. Diogenes, as you know, was a popular name in Greece. I assume the Diogenes to which Delphes is derived is Diogenes of Sinope, sometimes remembered as the Cynic More information about Diogenes of Sinope is here.)

Diogene extracts information using “dynamic natural language processing”. The iterative, linguistic process generates metadata, tags concepts, and classifies information processed by the system.
The company’s technology is available in enterprise, Web, and personal versions of the system. DioWeb Enterprise is the behind-the-firewall version of the product. You can license from the company DioMorpho which is for an individual user on a single workstation. Delphes works through a number of partners, and you can deal directly with the company for an on-premises license or an OEM (original equipment manufacturing) deal. Its partners include Sun Microsystems, Microsoft, and EMC, among others.

When I first looked at Delphes in 2002, the company had a good reputation in Montréal (Québec), Toronto and Ottawa (Ontario). The company’s clients now include governmental agencies, insurance companies, law firms, financial institutions, healthcare institutions, and consulting firms, among others. You can explore how some of the firm’s clients use the firm’s content processing technology by navigating to the Québec International Portal. The search and content processing for this Web site is provided by Delphes.

The company’s Web site includes a wealth of information about the architecture of the system, its features and functions, and services available from the company. The company offers a PDF that describes in a succinct way the features of what the company calls its “Intelligent Knowledge Management System”. You can download the IKMS overview document here.

Architecture

Information about the technical underpinnings of Delphes is sketchy. I have in my files a Delphes document called “The Birth of Digital Intelligence: Extranet and Internet Solutions”. This information, dated 2004, includes a high-level schematic of the Delphes system. Keep in mind that the company has enhanced its technology, but I think we can use this diagram to form a general impression of the system. Note: these diagrams were available in open sources, and are copyrighted by Delphes.

system archtecture

The “linguistic soul” of the system is encapsulated in two clusters of sub systems. First, there is the “advanced analysis” for content processing. This set of functions performs semantic analysis, which “understands” each processed document. The second system permits cross-language operation. Canada is officially bilingual, so for Delphes to make sales in Canadian agencies, the system must handle multiple languages and have a means to permit a user to locate information using either English or French.

The “body” of the system includes a distributed architecture, multi-index support, a federating function, support for XML and Web services. In short, Delphes followed the innovation trajectory of Autonomy (LO:AU), Endeca, and Fast Search & Transfer (NASDAQ:MSFT). One can argue that Delphes has a system of comparable sophistication that permits the same customization and scaling.

Delphes makes a live demo available in a side-by-side comparison with Google. The content used for the demo comes from the Cisco Systems’ Web site. You can explore this live implementation in the Delphes demo here. The interface incorporates a number of functions that strike me as quite useful. The screen shot below comes from the Delphes document from which the systems diagram was extracted. Portions of the graphic are difficult to read, but I will summarize the key features. You will be able to get a notion of the default interface, which, of course, can be customized by the licensee.

delphes_interfacefeatures

The results of the query high speed access through cable appear in the main display. Note that a user can select “themes” (actually a document type) and a “category”.

Each “hit” in the results list includes an extract from the most relevant paragraph in the source document that matches the query. In this example, the query terms are not matched exactly. The Delphes system can understand “fuzzy” notions and use them to find relevant documents. Key word indexing systems typically don’t have this functionality. With a single click, the user can launch a second query within the subset. This is generally known as “search within results.” Many search systems do not make this feature available to their users.

Notice that a link is available so the user can send the document with one-click to a colleague. The hit also includes a link to the source document. A link is provided so the user can jump directly to the next relevant paragraph in a hit. This feature eliminates scrolling through long documents looking for results. Finally, the hit provides a count of the number of relevant paragraphs in a source document. A long document with a single relevant paragraph may not be as useful to a user as a document with a larger number of relevant paragraphs.

Based on my notes to myself about the Delphes system, I identified the following major functions of DioWeb. Forgive me if I blur some functions from the DioWeb product. I can no longer recall the boundaries of each product. Delphes, I’m confident, can set you straight if I go off track.

First, the system can perform search-and-retrieval tasks. The interface permits free text and natural language querying. The system’s ability to “understand” content eliminates the shackles of the key word Boolean search technology. Users want the search box to be more understanding. Boolean systems are powerful but not understood by most users. Delphes describes its semantic approach as using “key linguistic differentiators”. I explain these functions briefly in Beyond Search, so I won’t define each of these concepts in this essay. Delphes uses syntax, disambiguation, lemmatization, masks, controlled term lists, and automatic language recognition, among other techniques.

Second, the system can federate content from different systems and further segment processed content by document type. Concepts can be used to refine a results list. Delphes defines concepts as proper nouns, dates, product names, codes, and other types of metadata.

Third, the system identifies relevant portions of a hit. A user can see only those portions of the document or browse the entire document. A navigator link allows the user to jump from relevant paragraph to relevant paragraph without the annoying scrolling imposed by some other vendors’ approaches to results viewing.

Fourth, the system can generate a “gist” or “summary” of a result. This feature extracts the most important portions of each hit and makes them available in a report. The system’s email link makes it easy to send the results to a colleague.

Fifth, Delphes includes what it calls a “knowledge manager”. I’m generally suspicious of KM or knowledge management systems. Delphes’ implementation strikes me as a variation on the “gist” or “summary” feature. The user can add comments, save the results, or perform other housekeeping functions. A complementary “information manager” function generates a display that shows what reports a user has generated. If a user sends a report to a colleague, the dashboard display of the “information manager” makes it possible to see that the colleague added a comment to a report. Again, this is useful housekeeping stuff, not the more esoteric functions described in my earlier summary of the Entopia approach.

What Can We Learn?

My goal for Beyond Search was to write a study with fewer than 200 pages, minimizing the technical details to focus on “what’s in it for the licensee”. Beyond Search is going to run about 250 pages, and I had to trim some information that I thought was important to readers. Delphes is an interesting vendor, and it offers a system that has a number of high-profile, demanding licensees in Canada, Europe, and elsewhere.

The reason I wanted to provide this brief summary — fully unauthorized by the company — was to underscore what I call the visibility problem in behind-the-firewall search.

Reading the information from the major consultancies and pundits who “cover” this sector of the software business, Delphes is essentially invisible. However, Delphes does exist and offers a competitive system that can go toe-top-toe with Autonomy, Endeca, and Fast Search & Transfer. One can argue that Delphes can enhance a SharePoint environment and match the functionality of a custom system built from IBM’s (NYSE:IBM) WebSphere and Ominifind components.

What’s does this discussion of Delphes tell us?

If you rely on the consultants and pundits, you may not be getting the full story. Just as I had to chop information from Beyond Search, others exercise the same judgment. This means that when you ask, “Which system is best for my requirements?” — you may be getting at best an incomplete answer. You may be getting the wrong answer.

A search for Delphes on Exalead, Live.com (NASDAQ:MSFT), Google (NASDAQ:GOOG), and Yahoo (NASDAQ:YHOO) is essentially useless. Little of the information I provide in this essay is available to you. Part of the problem is that the word Delphes is perceived by the search systems as a variant of Delphi. You learn a lot about tourism and not too much about this system.

There are two key points to keep in mind about search-and-retrieval systems:

  1. The “experts” may not know about some systems that could be germane to your needs. If the “experts” don’t know about these systems, you are not going to get a well-rounded analysis. The phrase that sticks in my mind is “bright but uninformed”. This can be a critical weak spot for some “experts”.
  2. The public Web search systems do a pretty awful job on certain types of queries. It is worth keeping this in mind because in the last few weeks, Google’s market share of Web search is viewed as a “game over” market. I’m not so sure. People who think the “game is over” in search are “bright but uninformed”. Don’t believe me. Run the Delphes query and let me know your impression of the results. (Don’t cheat and use the product names I include in this essay. Start with Delphes and go from there.)

In closing, contrast Entopia with Delphes. Both companies asserted in 2004 – 2006 similar functionality. Today, the high-profile Entopia is nowhere to be found. The lower-profile Delphes is still in business.

Make no mistake. Search is a tough business. Delphes illustrates the usefulness of focusing on a market, not lighting up the sky with marketing fireworks. I would like to ask the Delphic oracle in Greece, “What’s the future of Delphes?” I will have to wait and see. I’m not trekking to Greece to look at smoke and pigeon entrails. I do know some search engine “pundits” who may want to go. Perhaps the Delphic oracle will short cut their learning about Delphes?

Stephen Arnold, February 17, 2008

Entopia: A Look Back in Time

February 16, 2008

Periodically I browse though my notes about behind-the-firewall systems, content processing solutions, and information retrieval start ups. I think Entopia, a well-funded content processing company founded in 1999, shut down, maybe permanently some time in 2006.

In my “Dormant Search Vendors” folder, I keep information about companies that had interesting technology but dropped off my watch list. A small number of search vendors are intriguing. I revisit what information I have in order to see if there are any salient facts I have overlooked or forgotten.

KangarooNet and Smart Pouches

Do you remember Entopia? The company offered a system that would key word index, identify entities and concepts, and allow a licensee to access information from the bottom up. The firm open its doors as KangarooNet. I noticed the name because it reminded me of the whimsical Purple Yogi (now Stratify). Some names lure me because they are off-beat if not too helpful to a prospective customer. I do recall that the reference to a kangaroo was intended to evoke something called a “smart pouch”. The founders, I believe, were from Israel, not Australia. I assumed some Australian tech wizards had crafted the “smart pouch” moniker, but I was wrong.

Do you know what a “smart pouch” is? The idea is that the kangaroo has a place to keep important items such as baby kangaroos. The Entopia “smart pouch” was a way to gather important information and keep it available. Users could share “smart pouches” and collaborate on information. Delicious.com’s bookmarks provide a crude analog of a single “smart pouch” function.

I recall contacting the company in 2000, but I had a difficult time understanding how the company’s system would operate at scale in an affordable way. Infrastructure and engineering support costs seemed likely to be unacceptably high. No matter what the proposed benefits of a system, if the costs are too high, customers are unwilling to ink a deal.

Shifting Gears: New Name, New Positioning

Entopia is a company name derived from the Greek word entopizo. For those of you whose Greek is a rusty, the verb means to locate or bring to light. Entopia’s senior technologists stressed that their K-Bus and Quantum systems allowed a licensee to locate and make use of information that would otherwise be invisible to some decision makers.

When I spoke with representatives of the company at one of the Information Today conferences in New York, New York, in 2005. I learned that Entopia was, according to the engineer giving me the demo, was “a third-generation technology”. The idea was that Entopia’s system would supplement indexing with data about the document’s author, display Use For and See Also references, and foster collaboration.

I noted that I also spoke with Entopia’s vice president of product management, David Hickman, a quite personable man as I recall. My notes included this impression:

Entopia wants to capture social aspects of information in an organization. Relationships and social nuances are analyzed by Entopia’s system. Instead of a person looking at a list of possibly relevant documents, the user sees the information in the context of the document author, the author’s role in the organization, and the relationships among these elements.

In my files, I found this screen shot of Entopia’s default search results display. It’s very attractive, and includes a number of features that systems now in the channel do not provide. For example, if you had access to Entopia’s system in 2006 prior to its apparent withdrawal from the market, you could:

  • See concepts, people, and sources related to your query. These appear in the left hand panel on the screen shot below
  • Get a results list with the creator, source, date, and relevance score for each item clearly presented. In contrast to the default displays used by some of the company’s in my Beyond Search study, Entopia’s interface is significantly more advanced
  • The standard search box, a hot link to advanced search functions, and one-click access to saved searches keep important but little used functions front and center.

When the firm was repositioned in 2003, the core product was named, according to my handwritten notes, the “K-Bus Knowledge Extractor”. I think the “k” in K-Bus is a remnant of the original “kangaroo” notion. I wrote in my notes that Entopia was a spin out from an outfit called Omind and Global Catalyst Partners.

entopiaresults

Other features of the Entopia system were:

  • Support for knowledge bases, taxonomies, and controlled term lists
  • An API and a software development kit
  • Support for natural language processing
  • Classification of content
  • Enhanced metatagging

The K-Bus technology was enhanced with another software component called Quantum. The software system created a collaborative workspace. The idea was that system users to assemble, discuss, and manipulate the information processed by the K-Bus. This is the original SmartPouch technology that allows a user to gather information and keep it in a virtual workspace.

System Overview

In my Entopia folder, I found white papers and other materials given to me by the company. Among the illustrations was this high-level view of the Entopia system.

clip_image002

Several observations are warranted even though the labels in the figure are not readable. First, licensees had to embrace a comprehensive information platform. In the 2005 – 2006 period, a number of content processing vendors had added the word “platform” to their marketing collateral. Entopia to its credit does a good job of depicting how significant an investment is required to make good on the firm’s assertions for discovering information.

Second, it is clear that the complex interactions required to make the system work as advertised cannot tolerate bottlenecks. A slow down in one component — for instance, the horizontal gray rectangle in the center of the diagram is the “Session Facade Beans” subsystem. If these processes slow down the Web framework in the horizontal blue box above the horizontal gray box slows down user access. Another hot spot is the Data Access Module — the gray rectangle below the horizontal gray rectangle just referenced. A problem in this component prevents the metadata from being accessed. In short, a heck of an infrastructure of systems, storage, and bandwidth availability are needed to keep the system performing at acceptable levels.

Finally, the complexity of the system appears to require on-site support and in some cases, technical support from Entopia. A licensee’s existing information technology staff could require additional headcount to manage this K-Bus architecture.

As I scanned these notes, now more than two years’ old, I was struck by the fact that Entopia was on the right track. The buzz about social search makes sense, particularly in an organization where one-to-one relationships occur out of a hierarchical organizational structure. Software can provide some context for knowledge workers who are often monads, responsible to other monads, not the organization as a whole.

Entopia wanted to blend expertise identification, content visualization, social network analysis, and content discovery into one behind-the-firewall system. I noted that the company’s system started at $250,000, and I assume the up-and-running price tag would be in the millions.

When I asked, “Who are Entopia’s customers?”, I learned that Saab, the US government, Intel, and Boeing were licensees. Those were blue-chip names, and I thought that these firms’ use of the the K-Bus indicated Entopia would thrive. Entopia was among the first search vendors to integrate with Salesforce.com. The system also allowed a licensee to invoke the Entopia functions within a Word document.

What Can We Learn?

Entopia seems to have gone dark quietly in the last half of 2006. My hunch is that the intellectual property of the company has been recycle. Entopia could be in operation under a different corporate name or incorporated as a proprietary system in other content processing systems. When I clicked on the Entopia.com Web address in my folder, a page of links appeared. Running queries on Live.com, Google, and Yahoo returned links to stale information. If Entopia remains in business, it is doing a great job of keeping a low profile.

If you read my essay “Power Leveling”, you know that two common challenges in search and content processing are getting caught in a programming maze. The need to solve a particular problem fails to meet a licensee’s needs. The second problem is that when the system developer assembles the local solutions, the overall result is not efficient. Instead of driving straight from Point A to Point B, the system iterates and explores every highway and by way. Performance becomes a problem. To get the system to go fast, capital investment is necessary. When licensees can’t or won’t spend more on hardware, the system remains sluggish.

Entopia, on the surface, appears to be an excellent candidate for further analysis. My cursory looks at the system in 2001, again in 2005, and finally in 2006 revealed considerable prescience about the overall direction of the content processing market. Some of the subsystems were very clever and well in advance of what other vendors had on the market. The use of the social metadata in search results was quite useful. When these clever subsystems were hooked together, my recollection is now hazy, but I had noted that response time was sluggish. Maybe it was. Maybe it wasn’t. The point is that a complex system like that illustrated above would require on-going work to keep operating at peak performance.

Unfortunately, I don’t have an Entopia system to benchmark against the systems of the 24 companies profiled in Beyond Search. I wanted to include this Entopia information, but I couldn’t justify a historical look back when there was so much to communicate about systems now in the channel.

In Beyond Search, I don’t discuss the platforms available from Autonomy , Endeca, Fast Search & Transfer. IBM, and Oracle. I do mention these companies to frame the new players and little known up and comers that figure in Beyond Search. I would like to conclude this essay with several broad observations about the perils of selling organizations platforms.

First, any company selling a platform is essentially trying to obtain a controlling or central position in the licensee’s organization. A platform play is one that has a potentially huge financial pay off. A platform is a sophisticated “lock in”. Once the platform is in position, competitors have a difficult time making headway against the incumbent platform.

Second, the platform is the core product of IBM (NYSE:IBM), Microsoft (NASDAQ:MSFT), and Oracle (NASDAQ:ORCL). One might include SAP (NYSE:SAP) in this list, but I will omit the company because it’s in transition. These Big Three have the financial and market clout to compete with one another. Smaller outfits p9ushing platforms have to out market, out fox, and out deliver any of the Big Three. After all, why would an Oracle DBA want another information processing platform in an all-Oracle environment. IBM and Microsoft operate with almost the same mind set. Smaller platform vendors — perhaps we could include Autonomy (LON:AU) and Endeca in this category — are likely to face increasing pressure to mesh seamlessly with whatever a licensee has. If this is correct, Fast Search’s ESP has a better chance going forward than Autonomy. It’s too early to determine if Endeca’s deal with SAP will pay similar dividends. You can decide for yourself if Autonomy can go toe-to-tow with the Big Three. From my observation post in rural Kentucky, Autonomy will have to shift into a higher gear in 2008.

Third, super-advanced systems are vulnerable in business environments where credit is tight, sales are in slow or low growth cycles, and a licensee’s technical staff may be understaffed and overworked.

In conclusion, I think Entopia was a forward-thinking company. Its technology anticipated market needs now more clearly discernable. Its system was slick, anticipating some of the functionality of the Web 2.0 boom. The company demonstrated a willingness to abandon overly cute marketing for more professional product and company nomenclature. The company did apparently have one weakness — too little revenue. Entopia, if you are still out there, please, let me know.

Stephen Arnold, February 16, 2008

Search Musical Chairs

February 15, 2008

Running a search business is tough. Being involved in search and retrieval is no picnic either. The game of musical chairs that dominates the news I review comes as no surprise.

For example, Yahoo’s Bradley Horowitz, head of advanced projects, pulls in the Google parking lot now. You can read his “unfortunate timing” and “I really love Yahoo” apologia Horowitz apologia link. The executive shifts at Microsoft are too numerous for me to try and figure out. The search wizard from Ask.com — Steve Berkowitz — has turned in his magic wand to Microsoft security. You can read more about that Microsoft shuffle link. The low profile SchemaLogic lost one of its founders a month ago, although the news was largely overlooked by the technical media. Then, in Seattle on February 13, I heard that changes are afoot in Oracle’s secure enterprise search group. In short, the revolving doors in search and retrieval keep spinning.

But there are even larger games link afoot. For example, T-Mobile embraced Yahoo. Almost simultaneously, Nokia snuggled up to Google. (Note: the links to these news stories go dark without warning, and I can’t archive the original material on my Web site due to copyright considerations.) The world of mobile search continues to be fluid, and we haven’t had the winner of the FTC spectrum auction announced yet. As these larger tie ups play out, I want to keep my eye on telco search companies that are off the radar; for example, Fast Search & Transfer’s mobile licensees might be jetting to different climes when the Microsoft acquisition is completed. A certain large behind-the-scenes vendor of mobile search is likely to be among the first to seek a new partner.

At the next higher level, the investment banks continue to take a close look at their exposure in search and related sectors. With more than 150 companies actively marketing search, content processing, and utilities, some major financial institutions are becoming increasingly concerned. What once looked like a very large, open-ended opportunity has a very different appearance. The news that Google touches more than 60 percent of online search traffic leaves little wiggle room for online search competitors in the US and Europe. Asia seems to be a different issue, but in the lucrative US market, Google is the factor. In the behind-the-firewall sector Microsoft – Fast and Google seem destined to collide. With that looking increasingly likely, IBM and Oracle will have to crank up their respective efforts.

In short, at the executive level, sector level, and investment level, speed dating is likely to be a feature of the landscape for the next six to nine months. If someone were to ask me to run a search-centric company, I would push the button on my little gizmo that interrupts telephone calls with bursts of static. The MBAs, lawyers, and accountants who assume leadership positions in search-centric companies are wiser, braver, and younger than I. Unfortunately, as the bladder of their confidence swells, the substance behind that confidence may prove thin indeed.

I have resisted making forecasts about what the major trends in search and retrieval will be in 2008. I can make one prediction and feel quite certain that it will hold true.

The executive turnover in the ranks
of search and content processing
companies will churn, flip,
and flop throughout 2008.

The reason? Too many companies chasing too few opportunities. The wide open spaces of search are beginning to close. Beyond Search contains a diagram showing how the forces of Lucene, up-and-coming vendors with value-priced systems, and embedded search from super platforms like IBM, Microsoft, and Oracle are going to make life very interesting for certain companies. When the pressure increases, the management chair becomes a hot seat. The investors get cranky, and the Bradley Horowitz’s of the world find a nice way to say, “Adios.”

Stephen Arnold, February 15, 2008

Search’s Old Chestnuts Roasted and Crunched

February 14, 2008

Once again, I’m sitting in the Seattle – Tacoma airport waiting for one of inconvenient flights from the cold damp Northwest to the ice-frosted coal tailings of rural Kentucky.

Chewing through the news stories that flow to me each day, I scanned Nick Patience’s article “Fast Solutions to Lost Causes.” Dated February 13, 2008, Mr. Patience provides CIO Magazine’s readers with his commentary on the Microsoft – Fast Search deal. The subtitle of the article reads, “Nick Patience examines how Microsoft’s acquisition of FAST could help CIOs get to grips with the useful storage of elusive information.”

For the last 48 hours, I have dipped in and out of the Seattle business community where chatter about Microsoft is more common than talk about the US presidential primaries or the Seattle weather. After listening to opinions ranging from “the greatest thing since sliced bread” to “the deal was done without much thought”, I was revved and ready to get the CIO Magazine’s take on this subject.

Several points strike me as old chestnuts; that is, familiar information presented as fresh vegetables. For example, let me pull out three points and then offer a slightly different take on each of them. You can judge for yourself what you want to perceive about this deal.

First, the title is “Fast Solutions to Lost Causes.” As I read the title, it seems to say, “Microsoft has a lost cause.” So, how can an acquisition offer a “fast solution” to a “lost cause”? If a cause is lost, a solution is not available. A “fast solution” to a big problem is almost a guarantee that the fast solution will amplify the original, unsolved problem. Puzzling to me. I think this is one of those catchy headlines so liked by editors. Google’s indexing robot will be challenged to tag this story correctly based on the metaphor-charged word choice. But that’s a grumpy old man’s viewpoint.

Now, the second point. Mr. Patience asserts, “But Microsoft has done very little — even after the introduction of SharePoint in 2001 — to help CIOs not only get their arms around all this unstructured information, but more pertinently, to figure out what it is, where it is, how valuable or risky it is and how most effectively to store it.” Based on the research I did for the first three editions of the Enterprise Search Report and my regular consulting business, I don’t agree. Microsoft’s apparent disinterest in search does not match what I know. Specifically, in the early years of this decade, Microsoft relied on its partners to develop solutions using Microsoft tools. These solutions would “snap in”, amplify, and expand the ecosystem for Microsoft products, services, and certified vendors. The existence of dtSearch, Mondosoft (now part of SurfRay), Coveo, and other search systems that integrate more or less seamlessly with SharePoint are examples of a Microsoft strategy. I’m not saying Microsoft chose the optimal strategy, but to suggest that the company lacked a strategy misrepresents one of Microsoft’s methods for delivering on its “agenda”. Could Microsoft have approached behind-the-firewall search differently? Sure. Would these unexercised options worked better than the ecosystem approach? Who knows? The acquisition of Fast Search & Transfer is a new strategy. Coveo, for example, has expanded its operating system support, added functions, and looked into non-Microsoft markets because Microsoft seemed to be shifting from one strategic approach to a different one. But at the same time some vendors were decreasing their reliance on Microsoft, others like Autonomy created adapters to make their systems more compatible with the SharePoint environments. This is not a grumpy old man’s view; these are the facts of one facet of Microsoft’s business model.

Third, Mr. Patience references implicitly the search initiatives at Oracle (SES 11g, Triple Hop acquisition, Google partner deal) and IBM (Omnifind, WebFountain, the iPhrase acquisition, X1 tie up, deals with Endeca and Fast) as examples of more satisfying market tactics than Microsoft’s. No grump, just stabs in the dark the way I perceive reality.

As I stated in the three editions of Enterprise Search Report done on my watch, none of the superplatforms implemented effective behind-the-firewall strategies. Each of these companies tried different approaches. Each of these companies pushed into behind-the-firewall search with combinations of in-house development and acquisitions. Each of these companies experienced some success. Each of these companies’ strategies remain at this time works in progress. I’m not grumpy about this. This is just plain old corporate history. IBM and Oracle have been trying to crack the behind-the-firewall chestnut (metaphor intended). So far, both companies have only chipped teeth to show for their decades of effort.

I urge you to read Mr. Patience’s article and take a thorough look at other reports from the 451 Group. You will find some useful insights. Keep in mind, however, that when the chestnuts are broken open, the meat revealed may be quite different from the shell’s surface.

These three companies — IBM, Microsoft, and Oracle — have deep experience with behind-the-firewall search. Oracle’s efforts extend to the late 1980s when the database company acquired Artificial Linguistics. IBM’s efforts reach back to mainframe search with the STAIRS system, which is still available today as System Manager, and Microsoft’s search efforts have been one of the firm’s R&D centroids for many, many, many years.

Success is a different issue altogether. There are many reasons why none of these firms has emerged as the leader in behind-the-firewall search. But it is much more comforting to grab a handful of chestnuts than go find the chestnut tree, harvest the nuts, roast them, and then consume their tasty bits. What we know is that Microsoft is willing to pay $1.2 billion to try and go faster.

Stephen Arnold, February 14, 2008

Context: Popular Term, Difficult Technical Challenge

February 13, 2008

In April 2008, I’m giving a talk at Information Today’s Buying & Selling Econtent conference.

When I am designated as a keynote speaker, I want to be thought provoking and well prepared. So I try to start thinking about the topic a month or more before the event. As I was ruminating about my topic, I was popping in and out of email. I was doing, what some students of human behavior might call, context shifting.

The idea is that I was doing one thing (thinking about a speech) and then turning my attention to email or a telephone call. When I worked at Booz, Allen, my boss described this behavior as multi-tasking, but I don’t think what I was doing was doing two or three things at once. He was, like Einstein, not really human. I’m just a guy from a small town in Illinois, trying to do one thing and not screwing it up. So I was doing one thing at a time, just jumping from one work context to another. Normal behavior for me, but I know from observation my 86-year-old father doesn’t handle this type of function as easily as I do. I also know that my son is more adept at context shifting than I am. Obviously it’s a skill that can deteriorate as one’s mental acuity declines.

What struck me this morning was that in the space of a half hour, one email, one telephone call, and one face-to-face meeting each used the word “context”. Perhaps the Nokia announcement and its use of the word context allowed me to group these different events. I think that may be a type of meta tagging, but more about that notion in a moment.

Context seemed to be a high-frequency term in the last 24 hours. I don’t meed a Markov procedure to flag the term. The Google Trends’ report seems to suggest that context has been in a slow decline since the fourth quarter of 2004. Maybe so, but “context” was le mot de jour for me.

What’s Context in Search?

In my insular world, most of the buzzwords I hear pertain to search and retrieval, text processing, and online. After thinking about the word context, I jotted down the different meanings of the word context had in each of the communications I noticed.

The first use of context referenced the term as I defined it in my 2007 contributions to Bear Stearns’ analyst note, “Google and the Semantic Web.” I can’t provide a link to this document. You will have to chase down your local Bear Stearns’ broker to get a copy. This report describes the inventions of Ramanathan Guha. The PSE or Programmable Search engine discerns and captures context for a user’s query, the information satisfying that query, and other data that provide clues to interpret a particular situation.

The second use of context was a synonym for personalization. The idea was that a user profile would provide useful information about the meaning of a query. The idea is that a user looks for consumer information about gasoline mileage. When the system “knows” this fact, a subsequent query for “green fuel” is processed in the context of an automobile. In this case, “green” means environmentally friendly. Context makes it possible to predict a user’s likely context based on search history and implicit or explicit personalization.

The third use of context came up in a discussion about key word search. My colleague made the point that most search engines are “pretty dumb.” “The key words entered in a search box have no context,” he opined. The search engine, therefore, has to deliver the most likely match based on whatever data are available to the query processor. A Web search engine gives you a popular result for many queries. Type Spears into Google and you get pop star hits and few manufacturing and weapon hits.

When a search engine “knows” something about a user — for example, search history, factual information provided when the user registered for a free service, or the implicit or explicit information a search system gathers from users — search results can be made more on point. The idea is that the relevance of the hits matches the user’s needs. The more the system knows about a user and his context, the more relevant the results can be.

Sometimes the word context, when used in reference to search and retrieval, means “popping up a level” in order to understand the bigger picture for the user. Context, therefore, makes it possible to “know” that a user is moving toward the airport (geo spatial input), has a history of looking at flight departure information (user search history), and making numerous data entry errors (implicit monitoring of user misspellings or query restarts). These items of information can be used to shape a results set. In a more extreme application, these context data can be used to launch a query and “push” the information to the user’s mobile device. This is the “search without search” function I discussed in my May 2007 iBreakfast briefing, which — alas! — is not available online at this time.

Is Context Functionality Ubiquitous Today?

Yes, there are many online services that make use of context functions, systems, and methods today.

Even though context systems and methods add extra computational cycles, many companies are knee deep in context and its use. I think the low profile of context functions may be, in part, due to privacy issues becoming the target of a media blitz. In my experience, most users accept implicit monitoring if the user has a perception that their identity is neither tracked nor used. The more fuzzification — that is, statistical blurring — of a single user’s identity, the less the user’s anxiety about implicit tracking in order to use context data as a way to make results more relevant. Other vendors have not figured out how to add additional computational loads to their systems without introducing unacceptable latency, and these vendors offer dribs and drabs of context functionality. As their infrastructure becomes more robust, look for more context services.

The company making good use of personalization-centric context is Yahoo. Its personalized MyYahoo service delivers news and information selected by the user. Yahoo’s forthcoming OneConnect, announced this week at the telco conference in Barcelona, Spain. Based on the news reports I have seen, Yahoo wants to extend its personalization services to mobile devices.

Although Yahoo doesn’t talk about context, a user who logs in with a Yahoo ID will be “known” to some degree by Yahoo. The user’s mobile experience, therefore, has more context than a user not “known” to Yahoo. Yahoo’s OneConnect is a single example of context that helps an online service customize information services. Viewed from a privacy advocate’s point of view, this type of context is an intrusion, perhaps unwelcome. However, from the vantage point of a mobile device user rushing to the airport, Yahoo’s ability to “know” more about the user’s context can allow more customized information displays. Flight departure information, parking lot availability, or weather information can be “pushed” to the Yahoo user’s mobile device without the user having to push buttons or make finger gestures.

Context, when used in conjunction with search, refers to additional information about [a] a particular user or group of users identified as belonging to a cluster of users, [b] information and data in the system, [c] data about system processes, and [d] or information available to Yahoo though not residing on its servers.

Yahoo and T-Mobile are not alone in their interest in this type of context sensitive search. Geo spatial functions are potential enablers of news services and targeted advertising revenue. Google and Nokia seem to be moving on a similar vector. Microsoft has a keen awareness of context and its usefulness in search, personalization, and advertising.

Context has become a key part of reducing what I call the “shackles of the search box.” Thumb typing is okay but it’s much more useful to have a device that anticipates, personalizes, and contextualizes information and services. If I’m on my way to the airport, the mobile device should be able to “know” what I will need. I know that I am a creature of habit as you probably are with regard to certain behaviors.

Context allows disambiguation. Disambiguation means figuring out which of two or more possibilities is the “right” one. A good example is comes up dozens of times a day. You are in line to buy a bagel. The clerk asks you, “What kind of bagel?” with a very heavy accent, speaking rapidly and softly. You know you want a plain bagel. Without hesitation, you are able to disambiguate what the clear uttered and reply, “Plain, please.”

Humans disambiguate in most social settings, when reading, when watching the boob tube, or just figuring out weird road signs glimpsed at 60 miles per hour. Software doesn’t have the wetware humans have. Disambiguation in search and retrieval systems is a much more complex problem than looking up string matches in an index.

Context is one of the keys to figuring out what a person means or wants. If you know a certain person looks at news about Kolmogorov axioms, next-generation search systems should know that if the user types “Plank”, that user wants information about Max Planck, even though the intrepid user mistyped the name. Google seems to be pushing forward to use this type of context information to minimize the thumb typing that plagues many mobile device users today.

These types of context awareness seem within reach. Though complex, many companies have technologies, systems, and methods to deliver what I call basic context metadata. Let me note that context aware services are in wide use, but rarely labeled as “context” functions. The problem with naming is endemic in search, but you can explore some of these services at there sites. You may have to register and provide some information to take advantage of the features:

  • Google ig (Individualized Google) — Personalized start page, automatic identification of possibly relevant information based on your search history, and tools for you to customize the information
  • Yahoo MyYahoo — content customization, email previews, and likely integration with the forthcoming OneConnect service
  • MyWay — IAC’s personalized start page. One can argue that IAC’s implementation is easier to use than Yahoo’s and more graphically adept than Google’s ig service.

If you are younger than I or young at heart, you will be familiar with the legions of Web 2.0 personalization services. These range from RSS (really simple syndication) feeds that you set up to NetVibes, among hundreds of other mashy, nifty, sticky services. You can explore the most interesting of these services at Tech Crunch. It’s useful to click through the Tech Crunch Top 40 here. I have set up a custom profile on Daily Rotation, a very useful service for people in the information technology market.

An Even Tougher Context Challenge

As interesting and useful as voice disambiguation and automatic adjustment of search results are, I think there is a more significant context issue. At this time, only a handful of researchers are working on this problem. It probably won’t surprise you that my research has identified Google as the leader in what I call “meta-context systems and methods.”

The term meta refers to “information about” a person, process, datum, or other information. The term has drifted a long way from its Latin meaning of a turn in a hippodrome; for example, meta prima was the first turn. Mathematicians and scientists use the term to mean related to or based upon. When a vendor talks about indexing, the term metadata is used to mean those tags or terms assigned to an information object by an automated indexing system or a human subject matter expert who assigns index terms.

The term is also stretched to reference higher levels in nested sets. So, when an index term applies to other index terms, that broader index term performs a meta-index function. For example, if you have an index of documents on your hard drive, you can index groups of documents about a new proposal as “USDA Proposal.” The term does not appear in any of the documents on your hard drive. You have created a meta-index term to refer to a grouping of information. You can create meta-indexes automatically. Most people don’t apply a term to creating a folder name or new directory. Software that performs automatic indexing can assign these meta-index terms. Automatic classification systems can perform this function. I discuss the different approaches in Beyond Search, and I won’t rehash that information in this essay.

The “real context challenge” then is to create a meta context for available context data. Recognize that context data is itself a higher level of abstraction than a key word index. So we are now talking about taking multiple contexts, probably from multiple systems, and creating a way to use these abstractions in an informed way.

You, like me, get a headache when thinking about these Russian doll structures. Matryoshka (матрёшка)mare made of wood or plastic. When you open one doll, you see another inside. You open each doll and find increasingly small dolls inside the largest doll. The Russian doll metaphor is a useful one. Each meta-context refers to the larger doll containing smaller dolls. The type of meta context challenge I perceive is finding a way to deal with multiple matryoshkas, each containing smaller dolls. What we need, then, is a digital basket into which we can put our matryoshka. Single item of context data is useful, but having access to multiple items and multiple context containers opens up some interesting possibilities.

In Beyond Search, I describe one interesting initiative at Google. In 2006, Google acquired a small company that specialized in systems and methods for manipulating these types of information context abstractions. There is interesting research into this meta context challenge underway at the University of Wisconsin — Madison as well as at other universities in the U.S. and elsewhere.

Progress in context is taking place at two levels. At the lowest level, commercial services are starting to implement context functions into their products and services. Mobile telephony is one obvious application, and I think the musical chairs underway with Google, Yahoo, and their respective mobile partners is an indication that jockeying is underway. Also at this lowest level are the Web 2.0 and various personalization services that are widely available on Web sites or in commercial software bundles. In the middle, there is not much high-profile activity, but that will change as entrepreneurs sniff the big pay offs in context tools, applications, and services. The most intense activity is taking place out of sight of most journalists and analysts. Google, one of the leaders in this technology space, provides almost zero information about its activities. Even researchers at major universities have a low profile.

That’s going to change. Context systems and methods may open new types of information utility. In my April 2008 talk, I will provide more information about context and its potential for igniting new products, services, features, and functions for information-centric activities.

Stephen Arnold, February 13, 2008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta