Enterprise Search: It’s Easy but Work Is Never Done

July 17, 2008

The Burton Group caught my attention with its report describing Microsoft a couple of years ago as a superplatform. I liked the term, but the report struck me as overly enthusiastic in favor of Microsoft’s server products.

I was surprised when I saw part one  of Margie Semilof’s interview with two Burton Group consultants, Guy Creese and Larry Cannell. These folks were described as experts in content management, a discipline with a somewhat checkered history in the pantheon of enterprise software applications. You can read the first  part interview here. The interview carries a July 15, 2008, date, and I am capturing my personal thoughts on July 16, 2008. That’s my mode of operation, a euro short and a day late. Also, I am not enthusiastic about CMS experts making the jump to enterprise search expertise. The leap can be made, but it’s like jumping from the frying pan into the fire.

The interview contains a rich vein of intellectual gold or what appears to me to be sort of gold. I jotted down two points made by the Burton experts, and I wanted to offer some color around selected points. When you read the interview, your conclusions and take aways will probably differ from mine. I am an opinionated goose, so if that bothers you, quit reading now.

Let me address two points.

First, this question and answer surprised me:

Question: How much development work is require with search technology?

Answer by Guy Creese, Burton Group expert in content management: It’s pretty easy… Usually a company is up and running and can see most of its documents without trouble.

Yikes. Enterprise search dissatisfies anywhere from half to two thirds of a system’s users. Enterprise search systems are among the most troublesome enterprise applications to set up, optimize, and maintain. Even the Google Search Appliance, one of the most toaster like search solutions, takes some effort to get into fighting shape. Customization requires expertise with the OneBox API. “Seeing documents”  and finding information are two quite different functions in my experience.

Second, this question and answer ran counter to the research I conducted for the first three editions of Enterprise Search Report (2004-2006) and my most recent study Beyond Search (2008).

Search technology has some care and feeding involved. How do companies organize the various tasks?

Answer by Guy Creese, Burton Group expert in content management: This is not onerous. Companies don’t have huge armies [to do this work], but someone has to know the formats, whether to index, how quickly they refresh. If no one worries about this, then search becomes less effective. So beyond the eye candy, you have to know how to maintain and adjust your search.

“Not onerous” runs counter to the data I have gathered in surveys and focus groups. “Formats” invoke transformation. Transformation can be difficult and expensive. Hooking search into work processes requires analysis and then customization of search functions. Search that processes content in content management systems often require specialized set up, particularly when the search system indexes duplicate or versioned documents. Rich text processing, a highly desirable function, can wander off the beaten path unless customization and tuning are performed.

Observations

There are a handful of people who have a solid understanding of enterprise search. Miles Kehoe, one of the Verity wizards, is the subject of a Search Wizards Speak interview that will be published on ArnoldIT.com on July 21, 2008. His company, New Idea Engineering, has considerable expertise in search, and you can read his views on what must be done to ensure a satisfactory deployment. Another expert is my son, Erik Arnold, whose company Adhere Solutions, specializes in customizing and integrating the Google Search Appliance into enterprise environments. To my knowledge, neither Mr. Kehoe nor Mr. Arnold characterizes search as a “pretty easy” task. In fact, I can’t recall anyone in my circle of professional acquaintances describing enterprise search as “pretty easy.”

Second, I am concerned that content management systems are expanding into applications and functions that are not germane to these systems’ capabilities. For example, CMS needs search. Interwoven has struck a deal with Vivisimo to provide search that “just works” to Interwoven customers. Vivisimo has worked hard to create a seamless experience, but,  based on my sources, the initial work was not  “pretty easy”. In fact, Interwoven had a mixed track record in delivering search before hooking up with Vivisimo. But CMS vendors are also asserting that their system is social. Well, CMS allows different people to index a document. I think that’s a social and collaborative function. But social software to me suggests Digg, Twitter, and Mahalo type functionality. Implementing these technologies in a Broadvision (if it is still paddling upstream) or Vignette might take some doing.

Third, SharePoint (a favorite of Burton if I recall the superplatform document) is a polymorphic software system. Once it was a CMS. Now it is a collaboration platform just like Exchange. I think these are marketing words slapped on servers which are positioned to make sales, not solve problems.  SharePoint includes a search function, which is improving. But deploying a robust search system within SharePoint is hard in my experience. I prefer using third party software from such companies as ISYS Search Software or the use of third-party tools. ISYS, along with Coveo, offer systems that are indeed much easier to deploy, configure, and maintain than SharePoint. But planning and experience with SharePoint are necessary.

I look forward to the second part of this interesting interview with CMS experts about enterprise search. Agree? Disagree? Quack back.

Stephen Arnold, July 17, 2008

Digital Convergence: A Blast from the Past

July 15, 2008

In 1999, I wrote two articles for a professional journal called Searcher. The editor, Barbara Quint, former guru of information at RAND Corporation, asked me to update the these two articles. I no longer had copies of them, but Ms. Quint emailed my fair copies, and I read my nine-year old prose.

The 2008 version is “Digital Convergence: Building Blocks or Mud Bricks”. You can obtain a hard copy from the publisher, Information Today here. In a month or two, an electronic version of the article will appear in one of the online commercial databases.

My son, Erik, who contributed his column to Searcher this month as well, asked me, “What’s with the mud bricks?” I chose the title to suggest that the technologies I identified as potential winners in 1999 may lack staying power. One example is human assigned tags. This is indexing, and it has been around in one form or another since humans learned to write. Imagine trying to find a single scroll in a stack of scrolls. Indexing was a must. What’s not going to have staying power is my assigning tags. The concept of indexing is a keeper; the function is moving to smart software, which can arguably do a better job than a subject matter expert as long as we define “better” as meaning faster and cheaper”. A “mud brick” is a technology that decomposes into a more basic element. Innovations are based on interesting assemblages of constitute components. Get the mix right and you have something with substance, the equivalent of the Lion’s Gate keystone.

lion-gate-mycenae-2a

Today’s information environment is composed of systems and methods that are durable. XML, for example, is not new. It traces its roots back 50 years. Today’s tools took decades of refinement. Good or bad, the notion of structuring content for meaning and separating the layout information from content is with us for the foreseeable future.

Three thoughts emerged from the review of the original essays whose titles I no longer recall.

First, most of today’s hottest technologies were around nine years ago. Computers were too expensive and storage was too costly to make wide spread deployment of services based on antecedents of today’s hottest applications such as social search and mobile search, among others.

Second, even though I identified a dozen or so “hot” technologies in 1999, I had to wait for competition and market actions to identify the winners. Content processing, to pick one, is just now emerging as a method that most organizations can afford to deploy. In short, it’s easy to identify a group of interesting technologies; it’s hard for me to pick the technology that will generate the most money or have the greatest impact.

Read more

Enterprise Search Top Vendors: But Who Is the Judge?

July 3, 2008

My jaw dropped when I saw “The Top Enterprise Search Vendors,” an essay by Jon Brodkin, a writer affiliated with Network World. You can read the two-part document here. (Note: The url is one of those wacky jobs with percent signs and random characters, the product of a misbegotten content management system. So, if you can’t get the link to work after I write this [July 3, 2008, 2 pm Eastern time], you are on your own.)

Let’s cut to the chase.

Mr. Brodkin is using a consulting firm’s report as the backbone of his analysis. There is nothing wrong with that approach, and I use it myself for some documents. He picks up assertions in the consultant report and identifies some companies as “best” or “top” in the “enterprise search” market. We need a definition of “enterprise search”. A definition, in my view, is an essential first step. Why? I wrote a 300-page study about moving beyond search for Gilbane Group. A large part of my argument was that no one knows what enterprise search so dissatisfaction runs high, in the 50 to 75 percent. Picking the “best” or “top” vendor when the majority of system users are unhappy is an issue with me.

He writes:

The best enterprise search products on the market come from Autonomy, Endeca, the Microsoft subsidiary Fast and Vivisimo, but Google’s Search Appliance continues to dominate the market in terms of brand awareness and sheer number of customers, Forrester Research says in a new report.

Ah, yes, the Forrester  “wave” report. Now we know the origin of the adjectives “top” and “best”. Other vendors to note include:

  • Coveo
  • IBM
  • Microsoft’s own MOSS and MSS search systems (distinct from the Fast Search & Transfer ESP system). This is in too much flux to warrant discussion by me. I handle this in Beyond Search by saying, “Wait and see.” I know this is not what 65 million SharePoint users want to hear, but “wait and see”.
  • Oracle
  • Recommind.

Let’s do a reality check here, not for Mr. Brodkin’s sake or that of the Forrester “wave” team. Just in case an individual wants to license a search system, some basic information may be useful.

First, there are more than 300 vendors offering search, content processing, and text analytics systems at this time. There is no leader for several reasons:

  • Autonomy has diversified aggressively and much of their market impact comes from systems in which search is a comparatively modest part in a far larger system; for example, fraud detection. So, revenues alone or total customer count are not key indicators of search.
  • Fast Search & Transfer has been struggling with a modest challenge; namely, the investigation of its finances over an alleged loss of FY2007 $122 million in the fiscal year prior to Microsoft’s buying the company for $1.2 billion. Somehow “best” and “top” are in conflict with this alleged short fall. So, “best” and “top” mean one thing to me and definitely another to the Mr. Brodkin and the Forrester “wave” team. If an outfit is the best, I assume the firm’s financial health is part of its being “top” or “best”. I guess I am old fashioned or an addled goose.
  • Endeca works hard to explain that it is an information access company. Sure, search functions work in an Endeca implementation, but I think lumping this company with Autonomy (diversified information services) and Fast Search & Transfer (murky financial picture) clarifies little and confuses more.
  • Vivisimo is a relative newcomer to enterprise search. The company has some nifty de-duplication technology and it can federate results from different engines. The company is making sales in the enterprise arena. I categorize it as an up-and-coming vendor. I wonder if Vivisimo was surprised by its being labeled as a firm nosing around in Autonomy and Endeca territory. Great publicity. But Autonomy is about $300 million in revenue. Endeca is in the $110 million in revenue range. Vivisimo is far smaller, maybe one tenth Endeca’s size, but growing. A set to my way of thinking should contain like objects. $300 million, $100 million, $10 million–not the type of set I would craft to explain “enterprise search”.

Second, have vendors been miscategorized. I am okay with mentioning Coveo and Recommind. Both companies seem to have a solid value proposition and a clear sense of who their prospects are. Coveo, in particular, has some extremely tasty technology for mobile search. Recommind, despite its efforts to break out of the legal market, continues to make sales to lawyer-types. I am not sure the word “search” covers what these two firms are offering their customers. I think of both vendors offering “search plus other services and functions.”

Third, identifying IBM and Oracle as key players in search baffles me. Both buy consulting and advertising, but in “enterprise search”, neither figures prominently in my analyses. IBM is not a search company; it is a consulting firm using advice to push hardware, software, and services. Search at IBM can mean Lucene with an IBM T shirt. IBM also sells DB2, FileNet, iPhrase, and assorted text processing tools whose names I cannot keep straight. IBM also has an industry “openness” initiative called UIMA, a gasping swan right now in my opinion.

And, Oracle has been beating the secure search drum to deaf ears for a couple of years. Oracle SES 10g sells more Oracle servers, but Oracle is moving a lot of Google Search Appliances. So, what’s Oracle search? Is it the PL/SQL stuff that fuels more Oracle database installations, the SES 10g, or the Google Search Appliance? My sources indicate that Oracle sells more Google Search Appliances than SES 10g. Why? Well, it works and has a nifty API that allows Oracle consultants to hook the GSA into other enterprise systems. Forrester says Oracle is a search vendor, which is accurate. Forrester and Mr. Brodkin don’t mention the importance of the GSA in Oracle’s information access efforts.

Then there is Google or the GOOG. Google rates inclusion in the list of search leaders. The surprise is that Google is THE leader in enterprise search. The company doesn’t provide much information, but based on my research, Google has more than 11,000 Google Search Appliance licensees and more coming every day. When you add up the revenue from various enterprise activities, Google is not generating the paltry $188 million reported in its FY2007 financials. Nope. The GOOG is in the $400 million range. If my data are correct, Google, not Autonomy, is number one in gross revenue related to search.

What’s this all mean?

Let me boil out the waste products for you:

  1. Enterprise search is a non-starter in organizations. People don’t like the “search” experience, so the market is shifting. The change is coming quickly, and the established vendors are trying to reposition themselves by adding social search, business analytics, and discovery functions. The problem is that other companies are moving more quickly and delivering these much needed options quicker.
  2. There are some very significant vendors in the information access market, and these must be included on any procurement team’s “look at” list; specifically, Exalead (Paris) and Isys Search Software (Sydney and Denver). Both companies serve slightly different sectors of the information access market, but omitting them underscores a lack of knowledge of what’s hot and what’s not.
  3. Specialist vendors are having a significant impact in niche markets, and these vendors could make leaps into other segments as well. Examples that come to my mind are Attensity and  Clearwell Systems.
  4. New players are poised to disrupt existing information access markets. Examples range from Silobreaker (Stockholm) to companies such as Attivio and Connotate. In fact, there is an ecosystem of new and interesting approaches that have search and retrieval functions but are definitely distancing themselves from the train wreck that is “enterprise search”.

I urge you to read the Forrester report. Just be sure of your facts before you base your decision on a single firm’s analysis. There is a reason that a pecking order in consulting exists. At the top are Booz, Allen & Hamilton, Boston Consulting Group, Bain, and McKinsey. Then there is a vast middle tier. Below the middle tier are firms that offers boutique services. Instead of accepting a firm’s view of the “top” or the “best”, make sure the advice you take comes from a firm that has a blue-chip recommendation.

The growing dissatisfaction with enterprise search can come back and bite hard.

Stephen Arnold, July 3, 2008

Microsoft BIOIT: Opportunities for Text Mining Vendors

June 14, 2008

I came across Microsoft BIOIT in a news release from Linguamatics, a UK-based text processing company. If you are not familiar with Linguamatics, you can learn more about the company here. The company’s catchphrase is “Intelligent answers from text.”

In April 2006, Microsoft announced its BIOIT alliance. The idea was to create “a cross-industry group working to further integrate science and technology as a first step toward making personalized medicine a reality.” The official announcement continued:

The alliance unites the pharmaceutical, biotechnology, hardware and software industries to explore new ways to share complex biomedical data and collaborate among multidisciplinary teams to ultimately speed the pace of drug discovery and development. Founding members of the alliance include Accelrys Software Inc., Affymetrix Inc., Amylin Pharmaceuticals Inc., Applied Biosystems and The Scripps Research Institute, among more than a dozen industry leaders.

The core of the program is Microsoft’s agenda for making SharePoint and its other server products the plumbing of health-related systems among its partners. The official release makes this point as well, “The BioIT Alliance will also provide independent software vendors (ISVs) with industry knowledge that helps them commercialize informatics solutions more quickly with less risk.”

Rudy Potenzone, a highly regarded expert in the pharmaceutical industry, joined Microsoft in 2007 to bolster Redmond’s BIOIT team. Dr. Potenzone, who has experience in online with Chemical Abstracts, has added horsepower to the Microsoft team.

This week on June 12, 2008, Linguamatics hopped on the BIOIT band wagon. In its news announcement, Linguamatics co-founder Roger Hale said:

As the amount of textual information impacting drug discovery and development programs grows exponentially each year, the ability to extract and share decision-relevant knowledge is crucial to streamline the process and raise productivity… As a leader in knowledge discovery from text, we look forward to working with other alliance members to explore new ways in which the immense value of text mining can be exploited across complex, multidisciplinary organizations like pharmaceutical companies.

Observations

Health and medicine is an important player in the scientific, medical, and technical information sector. More importantly, health presages money. In the US, the baby boomer bulge is moving toward retirement, bringing a cornucopia of revenue opportunity for many companies.

Google has designs on this sector as well. You can read about its pilot project here. Microsoft introduced a similar project in 2006. You can read about it here.

Several observations are warranted:

  1. There is little doubt that bringing order, control, metadata and online access to certain STM information is a plus. Tossing in the patient health record allows smart software to crunch through data looking for interesting trends. Evidence based medicine also can benefit. There’s a social upside beyond the opportunity for revenue.
  2. The issue of privacy looms large as personal medical records move into these utility-like systems. The experts working on these systems to collect, disseminate, and mine data have good intentions. Nevertheless, this is uncharted territory, and when one explores, one must be prepared for the unexpected. The profile of these projects is low, seemingly controlled quite tightly. It is difficult to know if security and privacy issues have been adequately addressed. I’m not sure government authorities are on top of this issue.
  3. The commercial imperative fuels some potent corporate interests. These interests could run counter with social needs. The medical informatics sector, the STM players, and the health care stakeholders are moving forward, and it is not clear what the impacts will be when their text mining reveals hiterto unknown facets of information.

One thing is clear. Linguamatics, Hakia, and other content processing companies see an opportunity to leverage these broader industry interests to find new markets for its text mining technology. I anticipate that other content processing companies will find the opportunities sufficiently promising to give BIOIT a whirl.

Stephen Arnold, June 14, 2008

Ovum Says, ‘Microsoft Has a Plan’ for Search

May 24, 2008

Ovum, a British consultancy of high repute, asserts that Microsoft has its sights set on being “the king of search”. You can read its summary here. This article, penned by Mike Davis, is based upon a longer piece available to Ovum’s paying customers as part of the pundit shop’s Straight Talk service.

The Ovum conclusion, if I read Mr. Davis’ article correctly, is that Microsoft’s pay-for-traffic initiative is just one component of a far larger strategy to close the gap with Google. He writes:

The technology for the programme came from the acquisition of Jellyfish.com last year. The service is a different proposition to merchants than the usual ‘cost per click(s)’ such as used by Microsoft’s current nemesis Google. The payment model being used by Microsoft is called Cost Per Acquisition, and the advertiser only pays when the advertisement results in a purchase.

So, it’s not pay for traffic. It’s a rebate of three to 30 percent, requires a minimum balance of $5, and is designed to go after Amazon.com and eBay.com.

The point that jumped out at me is that Mr. Davis tosses the Fast Search & Transfer acquisition into the mix. Mr. Davis sees the pay-for-traffic plan announced by William Gates at the Advance 08 advertising conference and the $1.2 billion deal for Fast Search as signs of Microsoft’s determination to be “king of search”.

Let’s assume that Ovum’s research and Mr. Davis are right on target. This means that:

  • The Jellyfish technology underpinning the cash back for search play will generate traffic and hence ad revenue for Microsoft.
  • The Fast Search technology will allow Microsoft to break through the 50 million document barrier that some SharePoint users encounter with native SharePoint search
  • Consumers and advertisers leap on the cash back bandwagon and SharePoint licensees pay for the Fast ESP (enterprise search solution)

Each of these actions take place quickly and produce gains for Microsoft.

How much traffic and revenue does Microsoft need to become “king of search”?

The gap between Microsoft and Google is a reasonably large one. Recent data from an admittedly uneven resource suggests that Google has about 62 percent of the US search traffic. Google’s share of the global market is higher. In the April 2008 period, you can read Mashable’s quite good analysis here, Microsoft lost search market share. If the ComScore data are accurate, Microsoft accounts for 9.1 percent of the search traffic. The month before, Microsoft’s search traffic was 9.4 percent. Google’s share is growing, if the ComScore data are correct; Microsoft’s share of search traffic is degrading. Wow!

In order to close this gap, the pay-for-search scheme is going to have to reverse a declining trend, attract advertisers, and scale like the devil. I don’t think the pay-for-traffic scheme will work whether it is aimed at Amazon.com, eBay.com, Google.com, or Yahoo.com.

The Fast Search deal is going to have to show some sizzle. At the recent Enterprise Search Summit, I stopped by the Microsoft exhibit and asked about search. I was told SharePoint was quite good. I asked about Fast Search and I was told that Fast Search had a booth. I asked, “Please, show me the Fast ESP system running on a SharePoint system.” The nice Microsoft person said, “I don’t have that information.” So, no FAST logo in the Microsoft booth and no demo that I could see. Keep in mind that there were vendors such as BA-Insight, Coveo and ISYS Search Software, among others, showing potential buyers SharePoint search systems that worked, scaled, and delivered the nifty metatagging so much in demand.

I walked to the opposite side of the room where the Fast Search exhibit was. I asked to see the Fast ESP SharePoint demo. I was told, “Come back between sessions. We will have it up then.” I came back and was told, “We’ll walk you through the basic systems. SharePoint works the same way.” Iasked, “Where’s your Microsoft logo.” The really friendly person told me, “We don’t have that logo yet. Leave your card, and I will get that information for you.” I said, “No.” Your PR guy hassles me about not knowing anything about Fast Search despite my analysis of the system for the US federal government over a two year period.

Now putting the pay for traffic puzzle piece up against the Fast Search puzzle piece, Ovum sees a fit. I don’t. What I see is a very large orgaznization faced with market push back on three separate war fighting fronts. A three-front conflict is complex, not tidy. And what are the three fronts?

First, Microsoft controls the desktops of 90 percent of computer users and Internet Explorer is the default home page for the browser. Google’s market share means that people are consciouly navigating to Google to run queries even though the Micrsoft Web search box is the default. Most people don’t change their default home page, so extra clicks and typing are required. People like easy, but when it comes to search people go to Google. With a massive market share and the default browser’s search box, users go to Google. I find this pretty amazing. The longer Microsoft persists in losing market share, the more deeply ingrained the Google habit becomes. In the history of online, user habits–once set–become very hard to change.

Second, the pay for clicks approach is a double edged sword. Here’s why. There is a tremendous incentive for users to find ways to scam the system. Google has to work overtime to snuff out fraudulent clicks. Microsoft–lacking a high traffic site and the easy money of AdSense–will find that it must spend more to deal with tricky users. And if the pay for traffic play is really successful, Microsoft will have to scale its online system quickly. One edge is giving up some money by betting more traffic will yield cash. The other edge is that success means more scaling costs. The way I look at it, the pay for traffic play costs money and does not hold the promise of a way to lame the nimble Googzilla.

In fact, Google is adept scaling quickly and at lower costs due to its use of commodity hardware and its “smart” extensions to Linux. Microsoft has yet to prove that it can scale without taking extreme measures such as complex tiering, using super-fast, expensive, high-end branded server gear from Hewlett Packard and other vendors, and dealing with the time and bandwidth issues imposed by Micrsooft’s own 64-bit operating system and application overhead. Microsoft has to spend more to get the basic job done. My take? A huge success for Microsoft results in higher costs. In the short term, that’s not a problem. Over the longer term, higher costs can become a problem even for a deep-pockets giant like Microsoft. If performance lags or user trickery becomes evident, the gains may slip away leaving puddles of red ink.

Third, the Ovum analysis says that the pay for traffic play is based on the Jellyfish acquisition. The enterprise search initiative is based on the Fast Search acquisition. These two key components were not invented at Microsoft, and I have a hunch that integrating these acquired technologies into the Windows-based systems is a work in progress. Again, more costs and increased chance for technical and managerial friction. Microsoft’s ingrained project manager system and its silo-type structure make feudal squabbles between digital princes a feature of Redmond life. To be “king of search”, these destructive hot spots have to be remedied. Google’s certainly not perfect, but it seems able to innovate without clashes over interface and technology popping up when I use its system.

I wish Microsoft well in its quest to become “king of search”. I know Ovum’s management wants its analyses to be accurate and generate consulting business for the firm’s analysts. I hope SharePoint users find search happiness from Fast ESP. I hope Web searchers find that Microsoft’s Web search initiatives deliver the goods.

Microsoft has to find a way to leap frog ahead of Google. I’m not sure making acquisitions and paying for traffic fit together seamlessly. Furthermore I disagree that these two initiatives mesh, have been fully integrated, and represent a significant challenge to the GOOG. Agree? Disagree? Let me know.

Stephen Arnold, May 24, 2008

Microsoft Chomps and Swallows Fast

April 26, 2008

It’s official. On April 24, 2008, Fast Search & Transfer became part of the Microsoft operation. You can read the details at Digital Trends here, the InfoWorld version here, or Examiner.com’s take here.

John Lervik, the Fast Search CEO, will become a corporate vice president at Microsoft. He will report to Jeff Teper, the corporate vice president for the Office Business Platform at Microsoft. The idea–based on my understanding of the set up–is that Dr. Lervik will develop a comprehensive group of search products and services. The offerings will involve Microsoft Search Sever 2008 Express, search for the Microsoft Office SharePoint Server 2007, and the Fast Enterprise Search Platform. Despite my age, I think the idea is to create a single enterprise search platform. Lucky licensees of Fast Search’s technology prior to the buy out will not be orphaned. Good news indeed, assuming the transition verbiage sets like hydrated lime, pozzolana, and aggregate. Some Roman concrete has been solid for two thousand years.

romanconcrete

This is an example of Roman concrete. The idea of “set in stone” means that change is difficult. Microsoft has some management procedures that resist change.

A Big Job

The job is going to be a complicated one for Microsoft’s and Fast Search’s wizards.

First, Microsoft has encouraged partners to develop search solutions for its operating system, servers, and applications. The effort has been wildly successful. For example, if you are one of the more than 80 million SharePoint users, you can use search solutions from specialists like Interse in Denmark to add zip to the metadata functions of SharePoint, dtSearch to deliver lightning-fast performance with a natural language procession option, Coveo for clustering and seamless integration. You can dial into SurfRay’s snap in replacement for the native SharePoint search. You can turn to the ISYS Search System which delivers fast performance, entity extraction, and other other “beyond search” features. In short, there are dozens of companies who have developed solutions to address some of the native search weaknesses in SharePoint. So, one job will be handling the increased competition as the Fast Search team digs in while keeping “certified gold partners” reasonably happy.

immortals

This is a ceramic rendering of two of the “10,000 Immortals”. The idea is that when one Immortal is killed, another one takes his place. Microsoft’s certified gold partners–if shut out of the lucrative SharePoint aftermarket for search–may fight to keep their customers like the “10,000 Immortals”. The competitors will just keep coming until Microsoft emerges victorious.

Read more

Traditional Publishers: Patricians under Siege

April 19, 2008

This is an abbreviated version of Stephen Arnold’s key note at the Buying and Selling eContent Conference on April 15, 2008. A full text of the remarks is here.

Roman generals like Caesar relied on towers spaced about 3000 feet apart. Torch signals allowed messages to be passed. Routine communications used a Roman version of the “pony express”, based on innovations in Persia centuries before Rome took to the battlefield.

Today, you rely on email and your mobile phones. Those in the teens and tweens Twitter and use “instant” social messaging systems like those in Facebook and Google Mail. Try to Imagine how difficult it would be for Caesar to understand the technology behind Twitter. but how many of you think Caesar would have hit upon a tactical use of this “faster that flares” technology?

Read more

Computerworld’s Take on Enterprise Search

January 12, 2008

Several years ago I received a call. I’m not at liberty to reveal the names of the two callers, but I can say that both callers were employed by the owner of Computerworld, a highly-regarded trade publication. Unlike its weaker sister, InfoWorld, Computerworld remains both a print and online publication. The subject of the call was “enterprise search” or what I now prefer to label “behind-the-firewall search.”

The callers wanted my opinion about a particular vendor of search systems. I provided a few observations and said, “This particular company’s system may not be the optimal choice for your organization.” I was told, “Thanks. Goodbye” IDG promptly licensed the system against which I cautioned. In December 2007 at the international online meeting in London, England, an aquaintance of mine who works at another IDG company complained about the IDG “enterprise search” system. When I found myself this morning (January 12, 2008) mentioned in an article authored by a professional working at an IDG unit, I invested a few moments with the article, an “FAQ” organized as questions and answers.

In general, the FAQ snugly fitted what I believe are Computerworld’s criteria for excellence. But a few of the comments in the FAQ nibbled at me. I had to work on my new study Beyond Search: What to Do When Your Search System Doesn’t Work, and I had this FAQ chewing at my attention. A Web can be a useful way to test certain ideas before “official” publication. Even more interesting is that I know that IDG’s incumbent search system, ah, disappoints some users. Now, before the playoff games begin I have an IDG professional cutting to the heart of search and content processing. The article “FAQ: Why Is Enterprise Search Harder Than Google Web Search?” references me. The author appears to be Eric Lai, and I don’t know him, nor do I have any interaction with Computerworld or its immedite parent, IDC, or the International Data Group, the conglomerate assembled by Patrick McGovern (blue suit, red tie, all the time, anywhere, regardless of the occasion).

On the article’s three Web pages (pages I want to add that are chock full of sidebars, advertisements, and complex choices such as Recommendations and White Papers) Mr. Lai’s Socratic dialog unfurls. The subtitle is good too: “Where Format Complications Meet Inflated User Expectations”. I cannot do justice to the writing of a trained, IDC-vetted journalist backed by the crack IDG editorial resources, of course. I’m a lousy writer, backed by my boxer dog Tyson and a moonshine-swilling neighbor next hollow down in Harrods Creek, Kentucky.

Let me hit the key points of the FAQ’s Socratic approach to the thorny issues of “enterprise search”, which is remember “behind-the-firewall search” or Intranet search. After thumbnailing each of Mr. Lai’s points, I will offer comments. I invite feedback from IDC. IDG, or anyone who has blundered into my Beyond Search Web log.

Point 1: Function of Enterprise Search

Mr. Lai’s view is that enterprise search makes information “stored in their [users’] corporate network available. Structured and unstructured data must be manipulated, and Mr. Lai on the authority of Dr. Yves Schabes, Harvard professor and Teragram founder, reports that a dedicated search system executes queries more rapidly “though it can’t manipulate or numerically analyze the data.”

Beyond Search wants to add that Teragram is an interesting content processing system. In Mr. Lai’s discussion of this first FAQ point, he has created a fruit salad mixed in with his ones and zeros. The phrase “enterprise search” is used as a shorthand way to refer to the information on an organization’s computers. Although a minor point, there is no “enterprise” in “enterprise search” because indexing behind-the-firewall information means deciding what not to index or at least, what content is available to whom under what circumstances. One of the gotchas in behind-the-firewall search, therefore, is making sure that the system doesn’t find and make available personal information, health and salary information, certain sensitive information such as what division is up for sale, and the like. A second comment I want to make is that Teragram is what I classify as a “content processing system provider”. Teragram’s technology, which has been used at the New York Times and America Online can be an enhancement to other vendors’ technology. Finally, the “war of words” that rages between various vendors about performance of database systems is quite interesting. My view is that behind-the-firewall search and the new systems on offer from Teragram and others in the content processing sector are responding to a larger data management problem. Content processing is a first step toward breaking free of the limitations of the Codd database. We’re at an inflection point and the swizzling of technologies presages a far larger change coming. Think dataspaces, not databases, for example. I discuss dataspaces in my new study out in April 2008, and I hope my discussion will put the mélange of ideas in Mr. Lai’s first Socratic question in a different context. The change from databases to dataspaces is more than a two consonants.

Point 2: Google as the Model for Learning Search

Mr. Lai’s view is that a user of Google won’t necessarily be able to “easily learn” [sic] “enterprise search” system.

I generally agree with the sentiment of the statement. In Beyond Search I take this idea and expand it to about 250 pages of information, including profiles of 24 companies offering a spectrum of systems, interfaces, and approaches to information access. Most of the vendors’ systems that I profile offer interfaces that allow the user to point-and-click their way to needed information. Some of the systems absolve the user of having to search for anything because work flow tools and stored queries operated in the background. Just-in-time information delivery makes the modern systems easier to use because the hapless employee doesn’t have to play the “search box guessing game.” Mr. Lai, I believe, finds query formulation undaunting. My research reveals the opposite. Formulating a query is difficult for many users of enterprise information access systems. When a deadline looms, employees are uncomfortable trying to guess the key word combination that unlocks the secret to the needed information.

Point 3: Hard Information Types

I think Mr. Lai reveals more about his understanding of search in this FAQ segment. Citing our intrepid Luxembourgian, Dr. Schabes, we learn about eDiscovery, rich media, and the challenge of duplicate documents routinely spat out by content management systems.

The problem is the large amounts of unstructured data in an organization. Let’s reign in this line of argument. There are multiple challenges in behind-the-firewall search. What makes information “hard” (I interpret the word “hard” as meaning “complex”) involves several little-understood factors colliding in interesting ways. [a] In an organization there may be many versions of documents, many copies of various versions, and different forms of those documents; for example, a sales person may have the Word version of a contract on his departmental server, but there may be an Adobe Portable Document Format version attached to the email telling the client to sign it and fax the PDF back. You may have had to sift through these variants in your own work. [b] There are files types that are in wide use. Many of these may be renegades; that is, the organization’s over-worked technical staff may be able to deal with some of them. Other file types such as iPod files, digital videos of a sales pitch captured on a PR person’s digital video recorder, or someone’s version of a document exported using Word 2007’s XML format are troublesome. Systems that process content for search and retrieval have filters to handle most common file types. The odd ducks require some special care and feeding. Translation: coding filters, manual work, and figuring out what to do with the file types for easy access. [c] Results in the form of a laundry list are useful for some types of queries but not for others. The more types of content processed by the system, the less likely a laundry list will be useful. Not urprisingly, advanced content processing systems produce reports, graphic displays, suggestions, and interactive maps. When videos and audio programs are added to the mix, the system must be able to render that information. Most organizations’ networks are not set up to shove 200 megabyte video files to and fro with abandon or alacrity. You can imagine the research, planning, and thought that must go into figuring out what to do with these types of digital content. None is “hard”. What’s difficult is the problem solving needed to make these data and information useful to an employee so work gets done quickly and in an informed manner. Not surprisingly, Mr. Lai’s Socratic approach leaves a few nuances in the tiny spaces of the recitation of what he thinks he heard Mr. Schabes suggest. Note that I know Mr. Schabes, and he’s an expert on rule-based content processing and Teragram’s original rule nesting technique, a professor at Harvard, and a respected computer scientist. So “hard” may not be Teragram’s preferred word. It’s not mine.

Point 4: Enterprise Search Is No More Difficult than Web Search

Mr. Lai’s question burrows to the root of much consternation in search and retrieval. “Enterprise search” is difficult.

My view is that any type of search ranks as one of the hardest problems in computer science. There are different types of problems with each variety of search–Web, behind-the-firewall, video, question answering, discovery, etc. The reason is that information itself is a very, very complicated aspect of human behavior. Dissatisfaction with “behind-the-firewall” search is due to many factors. Some are technical. In my work, when I see yellow sticky notes on monitors or observe piles of paper next to a desk, I know there’s an information access problem. These signs signal the system doesn’t “work”. For some employees, the system is too slow. For others, the system is too complex. A new hire may not know how to finagle the system to output what’s needed. Another employee may be too frazzled to be able to remember what to do due to a larger problem which needs immediate attention. Web content is no walk in the park either. But free Web indexing systems have a quick fix for problem content. Google, Microsoft, and Yahoo can ignore the problem content. With billions of pages in the index, missing a couple hundred million with each indexing pass is irrelevant. In an organization, nothing angers a system user quicker than knowing a document has been processed or should have been processed by the search system. When the document cannot be located, the employee either performs a manual search (expensive, slow, and stress inducing) or goes ballistic (cheap, fast, and stress releasing). In either scenario or one in the middle, resentment builds toward the information access system, the IT department, the hapless colleague at the next desk, or maybe the person’s dog at home. To reiterate an earlier point. Search, regardless of type, is extremely challenging. Within each type of search, specific combinations of complexities exist. A different mix of complexities becomes evident within each search implementation. Few have internalized these fundamental truths about finding information via software. Humans often prefer to ask another human for information. I know I do. I have more information access tools than a nerd should possess. Each has its benefits. Each has its limitations. The trick is knowing what tool is needed for a specific information job. Once that is accomplished, one must know how to deal with the security, format, freshness, and other complications of information.

Point 5: Classification and Social Functions

Mr. Lai, like most search users and observers, have noses that twitch when a “new” solution appears. Automatic classification of documents and support of social content are two of the zippiest content trends today.

Software that can suck in a Word file and automatically determine that the content is “about” the Smith contract, belongs to someone in accounting, and uses the correct flavor of warranty terminology is useful. It’s also like watching Star Trek and hoping your BlackBerry Pearl works like Captain Kirk’s communicator. Today’s systems, including Teragram’s, can index at 75 to 85 percent accuracy in most cases. This percentage can be improved with tuning. When properly set up, modern content processing systems can hit 90 percent. Human indexers, if they are really good, hit in the 85 to 95 percent range. Keep in mind that humans sometimes learn intuitively how to take short cuts. Software learns via fancy algorithms and doesn’t take short cuts. Both humans and machine processing, therefore, have their particular strengths and weaknesses. The best performing systems with which I am familiar rely on humans at certain points in system set up, configuration, and maintenance. Without the proper use of expensive and scarce human wizards, modern systems can veer into the ditch. The phrase “a manager will look at things differently than a salesperson” is spot on. The trick is to recognize this perceptual variance and accommodate it insofar as possible. A failure to deal with the intensely personal nature of some types of search issues is apparent when you visit a company where there are multiple search systems or a company where there’s one system–such as the the one in use at IDC–and discover that it does not work too well. (I am tempted to name the vendor, but my desire to avoid a phone call from hostile 20-year-olds is very intense today. I want to watch some of the playoff games on my couch potato television.)

Point 6: Fast’s Search Better than Google’s Search

Mr. Lai raises the question that is similar to America’s fascination with identifying the winner in any situation.

We’re back to a life-or-death, winner-take-all knife fight between Google and Microsoft. No search technology is necessarily better or worse than another. There are very few approaches that are radically different under the hood. Even the highly innovative approaches of companies such as Brainware and its “associative memory” approach or Exegy with its juiced up hardware and terabytes of on board RAM appliance share some fundamentals with other vendors’ systems. If you slogged through my jejune and hopelessly inadequate monographs, The Google Legacy (Infonortics, 2005) and Google Version 2.0 (Infonortics, 2007), and the three editions I wrote of The Enterprise Search Report (CMSWatch.com, 2004, 2005, 2006) you will know that subtle technical distinctions have major search system implications. Search is one of these areas with a minor tweak can yield two quite distinctive systems even though both share similar algorithms. A good example is the difference between Autonomy and Recommind. Both use Bayesian mathematics, but the differences are significant. Which is better? The answer is, “It depends.” For some situations, Autonomy is very solid. For others, Recommind is the system of choice. The same may be said of Coveo, Exalead, ISYS Search Software, Siderean, or Vivisimo, among others. Microsoft will have some work to do to understand what it has purchased. Once that learning is completed, Microsoft will have to make some decisions about how to implement those features into its various products. Google, on the other hand, has a track record of making the behind-the-firewall search in its Google Search Appliance better with each point upgrade. The company has made the GSA better and rolled out the useful OneBox API to make integration and function tweaking easier. The problem with trying to get Google and Microsoft to square off is that each company is playing its own game. Socratic Computerworld professionals want both companies to play one game, on a fight-to-the-death basis, now. My reading of the data I have is that a Thermopylae is not now or in the near future in the interests of either Google of Microsoft to clash too much. The companies have different agendas, different business models, and different top-of-mind problems to resolve. The future of search is that it will be invisible when it works. I don’t think that technology is available from either Google or Microsoft at this time.

Point 7: Consolidation

Mr. Lai wants to rev the uncertainty engine, I think. We learn from the FAQ that search is still a small, largely unknown market sector. We learn that big companies may buy smaller companies.

My view is that consolidation is a feature of our market economy. Mergers and acquisitions are part of the blood and bones of business, not a characteristic of the present search or content processing sector. The key point that is not addressed is the difficulty of generating a sustainable business selling a fuzzy solution to a tough problem. Philosophers have been trying to figure out information for a long time and have done a pretty miserable job as far as I can tell. Software that ventures into information is going to face some challenges. There’s user satisfaction, return on investment, appropriate performance, and the other factors referenced in this essay. The forces that will ripple through behind-the-firewall search are:

  • Business failure. There are too many vendors and too few buyers willing to pay enough to keep the more than 350 companies’ sustainable
  • Mergers. A company with customers and so-so technology is probably more valuable than a company with great technology and few customers. I have read that Microsoft was buying customers, not Fast Search & Transfer’s technology. Maybe? Maybe not.
  • Divestitures and spin outs. Keep in mind that Inxight Software, an early leader in content processing, was pushed out of Xerox’s Palo Alto Research Center. The fact that it was reported as an acquisition by Business Objects emphasized the end game. The start was, “Okay, it’s time to leave the nest.”

The other factor is not consolidation; it is absorption. Information is too important to leave in a stand-alone application. That’s why Microsoft’s Mr. Raikes seems eager to point out that Fast Search would become part of SharePoint.

Net-Net

The future, therefore, is that there will be less and less enthusiasm for expensive, stand-alone “behind-the-firewall” search. Information access is part of larger, higher-value information access solutions.

Stephen E. Arnold
January 13, 2008

« Previous Page

  • Archives

  • Recent Posts

  • Meta