Apple Going Its Own Way in Search

May 8, 2008

On May 6, 2008, the USPTO granted US 7,369,987 to Apple Inc. In my research for Beyond Search, one source told me that Apple was having some “difficulties” with its search-and-retrieval system for iTunes and OS X. I dismissed the comment because I had no corroboration. Apple is paranoid about what it does and how it does it. I was, therefore, intrigued by the invention disclosed as a “Multi-Language Document Search and Retrieval System”.

I’m no attorney, so you will need to download the document from the wonderful search system provided without charge by the US Patent & Trademark Office. Please, pay close attention to the syntax the USPTO’s outstanding search system requires. Google-style queries won’t work on this puppy.

Apple’s invention, according to US 7,369,987 is:

A multi-lingual indexing and search system … that performs tokenization and stemming in a manner which is independent of whether index entries and search terms appear as words in a dictionary.

The disclosures in this document make it clear that Apple, like Google and Microsoft, are poking around in similar algorithmic gardens. The claims put Apple in the search game. The document makes for interesting reading if you like legalese and information retrieval jargon. Maybe the iTunes’ search system will be juiced. I’m pretty happy with the built-in search function on my trusty Mac.

Stephen Arnold, May 8, 2008

Teragram: SAS’s Search Launchpad

March 20, 2008

This week SAS announced that it purchased Teragram, a content processing company with deep roots in, computer science, mathematics, and blue - chip clients. If you poke around Teragram’s Web site, you learn that the company supports double byte languages. If I read the Teragram information correctly, this little-known outfit not far from Harvard Yard has proprietary technology strongly suggestive of the super - sophisticated techniques in use at IBM, Google, Microsoft, and Yahoo.

The Teragram system can match other systems advanced functions like advanced function — NLP (natural language processing)? Automatic summarization? No problem. Hosted services option? Check. Autonomy - Recommind type patten matching? Done. Attensity and Bitext style linguistic analysis? Covered. Teragram has a warehouse chock full of search and content processing goodies.

Now SAS owns this “search tech” tool box.

Teragram, founded in 1997, was a privately-held content processing company in Cambridge, Massachusetts. Two wizards — both from Luxembourg — have applied their computer science and mathematical expertise to unstructured information for more than a decade. That’s a long time in the fast-moving search and text processing sector.

I learned about Teragram when someone told me that the company was a technology provider to Fast Search & Transfer SA. Fast Search’s Dr. John Lervik is a canny technologist, and he has a good nose for solid technology.

Read more

Arikus: Going Beyond Search since 1997

March 10, 2008

In Boston, Massachusetts, on, Wednesday, March 5, 2008, a person engaged me and raved about the Arikus search engine. I recall writing about this system in 2002, when I was creating an inventory of Canadian search engines for a client. My contact was, I believe, Markus Gunn, now an advisor to the company. He was a wonderful telephone chat. To be honest, Arikus slipped off my radar until this bouncy cherub gave me the Arikus sales pitch.

The company offers what it calls “enterprise search and categorization.” As you may know, I don’t think too much of the phrase enterprise search. I’ve argued for more than 18 months that enterprise search is a Pandora’s box, leaving most users frustrated and angry about sluggish response, unfindable content, and features that lag Google.com, Live.com, and Yahoo.com by a country mile. The categorization function is not search exactly. It’s one of the rich content processing options introduced to allow user to point and click their way through topics, suggestions, Use For ideas, See Also references. Endeca cranked up the volume on this notion almost a decade ago. Arikus has offered these features for more than a decade, yet the company remains in stealth mode in the US.

Background

Arikus, founded in 1997, was when I first encountered the firm, was a privately-held company. In 2003, Arikus wanted to “help customers manage information.” The focus of the firm and its Canadian and UK technology team was to “develop and market data management technologies that transform unstructured, scatted information into a business asset.” The idea was to “increase the value of business applications”.

The core product used to be the Aire server. Today, the company’s technology is described this way on the firm’s Web site: “Arikus unlocks a company’s information resources, providing knowledge workers and web site visitors with access to the right information, right away….Arikus brings together a company’s external and internal data resources and allows it to use this data on its own terms; automated information management, plain language precision search, and navigation software for self-service applications.”

The company’s principal products are under the “inContext” banner. The products available today include:

  • An inContext enterprise solution. A 15-day evaluation version is available via a direct download here.
  • An inContext developer solution
  • An inContext self-service solution called AnswerIT. A seven-day evaluation version is available by contacting sales at arikus.com

Each of these solutions come with an SDK. Details are located on the Arikus Web sit here. You will have to fill out a form without skipping a field.

Technology

Arikus incorporates a variety of technologies in its system. True to its Cambridge University roots, there’s a heft statistical component. In addition, the company has integrate linguistic technologies as well. The system can handle unstructured information like Word documents and email. Arikus can also process structured data. The user interface can be configured with point-and-click options, a search box, and fill-in-the-blank prompts for parametric searches.

The company makes an interesting claim for its approach. Let me quote from the Arikus Web site:

Arikus technology goes further than any other known industry document ranking technology in delivering accurate, relevant and quality results. Arikus begins by analyzing an end user query to determine word usage, grammatical inflections, phrase structure, and sentence structure. An analysis is performed on target documents to determine how words are distributed in a document and what the relationships are between the words in the document. Sentence analysis is performed to determine both a position-independent and a position-dependent score for each sentence in the document. This step not only improves the accuracy, relevancy, and quality of returned results but also allows for the determination of the “best summary”, that is, the passage that best answers the user’s question. This is entirely unique to Arikus.

I’ve have added bold italic to the assertions that I believe warrant head-to-head testing with other search and content processing solutions.

Taxonomy Support

A plus for Arikus are its taxonomies. You can find information about there here. If you are unsure what a taxonomy’s main headings should be for business, for example, you can cadge these terms from the Arikus PDFs. What a wonderful jump start! Kudos, Arikus.

inContext supports knowledge bases. You can license industry-specific taxonomies from Arikus. Should your organization have a taxonomy, you can use that scheme with your Arikus installation. Arikus’s taxonomies are based on contributions and information from established business, scientific, and commercial publishers.

System in Action

You can see the inContext system in action via the Arikus demonstration located here. The system is limited, and you may want to contact the company directly for a more detailed demonstration. When I tested the system on March 10, I received an error message. However, the original Arikus system results looked like this. Note: this is a screen shot provided by Arikus to me in 2002, and I believe a version of this screen shot appeared in the first edition of the Enterprise Search Report. Arikus is no longer included in the fourth edition, but if you have the first edition at hand, you can see my full write up plus the upsides and downsides of the system based on my analysis in 2002:

arikusinterface

Features of this interface include:

  • A central results list with a machine-generated summary of the document. I found these summaries useful. The snippets shown in the Google Search Appliance, in contrast, can be cryptic
  • A listing of topics in the result set. The number to the right of the concept indicates the number of documents in that topic
  • The publication types in the result set
  • Authors of documents. Again the number to the right of the author’s name indicates the number of documents by that person

This screen shot from 2002 shows that Arikus’s engineers anticipated the direction of Microsoft SharePoint in general and, in particular, the third-party interface enhancements from Interse in Denmark.

Customers

Arikus provides few clues to its customers. The company has a deal with Liquid Litigation Management Inc. The inContext system is used for litigation document management. LLM uses the system for automatic content aggregation, indexing, key word search, guided search (point-and-click exploration of information), and content categorization. Other customers include Cold North Wind whose bubbling employee prompted me to dig out my Arikus files and Monitor 24-7, Inc. (who in turn sells Arikus powered solutions to Epson Europe and the Mayo Clinic, among other organizations).

Wrap Up

My notes suggest that a version of Arikus once cost about $1,000. You will need to contact the company to get a current price for its various products. Information about the company is available from Alacra, but these often provide high-level information, not the nitty gritty that you probably want. The president of the company is Aris Zakinthinos. You can contact Arikus by calling 1 416 410-8701.

Clustify: Identifying Similar Documents

March 9, 2008

In my recent lecture about eDiscovery in Boston, March 6, 2008, several people engaged me after my presentation. One of these interlocutors wanted my view on a company called Hot Neuron LLC. I said, “I’ll check my files.”

When I staggered into my log cabin in Harrod’s Creek, Kentucky, a parcel awaited me. I chopped it open with the flint knife popular in these parts. An envelope labeled “Cluster-Text.com” held a Clustify mouse pad. Coincidence or very expensive direct mail campaign?

A quick dip into my electronic files and a gander at Hot Neuron’s Web site reminded me that Clustify is in the eDiscovery category of Beyond Search tools. An attorney with too little time and too many documents can use Clustify to get her arms around the documents, tag them, and tackle the most significant documents in key clusters first.

Clustify can deliver an overview of the document set with documents nearly organized by categories. Clustify groups by analyzing the the text to identify the structure that arises naturally.

Bill Dimm, founder of Hot Neuron, told me in a thoughtful email:

Clustify organizes the documents into clusters, and it labels each cluster with keywords so that you can see what it is about. It can sort the clusters by the number of documents they contain, so you can quickly see what the most significant topics are for the document set, but it doesn’t, by itself, put the documents into a hierarchy of categories. It does, however, provide a tagging tool that allows the user to define his/her own hierarchy of categories and very efficiently put the documents into the categories.

An attorney or paralegal can mark documents and “hook” them to other documents or topics. The ability to examine documents together can shave hours of a task that is tedious and susceptible to interruption. The software can speed the process of manual categorization by allowing you to make decisions one cluster at a time, instead of one document at a time. Clustify can enhance your search engine by group search results or finding relevant documents that don’t exactly match a query.

This essay provides a preliminary impression of Hot Neuron’s offerings.

The Names

Let me provide some guidance about the three names, Clustify, Cluster-Text.com, and Hot Neuron LLC. Clustify is the name of the clustering technology and software. The url is www.cluster-text.com where you will find information about the product, contact information, etc. Hot Neuron LLC is the name of the company selling the software and operating the Web site. MagPortal.com is a component of Hot Neuron where the company’s for-fee news feed service and financial information components are available. MagPortal appears to use portions of the Clustify technology, and a screen shot of a result set appear elsewhere in this essay.

The Company

The privately-held Hot Neuron is the brain child of William Dimm, a theoretical physicist from Cornell University. Dr. Dimm left the university and physics in 1995. (For those of you who don’t associate Cornell with search and content processing, please, recall that Dr. Gerald Salton — né Gerhard Anton Aahlmann — put Information Retrieval into the consciousness of computer scientists world wide first at Harvard University, then at Cornell University. More information about Dr. Salton and his SMART innovations are here..)

Hot Neuron offers a number of useful products. These range from specialized financial data components that you can plug into your Web site. With these widgets, you can assemble a view of the US and Canadian financial markets. The company also offers for-fee news feed services and an affiliate program.

Information retrieval and physics are tightly bound. Spend a few days wandering around Google, for instance, and you will find that physicists are as much a part of the Google ecosystem as mathematicians and computer scientists.

Mr. Dimm has dabbled in the commercial world for many years. For example, he has worked in the financial services industry fiddling with models for interest rate derivatives. Hot Neuron is the company set up to commercialize Clustify.

Hot Neuron LLC is an information retrieval software and services company located in Bryn Mawr, Pennsylvania.

Clustify

Clustify is document clustering software. This means that software processes documents and groups them according to their similarity. A definition I have in my files from Carnegie Mellon University link says: “Document clustering is the act of collecting similar documents into bins, where similarity is some function on a document.”

Clustify, according to the company, “identifies important keywords for each cluster, giving quick insight into the document set. It also allows the user to create a hierarchy of custom tags that can be applied to individual documents, all documents in a particular cluster, or all clusters containing a particular combination of keywords, allowing the user to categorize hundreds of documents with a single mouse click.”

The system allow you to create a hierarchy of custom tags. These can then be applied to individual documents, the documents in a specific cluster, or those clusters containing certain key words. You can, according to the company, categorize many documents by pointing and clicking.

Clustering is useful in legal discovery, competitive intelligence, and general research.

Technology

Hot Neuron’s technology uses a proprietary algorithm that scales and delivers good cluster quality. According to the company, Clustify can cluster 1.3 million Wikipedia entries on a desktop computer running Linux in 20 minutes or 50 minutes under Microsoft Windows. The system outputs files with these extensions: CYI, CYO, and CYS.

Clustify can generate concept-based clusters, or it can require documents in the same cluster to contain identical passages of text. The latter option is useful for identifying near-duplicates, which can cut the cost of electronic discovery further than simple deduplication.

Clustifymagsearchportal

This screen shot from Hot Neuron’s MagPortal.com Web site provides a glimpse of the Clustify functionality. (Note: I’m using MagPortal.com to illustrate the company’s technology. MagPortal.com charges a fee for some of its information services.) These Clustify features jumped out at me when I tested the system on March 9, 2008. Please, keep in mind that some of these features are particular to MagPortal.com and not included in the Clustify license:

  1. Search results with one-click access to similar articles. The icon for this function is the orange flag icon in each item in the results list
  2. A “Browse Main Topics” side bar. A user can scan these items instead of firing random key words into the search box, an activity meeting more and more user resistance
  3. Drop down boxes to specify what to search or narrowing the results by date, publication, or category. (Google capture date and time information, but as far as I know, those metatags are not yet available to the run-of-the-mill user like me.)
  4. A list of “Related Categories” front and center in the shaded box above the search results.

I also liked the display of the results. Breaking out the source, author, and date is a common sense feature that other systems would do well to emulate. I find scanning results for these key meta items not only annoying but fatiguing. I have a hard time, when tired, of differentiating light blue, light green, and other Web 2.0 colors. Kudos to MagPortal.com for make their results easy for me to scan.

About the MagPortal.com Search Engine

The MagPortal.com search engine tries to do a case-insensitive match of your query against the article title, description, authors, and the body of the article. When processing your query, it is sliced into “words”. Articles that match the words in your query — minus stop words — are displayed in the results list. By default, the output is ordered by the “quality” of the match. The quality is determined by using a mathematical formula (standard term-frequency inverse-document frequency algorithm, not related to Hot Neuron Similarity) which takes into account how often the search term appears in the document (relative to the total length of the document) and how rare that particular search term is among all documents.

Mr. Dinn told Beyond Search:

MagPortal.com has been around for eight years. We primarily index online magazine articles (not newspapers) on MagPortal, with the index being updated once each business day (magazines don’t change very fast). Anyone can browse and search the articles for free. The fee-based feed service allows clients to display a subset of the MagPortal data and functionality on their own Web sites. The client licenses a set of topics/categories relevant for their site, and their site users can browse/search the article data on their site much as they would on MagPortal. We also have a free version of the feed service, with limited functionality, available for on-commercial use here. For the financial data components (stock charts, etc.), we are actually just reselling feeds from another company because we’ve encountered clients that want both articles and charts.

More Information

For pricing, availability, and licensing details, contact Clustify at hotneuron.com.

Stephen Arnold, March 9, 2008 with an update at 5 35 pm Eastern

Lemur Flax: Open Source Search Pressure Rises

March 8, 2008

LastMinute.com founders Brent Hoberman and Martha Lane Fox have a new venture, MyDeco.com. What’s interesting about the furniture and home design site is that its search engine, based on the information available to me, is Lemur Consulting’s Flax. (Yanks will have to be careful when searching for information, because there are Mac utilities, restaurants, and clothing sites with “flax” domain names.)

Let’s get the links out of the way first:

  • The MyDeco beta site is at MyDeco.com
  • The link to Lemur’s free version of Flax is tucked away on the company’s Web site here and is accessible from Google Code as well
  • The link to the start page for Flax information is at Flax.co.uk
  • Information about Lemur Consulting Ltd. is at LemurConsulting.com

MyDeco.com Deployment

The screen shot below shows that the Flax system can deliver a feature-rich experience when properly set up, configured, and tuned. Keep in mind that Lemur does charge a fee for its services and its version of the Flax system.

flax_results

You can replicate this search for corner sofa by entering the query in the MyDeco.com search box. Note these features, please:

  • Tabs that allow one click access to the corner sofas in the result list in a “room” setting and articles about corner sofas
  • A price range slider to minimize the need for a user to type range values
  • A catalog-style results list
  • Across the top of the interface is a dark gray bar with one-click access to buying a “look”, content, social / community functions, and an FAQ

What’s interesting is that the Flax system for a single user’s desktop is available without charge from Lemur. Open source repositories have the code tool. Why? The search system is free, although some restrictions apply. I’m not going to summarize the details which are clearly stated on the Lemur Web site, but you can now with a bit of fiddling replicate the type of features available from commercial systems at a fraction of the price. The MyDeco.com implementation, which is in beta, could fool a user into assuming that the site’s search system was provided by Dieselpoint, Endeca, or Fast Search & Transfer. Unless my research is dead wrong, Flax sure looks like one of these industrial-strength solutions that can cost upwards of six figures to get up and running.

What’s Flax Do?

Flax can build searchable indexes of millions of documents such as Microsoft Office files, HTML, and Adobe PDFs. The default interface can be easily customized. Flax runs on Windows, Mac, and Linux. When properly resourced, the system is responsive and displays results quickly in my tests.

A Little about the Plumbing

Flax is based on the Xapian, an open source platform. Lemur’s engineers have added software to handle more file types, including email and structured information in database management systems. Information about Xapian is available here.

Lemur’s Value Adds

Lemur provides its proprietary enhancements; namely, the adapters to handle file types, a method to provide support for large indexes, and middleware to make it easy to integrate Lemur Flax into enterprise applications. The company also provides various default interfaces to allow quick customization for specific client needs.

Lemur provides professional services to support Flax. The firm’s engineers will customize, install, and configure Flax to meet your specific requirements. The company also offers an on-going maintenance service and will handle software maintenance and ad hoc technical support as well.

To get a custom price quotation, you can contact Lemur, based in Cambridge, England, by sending an email to info at lemurconsulting.com.

Observations

You may want to try out Flax’s basic version. It can be installed on a single desktop PC. You can index a large number of files quickly. Note that Adobe PDF files consume more CPU cycles. Our test corpus processed in less than five minutes, which is comparable to the throughput of other desktop systems such as IBM Yahoo OmniFind, for example. OmniFind is based the Lucene open source code. Unlike other “free” search products, Flax Basic does not put a limit on how many documents you can index. An FAQ is available, and the documentation for a free product is quite good.

Three other observations are warranted in my opinion:

  1. The use of open source code “wrapped” with proprietary middleware and equipped with customized widgets is increasing. I’ve written about Tesuji, a Hungarian vendor, offering a very good search system built on Lucene. These systems are good and can be used with confidence. It doesn’t take a month of research to figure out that open source search doesn’t pose much of a risk. If the price is right for you, you can save on license fees and deploy a very good search system.
  2. The idea of providing professional services as the core product is gaining momentum. When Verity reported that it was generating more than half its revenue from professional services shortly before it was acquired by Autonomy, the writing on the wall was clear. Trying to build a search company on license fees alone is not a revenue model with stamina. Search is not just a commodity; it’s a give-away or at least much less expensive than the products from high-flying, high-profile companies.
  3. The functionality of Flax, Tesuji, and other open source search plumbing is very good. In fact, the features of Flax are comparable to the six-figure products on offer in the 2004 - 2005 time period. Furthermore, these open source systems are improving rapidly.

What does this mean for commercial vendors of search? Two things.

First, the open source products will exert price pressure on more traditional vendors. IBM, Microsoft - Fast, and Oracle (where there’s been some turnover in the enterprise search unit) may find that their search solutions will have to be given away free or bundled with other enterprise software. Who will be willing to pay a premium for search solutions known to be complex, expensive to maintain, or, in the case of IBM, based on open source code.

Second, with each Flax or Tesuji, procurement teams and systems professionals have an increased opportunity to learn about the benefits of open source software. Most commercial organizations are conservative turtles. But financial pressure and cost overruns from better - known search vendors may combine to let Lemur Consulting and Tesuji, to cite two open source repackagers, to get their moist noses under into the enterprise tent.

Traditional enterprise search vendors — what I call “behind the firewall search” — face a significant incursion of open source into their territory. Arrogant dismissals of open source solutions won’t work the way they used to in the good old days. In fact, those good old days are gone. Can proprietary enterprise search solutions be today’s buggy whips?

Stephen Arnold, March 8, 2008

SharePoint: Another “Free” Behind-the-Firewall Search System?

March 3, 2008

It’s 6 am in cheery Louisville International Airport, but the word “international” can be misleading. The news this morning is that Microsoft will roll out a “new” SharePoint search service. You can read the breathless InfoWorld story here. The announcement will be made, I believe, at one, maybe two, separate Microsoft conferences this week.

The “free” word is a powerful marketing tool for commercial firms. When it comes to behind-the-firewall search, “free” is a synonym for demonstration product. The set up, configuration, debug process, optimization, and operation of a search or content processing system come with some hefty costs. The license fee is, of course, the cost that the gullible seize upon. When you root around in the financial statements of publicly-traded companies in the search and retrieval business, you find that many are trying to follow in Verity’s pre-sell out footsteps. Specifically, vendors want to pump up consulting fees, making them carry the freight for earnings and growth. My recollection is foggy after seven consecutive days of travel, but Verity was generating more than half its revenues from non-license revenue. The number 65 percent pops in and out of my memory, but I’m going to have to dig through my files to verify this. As license revenues flat line (a common problem for some search vendors), cash can be generated by selling services. These are higher margin than a license fee with yearly maintenance fee add ons. Services can be open ended, and have a certain upside revenue charm for certain software vendors.

“Free” Search Systems: A Marketing Tactic

The idea is that you can install a working version of the program, get a sense of its basic features, and kick the tires. When we tested the “free” IBM - Yahoo Ominifind search system a few months ago, it worked quite well, but it had a document limit. My recollection is that most of the “free” systems have some type of governor on the system. The reason is that the “free” system is a way to qualify sales leads. When a user needs to process more content or perform some magic such as integrating the system into a third-party application, the vendor jumps with joy. A real sales lead has landed in her lap without booth duty, blogging, or hammer dialing.

Microsoft has jumped into the “free” fray with a beefed up search function for SharePoint. The SharePoint system has been in the forefront of the “knowledge management” revolution. The idea is that a Web-like interface makes it possible for a user to find, edit, share, and connect with colleagues, their documents, or related content. The word “portal” is sometimes used to describe this multi-function interface.

My sources tell me that SharePoint has more than 100 million users worldwide. This is a significant jump from the 65 million users I had learned in the fourth quarter of 2007. Microsoft SharePoint is on a roll. When we install a robust content management system designed to work in a Microsoft-centric environment, SharePoint is a required “server”. In fact, to make these high-end CMS systems function, we typically install SQLServer, Windows Server, and IIS (Internet Information Services), among others. I may be wrong in how I perceive this server conga line, however.

Microsoft Search Systems

In my analyses of SharePoint search in the first three editions of the Enterprise Search Report, I summarized these separate search systems for SharePoint.

  • SharePoint search with a “blue” interface
  • SharePoint search with a “green” interface
  • SQLServer search
  • Microsoft tool bar search
  • Start button / Explorer search
  • Microsoft’s http://search.live.com

Without repeating that 40-page analysis and tromping over the rights I assigned to CMSWatch.com, I can go into much detail about what each of these different search systems do. But what I can tell you is that there is not “one” search system available when you implement a SharePoint search.

What’s New?

The “free” system is Search Server 2008 Express. Express was rolled out last year and includes metatag functions so results can be sorted. You can also click on a colleague’s name and see documents written by that person. Keep in mind that SharePoint is not breaking new ground here. SharePoint is adding features that have been available from Certified Gold Partners like Coveo and Mondosoft, among others, for a couple of years. What’s new is that anyone will be able to download Express and give it a whirl. My understanding was that only certain customers would be able to experiment with the Express system. I don’t have a download link, which I think will be available in the near future. You can also download a version of Silverlight to hook visualization into search results. Again, this is a feature that has been available from such vendors as Inxight Software (now part of Business Objects and owned by SAP) for more than a decade.

Observations

I am intrigued with this “free” version of Express. When I look at it in terms of Autonomy, I see a counter to Autonomy’s UltraSeek solution. UltraSeek, developed when Steve Kirsch was at InfoSeek, is a useful system acquired when Autonomy gobbled up Verity in December 2006. Autonomy, according to my sources, has had some success upselling UltraSeek users to more robust search and retrieval solutions.

When I compare the different “flavors” of SharePoint search with offerings from Microsoft Certified Gold Partners, I am somewhat uncertain about the Microsoft approach. For example, Interse, a company with a modest profile in Harrod’s Creek, Kentucky, offers software that manipulates the metadata available in SharePoint repositories. Also, Fast Search & Transfer coded an adapter for SharePoint. With this code widget, a SharePoint customer could use the functionality supported by the Fast ESP (enterprise search platform). In addition, there are a number of companies offering enhancements to SharePoint.

The reason there are so many search, indexing, and content processing options for SharePoint boils down to two reasons in my opinion. First, Microsoft encouraged its partners to create these products. Second, the SharePoint search is not as easy to use for system administrators as it could be. (Forget “good” because most search and retrieval systems leave as many as two-thirds of their users griping.)

I will be interested to see how Microsoft handles the Certified Gold Partners who might feel a bit of competitive pressure. I’m also interested to see how the SharePoint platform will be mapped to the FAST enterprise search platform. (There are some areas of overlap and a few interesting technical issues to resolve.)

To wrap up, I urge you to download and install the Express search function. You are canny enough to know that you should check out these systems vendors as well:

  1. Coveo (Canada)
  2. Exalead (France)
  3. ISYS Search Software
  4. Mondosoft (now part of SurfRay in Denmark)

You can get a copy of Enterprise Search Report (now in its 4th edition) or place an order for my Beyond Search study, which will be available in April 2008).

SharePoint is a useful system, and it isn’t going to be displaced by a competitive system anytime soon. Keep in mind that it’s complex. You know behind-the-firewall search is complex. So “free” doesn’t mean with out cost. You will have to throw time, programmers, and effort at anyone’s “free” search system. That goes for anyone who offers a “free lunch” to you.

Stephen Arnold, March 3, 2008

Delphes: A Low-Profile Search Vendor

February 17, 2008

Now that I am in clean up mode for Beyond Search, I have been double-checking my selection of companies for the “Profiles” section of the study. In a few days, I will make public a summary of the study’s contents. The publisher — The Gilbane Group — will also post an informational page. Publication is likely to be very close to the previously announced target of April 2008.

Yesterday, I used the Entopia system as the backbone of a mini-case study. Today — Sunday, February 17, 2008 — I want to provide some information about an interesting company not included in my Beyond Search study.

The last information I received from this company arrived in 2006, but the company’s Web site explicitly copyrights its content for 2008. When I telephoned on Friday, February 15, 2008, I went to voice mail. Therefore, I believe the company is in business.

Delphes, in the tradition of search and content processing companies, is a variant of the English word Delphi. You are probably familiar with the oracle of Delphi. I think the name of the company is intended to evoke a system that speaks with authority. As far as I know, Delphes is a private concern and concentrates its sales and marketing efforts in Canada, Francophone nations, and Spain. When I mention the name Delphes to Americans, I’m usually met with a question, “What did you say?” Delphes has a very low profile in the United States. I don’t recall seeing the company on the program of the search-and-retrieval conferences I attended in 2006 or 2007, but I go to a small number of shows. I may have overlooked the company’s sessions.

The Company’s Approach

The “guts” of the Delphes’ search-and-retrieval system is based on natural language processing embedded in a platform. The firm’s product is marketed as Diogene, another Greek variant. Diogenes, as you know, was a popular name in Greece. I assume the Diogenes to which Delphes is derived is Diogenes of Sinope, sometimes remembered as the Cynic More information about Diogenes of Sinope is here.)

Diogene extracts information using “dynamic natural language processing”. The iterative, linguistic process generates metadata, tags concepts, and classifies information processed by the system.
The company’s technology is available in enterprise, Web, and personal versions of the system. DioWeb Enterprise is the behind-the-firewall version of the product. You can license from the company DioMorpho which is for an individual user on a single workstation. Delphes works through a number of partners, and you can deal directly with the company for an on-premises license or an OEM (original equipment manufacturing) deal. Its partners include Sun Microsystems, Microsoft, and EMC, among others.

When I first looked at Delphes in 2002, the company had a good reputation in Montréal (Québec), Toronto and Ottawa (Ontario). The company’s clients now include governmental agencies, insurance companies, law firms, financial institutions, healthcare institutions, and consulting firms, among others. You can explore how some of the firm’s clients use the firm’s content processing technology by navigating to the Québec International Portal. The search and content processing for this Web site is provided by Delphes.

The company’s Web site includes a wealth of information about the architecture of the system, its features and functions, and services available from the company. The company offers a PDF that describes in a succinct way the features of what the company calls its “Intelligent Knowledge Management System”. You can download the IKMS overview document here.

Architecture

Information about the technical underpinnings of Delphes is sketchy. I have in my files a Delphes document called “The Birth of Digital Intelligence: Extranet and Internet Solutions”. This information, dated 2004, includes a high-level schematic of the Delphes system. Keep in mind that the company has enhanced its technology, but I think we can use this diagram to form a general impression of the system. Note: these diagrams were available in open sources, and are copyrighted by Delphes.

system archtecture

The “linguistic soul” of the system is encapsulated in two clusters of sub systems. First, there is the “advanced analysis” for content processing. This set of functions performs semantic analysis, which “understands” each processed document. The second system permits cross-language operation. Canada is officially bilingual, so for Delphes to make sales in Canadian agencies, the system must handle multiple languages and have a means to permit a user to locate information using either English or French.

The “body” of the system includes a distributed architecture, multi-index support, a federating function, support for XML and Web services. In short, Delphes followed the innovation trajectory of Autonomy (LO:AU), Endeca, and Fast Search & Transfer (NASDAQ:MSFT). One can argue that Delphes has a system of comparable sophistication that permits the same customization and scaling.

Delphes makes a live demo available in a side-by-side comparison with Google. The content used for the demo comes from the Cisco Systems’ Web site. You can explore this live implementation in the Delphes demo here. The interface incorporates a number of functions that strike me as quite useful. The screen shot below comes from the Delphes document from which the systems diagram was extracted. Portions of the graphic are difficult to read, but I will summarize the key features. You will be able to get a notion of the default interface, which, of course, can be customized by the licensee.

delphes_interfacefeatures

The results of the query high speed access through cable appear in the main display. Note that a user can select “themes” (actually a document type) and a “category”.

Each “hit” in the results list includes an extract from the most relevant paragraph in the source document that matches the query. In this example, the query terms are not matched exactly. The Delphes system can understand “fuzzy” notions and use them to find relevant documents. Key word indexing systems typically don’t have this functionality. With a single click, the user can launch a second query within the subset. This is generally known as “search within results.” Many search systems do not make this feature available to their users.

Notice that a link is available so the user can send the document with one-click to a colleague. The hit also includes a link to the source document. A link is provided so the user can jump directly to the next relevant paragraph in a hit. This feature eliminates scrolling through long documents looking for results. Finally, the hit provides a count of the number of relevant paragraphs in a source document. A long document with a single relevant paragraph may not be as useful to a user as a document with a larger number of relevant paragraphs.

Based on my notes to myself about the Delphes system, I identified the following major functions of DioWeb. Forgive me if I blur some functions from the DioWeb product. I can no longer recall the boundaries of each product. Delphes, I’m confident, can set you straight if I go off track.

First, the system can perform search-and-retrieval tasks. The interface permits free text and natural language querying. The system’s ability to “understand” content eliminates the shackles of the key word Boolean search technology. Users want the search box to be more understanding. Boolean systems are powerful but not understood by most users. Delphes describes its semantic approach as using “key linguistic differentiators”. I explain these functions briefly in Beyond Search, so I won’t define each of these concepts in this essay. Delphes uses syntax, disambiguation, lemmatization, masks, controlled term lists, and automatic language recognition, among other techniques.

Second, the system can federate content from different systems and further segment processed content by document type. Concepts can be used to refine a results list. Delphes defines concepts as proper nouns, dates, product names, codes, and other types of metadata.

Third, the system identifies relevant portions of a hit. A user can see only those portions of the document or browse the entire document. A navigator link allows the user to jump from relevant paragraph to relevant paragraph without the annoying scrolling imposed by some other vendors’ approaches to results viewing.

Fourth, the system can generate a “gist” or “summary” of a result. This feature extracts the most important portions of each hit and makes them available in a report. The system’s email link makes it easy to send the results to a colleague.

Fifth, Delphes includes what it calls a “knowledge manager”. I’m generally suspicious of KM or knowledge management systems. Delphes’ implementation strikes me as a variation on the “gist” or “summary” feature. The user can add comments, save the results, or perform other housekeeping functions. A complementary “information manager” function generates a display that shows what reports a user has generated. If a user sends a report to a colleague, the dashboard display of the “information manager” makes it possible to see that the colleague added a comment to a report. Again, this is useful housekeeping stuff, not the more esoteric functions described in my earlier summary of the Entopia approach.

What Can We Learn?

My goal for Beyond Search was to write a study with fewer than 200 pages, minimizing the technical details to focus on “what’s in it for the licensee”. Beyond Search is going to run about 250 pages, and I had to trim some information that I thought was important to readers. Delphes is an interesting vendor, and it offers a system that has a number of high-profile, demanding licensees in Canada, Europe, and elsewhere.

The reason I wanted to provide this brief summary — fully unauthorized by the company — was to underscore what I call the visibility problem in behind-the-firewall search.

Reading the information from the major consultancies and pundits who “cover” this sector of the software business, Delphes is essentially invisible. However, Delphes does exist and offers a competitive system that can go toe-top-toe with Autonomy, Endeca, and Fast Search & Transfer. One can argue that Delphes can enhance a SharePoint environment and match the functionality of a custom system built from IBM’s (NYSE:IBM) WebSphere and Ominifind components.

What’s does this discussion of Delphes tell us?

If you rely on the consultants and pundits, you may not be getting the full story. Just as I had to chop information from Beyond Search, others exercise the same judgment. This means that when you ask, “Which system is best for my requirements?” — you may be getting at best an incomplete answer. You may be getting the wrong answer.

A search for Delphes on Exalead, Live.com (NASDAQ:MSFT), Google (NASDAQ:GOOG), and Yahoo (NASDAQ:YHOO) is essentially useless. Little of the information I provide in this essay is available to you. Part of the problem is that the word Delphes is perceived by the search systems as a variant of Delphi. You learn a lot about tourism and not too much about this system.

There are two key points to keep in mind about search-and-retrieval systems:

  1. The “experts” may not know about some systems that could be germane to your needs. If the “experts” don’t know about these systems, you are not going to get a well-rounded analysis. The phrase that sticks in my mind is “bright but uninformed”. This can be a critical weak spot for some “experts”.
  2. The public Web search systems do a pretty awful job on certain types of queries. It is worth keeping this in mind because in the last few weeks, Google’s market share of Web search is viewed as a “game over” market. I’m not so sure. People who think the “game is over” in search are “bright but uninformed”. Don’t believe me. Run the Delphes query and let me know your impression of the results. (Don’t cheat and use the product names I include in this essay. Start with Delphes and go from there.)

In closing, contrast Entopia with Delphes. Both companies asserted in 2004 - 2006 similar functionality. Today, the high-profile Entopia is nowhere to be found. The lower-profile Delphes is still in business.

Make no mistake. Search is a tough business. Delphes illustrates the usefulness of focusing on a market, not lighting up the sky with marketing fireworks. I would like to ask the Delphic oracle in Greece, “What’s the future of Delphes?” I will have to wait and see. I’m not trekking to Greece to look at smoke and pigeon entrails. I do know some search engine “pundits” who may want to go. Perhaps the Delphic oracle will short cut their learning about Delphes?

Stephen Arnold, February 17, 2008

Entopia: A Look Back in Time

February 16, 2008

Periodically I browse though my notes about behind-the-firewall systems, content processing solutions, and information retrieval start ups. I think Entopia, a well-funded content processing company founded in 1999, shut down, maybe permanently some time in 2006.

In my “Dormant Search Vendors” folder, I keep information about companies that had interesting technology but dropped off my watch list. A small number of search vendors are intriguing. I revisit what information I have in order to see if there are any salient facts I have overlooked or forgotten.

KangarooNet and Smart Pouches

Do you remember Entopia? The company offered a system that would key word index, identify entities and concepts, and allow a licensee to access information from the bottom up. The firm open its doors as KangarooNet. I noticed the name because it reminded me of the whimsical Purple Yogi (now Stratify). Some names lure me because they are off-beat if not too helpful to a prospective customer. I do recall that the reference to a kangaroo was intended to evoke something called a “smart pouch”. The founders, I believe, were from Israel, not Australia. I assumed some Australian tech wizards had crafted the “smart pouch” moniker, but I was wrong.

Do you know what a “smart pouch” is? The idea is that the kangaroo has a place to keep important items such as baby kangaroos. The Entopia “smart pouch” was a way to gather important information and keep it available. Users could share “smart pouches” and collaborate on information. Delicious.com’s bookmarks provide a crude analog of a single “smart pouch” function.

I recall contacting the company in 2000, but I had a difficult time understanding how the company’s system would operate at scale in an affordable way. Infrastructure and engineering support costs seemed likely to be unacceptably high. No matter what the proposed benefits of a system, if the costs are too high, customers are unwilling to ink a deal.

Shifting Gears: New Name, New Positioning

Entopia is a company name derived from the Greek word entopizo. For those of you whose Greek is a rusty, the verb means to locate or bring to light. Entopia’s senior technologists stressed that their K-Bus and Quantum systems allowed a licensee to locate and make use of information that would otherwise be invisible to some decision makers.

When I spoke with representatives of the company at one of the Information Today conferences in New York, New York, in 2005. I learned that Entopia was, according to the engineer giving me the demo, was “a third-generation technology”. The idea was that Entopia’s system would supplement indexing with data about the document’s author, display Use For and See Also references, and foster collaboration.

I noted that I also spoke with Entopia’s vice president of product management, David Hickman, a quite personable man as I recall. My notes included this impression:

Entopia wants to capture social aspects of information in an organization. Relationships and social nuances are analyzed by Entopia’s system. Instead of a person looking at a list of possibly relevant documents, the user sees the information in the context of the document author, the author’s role in the organization, and the relationships among these elements.

In my files, I found this screen shot of Entopia’s default search results display. It’s very attractive, and includes a number of features that systems now in the channel do not provide. For example, if you had access to Entopia’s system in 2006 prior to its apparent withdrawal from the market, you could:

  • See concepts, people, and sources related to your query. These appear in the left hand panel on the screen shot below
  • Get a results list with the creator, source, date, and relevance score for each item clearly presented. In contrast to the default displays used by some of the company’s in my Beyond Search study, Entopia’s interface is significantly more advanced
  • The standard search box, a hot link to advanced search functions, and one-click access to saved searches keep important but little used functions front and center.

When the firm was repositioned in 2003, the core product was named, according to my handwritten notes, the “K-Bus Knowledge Extractor”. I think the “k” in K-Bus is a remnant of the original “kangaroo” notion. I wrote in my notes that Entopia was a spin out from an outfit called Omind and Global Catalyst Partners.

entopiaresults

Other features of the Entopia system were:

  • Support for knowledge bases, taxonomies, and controlled term lists
  • An API and a software development kit
  • Support for natural language processing
  • Classification of content
  • Enhanced metatagging

The K-Bus technology was enhanced with another software component called Quantum. The software system created a collaborative workspace. The idea was that system users to assemble, discuss, and manipulate the information processed by the K-Bus. This is the original SmartPouch technology that allows a user to gather information and keep it in a virtual workspace.

System Overview

In my Entopia folder, I found white papers and other materials given to me by the company. Among the illustrations was this high-level view of the Entopia system.

clip_image002

Several observations are warranted even though the labels in the figure are not readable. First, licensees had to embrace a comprehensive information platform. In the 2005 - 2006 period, a number of content processing vendors had added the word “platform” to their marketing collateral. Entopia to its credit does a good job of depicting how significant an investment is required to make good on the firm’s assertions for discovering information.

Second, it is clear that the complex interactions required to make the system work as advertised cannot tolerate bottlenecks. A slow down in one component — for instance, the horizontal gray rectangle in the center of the diagram is the “Session Facade Beans” subsystem. If these processes slow down the Web framework in the horizontal blue box above the horizontal gray box slows down user access. Another hot spot is the Data Access Module — the gray rectangle below the horizontal gray rectangle just referenced. A problem in this component prevents the metadata from being accessed. In short, a heck of an infrastructure of systems, storage, and bandwidth availability are needed to keep the system performing at acceptable levels.

Finally, the complexity of the system appears to require on-site support and in some cases, technical support from Entopia. A licensee’s existing information technology staff could require additional headcount to manage this K-Bus architecture.

As I scanned these notes, now more than two years’ old, I was struck by the fact that Entopia was on the right track. The buzz about social search makes sense, particularly in an organization where one-to-one relationships occur out of a hierarchical organizational structure. Software can provide some context for knowledge workers who are often monads, responsible to other monads, not the organization as a whole.

Entopia wanted to blend expertise identification, content visualization, social network analysis, and content discovery into one behind-the-firewall system. I noted that the company’s system started at $250,000, and I assume the up-and-running price tag would be in the millions.

When I asked, “Who are Entopia’s customers?”, I learned that Saab, the US government, Intel, and Boeing were licensees. Those were blue-chip names, and I thought that these firms’ use of the the K-Bus indicated Entopia would thrive. Entopia was among the first search vendors to integrate with Salesforce.com. The system also allowed a licensee to invoke the Entopia functions within a Word document.

What Can We Learn?

Entopia seems to have gone dark quietly in the last half of 2006. My hunch is that the intellectual property of the company has been recycle. Entopia could be in operation under a different corporate name or incorporated as a proprietary system in other content processing systems. When I clicked on the Entopia.com Web address in my folder, a page of links appeared. Running queries on Live.com, Google, and Yahoo returned links to stale information. If Entopia remains in business, it is doing a great job of keeping a low profile.

If you read my essay “Power Leveling”, you know that two common challenges in search and content processing are getting caught in a programming maze. The need to solve a particular problem fails to meet a licensee’s needs. The second problem is that when the system developer assembles the local solutions, the overall result is not efficient. Instead of driving straight from Point A to Point B, the system iterates and explores every highway and by way. Performance becomes a problem. To get the system to go fast, capital investment is necessary. When licensees can’t or won’t spend more on hardware, the system remains sluggish.

Entopia, on the surface, appears to be an excellent candidate for further analysis. My cursory looks at the system in 2001, again in 2005, and finally in 2006 revealed considerable prescience about the overall direction of the content processing market. Some of the subsystems were very clever and well in advance of what other vendors had on the market. The use of the social metadata in search results was quite useful. When these clever subsystems were hooked together, my recollection is now hazy, but I had noted that response time was sluggish. Maybe it was. Maybe it wasn’t. The point is that a complex system like that illustrated above would require on-going work to keep operating at peak performance.

Unfortunately, I don’t have an Entopia system to benchmark against the systems of the 24 companies profiled in Beyond Search. I wanted to include this Entopia information, but I couldn’t justify a historical look back when there was so much to communicate about systems now in the channel.

In Beyond Search, I don’t discuss the platforms available from Autonomy , Endeca, Fast Search & Transfer. IBM, and Oracle. I do mention these companies to frame the new players and little known up and comers that figure in Beyond Search. I would like to conclude this essay with several broad observations about the perils of selling organizations platforms.

First, any company selling a platform is essentially trying to obtain a controlling or central position in the licensee’s organization. A platform play is one that has a potentially huge financial pay off. A platform is a sophisticated “lock in”. Once the platform is in position, competitors have a difficult time making headway against the incumbent platform.

Second, the platform is the core product of IBM (NYSE:IBM), Microsoft (NASDAQ:MSFT), and Oracle (NASDAQ:ORCL). One might include SAP (NYSE:SAP) in this list, but I will omit the company because it’s in transition. These Big Three have the financial and market clout to compete with one another. Smaller outfits p9ushing platforms have to out market, out fox, and out deliver any of the Big Three. After all, why would an Oracle DBA want another information processing platform in an all-Oracle environment. IBM and Microsoft operate with almost the same mind set. Smaller platform vendors — perhaps we could include Autonomy (LON:AU) and Endeca in this category — are likely to face increasing pressure to mesh seamlessly with whatever a licensee has. If this is correct, Fast Search’s ESP has a better chance going forward than Autonomy. It’s too early to determine if Endeca’s deal with SAP will pay similar dividends. You can decide for yourself if Autonomy can go toe-to-tow with the Big Three. From my observation post in rural Kentucky, Autonomy will have to shift into a higher gear in 2008.

Third, super-advanced systems are vulnerable in business environments where credit is tight, sales are in slow or low growth cycles, and a licensee’s technical staff may be understaffed and overworked.

In conclusion, I think Entopia was a forward-thinking company. Its technology anticipated market needs now more clearly discernable. Its system was slick, anticipating some of the functionality of the Web 2.0 boom. The company demonstrated a willingness to abandon overly cute marketing for more professional product and company nomenclature. The company did apparently have one weakness — too little revenue. Entopia, if you are still out there, please, let me know.

Stephen Arnold, February 16, 2008

Lotsa Search at Yahoo!

February 3, 2008

Microsoft’s hostile take over of Microsoft did not surprise me. Rumors about Micro - hoo or Ya - soft have floated around for a couple of years. I want to steer clear of the newsy part of this take over, ignore the share-pumping behind the idea that Mr. Murdoch will step in to buy Yahoo, and side step Yahoo’s 11th hour “we’re not sure we want to sell” Web log posting.

I prefer to do what might be called a “catalog of search engines,” a meaningless exercise roughly equivalent to Homer’s listing of ships in The Illiad. Scholars are still arguing about why he included the information and centuries later continue to figure out who these guys were and why such an odd collection of vessels was necessary. You may have a similar question about Yahoo’s search fleet after you peruse this short list of Yahoo “findability” systems:

  • InQuira. This is the Yahoo natural language customer support system. InQuira was formed from three smaller search outfits that ran aground. InQuire seems stable, and it provides NLP systems for customer support functions. Try it. Navigate to Yahoo. Click Help and ask a question, for example, “How do I cancel my premium mail account?” Good luck, but you have an opportunity to work with an “intelligent” agent who won’t tell you how to cancel a for-fee Yahoo service. When I learned of this deal, I asked, “Why don’t you just use Inktomi’s engine for this?” I didn’t get an answer. I don’t feel too bad. Google treats me the same way.
  • Inktomi. Yahoo bought this Internet indexing company in 2002. We used the Inktomi system for the original US government search service, FirstGov.gov (now USA.gov). The system worked reasonably well, but once in the Yahooligans’ hands, not much was done with the system, and Inktomi was showing its age. In 2002, Google was motoring just drawing even with Yahoo. Yahoo seemed indifferent or unaware that search had more potential than Yahoo’s portal approach.
  • Stata Labs. When Gmail entered semi-permanent beta, it offered two key features. First, there was one gigabyte of storage and, two, you could search your mail. Yahoo couldn’t search email at all. The fix was to buy Stata Labs in 2004. When you use the Yahoo mail search function, the Stata system does the work. Again I asked, “Why not use one of your Yahoo search systems to search mail?” Again, no response.
  • Fast Search & Transfer. Yahoo, through the acquisition of Overture, ended up with the AllTheWeb.com Web site. The spidering and search technology are operated by Fast Search & Transfer (the same outfit that Microsoft bought for $1.2 billion in January 2008). Yahoo trumpeted the “see results as you type feature” in 2007, maybe 2006. The idea was that as you key your query, the system shows you results matching what you have typed. I find this function distracting, but you may love it. Try it yourself here. I heard that Yahoo has outsourced some data center functions to Fast Search & Transfer, which, if true, contradicts some of the pundits who assert that Yahoo has its data center infrastructure well in hand. If so, why lean on Fast Search & Transfer?
  • Overture. When Yahoo acquired Overture (the original pay-for-traffic service) in 2003, it got the ad service and the Overture search engine. Overture purchased AllTheWeb.com and ad technology from Fast Search & Transfer. When Yahoo bought Overture, Yahoo inherited Overture’s Sun Microsystems’ servers with some Linux boxes running a home brew fraud detection service, the original Overture search system, and the AllTheWeb.com site. Yahoo still uses the Overture search system when you look for key words to buy. You can try it here. (Note: Google was “inspired” by the Overture system, and paid about $1.2 billion to Yahoo to avoid a messy lawsuit about its “inspiration” prior to the Google IPO in 2004. Yahoo seemed happy with the money and did little to impede Google.)
  • Delicious. Yahoo bought Delicious in 2005. Delicious came with its weird url and search engine. If you have tried it, you know that it can return results with some latency. When it does respond quickly, I find it difficult to locate Web sites that I have seen. As far as I know, the Delicious system still uses the original Delicious search engine. You can try it here.
  • Flickr. Yahoo bought Flickr in 2005, another cog in its social, Web 2.0 thing. The Flickr search engine runs on MySQL. At one trade show, I heard that the Flickr infrastructure and its search system were a “problem”. Scaling was tough. Based on the sketchy information I have about Yahoo’s search strategy, Flickr search is essentially the same as it was when it was purchased and is in need of refurbishing.
  • Mindset. Yahoo, like Google and Microsoft, has a research and development group. You can read about their work on the recently redesigned Web site here. If you want to try Mindset, navigate to Yahoo Research and slide the controls. I’ve run some tests, and I think that Mindset is better than the “regular” Yahoo search, but it seems unchanged over the last six or seven months.

I’m going to stop my listing of Yahoo’s search systems, although I could continue with the Personals search, Groups search, News search, and more. I may comment on AltaVista.com, another oar in Yahoo’s search vessel, but that’s a topic that requires more space than I have in this essay. And I won’t beat up on Yahoo Shopping search. If I were a Yahoo merchant, I would be hopping mad. I can’t figure out how to limit my query to just Yahoo merchants. The results pages are duplicative and no longer useful to me. Yahoo has 500 million “users” but Web statistics are mushy. Yahoo must be doing something right as it continues to drift with the breeze as a variant of America Online.

In my research for my studies and journal articles, I don’t recall coming across a discussion of Yahoo’s many different search systems. No one, it seems, has noticed that Yahoo lacks an integrated, coherent approach to search. I know I’m not the only person who has observed that Yahoo cannot mount a significant challenge to Google.

As Google’s most capable competitor, Yahoo stayed out of the race. But it baffles me that a sophisticated, hip, with-it Silicon Valley outfit like Yahoo collected different search systems the way my grandmother coveted weird dwarf figurines. Like Yahoo, my grandmother never did much with her collection, I may have to conclude that Yahoo hasn’t done much with its collection of search systems.The cost of licensing, maintaining, and upgrading a fleet of search systems is not trivial. What baffles me is why on earth couldn’t Yahoo index its own email? Why couldn’t Yahoo use one of its own search systems to index Delicious bookmarks and Flickr photos? Why does Yahoo have a historical track record of operating search systems in silos, thus making it difficult to rationalize costs and simplify technical problems?

Compared to Yahoo, Google has its destroyer ship shape — if you call squishy purple pillows, dinosaur bones, and a keen desire to hire every math geek with an IQ of 165 on the planet “ship shape”. But Yahoo is still looking for the wharf. As Google churned past Yahoo, Yahoo watched Google sail without headwinds to the horizon.Over the years, I’ve been in chit-chats with some Yahoo wizards. Let me share my impressions without using the wizards’ names:

  1. Yahoo believes that its generalized approach is correct as Google made search the killer app of cloud computing. Yahoo’s very smart people seem to live in a different dimension
  2. Yahoo believes that its technology is superior to Google’s and Microsoft’s. When I asked about a Google innovation, Yahoo’s senior technologist told me that Yahoo had “surprises for Google.” I think the surprise was the hostile take over bid last week
  3. Yahoo sees its future in social, Web 2.0 services. To prove this, Yahoo hired economists and other social scientists. While Yahoo was recruiting, the company muffed the Facebook deal and let Yahoo 360 run aground. Yo, Yahoo, Google is inherently social. PageRank is based on human clicks and human-created Web pages. Google’s been social since Day One.

To bring this listing of Yahoo search triremes (ancient wooden war ships) to a close, I am not sure Microsoft, if it is able to acquire Yahoo, can integrate the fleet of search systems. I don’t think Mr. Murdoch can given the MySpace glitches. Fixing the flotilla of systems at Yahoo will be expensive and time consuming. The catch is that time is running out. Yahoo appears to me to be operating on pre-Internet time. Without major changes, Yahoo will be remembered for its many search systems, leaving pundits and academics to wonder where they came from and why. Maybe these investigators will use Google to find the answer? I know I would.

Stephen Arnold, February 3, 2008

Vivisimo’s Remix

January 29, 2008

I’ve been interested in Vivisimo since I learned about the company in 2000. Disclaimer: my son worked for Vivisimo for several years, and I was involved in evaluating the technology for the U.S. Federal government. A new function, called “Remix“, caught my attention and triggered this essay.

Background

Carnegie Mellon University ranks among the top five or six leading universities in computer science. Lycos was a product of the legendary Fuzzy and his team. Disclaimer: my partner (Chris Kitze) and I sold search technology to Lycos in the mid-1990s. Dr. David Evans has practiced his brand of innovation with several successful search-centric start ups, including a chunk of the technology now used in JustSystems‘ XML engine. (Disclaimer: I have done some work for JustSystems in Tokyo, Japan.) Vivisimo, founded by Raul Valdes-Perez and Jerome Pesenti, was among the first of the value-added processing search systems. I have been paying attention to Vivisimo for more than a decade.

I’ve been impressed with Vivisimo’s innovations, and I have appropriated Mr. Valdes-Perez’s coinage, “information overlook” in my verbal arsenal. As I understand the term, “overlook” is a way for a person looking for information is a way to get a broader view of the information in the results list. I think of it in terms of standing on a bluff and being able to see the lay of the land. As obvious as an overlook may be, it is a surprisingly difficult problem in information retrieval. You’ve heard the expression “We can’t tell the forest from the trees,”. Information overlook attempts to get the viewer into a helicopter. From that vantage point, it’s easier to see the bigger picture.

A Demonstration Query

Vivisimo’s technology has kept that problem squarely in focus. With each iteration and incremental adjustment to the Vivisimo technology, overlook has been baked in to the Vivisimo approach to search-and-retrieval. Here’s an example.

Navigate to Clusty.com, Vivisimo’s public facing search system. Note that Clusty is a metasearch system. Your query is passed to other search systems such as Live.com and Yahoo. The results are retrieved and processed before you see them. Now enter the query ArnoldIT. You will see a main results page and a list of folders in the left hand column of your screen. You can browse the main results. Note that Vivisimo removes the duplicates for you, so you are looking at unique items. Now scan the folder names.

Those names represent the main categories or topics in that query’s result list. For ArnoldIT, you can see that my Web site has information about patents, international search, and so on. Let me highlight several points about the foundation of Vivisimo:

First, I’ve been impressed with Vivisimo’s on-the-fly clustering. It’s fast, unobtrusive, and a very useful way to get a view of what topics occur in a query’s result set. I use Vivisimo when I begin a research project to help me understand what topics can be researched via the Web and which will require the use of analysts making telephone calls.

Second, in the early days of online, deduplication was impossible. Dialog and Orbit, two of the earliest online systems, manipulated fielded flat files. A field name variation make it computationally expensive to recurse through records to identify and remove duplicate entries. When I was paying for results from commercial online sysetms, these duplicates cost me money. When I learned about Vivisimo’s duplicate detection function, I looked at it closely. No one at Vivisimo would give me the details of the approach, but it worked and still works well. Other systems have introduced deduplication, but Vivisimo made this critical function a must-have.

Third, Vivisimo’s implementation of metasearch remains speedy. There are a number of interesting approaches to metasearch, including the little-known ez2Find.com system developed by a brother and sister team working in the south of France. I also admire the Devilfinder search engine that is now one of the faster metasearch systems available. But in terms of features, Vivisimo ranks at the top of the list, easily outperforming ixquick, Dogpile, and other very useful tools.

Fourth, like Exalead, Vivisimo has been engineered using the Linux tricks of low-cost scaling and clustering for high performance. These engineering approaches are becoming widely known, but many of these innovations originated at Stanford, Uniersity of Waterloo, MIT, and Carnegie Mellon University.

The Shift to the Enterprise

Three years ago, Vivisimo made the decision to expand its presence in organizations. In effect, the company wanted to move from a specialist provider of clustering technology to delivering behind-the-firewall search. When Vivisimo’s management told me about this new direction, I explained that the market for behind-the-firewall search was a contentious, confused sector. Success would require more marketing, more sales professionals, and a tougher hide. Mr. Valdes-Peres looked at me and said, “No problem. We’re going to do it.”

The company’s first high-profile win was the contract for indexing the U.S. Federal government’s unclassified content. This contract was originally held by Inktomi in 2000 to 2001. Then Fast Search & Transfer with its partner AT&T held the contract from 2001 to 2005. When Vivisimo displaced Fast Search’s technology, the company was in a position to pursue other high-profile search deals.

Today, Vivisimo is one of the up-and-coming vendors of behind-the-firewall search solutions. I have learned that the company has just won another major search deal. I’m not able to reveal the name of the new client, but the organization touches the scientific and technical community worldwide. Based on my understanding of the information to be processed, Vivisimo will be making the research work of most US scientists and engineers more productive.

Remix

This essay is a direct result of my learning about a new Vivisimo function, Remix. You can use the remix function when you have a result set visible in your Clusty.com results display. In our earlier sample query, ArnoldIT, you see the top 10 topics or clusters of results for that query. When you select Remix, the system, according to Vivismo, “With a single click, remix clustering answers the question: What other, subtler topics are there? It works by clustering again the same search results, but with an added input: ignore the topics that the user just saw. Typically, the user will then see new major topics that didn’t quite make the final cut at the last round, but may still be interesting.”

The function is important for three reasons:

First, Vivisimo has made drill down easy. Some systems perform a similar function, but the user is not always aware of what’s happened or where the result list originated. Vivisimo does a good job of keeping the user in control and aware of his / her location in the results review sequence.

Second, Remix allows one-click access to categories that otherwise would not be seen by the Clusty user. The benefit of Remix is that the result sets do not duplicate any topics the user saw before clicking the Remix button. Just as Vivisimo’s original deduplication function worked invisibly, so does Remix. The function just happens.

Third, the function is speedy. Vivisimo has a number of innovations in its system to make on-the-fly processing of search results take place without latency–the annoying delays some systems impose upon me. Vivisimo’s value-added processing occurs almost immediately. Like Google, Vivisimo has focused on delivering fast response time and rocket science for the busy professional.

Some Challenges

Companies like Vivisimo will have to deal with the marketing challenges of today’s search-and-retireval marketplace. The noise created by Microsoft’s acquisition of Fast Search and Endeca’s injection of cash from Intel and SAP means that interesting companies like Vivisimo have to make themselves known. I don’t envy the companies trying to get traction is the search sector.

If you are looking for a behind-the-firewall system, you will want to take a look at Vivisimo’s system. In fact, you will want to spend additional time reviewing the search solutions available from the up-and-comers I profile in my new study “Beyond Search”, due out in April 2008. You will find that you can deliver a robust solution without the teeth-ratting licensing fees required by some of the higher-profile vendors.

I can’t say that any one search system will be better for you than another. In fact, when you compare ISYS Search Software, Siderean Software, and Exalead with Vivisimo, you may find that each is an exceptionally robust solution. Which system you find is best for you comes down to your requirements. The key point is that the up-and-coming systems must not be excluded from your short list because the companies are not making headlines on a daily basis.

If you have the impression that Vivisimo is not up to an enterprise-scale content processing job, you have flawed information. Give Vivisimo’s technology a test drive. Judge for yourself. I wrote about Vivisimo in the first, second, and third editions of The Enterprise Search Report. I won’t be repeating that information in Beyond Search. You can explore Vivisimo and learn more about the system from the company’s useful white papers and case studies.

Stephen E. Arnold, January 29, 2008

Next Page »