Rain on the Search Parade

March 14, 2008

The storm warnings flash across the sky. This morning (Mrch 14, 2008) BearStearns is rumored to face a Carlyle-like liquidity crisis.

But so far no lightning has hit the search lightening rods. In fact, the unsettled financial weather has had no visible effects. The Google – DoubleClick deal is done. The Microsoft – Fast Search tie up is nearing port. Yahoo says that it is embracing the Semantic Web whatever that means (semantically, of course). France funds a Google killer. Radar’s Twine spools out. Business as usual in the search sector. But still we have no “real” solution to the “problem” of Intranet search, what I call behind-the-firewall search. The marketing razzle dazzle can’t mask the pain begging for lidocaine.

The turmoil in the financial market, the degrading dollar, and the $1,000 per ounce gold price seem to have little impact on search and retrieval so far. Anyone who suggests that a problem looms or that an actual panic could occur is an alarmist. I don’t want to sound any alarms.

InfoWorld‘s Web log contained a post that has to make search vendors’ pant with revenue lust. Jon Williams wrote here on March 13, 2008:

Every system we build has a search function built into it, usually hand-crafted (proprietary). Why? … Search on the internet, whether it be google, youtube, facebook, amazon, ebay, or linkedin, is solved for me, I always find what I need. And I believe the same is true for most consumers. But why not in the enterprise? Seems like a solution waiting to happen.

Spot on, Mr. Willliams. Spot on. This unanswered need is why you won’t hear gloom and doom from me. Search often sucks, and whoever solves this problem can make their investors happy in our down market.

An Entrepreneur’s Concern

At dinner yesterday evening (March 13, 2008) in Palo Alto’s noisy Fish Market, I showed the president of a hosted application my current list of 150 next-generation search and content processing companies. Most of the outfits on this list won’t resonate with you. Bitext operates from Madrid, Spain. Thetus has offices near Microsoft’s stomping grounds. PolySpot is tucked away in Paris, France. He had heard of none of these companies or most of the others on my list.

He said, “There are so many on this list unknown to me.” Not unusual. He then asked me, “How can these companies survive so much competition? I think the market downturn will make it very hard for these companies.

Right?”

I said, “Yep, tough sector. But no one has the one right answer. Not Google. Not IBM. Not the seven score newcomers on my list.”

The search market remains a triathlon, one of those “iron” versions that require competitors to climb mountains, swim rapids, and bicycle from Burlingame to Boise. But there are some formidable hurdles search vendors must overcome; namely:

Oversupply. Without rehashing dear old Samuelson’s Economics (now in its 18th edition I think), you have an embarrassment of riches for search. You have high-profile, publicly-traded “brands” like Autonomy. You have market-leading companies like Endeca. You have up-and-coming vendors like Coveo, Exalead, ISYS Search Software, and Vivisimo. You have state-of-the-art deep extraction providers like Attensity and Exegy (bet you never heard of Exegy, right?). You have free search software such as Lucene and Flax. You have such super-platforms as IBM, Microsoft, Oracle, and SAP including search with every enterprise applications licensed. You have specialists in entity extraction (Inxight / Business Objects), semantics (Siderean), ANSI standard controlled terms (Access Innovations). You get the idea. Can the market support hundreds of vendors of search and content processing?

Confusion. You don’t want me to belabor this point. There’s a great deal of confusion about search, content processing, text mining, and related disciplines. The easiest way to illustrate this is to provide you with a handful of the buzz words that I have collected in the last two weeks. How many of these can you define? How many of these do you use in your discourse with colleagues? Here are the “Cs” through the “Ks” only:

Collective knowledge systems
Community portals
Composite applications
Conferencing
Context aware games
Context aware mobile search
Context aware search
Context search
Faceted search
Folksomony
Formal language
Geospatial search
Glass boxes
Instant messaging
Intelligent agents
Knowledge base
Knowledge computing
Knowledge management
Knowledge spaces

Confused buyers often drag their heels as they try to decipher the nuances of search-speak.

Skepticism. Some vendors have told me that potential customers are skeptical about some search features and functions. For example, on a telephone call with a non-U.S. search system vendor, a principal in the company told me, “The nest has been fouled. Two prospects told me today that our two to five day deployment time was impossible. Their incumbent system took more than a month to get installed and another two months of effort before deployment.” As organizations get more behind-the-firewall search experience, those organization’s employees know that some vendor claims may be a blend of wishful thinking and science fiction.

Over confidence. I don’t have much to say about this human failing. Most chief technical officers over estimate what they know about search and retrieval. Most of the Intranet search problems problems have their roots anchored in the licensees’ assumptions about what their systems can do, their knowledge of search systems, and their ability to figure out software. I get my Greek myths mixed up, but there were, as I recall, quite a few stories about the nasty effects of pride. “Flame out” and Icarus resonate with me.

Loosey goosey pricing. In the course of the research for my new study Beyond Search, I encountered one vendor who refused to give me a starting price for its system. The president refused. I said, “Take your total revenue, divide it by the number of customers you have, and I will use that number as the average price.” He sputtered in anger. Let’s face it. Unless something is free, most search software comes with a price tag. Even a free system such as Lucene costs money because someone who gets a salary has to babysit the Lucene system. More and more vendors are tap dancing on the cost of their licenses, services, and support. I suspect that these vendors want to hold out to get the best possible price. Maybe these vendors don’t want other customers to know that a price is rising or falling?

Adam Smith’s “invisible hand” will reach out to strangle me. Economics in March 2008, however, continues to surprise the Wall Street set. Last time I checked the super-secret Carlyle Group did not expect fellow bankers to demand cash.

How untoward!

But if some of the best-known financial services companies are in the doo-doo, what will become of the more 300 firms engaged in search and retrieval? Even the Teflon-coated Google has drawn criticism. Today (March 14, 2008) Google’s share price will open at $443, down from its 52-week high of $747. Microsoft will pay $1.2 billion for a chance at bat to hit a search home run. That’s a pricey swing methinks. In my conversations at conferences, I detect a note of concern about making numbers. Entrepreneurs are thoughtful.

Wrap Up

To wrap up, I believe the search landscape will be pockmarked with Entopia-like shut downs. I also anticipate more strident marketing. Sigh. There will be some buy outs, but there will be some firms that cannot sell out. One reader of this Web log wondered if Autonomy was an example of company that many look at but none has carried over the threshold. Maybe the right suitor has not come forward? I believe that some countries will intervene in order to keep certain search firms in business. Anyone think that the French government has this as a motive for the funding of its Google killer? Other companies will give away search software and try to make money via services and consulting. And don’t forget the bundling option. Every time I buy an IBM server, I get Lotus Notes. Perhaps the same approach will be used by Microsoft and Oracle to “lock in” customers with this tactic.

The big concern I have is that search’s “bird flu” will land. The weaker firms will die after a tough fight. The stronger firms will capture a larger share of the market. Instead of the surfeit of choices we have today, we may end up with fewer choices, higher prices, and a stifling of innovation. What do you think? End or beginning for behind-the-firewall search?

Stephen Arnold, March 14, 2008

Yahoo Goes Semantic

March 13, 2008

Yahoo has embraced the Semantic Web. Yahoo’s Web log stated:

In the coming weeks, we’ll be releasing more detailed specifications that will describe our support of semantic web standards. Initially, we plan to support a number of microformats, including hCard, hCalendar, hReview, hAtom, and XFN. Yahoo! Search will work with the web community to evolve the vocabulary framework for embedding structured data. For starters, we plan to support vocabulary components from Dublin Core, Creative Commons, FOAF, GeoRSS, MediaRSS, and others based on feedback. And, we will support RDFa and eRDF markup to embed these into existing HTML pages. Finally, we are announcing support for the OpenSearch specification, with extensions for structured queries to deep web data sources.

Interesting, but maybe these two lads knew something I didn’t. What’s interesting about this announcement is that Google’s Programmable Search Engine, disclosed in a series of patent applications in February 2007, strikes me as a more sophisticated, well-conceived approach. But Google has kept its semantic technology under wraps.

Amazon, like Yahoo, has moved more quickly than Google. Jeff Bezos has
deployed cloud computing, introduced storage, and a hosted data management service. Google has these technologies and disclosed each in patent applications.
The question for me is, “Is Google content to let Amazon and Yahoo operate like lab experiments?”

Google doesn’t answer my email, so I can’t provide any insight based on information from the Googleplex. Google’s professionals are a heck of a lot more intelligent than I am. Google is hanging back, allowing two of its rivals to push forward in areas where Google has a core competency.

I find this puzzling. Do you?

Stephen Arnold, March 13, 2008

Who Quaeros?

March 11, 2008

Europe is concerned about Google.

When I was in Denmark in November 2006, I learned that about 85 percent of the country’s search traffic was a result of Google searches. I think Google has increased its share of traffic in Denmark to Germany’s level. For those of you not paying attention, Google drives about 90 percent of the traffic in Deutschland.

There are two initiatives under way to “kill Google.” The first is Quaero, a French inititive. You can read about it here. The second is a German-flavored project called THESEUS, which received funding a year ago. My understanding is that Fast Search & Transfer is in the saddle for the THETUS project, but my information may be stale. The French, not to be outdone, have routed money to Quaero. Check out his story — “France Cleared to Fund Search Project”. Here’s a snippet:

France won EU approval Tuesday to give $152 million to several companies hoping to build a European rival to U.S. search giant Google Inc. … The commission said the grant would not give Thomson market power because rivals will likely keep up their investment in research and development. It cleared the German government to give $165 million to the German arm of the project, called THESEUS. That money will fund “icebreaker” companies — Siemens AG, SAP AG, Deutsche Thomson oHG and EMPOLIS GmbH, owned by Bertelsmann AG — to kick start research. The aid will later spread to smaller firms.

Am I misreading this? My work has publicized some of France’s most promising search and content processing companies.So, what do you do if your are German or French? Replicate the Silicon Valley VC environment? Alter the tax laws? Reduce bureaucratic red tape? Encourage university incubators? Nah, too complex. Just give the money to industrial giants and tell them “build a better Google”.

In my experience, governments dumping money on industrial giants leads to predictable outcomes. Those that come to my mind include Halliburton’s contributions in Iraq, IBM’s work to implement the Documentum content management system for the US Senate, and the numerous reengineerings of the Internal Revenue Service’s computer systems.

Look at these to re familiarize yourself with French engineering and computer science:

What puzzles me is how will France figure out which of these companies will get a wedge of euros to “kill Google”. What will the Thomson oHG operation do with French wizards who are hacking away in un dortoir? Probably nothing.

With French venture capital forcing some French entrepreneurs to leave France for such places as — gasp! — England, I hope some of the euros feather the nests of young entrepreneurs.

The battle lines are drawn. The German “icebreakers” Siemens AG, SAP AG, Deutsche Thomson oHG and EMPOLIS GmbH, owned by Bertelsmann AG will try to crush Google and, of course, France. The French companies will try to “kill Google” and turn off the power for THESEUS, of course.

If we factor these battle lines, you will notice that I think Google will chunk forward, allowing the icebreakers to smash and crunch forward.

For those of you who don’t know what a European icebreaker looks like. Take a gander. Do you think this can smash over Google? Will these efforts run aground? I will watch the progress closely and plot the activity on Google Maps until Google is crushed that is.

Stephen Arnold, March 12, 2008

Coveo’s Laurent Simoneau Interviewed

March 11, 2008

Coveo has been growing at a double digit pace since I first wrote about the company in the first edition of Enterprise Search Report. In the first week of March 2008, Coveo announced that it had received additional investment to accelerate the company’s growth. You can read the official news release here. Coveo has also added support for multimedia along with streamlining the company’s solid graphical administrative access.

Right after the announcement about the $2.5 million infusion, I sent an email to Mr. Simoneau, and he agreed to meet with me last week. In the course of this candid discussion, he provided some fresh insight into the reasons contributing to Coveo’s success in behind-the-firewall search.

I was particularly interested in his firm’s success in licensing Coveo to Microsoft SharePoint customers. He told me that his team had worked to understand how Microsoft “does things”. Armed with this technical knowledge, he says:

We provide a smart document address so a user can access a document. We also include specific item types so a user knows what type of information object she will get back. We also strive to understand what information people store in SharePoint. Documents are one thing, but there is a lot of structured information as well that has to be leveraged.

I’ve been tracking the addition of rich content processing features and functions for several years. Coveo has added a number of these features, including support for Salesforce.com. The system makes it possible for a licensee to offer a user a key word search box plus a point-and-click “assisted navigation” interface.

Mr. Simoneau reveals:

We have had to create some new solutions. I cannot say too much, but we have proprietary language detection and multilingual stemming algorithms that enrich the indexing of large corporate databases. We have also patented speech recognition technologies that enable businesses to perform high quality indexing of multimedia content like podcasts or videos.

You can read the full interview with Mr. Simoneau on the ArnoldIT.com Web site here. I will be tracking Coveo as the company continues to innovate.

Stephen Arnold, March 11, 2008

Arikus: Going Beyond Search since 1997

March 10, 2008

In Boston, Massachusetts, on, Wednesday, March 5, 2008, a person engaged me and raved about the Arikus search engine. I recall writing about this system in 2002, when I was creating an inventory of Canadian search engines for a client. My contact was, I believe, Markus Gunn, now an advisor to the company. He was a wonderful telephone chat. To be honest, Arikus slipped off my radar until this bouncy cherub gave me the Arikus sales pitch.

The company offers what it calls “enterprise search and categorization.” As you may know, I don’t think too much of the phrase enterprise search. I’ve argued for more than 18 months that enterprise search is a Pandora’s box, leaving most users frustrated and angry about sluggish response, unfindable content, and features that lag Google.com, Live.com, and Yahoo.com by a country mile. The categorization function is not search exactly. It’s one of the rich content processing options introduced to allow user to point and click their way through topics, suggestions, Use For ideas, See Also references. Endeca cranked up the volume on this notion almost a decade ago. Arikus has offered these features for more than a decade, yet the company remains in stealth mode in the US.

Background

Arikus, founded in 1997, was when I first encountered the firm, was a privately-held company. In 2003, Arikus wanted to “help customers manage information.” The focus of the firm and its Canadian and UK technology team was to “develop and market data management technologies that transform unstructured, scatted information into a business asset.” The idea was to “increase the value of business applications”.

The core product used to be the Aire server. Today, the company’s technology is described this way on the firm’s Web site: “Arikus unlocks a company’s information resources, providing knowledge workers and web site visitors with access to the right information, right away….Arikus brings together a company’s external and internal data resources and allows it to use this data on its own terms; automated information management, plain language precision search, and navigation software for self-service applications.”

The company’s principal products are under the “inContext” banner. The products available today include:

  • An inContext enterprise solution. A 15-day evaluation version is available via a direct download here.
  • An inContext developer solution
  • An inContext self-service solution called AnswerIT. A seven-day evaluation version is available by contacting sales at arikus.com

Each of these solutions come with an SDK. Details are located on the Arikus Web sit here. You will have to fill out a form without skipping a field.

Technology

Arikus incorporates a variety of technologies in its system. True to its Cambridge University roots, there’s a heft statistical component. In addition, the company has integrate linguistic technologies as well. The system can handle unstructured information like Word documents and email. Arikus can also process structured data. The user interface can be configured with point-and-click options, a search box, and fill-in-the-blank prompts for parametric searches.

The company makes an interesting claim for its approach. Let me quote from the Arikus Web site:

Arikus technology goes further than any other known industry document ranking technology in delivering accurate, relevant and quality results. Arikus begins by analyzing an end user query to determine word usage, grammatical inflections, phrase structure, and sentence structure. An analysis is performed on target documents to determine how words are distributed in a document and what the relationships are between the words in the document. Sentence analysis is performed to determine both a position-independent and a position-dependent score for each sentence in the document. This step not only improves the accuracy, relevancy, and quality of returned results but also allows for the determination of the “best summary”, that is, the passage that best answers the user’s question. This is entirely unique to Arikus.

I’ve have added bold italic to the assertions that I believe warrant head-to-head testing with other search and content processing solutions.

Taxonomy Support

A plus for Arikus are its taxonomies. You can find information about there here. If you are unsure what a taxonomy’s main headings should be for business, for example, you can cadge these terms from the Arikus PDFs. What a wonderful jump start! Kudos, Arikus.

inContext supports knowledge bases. You can license industry-specific taxonomies from Arikus. Should your organization have a taxonomy, you can use that scheme with your Arikus installation. Arikus’s taxonomies are based on contributions and information from established business, scientific, and commercial publishers.

System in Action

You can see the inContext system in action via the Arikus demonstration located here. The system is limited, and you may want to contact the company directly for a more detailed demonstration. When I tested the system on March 10, I received an error message. However, the original Arikus system results looked like this. Note: this is a screen shot provided by Arikus to me in 2002, and I believe a version of this screen shot appeared in the first edition of the Enterprise Search Report. Arikus is no longer included in the fourth edition, but if you have the first edition at hand, you can see my full write up plus the upsides and downsides of the system based on my analysis in 2002:

arikusinterface

Features of this interface include:

  • A central results list with a machine-generated summary of the document. I found these summaries useful. The snippets shown in the Google Search Appliance, in contrast, can be cryptic
  • A listing of topics in the result set. The number to the right of the concept indicates the number of documents in that topic
  • The publication types in the result set
  • Authors of documents. Again the number to the right of the author’s name indicates the number of documents by that person

This screen shot from 2002 shows that Arikus’s engineers anticipated the direction of Microsoft SharePoint in general and, in particular, the third-party interface enhancements from Interse in Denmark.

Customers

Arikus provides few clues to its customers. The company has a deal with Liquid Litigation Management Inc. The inContext system is used for litigation document management. LLM uses the system for automatic content aggregation, indexing, key word search, guided search (point-and-click exploration of information), and content categorization. Other customers include Cold North Wind whose bubbling employee prompted me to dig out my Arikus files and Monitor 24-7, Inc. (who in turn sells Arikus powered solutions to Epson Europe and the Mayo Clinic, among other organizations).

Wrap Up

My notes suggest that a version of Arikus once cost about $1,000. You will need to contact the company to get a current price for its various products. Information about the company is available from Alacra, but these often provide high-level information, not the nitty gritty that you probably want. The president of the company is Aris Zakinthinos. You can contact Arikus by calling 1 416 410-8701.

Clustify: Identifying Similar Documents

March 9, 2008

In my recent lecture about eDiscovery in Boston, March 6, 2008, several people engaged me after my presentation. One of these interlocutors wanted my view on a company called Hot Neuron LLC. I said, “I’ll check my files.”

When I staggered into my log cabin in Harrod’s Creek, Kentucky, a parcel awaited me. I chopped it open with the flint knife popular in these parts. An envelope labeled “Cluster-Text.com” held a Clustify mouse pad. Coincidence or very expensive direct mail campaign?

A quick dip into my electronic files and a gander at Hot Neuron’s Web site reminded me that Clustify is in the eDiscovery category of Beyond Search tools. An attorney with too little time and too many documents can use Clustify to get her arms around the documents, tag them, and tackle the most significant documents in key clusters first.

Clustify can deliver an overview of the document set with documents nearly organized by categories. Clustify groups by analyzing the the text to identify the structure that arises naturally.

Bill Dimm, founder of Hot Neuron, told me in a thoughtful email:

Clustify organizes the documents into clusters, and it labels each cluster with keywords so that you can see what it is about. It can sort the clusters by the number of documents they contain, so you can quickly see what the most significant topics are for the document set, but it doesn’t, by itself, put the documents into a hierarchy of categories. It does, however, provide a tagging tool that allows the user to define his/her own hierarchy of categories and very efficiently put the documents into the categories.

An attorney or paralegal can mark documents and “hook” them to other documents or topics. The ability to examine documents together can shave hours of a task that is tedious and susceptible to interruption. The software can speed the process of manual categorization by allowing you to make decisions one cluster at a time, instead of one document at a time. Clustify can enhance your search engine by group search results or finding relevant documents that don’t exactly match a query.

This essay provides a preliminary impression of Hot Neuron’s offerings.

The Names

Let me provide some guidance about the three names, Clustify, Cluster-Text.com, and Hot Neuron LLC. Clustify is the name of the clustering technology and software. The url is www.cluster-text.com where you will find information about the product, contact information, etc. Hot Neuron LLC is the name of the company selling the software and operating the Web site. MagPortal.com is a component of Hot Neuron where the company’s for-fee news feed service and financial information components are available. MagPortal appears to use portions of the Clustify technology, and a screen shot of a result set appear elsewhere in this essay.

The Company

The privately-held Hot Neuron is the brain child of William Dimm, a theoretical physicist from Cornell University. Dr. Dimm left the university and physics in 1995. (For those of you who don’t associate Cornell with search and content processing, please, recall that Dr. Gerald Salton — né Gerhard Anton Aahlmann — put Information Retrieval into the consciousness of computer scientists world wide first at Harvard University, then at Cornell University. More information about Dr. Salton and his SMART innovations are here..)

Hot Neuron offers a number of useful products. These range from specialized financial data components that you can plug into your Web site. With these widgets, you can assemble a view of the US and Canadian financial markets. The company also offers for-fee news feed services and an affiliate program.

Information retrieval and physics are tightly bound. Spend a few days wandering around Google, for instance, and you will find that physicists are as much a part of the Google ecosystem as mathematicians and computer scientists.

Mr. Dimm has dabbled in the commercial world for many years. For example, he has worked in the financial services industry fiddling with models for interest rate derivatives. Hot Neuron is the company set up to commercialize Clustify.

Hot Neuron LLC is an information retrieval software and services company located in Bryn Mawr, Pennsylvania.

Clustify

Clustify is document clustering software. This means that software processes documents and groups them according to their similarity. A definition I have in my files from Carnegie Mellon University link says: “Document clustering is the act of collecting similar documents into bins, where similarity is some function on a document.”

Clustify, according to the company, “identifies important keywords for each cluster, giving quick insight into the document set. It also allows the user to create a hierarchy of custom tags that can be applied to individual documents, all documents in a particular cluster, or all clusters containing a particular combination of keywords, allowing the user to categorize hundreds of documents with a single mouse click.”

The system allow you to create a hierarchy of custom tags. These can then be applied to individual documents, the documents in a specific cluster, or those clusters containing certain key words. You can, according to the company, categorize many documents by pointing and clicking.

Clustering is useful in legal discovery, competitive intelligence, and general research.

Technology

Hot Neuron’s technology uses a proprietary algorithm that scales and delivers good cluster quality. According to the company, Clustify can cluster 1.3 million Wikipedia entries on a desktop computer running Linux in 20 minutes or 50 minutes under Microsoft Windows. The system outputs files with these extensions: CYI, CYO, and CYS.

Clustify can generate concept-based clusters, or it can require documents in the same cluster to contain identical passages of text. The latter option is useful for identifying near-duplicates, which can cut the cost of electronic discovery further than simple deduplication.

Clustifymagsearchportal

This screen shot from Hot Neuron’s MagPortal.com Web site provides a glimpse of the Clustify functionality. (Note: I’m using MagPortal.com to illustrate the company’s technology. MagPortal.com charges a fee for some of its information services.) These Clustify features jumped out at me when I tested the system on March 9, 2008. Please, keep in mind that some of these features are particular to MagPortal.com and not included in the Clustify license:

  1. Search results with one-click access to similar articles. The icon for this function is the orange flag icon in each item in the results list
  2. A “Browse Main Topics” side bar. A user can scan these items instead of firing random key words into the search box, an activity meeting more and more user resistance
  3. Drop down boxes to specify what to search or narrowing the results by date, publication, or category. (Google capture date and time information, but as far as I know, those metatags are not yet available to the run-of-the-mill user like me.)
  4. A list of “Related Categories” front and center in the shaded box above the search results.

I also liked the display of the results. Breaking out the source, author, and date is a common sense feature that other systems would do well to emulate. I find scanning results for these key meta items not only annoying but fatiguing. I have a hard time, when tired, of differentiating light blue, light green, and other Web 2.0 colors. Kudos to MagPortal.com for make their results easy for me to scan.

About the MagPortal.com Search Engine

The MagPortal.com search engine tries to do a case-insensitive match of your query against the article title, description, authors, and the body of the article. When processing your query, it is sliced into “words”. Articles that match the words in your query — minus stop words — are displayed in the results list. By default, the output is ordered by the “quality” of the match. The quality is determined by using a mathematical formula (standard term-frequency inverse-document frequency algorithm, not related to Hot Neuron Similarity) which takes into account how often the search term appears in the document (relative to the total length of the document) and how rare that particular search term is among all documents.

Mr. Dinn told Beyond Search:

MagPortal.com has been around for eight years. We primarily index online magazine articles (not newspapers) on MagPortal, with the index being updated once each business day (magazines don’t change very fast). Anyone can browse and search the articles for free. The fee-based feed service allows clients to display a subset of the MagPortal data and functionality on their own Web sites. The client licenses a set of topics/categories relevant for their site, and their site users can browse/search the article data on their site much as they would on MagPortal. We also have a free version of the feed service, with limited functionality, available for on-commercial use here. For the financial data components (stock charts, etc.), we are actually just reselling feeds from another company because we’ve encountered clients that want both articles and charts.

More Information

For pricing, availability, and licensing details, contact Clustify at hotneuron.com.

Stephen Arnold, March 9, 2008 with an update at 5 35 pm Eastern

Lemur Flax: Open Source Search Pressure Rises

March 8, 2008

LastMinute.com founders Brent Hoberman and Martha Lane Fox have a new venture, MyDeco.com. What’s interesting about the furniture and home design site is that its search engine, based on the information available to me, is Lemur Consulting’s Flax. (Yanks will have to be careful when searching for information, because there are Mac utilities, restaurants, and clothing sites with “flax” domain names.)

Let’s get the links out of the way first:

  • The MyDeco beta site is at MyDeco.com
  • The link to Lemur’s free version of Flax is tucked away on the company’s Web site here and is accessible from Google Code as well
  • The link to the start page for Flax information is at Flax.co.uk
  • Information about Lemur Consulting Ltd. is at LemurConsulting.com

MyDeco.com Deployment

The screen shot below shows that the Flax system can deliver a feature-rich experience when properly set up, configured, and tuned. Keep in mind that Lemur does charge a fee for its services and its version of the Flax system.

flax_results

You can replicate this search for corner sofa by entering the query in the MyDeco.com search box. Note these features, please:

  • Tabs that allow one click access to the corner sofas in the result list in a “room” setting and articles about corner sofas
  • A price range slider to minimize the need for a user to type range values
  • A catalog-style results list
  • Across the top of the interface is a dark gray bar with one-click access to buying a “look”, content, social / community functions, and an FAQ

What’s interesting is that the Flax system for a single user’s desktop is available without charge from Lemur. Open source repositories have the code tool. Why? The search system is free, although some restrictions apply. I’m not going to summarize the details which are clearly stated on the Lemur Web site, but you can now with a bit of fiddling replicate the type of features available from commercial systems at a fraction of the price. The MyDeco.com implementation, which is in beta, could fool a user into assuming that the site’s search system was provided by Dieselpoint, Endeca, or Fast Search & Transfer. Unless my research is dead wrong, Flax sure looks like one of these industrial-strength solutions that can cost upwards of six figures to get up and running.

What’s Flax Do?

Flax can build searchable indexes of millions of documents such as Microsoft Office files, HTML, and Adobe PDFs. The default interface can be easily customized. Flax runs on Windows, Mac, and Linux. When properly resourced, the system is responsive and displays results quickly in my tests.

A Little about the Plumbing

Flax is based on the Xapian, an open source platform. Lemur’s engineers have added software to handle more file types, including email and structured information in database management systems. Information about Xapian is available here.

Lemur’s Value Adds

Lemur provides its proprietary enhancements; namely, the adapters to handle file types, a method to provide support for large indexes, and middleware to make it easy to integrate Lemur Flax into enterprise applications. The company also provides various default interfaces to allow quick customization for specific client needs.

Lemur provides professional services to support Flax. The firm’s engineers will customize, install, and configure Flax to meet your specific requirements. The company also offers an on-going maintenance service and will handle software maintenance and ad hoc technical support as well.

To get a custom price quotation, you can contact Lemur, based in Cambridge, England, by sending an email to info at lemurconsulting.com.

Observations

You may want to try out Flax’s basic version. It can be installed on a single desktop PC. You can index a large number of files quickly. Note that Adobe PDF files consume more CPU cycles. Our test corpus processed in less than five minutes, which is comparable to the throughput of other desktop systems such as IBM Yahoo OmniFind, for example. OmniFind is based the Lucene open source code. Unlike other “free” search products, Flax Basic does not put a limit on how many documents you can index. An FAQ is available, and the documentation for a free product is quite good.

Three other observations are warranted in my opinion:

  1. The use of open source code “wrapped” with proprietary middleware and equipped with customized widgets is increasing. I’ve written about Tesuji, a Hungarian vendor, offering a very good search system built on Lucene. These systems are good and can be used with confidence. It doesn’t take a month of research to figure out that open source search doesn’t pose much of a risk. If the price is right for you, you can save on license fees and deploy a very good search system.
  2. The idea of providing professional services as the core product is gaining momentum. When Verity reported that it was generating more than half its revenue from professional services shortly before it was acquired by Autonomy, the writing on the wall was clear. Trying to build a search company on license fees alone is not a revenue model with stamina. Search is not just a commodity; it’s a give-away or at least much less expensive than the products from high-flying, high-profile companies.
  3. The functionality of Flax, Tesuji, and other open source search plumbing is very good. In fact, the features of Flax are comparable to the six-figure products on offer in the 2004 – 2005 time period. Furthermore, these open source systems are improving rapidly.

What does this mean for commercial vendors of search? Two things.

First, the open source products will exert price pressure on more traditional vendors. IBM, Microsoft – Fast, and Oracle (where there’s been some turnover in the enterprise search unit) may find that their search solutions will have to be given away free or bundled with other enterprise software. Who will be willing to pay a premium for search solutions known to be complex, expensive to maintain, or, in the case of IBM, based on open source code.

Second, with each Flax or Tesuji, procurement teams and systems professionals have an increased opportunity to learn about the benefits of open source software. Most commercial organizations are conservative turtles. But financial pressure and cost overruns from better – known search vendors may combine to let Lemur Consulting and Tesuji, to cite two open source repackagers, to get their moist noses under into the enterprise tent.

Traditional enterprise search vendors — what I call “behind the firewall search” — face a significant incursion of open source into their territory. Arrogant dismissals of open source solutions won’t work the way they used to in the good old days. In fact, those good old days are gone. Can proprietary enterprise search solutions be today’s buggy whips?

Stephen Arnold, March 8, 2008

CMS: Houston, We Have a Problem!

March 7, 2008

The 2008 AIIM show is history.

aiimlogo

I spent several days in Boston (March 3, 4, 5, 2008), wondering why the city built a massive concrete shoe box, probably designed by a Harvard or MIT graduate inspired by Franz Kafka and post-Stalinist architecture. It’s obvious no one had the moxie to tell our budding Leonid Savelyev that people expect mass transit, doors to the hotel across the street, and an easy-to-navigate interior. Spend a few hours wandering around this monstrosity, and you may resonate with my perceptions of this facility.

There’s another disaster brewing under the AIIM umbrella. That’s what the in-crowd calls content management. Synonyms in play at this show included CMS, ECMS (enterprise or extreme content management systems), and eDocuments, among others.

These synonyms are a radio beacon that says to me, loud and clear: “We have a way to help you deal with electronic information.” These assurances wrapped in buzzwords make it clear that organizations are: [a] unable to deal with basic storage and findability tasks; [b] confused about how business processes can and should intersect; [c] staggered like a punch drunk fighter with the brutally punishing costs of these eDoc solutions; and [e] scared because a mistake can send them to court or, even worse, jail. No one I met fancied doing a perp walk in an orange suit due to a failure to comply with regulatory mandates, legal discovery, and basic, common sense record keeping.

Folks were pretty thrilled to get a Google mouse pad from the Googlers or a rubber ball with flashing lights in it from Open Text. But amidst the bonhomie, there was a soupçon of desperation.

To me CMS and its step children attempt to make a run-of-the-mill operation into a high-end publishing company. The problem with attempting to embed an intellectual process dependent on information into software is that most people aren’t very good informationists. Using a BlackBerry or an automatic teller machine is not the same as creating useful, accurate, on-point information. CMS has now morphed from managing a static Web site’s content into a giant, Rube Goldberg machine that ingests everything and outputs anything, at least according to the marketers I met.

Electronic information is now a major problem for most employees, senior managers, and vendors. Building a solution that is affordable and satisfies the needs of the Securities & Exchange Commission from Tinker Toys is a tough job. I saw lots of Tinker Toy solutions on offer. I’m genuinely concerned about the problems these systems are exacerbating. “Trouble,” as one cowboy said to his side kick, “is coming down the line.”

This essay highlights the three of my take-aways from this conference and exhibition. According to the chatter, there were more than 2,000 paying attendees who sat through lectures on subjects ranging from “Architecture Considerations in Electronic Records Management Software Selection: to “Pragmatic to Value Add: Will Anyone Really Pay for It?”. There were product reviews disguised as substantive lectures. I suffered some thin gruel that passing as a solid intellectual feast. I heard that another 20,000 people fascinated with copiers, high-speed imaging, and digital information wandered through the charming aircraft hanger of an exhibit hall.

Most of the presenters “follow the game plan”. The talks are in the average to below average grade range. A few are interesting, but finding one is a hit-and-miss affair. This conference housed a Drupal conference, something called On Demand, and the AIIM conference. For my purposes, there’s one conference, and the unifying theme was lots of people talking about electronic information.

What I Learned

Let me compress 18 hours of AIIM experiences into these points:

  1. Digital content is a major problem for most organizations. CMS is the band aid, but none of the vendors has a cure for information obesity. None of the customers with whom I spoke using vendors’ solutions are in shape for a digital triathlon. Systems are expensive and flaky. Budgets are tight, and the problems of storing, finding, and repurposing information are getting worse fast.
  2. Vendors with hardware solutions that scan paper, print paper, and manipulate digital counterparts of paper are spouting digital babble and double talk. Vendors of quasi – copy machines talk about hardware as if it were bits in a cloud. AIIM has its roots in scanning, micrographics, microfilm, and printing. Hardware — even when it is the size of an SUV — is positioned as software, a system, and a platform. Obviously hardware lacks sizzle. Vendors with software solutions talk about the pot of gold at the end of the dieters’ rainbow. It just ain’t true, folks. It’s a Nike running shoe commercial applied to information. No go. Sorry.
  3. Marketing messages are not just muddled; the messages are almost incomprehensible. Listening to earnest 30 – year olds tell me about “enterprise repositories with integrated content transformation and repurposing functionality” and “e – presentment” left me 100 percent convinced that the information crisis has arrived, and the vendors will say anything to get a deal and the buyers will buy whatever assuages their fears. Rationality was not a surplus in these sales pitches.

My stomach rebels at baloney.

The “Real” Problem

Organizations right now are fighting a three – front war against digital information. I know that the AIIM attendees are having a tough time expressing their challenges clearly. The people with whom I spoke can only describe the problem from an individual point of view. Vendors want to be all things to all people. The dialogs among the customers and the vendors are fascinating and disturbing to me. I think the market is in a state of turmoil.

Digital information is a different type of challenge for an organization. On one hand, it eliminates the hassles of recycling some information. Cut and paste is a wonderful function. But if your work processes are screwed up, digital information only creates more problems. If your employees aren’t good informationists, you will produce more dross than ever. You will, of course, do it more quickly which adds to the problem. Furthermore, finding something remains tough. Automated systems are expensive, complex, and fully capable of going off the rails with no warning.

What was crystal clear to me is that most business processes have not been “informationized”, to use a weird verb form I heard at the show. Work flows are based on human actions. Humans are just not very good at “being digital”.

Wrap Up

An inability to handle digital information is a problem of great import. Regulators expect companies to manage digital information. Organizations aren’t set up to deal effectively with the data volume and its challenges — format, versions, volatility, non-textual components, etc. The problem is not getting better. The problem is getting bigger.

One well-fed, sleek senior manager smirked with pride about the huge prices paid by certain firms to acquire enterprise content management companies (ECM or enterprise CMS in the jargon of AIIM). He pointed to two firms — EMC and Hewlett Packard — as particularly adept practitioners of snapping up “hot” companies in order to get “high margin upsides”. “There’s a big market for this high-end solution,” he asserted.

I think this weird MBA speak means that EMC and HP want to buy into a sector with fat margins and semi-desperate customers. This can work, but I am not sure that these two firms’ “solutions” are going to solve the information challenges most organizations now face. EMC wants to move hardware. HP wants to sell printers and ink.

I’m probably wrong. I usually stray into the swamp anyway.

I think information mis-management will bring the direct downfall of some organizations in the next few months. Tactical fixes will not be enough. When an information-centric collapse occurs, perhaps buzzwords will give way to new thinking about digital information in organizations. More meat, fewer empty calories, please!

Stephen Arnold, March 7, 2008

ISYS’ Ian Davies Interviewed

March 6, 2008

An interview, conducted in February 2008, with Ian Davies, founder of ISYS Search Software, is now available. ISYS — founded in Australia almost 20 years ago. Beyond Search has identified this company as one to watch. Its ready-to-run solution is fast, intuitive, and feature complete.

ISYS result screen

The company has made strong inroads into law enforcement, litigation support, and competitive intelligence. These are sectors long viewed as the domain of certain publicly-traded search system vendors. ISYS’ success is a result of more than a decade of innovation.

Mr. Davies says in his interview, “We described our early product as an iceberg … the really important bits were never seen. The bit below the water-line was crucial to make the whole thing work… Some of our competitors were ‘all tip and no berg'”.

ISYS lets its system “do the talking.” The key to ISYS’ marketing is the company’s almost obsessive desire to listen and respond to its customers.

I asked Mr. Davies about the marketing hype that swirls through the content processing industry. He laughed and said, “When I attend an industry event and hear the jargon and the buzzwords, I chuckle. ISYS is a search solution that offers features that are useful to a large percentage of users. You don’t need fancy jargon like semantic wiki or enhanced Bayesian algorithms to make a system useful. There needs to be rocket science, without doubt… [Our] system does the talking, not the buzzwords.”

In my testing of ISYS 8.x, I was impressed by the system’s performance. I asked Mr. Davies about ISYS’ speedy indexing and query processing. He told me:

Performance comes in two ways. The first is algorithmic performance, where the algorithms you come up with scale well with volume. It’s all about curves, and it’s key. If your curves are wrong, you’re never going to scale. The second is implementation performance, where there’s craftsmanship in how well you implement your algorithm. We spend a lot of time analyzing our code and profiling to ensure the bottlenecks are eliminated.

According to Mr. Davies, the company is experiencing rapid growth in its offices in Australia, the U.S., and Europe. ISYS is worth a hard look. You can read the full interview here. You can download a trial version here.

Stephen Arnold, March 8, 2008

Ask and You Shall Receive No Traffic

March 5, 2008

About eight months ago, a colleague and I had dinner with two whizzy consultant types from a big city.

One of the conversational topics was Web search, a subject which I make an effort to avoid. Web search evokes for me information of unknown provenance from an unspecified number of Web sites on an unknown update cycle with a murky (at best) method of determining relevancy.

I don’t care which search engine I use or you use.

None is particularly good, so running the same query on different systems is a must for me. I have some short cuts, but it is a chore to sift through chaff to find a couple of informational “Wheaties”.

Try it. Pick a Web search sysem. Enter a single word query like Spears, and you get the drivel of popular culture. Avoid. Like. The. Plague.

Back to the Dinner Chit Chat

The four of us are sitting in the River Creek Inn, a high class motorcycle bar in Harrod’s Creek. I’m trying to decide between the Kentucky favorites, burgoo (squirrel stew with bourbon) or hot brown (white bread, ham, turkey and bourbon-based gravy).

My colleague picks up on a stray comment and asks the male zippy consultant, “Did you say you prefer Ask.com over Google.com or Yahoo.com for Web search?”

I gave her my best 64-year old squint, but the zippy male consultant grabbed the opening and launched into an Ask.com panegyric. I decided on a green salad and turned my attention to this fellow.

What He Liked and My Rejoinders

I recall clearly that this Ask.com cheerleader liked three aspects of Ask.com. Hold on to your socks:

  1. Ask.com is easy to use. My comment: No doubt about that. The system has long been a favorite of the middle school crowd. Not a hot demographic, but a good window into the “strengths” of the system
  2. Ask.com has a better interface than Google, Microsoft, Yahoo, et al. My comment: I know about large, colorful icons and a search box. I’m not sure how this makes search better than Google’s text hot links or the research test Yahoo Mindset slider interface.
  3. Ask.com is gaining market share and is a real contender in Web search. My comment: Mr. Whizzy Consultant, sir, do you have substantive data to back this assertion? At the time of this meeting in 2007, AskJeeves.com was a distant fifth in Web search traffic and under pressure from Exalead.com.

The zippy consultant seemed shocked that I would challenge his assertions. Today, he probably has conveniently forgotten our table chat and overlooked the news that Ask.com (formerly AskJeeves.com) is going to morph into a Web site for mid western females. That should make the middle school kids happy when they try to look up Julius Caesar in a few months. You can read the AP story here. The writing was on the wall. I learned several months ago, Ask.com’s technical guru has returned to academe. On March 4, 2008, I learned that Gary Price, a highly regarded librarian, severed his ties to Ask.com (or had his ties to Ask.com severed). You can read his upbeat fare-thee-well here.

Some History: Faux NLP, A Miss with Direct Hit, and Killing the Butler

In the 1990s, AskJeeves.com was one of the first Web search companies to assert that it performed NLP or natural language processing. The idea was that a user would type a question in the search box; for example, “What’s the weather in Chicago?” AskJeeves would come back with the temperature. I can’t find my screen shots of this function, but I do recall it worked.But — and this is a big caveat — AskJeeves did not do NLP. Humans created templates and rules. When the user’s query matched a template, the rules would form a query, get the aforementioned temperature, and display it.

Send the system a question it didn’t understand and the system would return a bag of jelly beans in many flavors and colors. In less metaphorical terms, the relevancy of the results was full of empty calories.

Direct Hit

To fix the problem of the brutal costs of human editors working like made to create more templates and rules, AskJeeves.com bought DirectHit.com. As I recall, DirectHit.com was a shopping and ad engine. What sticks in my mind is that DirectHit used tiny orange stick figures to indicate relevance (I think). There was some hoo – haa about AskJeeves’ acquiring this technology, and then it drifted off my radar screen.

Downsizing

AskJeeves’ management team sold off the rules – based question answering system and the public Web search system plopped into the Barry Diller empire.

Teoma

Teoma, I recall, was developed by technnical wizards at Rutgers University. (I want to note that Rutgers is a great institution as evidenced by the Eagleton Lectureship the university awarded me in the late 1980s.)

Teoma, next generation Web Search, delivers three types of search responses. The results included a traditional laundry list. Teoma also offered “See Also” references and a point-and-click set of hot links to narrow a results set. Some of the queries, as I recall, included links suggested by other users. Today this type of feature is dubbed “social search”.

I liked Teoma, and it became the core engine of Ask.com.

What Happened?

In many ways, Ask.com did many things correctly. Management focused on a core strength — search. Management acquired and integrated more sophisticated technology. Management established a brand identity, the Jeeves butler who suggested that he was at my service.

What went wrong appears to have been a combination of exogenous forces and a stalwart tripping and falling on her sword.

The exogenous force was the broader market dynamics involving Yahoo, then Google, and more recently Microsoft. Yahoo and Google sucked up most of the search traffic and captured most of the ad revenue. As the distance widened between Google and Yahoo, there wasn’t enough revenue to keep pace with the brutal technical investments required to play the Web indexing game. Compared to Exalead, to cite one example, Ask.com’s infrastructure was more expensive to scale and more fiddly that the French upstart’s approach based on AltaVista – type engineering.

The self-inflicted wound may have been caused by putting marketing before technology. I am no marketing expert, so my summary is probably off base. I rather liked the butler logo. Like the Google logo, it conveyed some humor and whimsy in what is a bloodless game. I was baffled by some of the Ask.com advertisements. Other than a general sense of bewilderment, I wasn’t sure what the heck the company was trying to say. I recall sitting in a couple of presentations by Ask.com, and I came away feeling that the chipper Ask.com professionals were talking about a system that I did not recognize.

Denoument and a New Beginning

Let’s review what my opinion is:

  1. In order to scale a Web search system, the stakeholders have to be prepared to spend — often substantial sums without much notice. When that money is not available, short cuts become evident. These range from marketing sleight of hand, wacko advertising, and graphic tweaks.
  2. Any search engine working against a headwind and a three percent market share is tough. MBAs often see the search challenge as trivial. It’s not. If you have the technology to leap frog ahead of Google, you can pull Googzilla’s tail, maybe slow Googzilla down. But good enough solutions won’t do the job.
  3. The service was a positioned differently as the various owners / managers tried to turn a digital pig’s ear into a digital golden goose. Killing the venerable Jeeves character presaged the demise of the broader service.

The information superhighway is littered with road kill. Some of these are still around; for example, there’s Lycos.com, Excite.com, and Dogpile.com. Have you used these today? And what about HotBot.com, Muscat.com, or WiseNut.com? Maybe you are using AltaVista.com, AllTheWeb.com, Gigablast.com, IceRocket.com, or A9.com? I’ve got a list of a couple of hundred international search engines, lists of metasearch engines, and lists of more than 350 companies offering search and content processing systems. At this time, none of these outfits are able to hobble Googzilla if my understanding of usage data is correct.

Back to the Biggie Consultants

When a biggie consultant asserts that a long-shot with a track record of coming in last in a five or six year race, I’m not going to let that dog sleep peacefully. Don’t misunderstand me. I need access to multiple, high-quality Web search systems. As nifty as Google is, my tests reveal that if I run a query such as text mining or content processing on Google, I will double the number of relevant hits if I use two other Web search systems.

Here are a few observations about Web search, which I will use to control this “I told you so” essay.

First, Web indexing is a messy, complicated, and imprecise activity. Web robots can’t index servers when the servers are down or when network issues create time outs. A searcher does not know what’s omitted, what’s new, or what’s old in most cases. None of the systems I track provide much substantive information about the “certainty” of a result, its date of creation and when the content was refreshed, the “quality” of the information, and so on. Not only are Web results spotty, in most cases, I have zero useful information to help me determine what’s correct and what’s dead wrong. Social search sounds great, but social systems can be easy to twiddle, spoof, and fast dance?

Second, more Web sites are dynamic today than in the past. This means that static, easy to index Web pages make up a smaller percentage of public pages with content. Dynamic sites are more difficult to spider because robot technology does a lousy job with dynamic sites. There’s a solution, but it is even more expensive, complex, and difficult than indexing good, old flat HTML, XHTML, or XML pages. (Google has this technology called the Programmable Search Engine, but so far the company has been keeping it under wraps.) Under-funded tech operations find it tough to compete because the people and the money are not available.

Third, user behaviors are changing in step with the access devices. With more queries flowing from mobile devices, different processes are needed. Who wants to browse results on a tiny screen even if it is a state of the art iPhone or BlackBerry?

Fourth, search is drifting toward point-and-click interfaces and even more sophisticated approaches. What I call “beyond search” techniques.

To conclude, biggie consultants who assert that a particular search system will gain market share based on personal preferences and lack of information are much in evidence today. There’s no lack of talk about innovation in Web search. But I for one am waiting for Powerset.com to become available. I’m annoyed that EZ2Find.com has such lousy marketing for an interesting and useful service. I want a leap frog system to take me beyond Google.

I “ask” so that I may receive. Whizzy big city consultants don’t ask; they assert. Doesn’t work sometimes, does it?

Stephen Arnold, March 5, 2008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta