Vivisimo: Organizations Need a Search Strategy

August 3, 2008

Vivisimo, a company benefiting from the missteps of better known search vendors, has a new theme for its Fall sales push. Jerome Pesenti, chief scientist for Vivisimo, delivered a lecture called “Thinking Outside the (Search) Box”. The company issued a news release about the need for an organization to have an enterprise search strategy in order to prove the return on investment for a search system. What is remarkable is that–like Eric Schmidt’s opinions about how other companies should innovate here–scientists are providing consulting guidance. MBAs, accountants, and lawyers have long been the business gurus to whom challenged organizations turned for illumination. Now, a Ph.D. in math or a hard science provides the foundation for giving advice and counsel. Personally I think that scientists have a great deal to offer many of today’s befuddled executives. You will want to download the presentation here. You will have to register. I think that the company will use the names to follow up for marketing purposes, but no one has contacted me since I registered as Ben Kent, a name based on the names of beloved pets.

Is Vivisimo’s ROI Number Right?

For me the key point in the Vivisimo guidance is, and I am paraphrasing so your take may be different from mine, is that an organization needs to consider user needs when embarking on an enterprise search procurement. Mr. Pesenti reveals that the Vivisimo Velocity system saved Modine Manufacturing saved an estimated $3.5 million with a search strategy and the Vivisimo search system. You can learn more about Modine here. The company has about $1.8 billion in revenue in 2008, and it may punch through the $2.0 billion barrier in 2009. I know that savings are important, but when I calculated the percent of revenue the ROI yielded I got a small number. The payoff from search seems modest, but the $3.5 million is “large” in terms of the actual license fee and the estimated ROI. My thought is that if a mission critical system yields less than one percent return on investment, I would ask these questions:

  • How much did the search system cost fully loaded; that is, staff time, consultants, license fees, and engineering?
  • What’s the on going cost of maintaining and enhancing a search system; that is, when I project costs outwards for two years, a reasonable life for enterprise software in a fast moving application space, what is that cost?
  • How can I get my money back? What I want as a non-scientific consultant and corporate executive is a “hard” number directly tied to revenue or significant savings? If I am running a $2.0 billion per year company, I need a number that does more than twiddle the least significant digits. I need hundreds of millions to keep my shareholder happy and my country club membership.

Enterprise search vendors continue to wrestle with the ROI (MBA speak for proving that spending X returns Y cash) for content processing. Philosophically search makes good business sense. In most organizations, an employee can’t do “work” unless he or she can find electronic mail, locate an invoice, or unearth the contract for a customer who balks at paying his bill. One measure of the ROI of search is Sue Feldman’s and her colleagues’ approach. Ms. Feldman, a pretty sharp thinker, focuses on time; that is, an employee who requires 10 minutes to locate a document rooting through paper folders costs the company 10 minutes worth of salary. Replace the paper with a search system from one of the hundreds of vendors selling information retrieval, and you can chop that 10 minutes down to one minute, maybe less.

land of search costs

This is the land of search costs. What’s your return on investment when you wade into this muck?

Problems with ROI for Utility Functions

The problem with any method of calculating ROI for a non-fungible service that incurs on going costs is that accounting systems don’t capture the costs. In the US government, costs are scattered hither and yon and not too many government executives work very hard to pull “total costs” together. In my experience, corporate cost analysis is somewhat similar. When I look at the costs reported by Amazon, I have a tough time figuring out how Mr. Bezos spends so little to build such a big online and search system. The costs are opaque to me, but I suppose MBA mavens can figure out what he spends.

The problem search, content processing, and text analytics vendors can’t solve is the value of investments  in these complex information retrieval technologies. Even in tightly controlled, narrowly defined deployments of search systems, costs are tough to capture. Consider the investment special operations groups make in search systems. The cost is usually reported in a budget as the license fee, plus maintenance, and some hardware. The actual cost is unknown. Here’s why? How do you capture the staff cost for fixing a glitch in a system when the system must absolutely be online. That extraordinary cost disappears into a consulting or engineering budget. In some organizations, an engineer works overtime and bills the 16 hours to a project or maybe a broad category called “overtime”. Magnify this across a year of operations for a troubled search system and those costs exist but are often disassociated from the search system. Here’s why. The search system kills a network device due to a usage spike. The search system’s network infrastructure may be outsourced and the engineer records the time as “network troubleshooting.” The link to the search system is lost; therefore, the cost is not accrued to the search system.

In one search deployment, the first year operation cost was about $300,000. By the seventh year, the costs rose to $23.0 million. What’s the ROI on this installation? No one wants to gather the numbers and explain these costs. The standard operating procedure among vendors and licensees is to chop up the costs and push them under the rug.

Read more

Stanford TAP: Google Cool that Trails Cuil

July 31, 2008

in the period from 2000 to 2002, Dr. Ramanathan Guha with the help of various colleagues and students at Stanford built a demonstration project call TAP. You can download a Power Point presentation here. I verified this link on July 30, 2008. Frankly I was surprised that this useful document was still available.

TAP was a multi-organization research effort. Participants included IBM, Stanford, and Carnegie Mellon University.

Why am I writing about information that is at least six years old? The ideas set forth in the Power Point were not feasible when Dr. Guha formulated them. Today, the computational power of multi core processors coupled with attractive price-performance ratios for storage makes the demos from 2002 possible in 2008.

TAP was a project set up to unify islands of XML from disparate Web services. TAP also brushed against automatic augmentation of human-generated Web content.Working with Dr. Guha was Rob McCool, one of the developers of the common gateway interface. Mr. McCool worked at Yahoo, and he may still be at that company. Were he to leave Yahoo, he may want to join some of his former colleagues at Google or a similar company.

Now back to 2002.

One of TAP’s ambitious goals was to “make the Web a giant distributed database.” The reason for this effort was to bring “the Internet to programs”. The Web, however, is messy. One problem is that “different sites have different names for the same thing.” TAP wanted to develop a system and method for descriptions, not editors, to choreograph
the integration.”

The payoff for this effort, according to Dr. Guha and Mr. McCool is that “good infrastructures have waves of applications.” I think this is a very important point for two reasons:

  1. The infrastructure makes the semantic functions possible and then the infrastructure supports “waves of applications”.
  2. The outputs of the system described is new combinations of information, different ways to slice data, and new types of queries, particularly those related to time.

Here’s a screen shot of TAP augmenting a query run on Google.

augmented search results

The augmented results appear to the left of the results list. These are sometimes described as “facets” or “assisted navigation hot links”. I find this type of enhance quite useful. I can and do scan result lists. I find overviews of the retrieved information and other information in the system helpful. When well executed, these augmentations are significant time savers.

Keep in mind that when this TAP work up was done, Dr. Guha did not work at Google. Mr. McCool was employed at Stanford. Yet the demo platform was Google. I find this interesting as well that the presentation emphasizes this point: “We need [an] infrastructure layer for semantics.”

Let me conclude with three questions:

  1. Google was not directly mentioned as participating in this project, yet the augmented results were implemented using Google’s plumbing. Why is this?
  2. The notion of fueling waves of applications seems somewhat descriptive of Google’s current approach to enhancing its system. Are semantic functions one enabler of Google’s newer applications?
  3. When will Google implement these enhanced features of its interface? As recently as yesterday, the Cuil.com interface was described as more up to date than Google. Google had functionality in 2002 or shortly thereafter that moves beyond what Cuil.com showed today.

Let me close with a final question. What’s Google waiting for?

Stephen Arnold, July 31, 2008

Autonomy Nails Another Laurel to Its Crown

July 22, 2008

Autonomy follows it analyst-crushing financial results with the “highest Socha-Gelbmann rankings”. The story appeared in the highly regarded MarketWatch online news service via the PRNewswire via FirstCall via Comtex. I am thrilled that the news reached me quickly. You can read the full story here. If you have been living in a hollow in rural Kentucky, you may ask, “What’s a Socha-Gelbmann Ranking?” Well, let me fill you in.

Socha-Gelbmann

Socha Consulting LLC, operated by George J. Socha, Jr., Esquire, does surveys and delivers services in eDiscovery and automated litigation support activities. Socha Consulting focuses on the eDiscovery market. The acronym means “electronic discovery”, a buzzword much loved by attorneys and consultants involved in figuring out what’s in the terabytes of electronic information delivered by the legal discovery process.

Mr. Socha Jr., Esquire is the principal in Socha Consulting, LLC, a firm which provides expert advice to consumers with respect to effective electronic discovery strategies, and to providers with respect to the development of e-discovery services, software and strategy.  Prior to forming Socha Consulting, Mr. Socha worked in private practice where he helped establish litigation support departments at 250-attorney and 50-attorney firms.  Mr. Socha is a graduate of the University of Wisconsin (B.A.) and Cornell Law School. You can read this bio here. Mr. Socha’s offices are also in Minnesota, in St. Paul, a lovely city.

Tom Gelbmann is the other half of the research report’s team. Information about him is located at Gelbmann.biz here. Mr. Gelbmann runs a consulting practice focused on helping law firms and corporate law departments maximize value from investments in technology. He has worked as a CIO position at two major law firms, and he has also conducted several market research projects on behalf of information and technology service providers to the legal sector. Prior to his work with the legal technology community, Tom served as a Director of Computer Security Consulting for a global consulting organization. You can read his full bio here. His office is in Minnesota. Details are here.

eDiscovery

In a nutshell, eDiscovery indexes documents. The very best systems provide useful tools to the lucky souls who are billable during this tedious process of ferreting for evidence, facts, and supporting material. For example, some eDiscovery systems include billing functions to make it painless for the hard-charging attorney to tally the minutes, hours, days, weeks, and months required to “read” lots of email, memos, reports, and files with text in them. Other systems take an item–say, for example, the name of a person–and generate a list of related documents or people. Other systems chew through terabytes of text and generate a visual display of who is related to whom or what is related to what. I have seen systems using cartoon figures and lines to connect individuals, events, cash transfers, and other life actions. Most of these systems allow the legal eagle to enter a word or phrase, see a results list, browse a list of related topics, and perform other activities which can then be saved in a “case audit” file. The idea is that another lawyer can come along and recreate the exact finding process, identify the specific document with the needed “fact”, and print out the audit trail for a cowering opponent whose argument has been trashed with the brilliance of the legal argument, silver bullet fact, and solid research.

So, the most recent study by George J. Socha, Jr., Esquire is described here. The current report looks over the previous five years of Socha-Gelbmann results and the output is the 2008 Socha-Gelbmann 6th Annual Electronic Discovery Survey.

Findings

The new report are available now. The big news is that Autonomy has been, according the the aforementioned news story:

named a Top 5 Electronic Discovery Provider in the 2008 Socha-Gelbmann Electronic Discovery Survey Report for its ZANTAZ’ e-Discovery software and service. Autonomy was named a Top 5 Provider in nine software and service categories, including preservation, collection, analysis, production, presentation, and law firm rankings. This marks the fourth consecutive year that the company has been ranked as a Top 5 service provider in the report.

You can get a small nibble of the approach in this series of questions about the 2007 study here.

Autonomy provides, according to the news story:

end-to-end eDiscovery for the largest and most complex legal and regulatory matters, supported by 6,000 servers across five data centers. This comprehensive technology and services solution provides data preparation, analytics for Early Case Assessment (ECA), legal hold, full EDD processing, advanced review and production, all on a powerful platform. Through automatic processing of all electronically stored information (ESI), whether email, audio or video, Autonomy enforces legal hold policies and enables eDiscovery across the organization based on the meaning and relevance of information to litigation.

Kudos to Autonomy for this excellent showing. And, to George J. Socha, Jr., Esquire and Tom Gelbmann, “Keep up the good work.” A happy quack to the Autonomy team as well. With video, fraud detection, and eDiscovery, I may have to recategorize Autonomy from enterprise search vendor to enterprise information application solution provider. If I do this, the search sector will lose a luminary. Plus ca change, plus c’est la même chose!

Stephen Arnold, July 22, 2008

Megaputer: An Emerging Force in Data and Text Analysis

June 23, 2008

Megaputer, based in Bloomington, Indiana, continues to expand the capabilities of its data and text analysis system. The next release, said Sergei Ananyan, one of the company’s founders, will a 64-bit version, browser-based reporting, and support for text analysis in multiple languages.

Dr. Ananyan, a Ph.D. in nuclear physics, spoke to ArnoldIT.com and said:

Megaputer keeps developing PolyAnalyst as a powerful and flexible analytic platform, but our real strength derives from the ability to build push-button custom solutions for handling typical tasks in various application domains.

Megaputer was founded in 1994, which makes the company one of the more mature in the data and text analysis fields. The company has landed a number of blue-chip customers in law enforcement, pharmaceuticals, and financial services.

As organizations realize that individual users and work units require customized content processing systems, Megaputer’s approach has been attracting attention. Megaputer can deploy its range of analytic tools to meet the needs of different users without having to do the manual coding and hands-on rework that plague many of the firm’s competitors.

The company, however, is anchored in mathematics, quite advanced algorithms. Dr. Ananyan says:

We value math, and I suppose we share that technical foundation with Google. So, okay, we are good at math just like Google but with one difference. I think we are specialists in the type of math necessary to make Megaputer solve our clients’ problems.

The key to success, says Dr. Ananyan:

While providing users of PolyAnalyst with lots of functionality, we try to lower the learning curve for new users. We spend lots of thought and effort on keeping PolyAnal6yst as simple in use as possible. We make every effort to simplify the user experience with the system. The user builds a data analysis scenario through an intuitive drag-and-drop interface. The developed scenario is represented as a graphical flow chart with editable nodes and can be shared for collaboration or scheduled as a task for future execution. The results of any analytical step can be saved in an easy-to-comprehend and visually appealing report the user generates on the fly.

Megaputer has several advantages compared to some vendors who provide a specific text processing function:

  • The company’s technology suite is broad and deep, supporting on-the-fly categorization, ease-of-use, and versions for single user and on premises enterprise installations
  • The strong foundation in mathematics does not get in the way of the users due to careful design of the system interfaces
  • The inclusion of data cleansing, federation, and visualization functions allows the system to meet a range of needs without forcing licensees to seek add-ins or third-party utilities.

You can learn more about Megaputer here. The full text of the interview with Dr. Ananyan appears on ArnoldIT.com here as part of the Search Wizards Speak series.

Stephen Arnold, June 23, 2008

IBM Explains Text Analytics

June 15, 2008

A colleague called my attention to this April 2008 description of IBM’s view of text analysis. The essay “From Text Analytics to Data Warehousing” is more than processing content. The article by Matthias Nicola, Martin Sommerlandt, and Kathy Zeidenstein points toward what I call a “metaplay” or “umbrella tactic”. (You will want to read the posting here. When accessing IBM content, it is important to keep in mind that it can be difficult, if not impossible, to locate IBM information via the IBM search function. Pages available via a direct link like this “From Text Analytics to Data Warehousing” may require that you register, obtain an IBM user name and password, and then relaunch your search to locate the information. Other queries will return false drops with the desired article nowhere to be found. I’m not sure if this is OmniFind, Fast Search, Endeca, or some other vendor’s handiwork. But search and retrieval of IBM information on the IBM site can be frustrating to me. Click here now.)

The authors state:

This article review[s] the text analysis capabilities of IBM OmniFind Analytics Edition, including an analysis of the XML format of the text analysis results, MIML. It then examined different approaches that can help you extend the value of OmniFind Analytics Edition text analysis by storing analysis results from the MIML file into DB2 to enable standard business intelligence operations and reporting using the full power of SQL or SQL/XML.

As I worked through this article, reviewed the diagram, and explored the See Also references, one point jumped out at me. The mark up generated by the IBM system can be verbose. The emphasis on the use of DB2, IBM’s database system underscored for me that IBM text analytics requires software, DB2, and storage. In fact, without storage, the IBM text analysis system could grind to a halt. To increase the performance, the licensee may require additional IBM servers, management software, and other bits and pieces.

You have to store the “star schema for MIML” somewhere. Here’s what the structure looks like. Of course, the image is IBM’s and copyrighted by the company.

ibm architecture

I want to point out that this is one of the simpler diagrams in the write up.

Observations

  1. This write up suggests to me that IBM is defining text analytics as a component in a much, much larger array of software, hardware, and systems. My hunch is that IBM wants to shut the barn door before more standalone text analytics tools are sold into IBM shops.
  2. IBM is making explicit that text analytics is an exercise in data management. Google, I think, has much the same notion based on my reading of its technical papers.
  3. IBM has done a good job of making clear that software alone won’t deliver text analytics. Without the ability to scale, text analysis can choke most systems. Now IBM has to get this message to the information technology professionals who assume that their existing servers and infrastructure can handle text analytics.
  4. IBM has done an excellent job of moving the concept of text analysis as an add on into a larger constellation of operations. The notion of a metaplay or an umbrella tactic is important because individual vendors often ignore or understate the broader impact of their content processing subsystems.

I think this is an important write up. A happy quack to the reader who called the information to my attention.

Stephen Arnold, June 15, 2008

Microsoft BIOIT: Opportunities for Text Mining Vendors

June 14, 2008

I came across Microsoft BIOIT in a news release from Linguamatics, a UK-based text processing company. If you are not familiar with Linguamatics, you can learn more about the company here. The company’s catchphrase is “Intelligent answers from text.”

In April 2006, Microsoft announced its BIOIT alliance. The idea was to create “a cross-industry group working to further integrate science and technology as a first step toward making personalized medicine a reality.” The official announcement continued:

The alliance unites the pharmaceutical, biotechnology, hardware and software industries to explore new ways to share complex biomedical data and collaborate among multidisciplinary teams to ultimately speed the pace of drug discovery and development. Founding members of the alliance include Accelrys Software Inc., Affymetrix Inc., Amylin Pharmaceuticals Inc., Applied Biosystems and The Scripps Research Institute, among more than a dozen industry leaders.

The core of the program is Microsoft’s agenda for making SharePoint and its other server products the plumbing of health-related systems among its partners. The official release makes this point as well, “The BioIT Alliance will also provide independent software vendors (ISVs) with industry knowledge that helps them commercialize informatics solutions more quickly with less risk.”

Rudy Potenzone, a highly regarded expert in the pharmaceutical industry, joined Microsoft in 2007 to bolster Redmond’s BIOIT team. Dr. Potenzone, who has experience in online with Chemical Abstracts, has added horsepower to the Microsoft team.

This week on June 12, 2008, Linguamatics hopped on the BIOIT band wagon. In its news announcement, Linguamatics co-founder Roger Hale said:

As the amount of textual information impacting drug discovery and development programs grows exponentially each year, the ability to extract and share decision-relevant knowledge is crucial to streamline the process and raise productivity… As a leader in knowledge discovery from text, we look forward to working with other alliance members to explore new ways in which the immense value of text mining can be exploited across complex, multidisciplinary organizations like pharmaceutical companies.

Observations

Health and medicine is an important player in the scientific, medical, and technical information sector. More importantly, health presages money. In the US, the baby boomer bulge is moving toward retirement, bringing a cornucopia of revenue opportunity for many companies.

Google has designs on this sector as well. You can read about its pilot project here. Microsoft introduced a similar project in 2006. You can read about it here.

Several observations are warranted:

  1. There is little doubt that bringing order, control, metadata and online access to certain STM information is a plus. Tossing in the patient health record allows smart software to crunch through data looking for interesting trends. Evidence based medicine also can benefit. There’s a social upside beyond the opportunity for revenue.
  2. The issue of privacy looms large as personal medical records move into these utility-like systems. The experts working on these systems to collect, disseminate, and mine data have good intentions. Nevertheless, this is uncharted territory, and when one explores, one must be prepared for the unexpected. The profile of these projects is low, seemingly controlled quite tightly. It is difficult to know if security and privacy issues have been adequately addressed. I’m not sure government authorities are on top of this issue.
  3. The commercial imperative fuels some potent corporate interests. These interests could run counter with social needs. The medical informatics sector, the STM players, and the health care stakeholders are moving forward, and it is not clear what the impacts will be when their text mining reveals hiterto unknown facets of information.

One thing is clear. Linguamatics, Hakia, and other content processing companies see an opportunity to leverage these broader industry interests to find new markets for its text mining technology. I anticipate that other content processing companies will find the opportunities sufficiently promising to give BIOIT a whirl.

Stephen Arnold, June 14, 2008

Lexalytics: Stepping Up Its Marketing

June 7, 2008

Lexalytics is a finalist in the annual MIXT (Massachusetts Innovation & Technology Exchange. Lexalytics has also revamped its Web site. The company now makes it easy to download a trial of its text analytics software. Teh trial is limited to 50 documents, but you can generate a list of entities, generate summaries of the processed documents. The most interesting function of the trial’s ability to display a sentiment score for a document. In effect, you can tell if opinion is running for or against a product.

The company’s system performs three functions on collections of content. The content can be standard office files such as Word or PowerPoint documents. The system can ingest Web log content and RSS streams as well. Once installed, the system outputs:

  • The sentiment and tone from a text source
  • The names of the people, companies, places or other entities in processed content 
  • Any hot themes in a text source.

Lexalytics has provided technology to other search and content processing companies. For example, Northern Light and Fast Search & Transfer, to name two. A happy quack to the Lexalytics’ team for the MIXT recognition. You can learn more about the company here. 

Stephen Arnold, June 7, 2008

Content Analyst and dtSearch Combo Product Announced

April 4, 2008

Content Analyst, a text processing company with DNA from the US intelligence community, released its Conceptual Search and Text Analytics software Version 3.2. This release incorporates dtSearch’s search-and-retrieval system. dtSearch has offices in Bethesda, Maryland, has offered a solid search-and-retrieval system for single users, developers, and organizations since 1991.

The combo product delivers key work and conceptual search. The release also offers licensees clustering and support for cross language support. Based in Reston, Virginia, Content Analyst’s technology can be used to generate taxonomies and produce summaries of documents.

Content Analyst–like Groxis, Recommind, and Vivisimo–is making a move from a niche market into the broader market of behind-the-firewall search applications.

« Previous Page

  • Archives

  • Recent Posts

  • Meta