Technology from Harrod's Creek
Stephen E. Arnold
May 5, 2003

Vivisimo: Clustering Delivers Information Overlook

Several years ago, the brilliant Ramana Rao unveiled clustering tools that are now a standard part of the Inxight toolset. These tools helped trigger the ontology and taxonomy estampida.

Northern Light caught some of the wheezing giants of online when it sprinted to offer clustering as a standard feature of Northern Light's no-charge Web search.

In one leap after another, clustering or grouping of hits in folders or represented as a visual boundary in Kartoo's ungraded interface to make analysis of large result sets easier and more intuitive continues to improve.

Today, with taxonomies and ontologies fiercely competing for buzzword of the year, clustering provides much of the muscle for pigeonholing responses to a query in the property taxonomic slot. Instead of looking at a laundry list of hits, the results are grouped under headings. The best clustering engines provide "headings" or "categories" that use terms not appearing in the original query.

Now comes a company based in Pittsburgh, Pennsylvania, where a staff of 10 speaks two languages per employee. A caller speaking Chinese, French, German, Italian, Portuguese, or any one of a half dozen languages need never experience the sonic dislocation of howmayihepya when telephoning Vivisimo. Vivisimo is a small company--a spin off from Carnegie- Mellon's formidable computer science and information research laboratories--sits on a bluff above the Monongahela River. The company provides next-generation clustering software to its customers in Asia, Germany, and the United Kingdom as well as North America.

With the navel-gazing search world prostrating itself before the alter of Google, information professionals may overlook Vivisimo, a product that provides what the company calls "information overlook, not information overload."

Vivisimo, whose name translates as lively from Spanish and Portuguese, got its start in 1998 when computer science researchers attacked the still-thorny problem of how computers can be made to perform knowledge discovery on a par with humans.

Vivisimo Website

"Our Carnegie-Mellon team," said Raul Valdes-Perez, president and chairman of Vivisimo, "borrowed from the founders' artificial intelligence research in a spirited manner. We have gone quite a bit farther with real-time clustering. Our technology discovers relationships among documents and displaying a list of 'hits' with similar documents presented together. We can perform clustering without any changes to our clients' existing search-and-retrieval system."

"While our technology is interesting from a technical point of view, we are quite surprised by the benefit our approach to clustering gives our users. Instead of reviewing a list of 10 or 20 hits in a minute or two, Vivisimo allows the researcher to review as many as 200 or more hits in roughly the same time. One of our scientific-and-technical customers told us, 'Vivisimo gives me time plus usable information. The time-savings is the real payoff for my researchers.'"

Dr. Raul Valdes-Perez is one of the founders of Vivisimo. He is an engaging scientist from Pittsburgh via the Massachusetts Institute of Technology, Brazil, Chicago, with a stopover in China after a Cuban birth. He said, "Vivisimo is a document clustering technology that never needs to pre-process the collection from which the documents come from. Everything happens on the fly, spontaneously. In Northern Light, for example, librarians devised and maintained a controlled vocabulary, and software or people assigned labels to every new document."

"With Vivisimo's document clustering, there is no need for a controlled vocabulary or for pre-labeling (indexing) of the documents. This manual work is the expensive, slow downside of many taxonomy systems. Manual processing of records, building controlled vocabularies, and double-checking the word lists and word linkages cost money, "lots of it," added Dr. Valdes-Perez.

Added Dr. Vales-Peres, "One advantage of no pre-processing is that the user can cluster content not directly controlled by the user. Vivisimo is able to take the search results from publicly-accessible sites such as HP.com, Disney, Stanford University, and government sites and create meaningful clusters."

Those information professionals wearing Google goggles are often unaware of the significant differences among search engines. Professionals need to look at the world with 20 / 20 vision.

Vivisimo is one of a bevy of new and interesting technologies that perform specialized value-adding functions. Clear Forest (New York, New York), Stratify Inc. (Mountain View, California), and Google's recent acquisition Applied Linguistics, Inc. (Santa Monica, California), among others add value to a list of search results.

Although there are considerable technical differences among each of the companies' products, the user looking at search results gains such benefits as seeing documents related to one another in one group. Another benefit for many advanced searchers is the names of the clusters or groups themselves. A query about forensic psychiatry presents results in folders labeled "Forensic Psychology", "Expert Witness", and "Forensic Psychiatry Resources." Such suggested categories provides useful signposts and clues to the researcher.

The Vivisimo approach is to provide the value-added functions of clustering and metasearch. "These additional signposts overlay the search or database query engines that are already deployed. There is no need to rip the results out of the original list of hits which appear in a Vivisimo frame. There is no need to disturb the Vivisimo client's existing information infrastructure," added Dr. Valdes-Perez. "If a customer doesn't have a search engine, or wants to replace it, we have search-vendor partners in place so that a whole new solution can be delivered."

The Internet search is the fodder for most popular articles about finding needles in a haystack, a worn metaphor borrowed from Dr. Matthew Koll, the founder of Personal Library Software, without so much as a fare-thee-well.

The real challenge in search is for finding information in files residing on computers within organizations and then placing that information in a meaningful context with the other information sources an organization has. Some companies license data from the digital news aggregators or the global information combines that digitize and deliver content that originally appeared in journals and trade publications. Other organisations need to access the content from public Web sites and provide their users with access to that information in a meaningful context.

Vivisimo makes it possible to deliver organized information to employees without the punishing costs and the complexity of taxonomy building.

Vivisimo's engineers have created software that can work seamlessly with Web sites used for CRM/call center applications where problem resolutions can come from a mix of internal or external sources.

Performance tests of clustering products indicate that the best-of-breed providers like Vivisimo and a handful of other companies can deliver on off-the-shelf computers clustering of 100 results in 100 milliseconds. Optimization technology developed by Vivisimo allows a single cpu server to handle the needs of all but the largest corporations.

Many organizations have a taxonomy or an organizational thesaurus. The list of terms is too costly to update manually and too valuable a knowledge resource to be tossed out into the dust bin.

Vivisimo, like Albert SA, can make use of pre-existing information such as thesauri, ontologies, and metadata. The Vivisimo architecture can perform clustering on the search-result titles, summaries, and index terms. Vivisimo offers a demonstration of the use of similar knowledge assets under its PubMed demonstration. PubMed is the National Library of Medicine's search service that provides access to over 11 million citations in MEDLINE, PreMEDLINE, and other related files. Navigate to http://vivisimo.com/demos/PubMed@NIH.html

Vivisimo ingests index terms in the form of Medical Subject Headings (the much-loved MeSH), which are manually assigned at great cost to biomedical articles from the literature.

Next-generation clustering tools can manipulate company or industry-specific thesauri, acronym lists, abbreviations, and taxonomic relations.

Technologies like Vivisimo's and a handful of other innovators will make it possible for organizations of all sizes to inexpensively deliver information in a more intuitive, organized manner.

Dr. Valdes-Perez said, "Because of the cost of manually developing thesauri and classification schema, only large government agencies and corporations may still be investing in these essentials. For nimble organizations, breakthrough technologies such as Vivisimo's will allow some costs and delays to be leap-frogged. That is both a cost savings and a way to maximize the return on some knowledge-dependent business processes."

Vivisimo's focus is on developing high-performance technology. As broadband and greater computational power becomes available, graphical displays of clustered results shifts to the mainstream. Vivisimo provides what it calls "radical simplicity." The firm uses the near-universal folder metaphor. Vivisimo information can, said Dr. Valdes-Peres, "be used to drive a wide range of looks. But our research suggests that for the majority of users, the use of the folders paradigm reduces confusion and makes the clustering immediately understandable."

Overture, Yahoo, and Google are beginning to adopt certain outward similarities in blending objective search results with pay-for-placement listings. Clustering is not at this time a key component of any of these firm's services. As noted, Google has purchased a firm with robust ontology and link-relationship tools. At this time, it is not clear whether Google will use the Applied Linguistics technology to provide users with clustered results or whether the Applied Linguistics technology will be harnessed to Google's advertisers, who are likely to pay extra to get better exposure than word matching.

Vivisimo's Dr. Valdes-Perez noted that Vivisimo's principal main markets are corporate and government information systems, publishing, and Web search applications. He added, "We have several customers in Web search and we are in talks with most of the major players, who are planning their next moves. I have little doubt that on-the-fly, computationally-efficient

Tools such as Vivisimo's have the potential to add value to the mainstream consumer portals, by increasing click-through revenue opportunities and offering a better user experience.

The Web is multilingual. Non-U.S. users now account for the majority of those connecting to the Internet. Search tools, therefore, cannot be anchored solely to the English language. Despite the naive ethnocentrism of Mr. and Ms. Mainstream American, English is not the universal language of search.

Vivisimo, like EZ2Find.com, Kartoo, and Pertimm (all discussed in previous columns), is not anchored in a single language. Vivisimo's clustering products employ statistical, linguistic, and subject matter knowledge. Dr. Valdes-Perez said, "Just using our statistical approach, we get good results on any human language. We have found that some modest attention to stemming and stopwords improves the results greatly, and we have done this for the ten largest European languages. We have also added Korean, Arabic, and several others. Are system is designed to allow us to plug in more languages."

Vivisimo has attracted a number of blue-chip accounts. These include Stanford's HighWire Press and the Institute of Physics Publishing.

Dr. Valdes-Perez said, "The Institute of Physics, as I understand their feedback, have heard that their users who are scientists get to see much more of the scientific literature. A scientist can survey the top 200 to 500 hits in the time it took to browse 10 hits without clustering. It appears that clustering in the scientific research community provides a better view of information or what academics call recall. The folder metaphor itself appears to provide a precision or refinement tool of considerable benefit in certain research applications."

Vivisimo is written in C. The software runs on Windows, Linux, and Solaris, among others. The Vivisimo software is delivered as a library with an XML-like API. The API can be called from C/C++, Java, COM, PHP, Perl or Python.

Vivisimo is planning a number of enhancements to its core technology. Among those that Dr. Vales-Perez was willing to discuss were enhanced linguistic knowledge. The new algorithm will detect more subtle similarities. System administrators will get a new Web-based interface for the company's meta-search product and support for user-specified Knowledge bases.

Vivisimo also plans to provide user-friendly tools to allow Vivisimo licensees to create adapters to search structured databases using XSL (Extensible Style Language). XSL lets developers transfer of XML documents across different applications.

Vivisimo, through its Web site, is one of very few firms that confidently exhibits its clustering technology on dozens of searchable information resources that it cannot pre-process.

There will always be a few explorers who prefer bookstores with books piled on the floor rather than nearly categorized on shelves. The same is true of Web and corporate information systems. For everyone else, there is Vivisimo. Viva!

With clustering gaining momentum, Vivisimo plans to step lively in 2003.

Contact:

Raul Valdes-Perez, Ph.D
President

Vivisimo, Inc.
2435 Beechwood Boulevard
Pittsburgh, Pennsylvania 15217
Tel: 01- 412-422-2496
Fax: 01-412-422-2495

Cut line for vivi.tif:

A query for forensic psychology presents a results list plus folders that group the hits into such categories as Forensic Psychology, Expert Witness, and Forensic Psychology Resources." Clustering allows a researcher to review more results and obtain valuable clues for ways to refine the query. On-the-fly clustering from Vivisimo requires no change to the search-and-retrieval infrastructure at a client location.

 


[ Top ]   [ AIT Home ]   [ Beargrass ]   [ Site Map ]