Clustify: Identifying Similar Documents

March 9, 2008

In my recent lecture about eDiscovery in Boston, March 6, 2008, several people engaged me after my presentation. One of these interlocutors wanted my view on a company called Hot Neuron LLC. I said, “I’ll check my files.”

When I staggered into my log cabin in Harrod’s Creek, Kentucky, a parcel awaited me. I chopped it open with the flint knife popular in these parts. An envelope labeled “Cluster-Text.com” held a Clustify mouse pad. Coincidence or very expensive direct mail campaign?

A quick dip into my electronic files and a gander at Hot Neuron’s Web site reminded me that Clustify is in the eDiscovery category of Beyond Search tools. An attorney with too little time and too many documents can use Clustify to get her arms around the documents, tag them, and tackle the most significant documents in key clusters first.

Clustify can deliver an overview of the document set with documents nearly organized by categories. Clustify groups by analyzing the the text to identify the structure that arises naturally.

Bill Dimm, founder of Hot Neuron, told me in a thoughtful email:

Clustify organizes the documents into clusters, and it labels each cluster with keywords so that you can see what it is about. It can sort the clusters by the number of documents they contain, so you can quickly see what the most significant topics are for the document set, but it doesn’t, by itself, put the documents into a hierarchy of categories. It does, however, provide a tagging tool that allows the user to define his/her own hierarchy of categories and very efficiently put the documents into the categories.

An attorney or paralegal can mark documents and “hook” them to other documents or topics. The ability to examine documents together can shave hours of a task that is tedious and susceptible to interruption. The software can speed the process of manual categorization by allowing you to make decisions one cluster at a time, instead of one document at a time. Clustify can enhance your search engine by group search results or finding relevant documents that don’t exactly match a query.

This essay provides a preliminary impression of Hot Neuron’s offerings.

The Names

Let me provide some guidance about the three names, Clustify, Cluster-Text.com, and Hot Neuron LLC. Clustify is the name of the clustering technology and software. The url is www.cluster-text.com where you will find information about the product, contact information, etc. Hot Neuron LLC is the name of the company selling the software and operating the Web site. MagPortal.com is a component of Hot Neuron where the company’s for-fee news feed service and financial information components are available. MagPortal appears to use portions of the Clustify technology, and a screen shot of a result set appear elsewhere in this essay.

The Company

The privately-held Hot Neuron is the brain child of William Dimm, a theoretical physicist from Cornell University. Dr. Dimm left the university and physics in 1995. (For those of you who don’t associate Cornell with search and content processing, please, recall that Dr. Gerald Salton — nĂ© Gerhard Anton Aahlmann — put Information Retrieval into the consciousness of computer scientists world wide first at Harvard University, then at Cornell University. More information about Dr. Salton and his SMART innovations are here..)

Hot Neuron offers a number of useful products. These range from specialized financial data components that you can plug into your Web site. With these widgets, you can assemble a view of the US and Canadian financial markets. The company also offers for-fee news feed services and an affiliate program.

Information retrieval and physics are tightly bound. Spend a few days wandering around Google, for instance, and you will find that physicists are as much a part of the Google ecosystem as mathematicians and computer scientists.

Mr. Dimm has dabbled in the commercial world for many years. For example, he has worked in the financial services industry fiddling with models for interest rate derivatives. Hot Neuron is the company set up to commercialize Clustify.

Hot Neuron LLC is an information retrieval software and services company located in Bryn Mawr, Pennsylvania.

Clustify

Clustify is document clustering software. This means that software processes documents and groups them according to their similarity. A definition I have in my files from Carnegie Mellon University link says: “Document clustering is the act of collecting similar documents into bins, where similarity is some function on a document.”

Clustify, according to the company, “identifies important keywords for each cluster, giving quick insight into the document set. It also allows the user to create a hierarchy of custom tags that can be applied to individual documents, all documents in a particular cluster, or all clusters containing a particular combination of keywords, allowing the user to categorize hundreds of documents with a single mouse click.”

The system allow you to create a hierarchy of custom tags. These can then be applied to individual documents, the documents in a specific cluster, or those clusters containing certain key words. You can, according to the company, categorize many documents by pointing and clicking.

Clustering is useful in legal discovery, competitive intelligence, and general research.

Technology

Hot Neuron’s technology uses a proprietary algorithm that scales and delivers good cluster quality. According to the company, Clustify can cluster 1.3 million Wikipedia entries on a desktop computer running Linux in 20 minutes or 50 minutes under Microsoft Windows. The system outputs files with these extensions: CYI, CYO, and CYS.

Clustify can generate concept-based clusters, or it can require documents in the same cluster to contain identical passages of text. The latter option is useful for identifying near-duplicates, which can cut the cost of electronic discovery further than simple deduplication.

Clustifymagsearchportal

This screen shot from Hot Neuron’s MagPortal.com Web site provides a glimpse of the Clustify functionality. (Note: I’m using MagPortal.com to illustrate the company’s technology. MagPortal.com charges a fee for some of its information services.) These Clustify features jumped out at me when I tested the system on March 9, 2008. Please, keep in mind that some of these features are particular to MagPortal.com and not included in the Clustify license:

  1. Search results with one-click access to similar articles. The icon for this function is the orange flag icon in each item in the results list
  2. A “Browse Main Topics” side bar. A user can scan these items instead of firing random key words into the search box, an activity meeting more and more user resistance
  3. Drop down boxes to specify what to search or narrowing the results by date, publication, or category. (Google capture date and time information, but as far as I know, those metatags are not yet available to the run-of-the-mill user like me.)
  4. A list of “Related Categories” front and center in the shaded box above the search results.

I also liked the display of the results. Breaking out the source, author, and date is a common sense feature that other systems would do well to emulate. I find scanning results for these key meta items not only annoying but fatiguing. I have a hard time, when tired, of differentiating light blue, light green, and other Web 2.0 colors. Kudos to MagPortal.com for make their results easy for me to scan.

About the MagPortal.com Search Engine

The MagPortal.com search engine tries to do a case-insensitive match of your query against the article title, description, authors, and the body of the article. When processing your query, it is sliced into “words”. Articles that match the words in your query — minus stop words — are displayed in the results list. By default, the output is ordered by the “quality” of the match. The quality is determined by using a mathematical formula (standard term-frequency inverse-document frequency algorithm, not related to Hot Neuron Similarity) which takes into account how often the search term appears in the document (relative to the total length of the document) and how rare that particular search term is among all documents.

Mr. Dinn told Beyond Search:

MagPortal.com has been around for eight years. We primarily index online magazine articles (not newspapers) on MagPortal, with the index being updated once each business day (magazines don’t change very fast). Anyone can browse and search the articles for free. The fee-based feed service allows clients to display a subset of the MagPortal data and functionality on their own Web sites. The client licenses a set of topics/categories relevant for their site, and their site users can browse/search the article data on their site much as they would on MagPortal. We also have a free version of the feed service, with limited functionality, available for on-commercial use here. For the financial data components (stock charts, etc.), we are actually just reselling feeds from another company because we’ve encountered clients that want both articles and charts.

More Information

For pricing, availability, and licensing details, contact Clustify at hotneuron.com.

Stephen Arnold, March 9, 2008 with an update at 5 35 pm Eastern

Comments

4 Responses to “Clustify: Identifying Similar Documents”

  1. Rob Young on March 9th, 2008 5:14 pm

    I wonder how easy it would be to write a contender for this using the Carrot2 clustering engine. The search could be managed identically with Lucene. So far I have only heard of Carrot2 being used for clustering resultsets (ie. comparatively tiny sets), but it does accept a stream as input so it may be possible to shape up a good contender.
    Thoughts?

  2. Stephen E. Arnold on March 10th, 2008 10:57 am

    Thanks for posting Rob, I have looked at Carrot2, and it looks to me that Carrot2 can handle this job. You would have to write the middleware and hook into Carrot2. The stream function caught my eye with Clustify. Also, Vivisimo has a good clustering engine, and it works on the fly. I don’t know, however, if Vivisimo makes that available as a stand alone.

    Stephen Arnold, March 10, 2008 Noon Eastern

  3. Dmitri on March 10th, 2008 11:31 am

    Very interesting, too bad there is no online demo for this. Mining the concepts from text and tagging documents with them sounds like a very reasonable approach, as it gives the end user a quick grasp on the large body of documents (we are doing a similar thing). Regarding their legal implementation, I wonder if it is all based on derived tags, though; I think that a system like this also needs to have a dictionary with specific terms and semabntic constructs pertinent to the legal field. Glad I came across your blog Stephen, will keep following it.

  4. Stephen E. Arnold on March 10th, 2008 12:35 pm

    Dmitri, you can see some of the features at http://www.magportal.com.

    Stephen Arnold, March 10, 2008, 1 35 pm Eastern

Got something to say?