Clustify: Identifying Similar Documents

March 9, 2008

In my recent lecture about eDiscovery in Boston, March 6, 2008, several people engaged me after my presentation. One of these interlocutors wanted my view on a company called Hot Neuron LLC. I said, “I’ll check my files.”

When I staggered into my log cabin in Harrod’s Creek, Kentucky, a parcel awaited me. I chopped it open with the flint knife popular in these parts. An envelope labeled “Cluster-Text.com” held a Clustify mouse pad. Coincidence or very expensive direct mail campaign?

A quick dip into my electronic files and a gander at Hot Neuron’s Web site reminded me that Clustify is in the eDiscovery category of Beyond Search tools. An attorney with too little time and too many documents can use Clustify to get her arms around the documents, tag them, and tackle the most significant documents in key clusters first.

Clustify can deliver an overview of the document set with documents nearly organized by categories. Clustify groups by analyzing the the text to identify the structure that arises naturally.

Bill Dimm, founder of Hot Neuron, told me in a thoughtful email:

Clustify organizes the documents into clusters, and it labels each cluster with keywords so that you can see what it is about. It can sort the clusters by the number of documents they contain, so you can quickly see what the most significant topics are for the document set, but it doesn’t, by itself, put the documents into a hierarchy of categories. It does, however, provide a tagging tool that allows the user to define his/her own hierarchy of categories and very efficiently put the documents into the categories.

An attorney or paralegal can mark documents and “hook” them to other documents or topics. The ability to examine documents together can shave hours of a task that is tedious and susceptible to interruption. The software can speed the process of manual categorization by allowing you to make decisions one cluster at a time, instead of one document at a time. Clustify can enhance your search engine by group search results or finding relevant documents that don’t exactly match a query.

This essay provides a preliminary impression of Hot Neuron’s offerings.

The Names

Let me provide some guidance about the three names, Clustify, Cluster-Text.com, and Hot Neuron LLC. Clustify is the name of the clustering technology and software. The url is www.cluster-text.com where you will find information about the product, contact information, etc. Hot Neuron LLC is the name of the company selling the software and operating the Web site. MagPortal.com is a component of Hot Neuron where the company’s for-fee news feed service and financial information components are available. MagPortal appears to use portions of the Clustify technology, and a screen shot of a result set appear elsewhere in this essay.

The Company

The privately-held Hot Neuron is the brain child of William Dimm, a theoretical physicist from Cornell University. Dr. Dimm left the university and physics in 1995. (For those of you who don’t associate Cornell with search and content processing, please, recall that Dr. Gerald Salton — né Gerhard Anton Aahlmann — put Information Retrieval into the consciousness of computer scientists world wide first at Harvard University, then at Cornell University. More information about Dr. Salton and his SMART innovations are here..)

Hot Neuron offers a number of useful products. These range from specialized financial data components that you can plug into your Web site. With these widgets, you can assemble a view of the US and Canadian financial markets. The company also offers for-fee news feed services and an affiliate program.

Information retrieval and physics are tightly bound. Spend a few days wandering around Google, for instance, and you will find that physicists are as much a part of the Google ecosystem as mathematicians and computer scientists.

Mr. Dimm has dabbled in the commercial world for many years. For example, he has worked in the financial services industry fiddling with models for interest rate derivatives. Hot Neuron is the company set up to commercialize Clustify.

Hot Neuron LLC is an information retrieval software and services company located in Bryn Mawr, Pennsylvania.

Clustify

Clustify is document clustering software. This means that software processes documents and groups them according to their similarity. A definition I have in my files from Carnegie Mellon University link says: “Document clustering is the act of collecting similar documents into bins, where similarity is some function on a document.”

Clustify, according to the company, “identifies important keywords for each cluster, giving quick insight into the document set. It also allows the user to create a hierarchy of custom tags that can be applied to individual documents, all documents in a particular cluster, or all clusters containing a particular combination of keywords, allowing the user to categorize hundreds of documents with a single mouse click.”

The system allow you to create a hierarchy of custom tags. These can then be applied to individual documents, the documents in a specific cluster, or those clusters containing certain key words. You can, according to the company, categorize many documents by pointing and clicking.

Clustering is useful in legal discovery, competitive intelligence, and general research.

Technology

Hot Neuron’s technology uses a proprietary algorithm that scales and delivers good cluster quality. According to the company, Clustify can cluster 1.3 million Wikipedia entries on a desktop computer running Linux in 20 minutes or 50 minutes under Microsoft Windows. The system outputs files with these extensions: CYI, CYO, and CYS.

Clustify can generate concept-based clusters, or it can require documents in the same cluster to contain identical passages of text. The latter option is useful for identifying near-duplicates, which can cut the cost of electronic discovery further than simple deduplication.

This screen shot from Hot Neuron’s MagPortal.com Web site provides a glimpse of the Clustify functionality. (Note: I’m using MagPortal.com to illustrate the company’s technology. MagPortal.com charges a fee for some of its information services.) These Clustify features jumped out at me when I tested the system on March 9, 2008. Please, keep in mind that some of these features are particular to MagPortal.com and not included in the Clustify license:

Search results with one-click access to similar articles. The icon for this function is the orange flag icon in each item in the results list
A “Browse Main Topics” side bar. A user can scan these items instead of firing random key words into the search box, an activity meeting more and more user resistance
Drop down boxes to specify what to search or narrowing the results by date, publication, or category. (Google capture date and time information, but as far as I know, those metatags are not yet available to the run-of-the-mill user like me.)
A list of “Related Categories” front and center in the shaded box above the search results.

I also liked the display of the results. Breaking out the source, author, and date is a common sense feature that other systems would do well to emulate. I find scanning results for these key meta items not only annoying but fatiguing. I have a hard time, when tired, of differentiating light blue, light green, and other Web 2.0 colors. Kudos to MagPortal.com for make their results easy for me to scan.

About the MagPortal.com Search Engine

The MagPortal.com search engine tries to do a case-insensitive match of your query against the article title, description, authors, and the body of the article. When processing your query, it is sliced into “words”. Articles that match the words in your query — minus stop words — are displayed in the results list. By default, the output is ordered by the “quality” of the match. The quality is determined by using a mathematical formula (standard term-frequency inverse-document frequency algorithm, not related to Hot Neuron Similarity) which takes into account how often the search term appears in the document (relative to the total length of the document) and how rare that particular search term is among all documents.

Mr. Dinn told Beyond Search:

MagPortal.com has been around for eight years. We primarily index online magazine articles (not newspapers) on MagPortal, with the index being updated once each business day (magazines don’t change very fast). Anyone can browse and search the articles for free. The fee-based feed service allows clients to display a subset of the MagPortal data and functionality on their own Web sites. The client licenses a set of topics/categories relevant for their site, and their site users can browse/search the article data on their site much as they would on MagPortal. We also have a free version of the feed service, with limited functionality, available for on-commercial use here. For the financial data components (stock charts, etc.), we are actually just reselling feeds from another company because we’ve encountered clients that want both articles and charts.

More Information

For pricing, availability, and licensing details, contact Clustify at hotneuron.com.

Stephen Arnold, March 9, 2008 with an update at 5 35 pm Eastern

Written by Stephen E. Arnold · Filed Under Enterprise, Search, Text processing

Comments

22 Responses to “Clustify: Identifying Similar Documents”

Rob Young on March 9th, 2008 5:14 pm

I wonder how easy it would be to write a contender for this using the Carrot2 clustering engine. The search could be managed identically with Lucene. So far I have only heard of Carrot2 being used for clustering resultsets (ie. comparatively tiny sets), but it does accept a stream as input so it may be possible to shape up a good contender.
Thoughts?
Stephen E. Arnold on March 10th, 2008 10:57 am

Thanks for posting Rob, I have looked at Carrot2, and it looks to me that Carrot2 can handle this job. You would have to write the middleware and hook into Carrot2. The stream function caught my eye with Clustify. Also, Vivisimo has a good clustering engine, and it works on the fly. I don’t know, however, if Vivisimo makes that available as a stand alone.

Stephen Arnold, March 10, 2008 Noon Eastern
Dmitri on March 10th, 2008 11:31 am

Very interesting, too bad there is no online demo for this. Mining the concepts from text and tagging documents with them sounds like a very reasonable approach, as it gives the end user a quick grasp on the large body of documents (we are doing a similar thing). Regarding their legal implementation, I wonder if it is all based on derived tags, though; I think that a system like this also needs to have a dictionary with specific terms and semabntic constructs pertinent to the legal field. Glad I came across your blog Stephen, will keep following it.
Stephen E. Arnold on March 10th, 2008 12:35 pm

Dmitri, you can see some of the features at http://www.magportal.com.

Stephen Arnold, March 10, 2008, 1 35 pm Eastern
Steve on June 19th, 2009 8:21 am

Interesting article. Came across it on a google search for clustering. Have you done other research into this area with other tools. There are several tools in the marketplace today, but I would be interested in how this clustify worked against the other tools.
Pore Cleanser : on October 27th, 2010 1:02 pm

desktop computers these days gets obsolete the day that they are shipped considering how fast technology updates:“
Beard Trimmer · on November 13th, 2010 12:49 pm

desktop computers with Intel i5 cores are the best because they are very very fast and great for multitasking ,;.
Grow Facial Hair Faster on April 15th, 2011 3:58 pm

My brother told me about a great site where you can get loads of information about ways to grow facial hair. I didn’t realise that some men struggle to grow beards and stuff but I can see why a site dedicated to this subject would be helpful to a lot of men. Anyway, if you want to grow a beard or learn tips to make hair growth faster then visit http://www.growfacialhairfaster.com.
English Cv on May 1st, 2011 8:28 am

An fascinating discussion is value comment. I feel that it is best to write more on this matter, it might not be a taboo topic but usually people are not enough to talk on such topics. To the next. Cheers
Katie Greff on June 13th, 2011 5:10 am

A few things i have often told individuals is that while searching for a good online electronics store, there are a few aspects that you have to take into account. First and foremost, you need to make sure to discover a reputable along with reliable store that has enjoyed great reviews and classification from other people and business world analysts. This will make sure that you are dealing with a well-known store to provide good assistance and help to their patrons. Many thanks sharing your notions on this website.
group reading test on July 6th, 2011 3:06 pm

Perfect work you have done, this web site is really cool with fantastic info about Clustify: Identifying Similar Documents : Beyond Search.
Nutritional Supplements on July 7th, 2011 1:38 pm

Perfect work you have done, this web site is really cool with great info about Clustify: Identifying Similar Documents : Beyond Search.
Sanora Grout on July 15th, 2011 4:21 am

Your blog is wonderful, unfortunately but for some reason i can’t see your site on explorer, thats why i used another browser.
Best Wood Router on July 15th, 2011 9:29 am

Howdy! Quick question that’s completely off topic. Do you know how to make your site mobile friendly? My weblog looks weird when viewing from my iphone4. I’m trying to find a template or plugin that might be able to fix this problem. If you have any suggestions, please share. Thank you!
Lyda Pallino on August 6th, 2011 3:48 am

hey there and thanks in your information – I have definitely picked up anything new from proper here. I did then again experience a few technical points the use of this site, as I skilled to reload the web site lots of instances prior to I may get it to load correctly. I have been wondering in case your hosting is OK? Not that I’m complaining, however sluggish loading instances occasions will very frequently affect your placement in google and could damage your high quality ranking if ads and ***********|advertising|advertising|advertising and *********** with Adwords. Well I am adding this RSS to my e-mail and can glance out for a lot extra of your respective interesting content. Ensure that you update this once more very soon..
Timmy Stonebreaker on August 26th, 2011 6:23 am

What a great blog post, can I set it up so I get an alert email every time you make a new post?
windows phone 7 development on September 13th, 2011 7:59 pm

Thanks for the marvelous posting! I seriously enjoyed reading it, you can be a great author.I will be sure to bookmark your blog and may come back later on. I want to encourage you continue your great work, have a nice morning!
Contagion online free on September 17th, 2011 8:19 pm

you’ve submited definitely helpful information which is going to cover this article with simply no uncertainty , this will become such an excellent article intended for individuals may get virtually any problem utilizing that , whether they’re starters and also pros .i hope i possess the occasion and the opportunity to visit your web site over again .
cupola on September 29th, 2011 2:21 pm

Please let me know if you’re looking for a article writer for your site. You have some really great posts and I believe I would be a good asset. If you ever want to take some of the load off, I’d really like to write some content for your blog in exchange for a link back to mine. Please shoot me an email if interested. Kudos!
free fish games on September 29th, 2011 6:13 pm

I`m reading your blog that I found on yahoo and it makes me think that there are still a few mates on the web that there are passionate about their work. Good work with your blog
Stephen E. Arnold on October 3rd, 2011 4:31 pm

Free fish,

May your carp be without encumbrance.

Stephen E Arnold, October 3, 2011
Stephen E. Arnold on October 3rd, 2011 4:32 pm

Cupola,

Best to read the about section and write to that email address. I like writers who do some homework.

Stephen E Arnold, October 3, 2011

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.