Calais: Free Semantic Tagger

April 22, 2008

If you want to see how cloud-based software can perform rich metatagging, you will want to give the free Calais service a whirl. Navigate to the Calais Gallery and scroll down to the Capability Demonstrations and select the Calais Document Viewer. If you don’t see the link, click here.

Now cut a document and paste it into the window. The system will display this type of result:

calais_parser

The tags the ClearForest system automatically identifies are highlighted. The left-hand column of the display shows the types of tags identified; for example, city, company, person, etc. A single click opens a drop down list of what the system found. Worked well and it worked quickly with no “false drops” in the sample document. Performance showed some latency, but that’s not unusual with a cloud-based service and some fancy text crunching taking place on remote servers.

More about Calais

For now Calais is working to build a community to extend the Semantic Web. Without tools like Calais, the Semantic Web is likely to remain a great idea that failed because people don’t want to do tagging. When tagging is done, it’s lousy. I’m supposed to know how to index, and the tags for my Web log are pretty miserable. The reasons may be broader than just my own approach. First, indexing to be useful must use a body of terms that the average user can hit upon and remembered. So neologisms are out and weird jargon won’t work at all. Second, writing for a Web site or a Web log like this one is supposed to be disciplined, but it’s not. I have other research work that commands my primary attention. The Web log, while important, comes second, maybe third on some busy days. Finally, I’m not sure what I will write about. I react to information people send me in email, stories in my RSS reader, and comments made–often off the cuff–on a phone call. It is difficult for me to create a controlled term list because I’m not sure what the topics will be. Therefore, lousy tagging.

Calais asserts that its technologies can address my three failings and probably yours as well. You can download developer tools, upload content to Calais, or use the functions on the Calais Web site. Reuters-ClearForest has posted some useful documentation about Calais here. If your a bit nerdy, you can do some integration of Calais and your application. The best way to get a sense of what’s possible is to explore the sample applications on the Calais Web site.

More about ClearForest

ClearForest was founded by text mining guru Ronen Feldman. You can get the inside scoop on this wizard’s approach to squeezing information from text in his 2006 book, The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data (Cambridge University Press ISBN 13: 9780521836579).

The ClearForest technology performs “discovery”; that is, the system processes text and identifies important information. The company found a ready market wherever executives wanted to find the “hidden” information in text. I recall attending a presentation by Dr. Feldman in which he showed the ClearForest system processing auto warranty data in the written comments from customer support reps and owners who sent email about their vehicles.

The ClearForest system processed these comments and displayed important discoveries in easy-to-understand reports. One example concerned a flawed component that the ClearForest system pinpointed as one that was causing problems previously overlooked by the automobile manufacturer. The kicker to this example was that the manufacturer was able to make a change to the affected component and take pre emptive action to save significant amounts of warranty cost and avoid customer complaints.

The earlier versions of the ClearForest system made use of rules. Some of these required hand-tweaking or a ClearForest-adept programmer to set up, tune, and deploy. Over the years, ClearForest like other companies looking for ways to leverage “smart” software, more automation has been injected into the ClearForest system.

At this time, the system processes Reuters content, saving money because human subject matter experts are no longer needed to fiddle with information. The company has a broad product line and provides software that can categorize, extract the names of people and companies, manipulate information in relational databases, and perform other feats with structured and unstructured content.

ClearForest separated itself from some competitors by including solid analytic functions in their system. The ability to count, manipulate statistical procedures, and generate a chart or graph are the types of functions that analysts and end users appreciate. Like other industrial-strength text processing tools, the customers must adapt to six figure license fees and some optional costs for customization, support, and maintenance.

I did not include ClearForest in my new study Beyond Search. The company was founded in 1998, and I was looking for firms that were less well known and had jumped into the roiling waters of search and content processing more recently. I was also reluctant to talk about a subsidiary of a large, low-profile corporation with a long history of changing directions with little warning. I used the same line of reasoning to exclude Marsh & McLenna’s Kroll Ontrack service. That’s not a negative; just an editorial reality. I had room and energy to profile 24 companies and ClearForest did not make the cut.

What Strikes Me as Important

Several points coalesced in my mind as I was looking at my notes about ClearForest before it was acquired by Reuters and as I was exploring the Calais Web site. ClearForest has made good progress automating certain functions. A couple of years ago, ClearForest was mostly a manual system. Programmers had to fiddle with rules, tune the engine, and sometimes write custom scripts to get a complex system purring like a house cat. Calais evidences significant use of computational intelligence in the system.

Second, I was delighted to see several demonstrations that worked well as cloud services. As cloud-based services become more widely used by enterprises, text mining is one function that makes sense to get out of the on premises server room. Not only are these systems computational gluttons but the systems are complex. Some text processes make sense to park “out there” on the network.

The main drawback is the bundle of management in the Thomson-Reuters-ClearForest stack. The genealogy of Calais within the Thomson-Reuters-ClearForest set up is fuzzy to me, and the big company management philosophy could change without warning. I would be reluctant to build a business on Calais until I had a better sense of what the big company owner will do. I am skeptical of what any big company owner says. Actions, not words, make clear how the product and service offerings will be priced and under what terms.

Observations

To wrap up this essay, definitely review the Calais service. You can dabble in rich text processing and maybe advance the Semantic Web forward an inch or two. Also, pay attention to the details of the acquisition within an acquisition. Upper management decisions often translate to some wild and crazy actions. Conservatism and risk avoidance are important ideas to keep in mind before you write a check.

Stephen Arnold, April 23, 2008

Comments

3 Responses to “Calais: Free Semantic Tagger”

  1. Thomas Tague on April 23rd, 2008 10:45 am

    Stephen:

    Tom Tague, leader of the Calais Initiative at Thomson Reuters here.

    First, thank you for taking the time to investigate Calais and write such a thoughtful posting. 99% of what we see is some flavor of “Cool..here’s a link” – you actually take the time to address some substantive points.

    I’m going to respond with a few observations.

    First, Calais and its associated tools are in a period of extended “hyperdevelopment”. We released the first version of Calais in late January and have released two significant updates since that time. We also have a major update scheduled for release on 5 May. That’s the commitment we’ve made to ourselves – new functionality delivered to the end user every month without fail. Our development activities focus on two core elements: 1) developing or sponsoring the development of an entire ecosystems of tools from sample applications to WordPress and Drupal modules to code libraries – all with the intention of making Calais accessible and relevant to a broader audience, and 2) Improvements in the core service itself. With each release of Calais we add a significant number of new entities, facts and events in existing and new domains. Starting with our next release the rate at which we roll out new elements and domains will begin to accelerate even faster.

    Second, you bring up the point of commitment on the part of Thomson Reuters. As you mention you’d be reluctant to build a business on top of the service. We’ve learned something interesting over our first few months of operation: you can deploy a free service, you can deploy it for “web scale” volumes, you can tell people you’re committed for the long run – it just doesn’t matter. Unless you’re prepared to back that with contractual commitments of performance and availability prudent organizations are not going to bet their businesses on it. We heard. We get it.

    So – we’re going to do just that in the very near future. In addition to the current service we will offer a version backed by a contractual agreement specifying reliability and long term availability. We haven’t settled on pricing – but it will be modest. Very modest.

    Before a few thousand people say “Aha.. I knew the free service was too good to be true!” let me make a few important points: The free service remains unchanged. The free service will have 100% of the functionality of the contractual service. The free service will continue to have generous usage limits (currently 40,000 transactions per day and increasing regularly). The free service is still available for commercial and non-commercial use. Period. No ifs ands or buts. The contractual service will provide an SLA – and more importantly the confidence to bet your business on Calais.

    We’re incredibly excited about the level of attention Calais has received. Besides blog postings and articles in the Economist – over 3,000 developers have signed up to use the Calais service. The fact that 3,000 people are willing to spend time to develop innovative applications that have the potential to impact hundreds of thousands (can we say millions?) of end users is extraordinary. That its happened in a little under 90 days is amazing.

    So, in summary. We’re serious. We’re here to stay. The capabilities and tools delivered by Calais will continue to grow every single month. We have some amazingly exciting stuff in the pipeline we can’t talk about yet. And, I’ll echo your suggestion – jump in and start experimenting. There’s some great stuff to be built.

    Regards,

    Tom

  2. Stephen E. Arnold on April 23rd, 2008 5:10 pm

    Thanks for taking the time to respond. In my experience, roll ups that control diverse media properties can change direction quickly. Examples include Rupert Murdoch’s shift in the New York media market, Thomson’s divestiture of newspapers, and Reed Elsevier’s sale of uninteresting properties. Therefore, an initiative–no matter how well thought through or well intentioned–can be redirected, sold, or shut down deus ex machina. This was thrilling to a Greek theater goer, but not much fun for the hapless victim of a god’s whimsy. The senior management of the information companies dominating certain content spaces like law, technical information, and business information have to answer to shareholders first, market realities second, and then new, high-technology initiatives. I understand your comments, but I stand by admonition for users of any “free” or “new” service from today’s version of Ling-Temco-Vought-type companies to exercise prudence.
    Stephen Arnold, April 23, 2008

  3. Media Cloud: Foggy Payoff : Beyond Search on March 12th, 2009 1:46 pm

    […] I wrote about Calais in 2008. You can find that article here. Calais makes use of ClearForest technology to perform semantic tagging. I am cautious when large […]

  • Archives

  • Recent Posts

  • Meta