IBM Explains Text Analytics

June 15, 2008

A colleague called my attention to this April 2008 description of IBM’s view of text analysis. The essay “From Text Analytics to Data Warehousing” is more than processing content. The article by Matthias Nicola, Martin Sommerlandt, and Kathy Zeidenstein points toward what I call a “metaplay” or “umbrella tactic”. (You will want to read the posting here. When accessing IBM content, it is important to keep in mind that it can be difficult, if not impossible, to locate IBM information via the IBM search function. Pages available via a direct link like this “From Text Analytics to Data Warehousing” may require that you register, obtain an IBM user name and password, and then relaunch your search to locate the information. Other queries will return false drops with the desired article nowhere to be found. I’m not sure if this is OmniFind, Fast Search, Endeca, or some other vendor’s handiwork. But search and retrieval of IBM information on the IBM site can be frustrating to me. Click here now.)

The authors state:

This article review[s] the text analysis capabilities of IBM OmniFind Analytics Edition, including an analysis of the XML format of the text analysis results, MIML. It then examined different approaches that can help you extend the value of OmniFind Analytics Edition text analysis by storing analysis results from the MIML file into DB2 to enable standard business intelligence operations and reporting using the full power of SQL or SQL/XML.

As I worked through this article, reviewed the diagram, and explored the See Also references, one point jumped out at me. The mark up generated by the IBM system can be verbose. The emphasis on the use of DB2, IBM’s database system underscored for me that IBM text analytics requires software, DB2, and storage. In fact, without storage, the IBM text analysis system could grind to a halt. To increase the performance, the licensee may require additional IBM servers, management software, and other bits and pieces.

You have to store the “star schema for MIML” somewhere. Here’s what the structure looks like. Of course, the image is IBM’s and copyrighted by the company.

ibm architecture

I want to point out that this is one of the simpler diagrams in the write up.

Observations

  1. This write up suggests to me that IBM is defining text analytics as a component in a much, much larger array of software, hardware, and systems. My hunch is that IBM wants to shut the barn door before more standalone text analytics tools are sold into IBM shops.
  2. IBM is making explicit that text analytics is an exercise in data management. Google, I think, has much the same notion based on my reading of its technical papers.
  3. IBM has done a good job of making clear that software alone won’t deliver text analytics. Without the ability to scale, text analysis can choke most systems. Now IBM has to get this message to the information technology professionals who assume that their existing servers and infrastructure can handle text analytics.
  4. IBM has done an excellent job of moving the concept of text analysis as an add on into a larger constellation of operations. The notion of a metaplay or an umbrella tactic is important because individual vendors often ignore or understate the broader impact of their content processing subsystems.

I think this is an important write up. A happy quack to the reader who called the information to my attention.

Stephen Arnold, June 15, 2008

Comments

2 Responses to “IBM Explains Text Analytics”

  1. Seth Grimes on June 20th, 2008 2:16 pm

    I blogged this in May: http://www.intelligententerprise.com/blog/archives/2008/05/from_text_analy.html

    IBM’s content is quite easy to find. Just submit “From Text Analytics to Data Warehousing” to Google.

    IBM has actually been quite generous over the years in publishing text-analytics material whose value goes far beyond that of most vendor white papers. Consider, for instance, their MedTAKMI case study, which is informative several years after it was published: http://www.research.ibm.com/journal/sj/404/nasukawa.html . (More recent papers are at http://www.research.ibm.com/journal/sj/433/uramoto.html and http://www.research.ibm.com/journal/sj/461/inokuchi.html.)

    Your observations are generally on target although I’d dispute that “IBM wants to shut the barn door before more standalone text analytics tools are sold into IBM shops.” If that were the case, they wouldn’t have released UIMA to open source (Apache) or continued to maintain their partnership with Attensity that dates back to 2005.

  2. Stephen E. Arnold on June 20th, 2008 2:34 pm

    Seth, thanks for writing, but I beg to differ. Example: do a search for a NetFinity 5500. Not only is it difficult to narrow the query to a specific document related to ServeRaid, it is impossible to map the new nomenclature for IBM servers to the old product names. The same situation exists when looking for information about semantic search, WebSphere, and iPhrase, I have a very tough time locating [a] the current document, [b] documents I have seen that have been relocated, deleted or removed from IBM’s sprawling, loosely coordinated Web sites, and [c] figuring out why some documents are available and others require that I use my IBM user name and password to access. I also find that my IBM user name expires or is not supported on IBM sites in the UK for example. I am delighted you have mastered the IBM search system. I have not and I have been trying since I worked on IBM technical libraries’ search with BRS’s version of STAIRS in 1979 or so. I think IBM has some work to do on its search system for public facing IBM Web sites.

    Stephen Arnold, June 20, 2008

  • Archives

  • Recent Posts

  • Meta