CAVE to TDM: New Jargon for Publishers

December 7, 2015

I read “Text and Data Mining: Challenges and Solutions from the Publishers’ Perspective.” The write up summarizes a conference attracting publishers. The perspective of publishers, based on my experience, is survival. I know. I know. Some publishers are in high cotton. The Washington Post is vying to become the US newspaper of record. There are other examples as well, but surging revenues, generous organic growth, and health profits are not attributes of most publishing outfits engaged in doing things with dead trees.

The conference focused on text mining; that is, according to the blog focused on the program:

Text mining refers to “the process or practice of examining large collections of written resources in order to generate new information”. I am not an expert in text mining, but I understand that it is about applying specialized software/algorithms/techniques on existing textual information so that it can be read and analyzed by machines in order for them to extract more meaningful information for us, humans. Of course, text mining is no news to the research community, as it seems that it all started back in the ’80s with a methodology titled CAVE (Content Analysis of Verbatim Explanations) but its background goes beyond the scope of this article. What I can tell you is that it is a complex process, involving techniques from areas such as information retrieval, natural language processing, information extraction and data mining – into a single workflow!

I am not sure if this definition hits the core of the concern. From my point of view, publishers have to respond to numerical recipes which operate on content. Included in my view is the landscape of videos whose dialog and imagery is converted to either human understandable text, best guesses at what an image represents, and assorted metadata.

What does one do with these outputs of text mining? Numbers are good. But the more important use is that with smart software certain processes can be automated and made intelligent. The use of algorithms to generate news stories is, believe it or not, part of the Associated Press’ bag of cost cutting tricks.

For publishers, like those named in the write up, the future looks challenging. Authors pump out content with or without the ministrations of a “real” publishing company. Then tireless software agents labor away. When something useful (defined by the self adjusting algorithms) become evident, an output is generated.

The question, therefore, is not what tools have been and are available. Text mining is decades old. The question is, “How will publishers make informed decisions about increasingly smart systems which supplant older, slower, more expensive ways to find useful nuggets and assemble them into actionable reports, visualizations, or on demand dashboard displays.

Net net: Conferences are useful. Buzzwords maybe. The publishers have some thrill ahead. I wish I had the publishers’ grasp of trends and text mining technologies.

Stephen E Arnold, December 7, 2015

Comments

Comments are closed.

  • Archives

  • Recent Posts

  • Meta