The Text Mining Can of Worms

April 17, 2008

In October 2007, I participated in a half-day “text mining” tutorial held after the International Chemical Information Conference in that most appealing Spanish city, Barcelona. The attendees–I think there were about 24 people mostly from European companies–wanted to learn about advanced text mining systems–in theory. Reality often intrudes, however.

textmining worms copy

Fresh from the primary research for my Beyond Search: What to Do When Your Enterprise Search System Won’t Work, I had a significant amount of information about 50 vendors’ text mining systems and their technologies. The structure of the Barcelona tutorial was straight forward. After defining text mining and differentiating it from the better-known data mining, I walked through some case examples of text mining successes. The second part of the tutorial focused on the business issues of text mining. The end point of this segment tackled three key challenges to which I will return in a moment. The third segment of the tutorial took at look at what Google was disclosing through its engineering papers and speeches about its approach to text mining. This is a very interesting block of information, and I may at some point in the future describe a little of our findings. The tutorial wrap up was a series of observations with time for the attendees to ask additional questions and share some of their experiences.

The most interesting moment in the three-hour session was a statement made by a young chemist turned business development manager from a major European chemical firm. Her remarks were so distinctive that I wrote them down as I repeated her question to the group. She said, according to my hand written notes:

Will you please tell me what I have to do to make text mining work? I don’t have the budget for the systems you have been talking about. I don’t know how to use the type of tools you just described from the Megaputer company. Please, please, give me an answer.

I could not. Her anguish–almost to the point of tears–was evident. Her words and actions revealed a person cast into uncharted waters. She found herself in a position at work unable to navigate from the need (text mining) to a solution (system) that she could use or afford. We took a break, and I asked her to explain her needs, and she played back to me the three key challenges of text mining in a pharmaceutical, chemical, or closely-allied firm. The tutorial was no longer a lecture about technology. The information need was intensely personal and immediate.

The Challenges

Looking back at my handwritten notes of her comment from a span of six months, I realized not much progress has been made on her perception of the issues she faced:

  1. Complexity
  2. Fragmentation
  3. Cost

Let’s look at each of these issues quickly and then step back to consider the implications of these challenges for search, content processing, text mining, and business intelligence.

Challenge 1: Complexity

Unstructured text poses many technical hurdles for vendors, licensees, and users. Most of the systems available today are difficult. Let me give you some examples. For the licensee, there’s a steep learning curve. Each vendor comes at the problem differently, using different techniques and procedures. Over worked information technology groups often make errors that require circling back and re doing some steps. Over worked IT departments don’t know what’s ahead for them. Professional pride or stubbornness leads some IT people to stone wall. Once the system is working, the IT department needs to move on. The ever-steeper learning curves have to be flattened. In my work, I find that many text mining problems are a direct result of IT errors. Vendors look like the bad guys, but the arrow of responsibility points to the licensee’s own technical operation.

A representative example of system complexity.

In my files I had this picture from a SQLServer 2005 briefing. The complexity of this system is neither greater nor less than other vendors’ systems. Source: Microsoft Corp., 2005

Users are not very good at mathematical reasoning. The reason is that unless the user keeps her math skills honed, the nitty gritty details of statistical functions blur with time. I’m not saying end users can’t relearn nor that all end users are unable to count. The typical user in a large organization struggles with the numerical and statistical concepts that many text mining systems assume the user will know very well. Unless a system generates a graph and explicitly states a conclusion for a user, the text mining system outputs may as well be in Babylonian glyphs.

Many companies address these problems by outsourcing, use of subject matter experts, and hiring people fresh from university with equations fresh in their minds.

Challenge 2: Fragmentation

Why did the young chemist cum business development professional choke with angst? She was cast adrift within a multi-billion euro operation with no map, no safe harbor, and no guidance. Her pleas to me for a silver bullet or magic wand to wave over her head were signs of deeper management problems in her employer’s organization.

Large companies in the U.S. and Europe–two regions where I have the most experience–are collections of fragments. Regulatory requirements imposed on pharmaceutical companies exacerbate the problem. Management fear about losing a key researcher to a competitor strengthens the silos of information and the compartmentalization of information. Regulators are wary of pharmaceutical companies. There’s a rich, interesting history of products that have slipped through clinical trials with intriguing consequences, legal questions about pricing, and the good old-fashioned mismanagement that characterizes many companies in our post-Enron era.

The young woman cannot find a path forward from her colleagues in the company. She–which I find deeply troubling–must turn to a tutorial for an answer. Large organizations–not just chemical and pharmaceutical companies–may be creating information problems of such magnitude that there is no simple resolution available today.

I think that explains why in a large company there are at least five enterprise search (what I call Intranet search or behind-the-firewall search) systems. One pharmaceutical company in New York City was described to me this way: “That company buys one of everything. I’m not sure anyone knows what to do with all the systems, but we make a sale with every upgrade. No questions asked.”

The plea for a single system that works is a symptom of the technical anomie that plagues many organizations.

Challenge 3: Cost

I get asked about cost of text mining with great frequency. The truth is that most organizations do not want to take the steps needed to determine the fully-loaded cost of any system. The standard operating procedure is to define a few fixed costs that are carefully mapped to a specific budget allocation. The indirect or soft costs are ignored.

Here’s how this works.

An organization licenses a text mining system. Let’s assume that the system costs $100,000, a paltry sum for a Fortune 500 company. The company does not have a method for tracking the following costs relative to the text mining system:

  • Internal staff time. Most organizations do not track employee time. The result is that no one knows how most employees invest their time. Estimates won’t do the job. The cost of a labor-intensive system is likely to be greater than the licensing fee and capital expenditures combined. Lousy work tracking data leads to a manager firing the only person able to make a system work. This uninformed action immediately triggers the costs identified below.
  • Consultants. Few organizations have the internal expertise to set up, maintain, enhance, and operate one text mining system. When the organization has multiple systems, forget it. Costs sky rocket. Poor accounting controls make it easy to sweep some costs into other line items. When the cost over runs become known, it’s a major problem. Knee jerk reactions don’t solve the cost time bomb.
  • Failures of software and hardware. The system goes down. Exceptional measures must be taken to get back online. In most organizations, the expense of remediation is disconnected from its cause. Important cost pitfalls are, therefore, await the hapless manager who stumbles into them. Symptoms of this problem are IT departments with an established history of budget issues. Capping IT budgets increases the organization’s risk.
  • Legal issues. A text mining system is complicated. Humans are complicated. Adding two complicated entities creates a climate ideal for litigation. Legal costs are disconnected from their root cause and fatten the general and administrative cost.
  • Customization. Responding to a Ph.D. with a big ego can pump up system costs. A failure to respond creates more hassles for the IT department or the text mining manager. The failure of a corporate culture manifests itself in many ways: flawed clinical trials, bad products, and rule by squeaking wheel. Without the power to say “No”, customization of text mining systems can become a cost black hole.

Your organization may not evidence any of these cost symptoms. Be grateful. Cost becomes the issue no matter what anyone tells you about text mining.

The reason? No one can point to a system and prove with data, “This system generated X amount of revenue?”

Winding Down

There’s a person like the one in my tutorial pleading for a text mining system that solves that person’s information need in your organization. Maybe it’s you?

Pundits are touting the notion that business intelligence is a commodity. Maybe like search, text mining and the related disciplines will become the Pepsi Cola of the information world. I don’t think this will happen in the foreseeable future. We’ve created a corporate environment that encourages employees to assume responsibility.

From my point of view, allowing a chemist shouldering business development chores to procure a text mining system is too risky for my liking. A commodity text mining or business intelligence system will be, by definition, ill-matched to most users’ requirements. We have good data that say that a general search system is a problem for up to two thirds of the system’s users. Why would a more complex component generate greater user satisfaction. It won’t until vendors figure out how to build far better solutions than those now available?

Pity the hapless woman in my tutorial. I couldn’t help her in a meaningful way in October 2007. I can’t today. More troubling, her employer is incapable of providing the assistance she requires. Of greater consequence are the vendors who prey on these types of customers. Good fishing with the text mining worms.

Stephen Arnold, April

Comments

One Response to “The Text Mining Can of Worms”

  1. Recent Links Tagged With "textmining" - JabberTags on October 16th, 2008 12:04 am

    […] public links >> textmining The Text Mining Can of Worms : Beyond Search Saved by xxc on Tue 14-10-2008 THATCamp 2008: Text Mining and the Persian Carpet Effect Saved by […]

  • Archives

  • Recent Posts

  • Meta