Taxonomy: More Marketing Craziness in Play?

December 12, 2011

For whatever reason, I have been picking up rumors, factoids, and complaints about the sales and marketing tactics of various search and content processing vendors. With holidays just around the corner, one would think that in run up to Kwanzaa, Christmas, Hanukkah, and Boxing Day folks would chill.

Ah, Agility!

The first dust up concerns tag lines. At issue is the word “agile”, which is becoming one of more popular terms. I was in a meeting at which a heated discussion about whose search and content processing system is agile. Endeca claims agility. I am not going to dispute that a 13 or 14 year old system is not agile, but in Internet years, there may be some flexibility lost. Run a query for “agile” and “search” and you get a hit to a recruitment firm, a marketing outfit, and something called the Tamilan Search Engine. I also spotted PolySpot, a French infrastructure, solutions, and applications company. The problem is that words are slippery. What are the synonyms for “agile”? I expect to see some of these turning up in 2012. How about gazelle search or spry search?

In though economic times, financial pressures can distort business methods.

Circular Partnerships: Snakes Eating Their Tails

The second dust up concerns partnerships. I have been looking through the list of partners identified by such companies as Microsoft, WAND, and others. What I have discovered is that most of the partners are either household names like IBM or companies I have never heard of. Furthermore, when I dig into the partners’ names unfamiliar to me, I discover companies which are consulting firms or resellers who offer a roster of “stuff.” I understand the importance of amplifying a sales force. A partnership plan is little more than a way to reduce the cost of getting a lead and making a sales call. One of the experts in this game is the struggling giant Thomson Reuters. The company signs up partners when sales flag. In the taxonomy game, the partnerships have another twist. The linkages are circular. Antidot or Modeca points to partners and partners point to other search and content processing vendors which point to the original company. I find this confusing because “partner plays” are gaining momentum among specialist firms. I think the “partner” card is an indication that a search and content processing firm may be beating the bushes to get revenue. Just my opinion, of course.

Today, everything is for sale. Be wary if a pitch sounds too good to be true. Image source:

Pitching Automation No Matter the Consequences

The third dust up involves taxonomies and is related to the circular nature of partnerships and financial pressures. Now there is considerable contention in the market with regard to taxonomies. The word “taxonomy” itself is a shuttlecock with software badminton players swinging with abandon. The idea is simple: A hierarchical word list. But with hot new spins like ontology (not to be confused with the branch of metaphysics that deals with the nature of being), metatagging, and categorization.

On one side of the dictionary are those who want the software to discover the concepts, terms, and bound phrases. Then these terms are automatically assigned to content processed by the system. If this sounds like the Bayesian magic associated with Hewlett Packard Autonomy or Recommend, you are on the money. There alternative approaches which have considerable payoff. A good example is the work done by Tim Estes and his team at Digital Reasoning, a firm which received financial goodness from SilverLake Sumeru. The idea is that humans play either a modest role or no role at all. Because of the volume of data flowing through a system, human intermediated systems struggle to keep pace with fluidity of human discourse. On one side, therefore, automation. For simplicity’s sake, let’s call this the Google approach.

On the other side of the dictionary are those who see humans with subject matter expertise playing an important role. The idea, which seems quaint to many of the self appointed experts and azure chip consultants, is that human beings can set up a conceptual scheme, populate it with words, terms, and bound phrases. Thus, armed with a controlled term list, a system can use those terms to index or tag content. The idea has merit because the American National Standards Institute has spelled out guidelines for controlled term lists.

Here’s how the battle shapes up. One one side are the “we don’t need any humans” crowd. In my opinion, some enthusiasts for this no-humans position are TEMIS, Google, and in some cases Autonomy. Many of the automated indexing and tagging systems work quite well when the corpus of content is tightly bounded. What do I mean by “tightly bounded?” Pick up a hard copy of a medical journal about cancer or about nuclear engineering. The vocabulary does not vary too much from article to article within each topic area. In fact, once you learn about 2,000 nuclear terms, you can figure out the basic idea of most nuclear power write ups.

Are some search and content processing vendors taking notice of sales methods associated with used car sales professionals? Even Google is advertising on the “vast wasteland”. Image source:

What happens when you process unbounded content? Well, real life language use is more tricky. Non experts simplify complex ideas, often importing non specialist terms for arcane jargon. Do you know what an ECCS is? Probably not. A “real” journalist or consultant will convert the notion of an emergency core cooling system to something along the lines of a “spare radiator.” Not exactly on the money, but indicative of how precise language is softened. In these situations, it is useful to have a term list of the specialist words, terms, and bound phrases. Subcategories under Cooling Systems can contain the ECCS entry and others. The idea is that content can be assigned certain terms no matter what the words and phrases in the source document may be.

Some companies like TEMIS, Google, and Yandex are not to keen on the human involvement. The reasons range from the cost of getting humans to do index and taxonomy development to an arrogance about how software performs. Wizards see the world in terms of their wizardry which is okay with me. I think it is silly to assume software can handle language with the facility of humans, but I am have some experience with what happens when “good enough” is not.

Other companies like Access Innovations (a former client from days of yore)  and (believe it or not) Dow Jones (a component of the exciting Murdoch organization) believe that humans are important. The humans can develop the lists, set up guidelines or rules for the indexing system to consult, and provide interfaces to allow subject matter experts to adjust the term list and tune the indexing system. The benefit is that the accuracy of the indexing, based on my real life experience, is much better. There is language drift, but there are methods to intervene and correct that drift.

Without a method to adjust to what software is too stupid to see, the indexing “drifts”. The impact of this is not too good. You run a query for a particular snake bite treatment, and you cannot locate the content. The term you use is not assigned by the system and it does not appear in the source document. So what? Well, how about your child dies. Maybe this is an unpleasant thought, but the consequences of lousy indexing and concept assignment are often more serious than not finding a pizza joint in San Jose.

Here’s what one indexing professional told me. I have to mask the name and company to avoid a hassle, but you will get the idea from this comment I captured:

Some companies such as a certain Paris-based company sell expensive software to clients and then leave. People don’t know what to do with it.  So they have an expensive difficult to implement natural language processing systems which could work but are left hanging.  The package from us is the whole thing we are big on total service, follow up training, and getting people implemented and using it without our help but we are there – just a phone call or email away to help and support them. The Paris based company says companies like Access Innovations are not a natural language processing system and although we do have the natural language processing  we don’t make people pay for it separately. With most systems, rules are often needed to achieve more than “good enough” tagging.  Access Innovations, a specialist able to generate ANSI compliant term lists, delivers 85 to 90 percent accuracy. The Paris-based outfit delivers far lower accuracy. Clients don’t understand the issues with low accuracy tagging, findability, and long term system usability.

So What?

What we have, gentle reader, is an example of the automation crowd glossing over the need for human-intermediation solutions. What disturbs me is that the chatter about taxonomy in boot camps, companies which are coming from left field, and self appointed experts is putting the spotlight on indexing and classifying content.

That’s a plus.

The downside is that when the indexing goes off the rails, the user may not be able to find the needed information. That’s why companies like Digital Reasoning and Access Innovations have the ability to deliver automation plus human-intermediated interactions. The licensee suffers when automation goes wrong. The users suffer. The search system vendor may be blamed. Beware the taxonomy vendor spouting glittering generalities about smart software. Usually the “spout” dispenses tainted outputs.

Bottom line: I avoid vendors who present to me the “one true way.” This approach may work when preparing foie gras. For some taxonomy vendors hungry for cash, the traditional, labor intensive methods get in the way of making a quick sale. Unfortunately when humans create language, more traditional methods are often completely appropriate for mission critical indexing tasks. Honk!

Stephen E Arnold, December 12, 2011

Sponsored by


One Response to “Taxonomy: More Marketing Craziness in Play?”

  1. Daniel Mayer on December 15th, 2011 7:23 pm


    Thank you for mentioning TEMIS. Actually, we do share the opinion that humans have a big role to play in semantic content enrichment. Read all about it here :

    Best regards
    Daniel Mayer – TEMIS

  • Archives

  • Recent Posts

  • Meta