Dow Jones and Automatic Taxonomy Generation

September 30, 2008

An eager beaver reader (I only have two or three) sent me a link to “Taxonomies for Human Vs Auto-Indexing.” The author of the Synaptica Central write up is Wendy Lim. She is summarizing or reproducing information attributed to Heather Hedden. From a bibliographic angle, I think a tad more work could be done to make clear who was writing what, where, and when. But that’s an old, failed database goose quacking about the brilliant work done by “experts” decades younger than I. Quack. Quack.

You can read the September 26, 2008, write up here. The article is about a Taxonomy Bootcamp. After a bit of sleuthing, I discovered that this is an add on to some Information Today trade shows. The bootcamp, as I understand it, is an intellectual Camp Lejune except the that the attendees skip the push ups, the 5 am wake up calls, and the 20 mile runs. Over a period of two or three days, taxonomy recruits emerge battle ready, honed to deal with the intellectual rigors of creating taxonomies.

A real taxonomy. Source: www.nnf.org.na

The word “taxonomy” is more popular than “enterprise search” and for good reason. Enterpriser search has emerged from organizations with a bold 4F stamped on its fitness report. After hours, maybe months of work, and some hefty bills to pay, enterprise search customers are looking for a way to kill the enterprise search enemy. That’s where a taxonomy comes it. I’m no expert in taxonomies. I know I was involved in creating taxonomies for some once-hot commercial databases like ABI / INFORM, Business Dateline, General Business File, Health Reference Center, and the 1993 Web direct Point (Top 5% of the Internet). What those experiences taught me was that I don’t know too much about taxonomies or classification systems in general for that matter. I keep in touch with people who do know; for example, Marje Hlava at Access Innovations, Barbara Quint (Searcher Magazine), Marydee Ojala (Online Magazine), Ulla de Stricker (De Stricker & Associates), and other specialists. I get nervous when a 20- or 30-something explains that taxonomies are not big deal or that a business process can crack a taxonomy problem or a certain vendor’s software can auto-magically create a taxonomy.

A Synaptica Central tag cloud.

In my experience, the truth is not to be found in any one solution. In fact, the reality of taxonomies is that the concept has gained traction because of fundamental errors in planning and deploying information access systems. I don’t think a taxonomy can retrofit stupid, short sighted decisions. For that reason, I steer clear of most taxonomy discussions because after working with these beasts for more than 30 years, I understand their unpredictable behavior.

A taxonomy is an intellectual construct. A good one does justice to the user, who can actually recognize the words and phrases and make sense of the set up. When I was in grade school, my teachers “taught” us how the Dewey Decimal system worked. We went to a physical library and had to use the card catalog, the signs in the library, and the reference desk to answer questions. When my son (who is now a Google super partner) was in grade school, I dragged him to the public library and the local university library and went through the same drill. A taxonomy provides one way to think about information, which can be a weird combination of fungible and intangible elements. I can still remember where in my university library the books about computer science were located. I don’t recall the Library of Congress catalog prefix, but I have the kinesthetic memory to go to the shelves like a homing pigeon.

Even today, a taxonomy is a way of arranging objects in groups according to the relationship of each to the others. To pull this off is a great deal of work. To make the task more complicated, humans are busy creating new fields of knowledge (entanglement, Zeta zeros, and social software). Old things get new names; e.g., group discussions become buzzgroups and digital whiteboarding. In short, to create a useful taxonomy one has to know the domain of the information, the history of the word choice in the domain, the existing classification systems, what taxonomies are available for the domain, the new developments taking place, and what you want to accomplish by creating a taxonomy for your specific constituency. I find that the suggestion of learning how to accomplish this task is a couple days or even a college class laughable. In fact, Betty Eddison, founder of InMagic, used to laugh and laugh heartily when some companies’ taxonomies were discussed by our team at the Courier Journal / Data Courier operation. Some of these taxonomies were pretty humorous 30 years ago and remain risible today.

Now back to Ms. Lim’s article “Taxonomies for Human Vs Auto-Indexing” by Heather Hedden. The write up makes some interesting distinctions; for example, “indexing” is something that is done by a human. I assume a librarian or professional indexer does this work. Most interesting is that the indexing is used for browsing. I must admit that when I “browse” I don’t use index terms. I browse using a software gizmo like Firefox or I wander through a library looking at books coin the shelves. I obviously don’t get the new lingo, which is going to present a problem for me when I have to figure out what is meant by a taxonomy. Ms.Lim or Hedden (I’m not sure who is speaking to me) says, “Tagging can be done by anyone… These tags can then be used by a database.” I honestly don’t know what this means. Vivisimo lets anyone add a tag to a record and Vivisimo’s system uses that user-provided tag, but that’s software, not a database. Google takes a different approach with Knol. Taxonomy software imposes intellectual rigor on its trained users, and I have a report from one of my advisors that “tag” and “index” are synonyms for some taxonomy software vendors.

Okay, enough of that. I’m not happy. That’s my opinion. Feel free to agree or disagree.

The write up concludes with this statement:

She [Ms. Hedden, I presume] concluded with a short description of the additional tasks that an indexer would have to do in both human and auto-indexing. Both would require human intervention, its just that the tasks and extent of work is different. For human indexing, terms have to be checked and amended/added in if terms are omitted or misused. In the case of auto-indexing, the work is more focused on the training documents and adjustments of the rules.

I am not going to say, “I don’t know what this passage means.” I think I do. My concern about this type of write up in particular and the notion of a taxonomy boot camp in general contains these elements:

Presenting taxonomy as a concept and taxonomy as a task is potentially confusing. If the “someone” has already fouled up an enterprise search system, there is little chance that a taxonomy–however it is built–will refloat the boat.
The artificial distinctions among different types of indexing and the blurring of classification systems, taxonomies, and methods of creating these constructs confuses me, and I am supposed to know about these distinctions and the vendors. This intellectual rigor is akin to a Marine drill instructor teaching the recruits to hold the rifle by the barrel and to pull the trigger with a toe.
The Synaptica Central intent, which I am trying to dig from between the lines, is to say, “Hire Dow Jones / Factiva to do the work.” See, that’s easy to say, isn’t it?

Oh, now that I have walked through my notes on this article by Lim / Hedden, let me summarize a taxonomy boot camp this way: “Taxonomies are tricky. Use software to do some, maybe most of the work, then turn to experts and specialists to refine the taxonomy. After a shake down of a a few days, maybe a month, revisit the taxonomy and make fixes. Lost in space. Look for a real expert–Hlava, Quint, et al”

If you want to get fit, join the Marines. If you want a good taxonomy, don’t look for short cuts. Remember how fouled up enterprise search systems are. Well, a taxonomy done wrong won’t make the boo boo better.

Stephen Arnold, September 30, 2008

Written by Stephen E. Arnold · Filed Under Feature, Semantic, Text analytics, Text processing

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.