Indexing: A Cautionary Example

November 17, 2015

i read “Half of World’s Museum Specimens Are Wrongly Labeled, Oxford University Finds.” Anyone involved in indexing knows the perils of assigning labels, tags, or what the whiz kids call metadata to an object.

Humans make mistakes. According to the write up:

As many as half of all natural history specimens held in the some of the world’s greatest institutions are probably wrongly labeled, according to experts at Oxford University and the Royal Botanic Garden in Edinburgh. The confusion has arisen because even accomplished naturalists struggle to tell the difference between similar plants and insects. And with hundreds or thousands of specimens arriving at once, it can be too time-consuming to meticulously research each and guesses have to be made.

Yikes. Only half. I know that human indexers get tired. Now there is just too much work to do. The reaction is typical of busy subject matter experts. Just guess. Close enough for horse shoes.

What about machine indexing? Anyone who has retrained an HP Autonomy system knows that humans get involved as well. If humans make mistakes with bugs and weeds, imagine what happens when a human has to figure out a blog post in a dialect of Korean.

The brutal reality is that indexing is a problem. When dealing with humans, the problems do not go away. When humans interact with automated systems, the automated systems make mistakes, often more rapidly than the sorry human indexing professionals do.

What’s the point?

I would sum up the implication as:

Do not believe a human (indexing species or marketer of automated indexing species).

Acceptable indexing with accuracy above 85 percent is very difficult to achieve. Unfortunately the graduates of a taxonomy boot camp or the entrepreneur flogging an automatic indexing system which is powered by artificial intelligence may not be reliable sources of information.

I know that this notion of high error rates is disappointing to those who believe their whizzy new system works like a champ.

Reality is often painful, particularly when indexing is involved.

What are the consequences? Here are three:

  1. Results of queries are incomplete or just wrong
  2. Users are unaware of missing information
  3. Failure to maintain either human, human assisted, or automated systems results in indexing drift. Eventually the indexing is just misleading if not incorrect.

How accurate is your firm’s indexing? How accurate is your own indexing?

Stephen E Arnold, November 17, 2015

Comments

Comments are closed.

  • Archives

  • Recent Posts

  • Meta