Indexing Craziness
March 15, 2010
I read “Folksonomy and Taxonomy – do you have to choose?,” which takes the position that a SharePoint administrator can use a formal controlled term list or just let the users slap their own terms into an index field. The buzzword for allowing users to index documents is part of a larger 20 something invention—folksonomy. The key segment for me in the SharePoint centric Jopx blog was:
The way that SharePoint 2010 supports the notion of promoting free tags into a managed taxonomy demonstrates that a folksonomy can be used as a source to define a taxonomy as well.
Let me try and save you a lot of grief. Indexing must be normalized. The idea is to use certain terms to retrieve documents with reasonable reliability. Humans who are not trained indexers do a lousy job of applying terms. Even professional indexers working in production settings fall into some well known ruts. For example, unless care is exercised in management and making the term list available, humans will work from memory. The result is indexing that is wrong about 15 percent of the time. Machine indexing when properly tuned can hit that rate. The problem is the that the person looking for information assumes that indexing is 100 percent accurate. It is not.
The idea behind controlled term lists is that these are logically consistent. When changes are made such as the addition of a term such as “webinar” as a related term to “seminar”, a method exists to keep the terms consistent and a system is in place to update the index terms for the corpus.
When there is a mix of indexing methods, the likelihood of having a mess is pretty high. The way around this problem is to throw an array of “related” links in front of the user and invite the user to click around. This approach to discovery entertains the clueless but leads to the potential for rat holes and wasted time.
Most organizations don’t have the appetite to create a controlled term list and keep it current. The result is the approach that is something I encounter frequently. I see a mix of these methods:
- A controlled term list from someplace (old Oracle or Convera term list, a version of the ABI/INFORM or some other commercial database controlled vocabulary, or something from a specialty vendor)
- User assigned terms; that is, uncontrolled terms. (This approach works when you have big data like Google but it is not so good when there are little data, which is how I would characterize most SharePoint installations.)
- Indexes based on parsing the content.
A user may enter a term such as “Smith purchase order” and get a bunch of extra work. Users are not too good at searching, and this patchwork of indexing terms ensures that some users will have to do the Easter egg drill; that is, look for the specific information needed. When it is located, some users like me make a note card and keep in handy. No more Easter egg hunts for that item for me.
What about third party SharePoint metadata generators? These generate metadata but they don’t solve the problem of normalizing index terms.
SharePoint and its touting of metadata as the solution to search woes are interesting. In my opinion, the approach implemented within SharePoint will make it more difficult for some users to find data, not easier. And, in my opinion, the resulting index term list will be a mess. What happens when a search engine uses these flawed index terms, the search results force the user to look for information the old fashioned way.
Stephen E Arnold, March 15, 2010
A free write up. No one paid me to write this article. I will report non payment to the SharePoint fans at the Department of Defense. Metadata works first time every time at the DoD I assume.