Exogenous Complexity 1: Search

January 31, 2012

I am now using the phrase “exogenous complexity” to describe systems, methods, processes, and procedures which are likely to fail due to outside factors. This initial post focuses on indexing, but I will extend the concept to other content centric applications in the future. Disagree with me? Use the comments section of this blog, please.

What is an outside factor?

Let’s think about value adding indexing, content enrichment, or metatagging. The idea is that unstructured text contains entities, facts, bound phrases, and other identifiable entities. A key word search system is mostly blind to the meaning of a number in the form nnn nn nnnn, which in the United States is the pattern for a Social Security Number. There are similar patterns in Federal Express, financial, and other types of sequences. The idea is that a system will recognize these strings and tag them appropriately; for example:

nnn nn nnn Social Security Number

Thus, a query for Social Security Numbers will return a string of digits matching the pattern. The same logic can be applied to certain entities and with the help of a knowledge base, Bayesian numerical recipes, and other techniques such as synonym expansion determine that a query for Obama residence will return White House or a query for the White House will return links to the Obama residence.

One wishes that value added indexing systems were as predictable as a kabuki drama. What vendors of next generation content processing systems participate in is a kabuki which leads to failure two thirds of the time. A tragedy? It depends on whom one asks.

The problem is that companies offering automated solutions to value adding indexing, content enrichment, or metatagging are likely to fail for three reasons:

First, there is the issue of humans who use language in unexpected or what some poets call “fresh” or “metaphoric” methods. English is synthetic in that any string of sounds can be used in quite unexpected ways. Whether it is the use of the name of the fruit “mango” as a code name for software or whether it is the conversion of a noun like information into a verb like informationize which appears in Japanese government English language documents, the automated system may miss the boat. When the boat is missed, continued iterations try to arrive at the correct linkage, but anyone who has used fully automated systems know or who paid attention in math class, the recovery from an initial error can be time consuming and sometimes difficult. Therefore, an automated system—no matter how clever—may find itself fooled by the stream of content flowing through its content processing work flow. The user pays the price because false drops mean more work and suggestions which are not just off the mark, the suggestions are difficult for a human to figure out. You can get the inside dope on why poor suggestions are an issue in Thining, Fast and Slow.

Second, there is the quite real problem of figuring out the meaning of short, mostly context free snippets of text. These can be internal social postings such as those supported by Salesforce.com or the millions of messages dumped into Facebook, Twitter, and other social media systems. Automation can pull geocodes, perform look ups in knowledge bases, and look for messages with a common word or phrase. But for most of the systems, keeping up with the throughput is a big problem. Most of the automated indexing outfits talk about value added processing and real time data but few are able to deliver. In fact, when it comes to integrating large data flows into a system such as Microsoft SharePoint compromises are taken even if these are not fully disclosed or explained to the licensee. As a result, making decisions on subsets of “big data” leads to interesting indexing issues and possibly to decisions that are essentially laughable. The “data” are nearly useless. Once again, you have to dig out your math books from college and check out how many of what must be analyzed to have a confidence level in any output from a numerical recipe.

Third, there is the issue of people and companies desperate for a solution or desperate for revenue. The coin has two sides. Individuals who are looking for a silver bullet find vendors who promise not just one silver bullet but an ammunition belt stuffed with the rounds. The buyers and vendors act out a digital kabuki. Those involved in the deal know the outcome, but the play is the thing. The actual system rarely works, costs more than anticipated, and sets the stage for another round of cotnnt processing craziness. The drama engulfs the second string consulting firms looking for a quick consulting buck, the blog experts who explain what went wrong and why, and the coders who suggests fixes, work arounds, and solutions. See this interesting post at Quora.

The exogenous complexity, then, is putting a system which works in a controlled situation in the real world. When I describe a vendor’s system as subject to exogenous complexity, I am suggesting:

Budget additional funds for either a complete rebuild or a massive, emergency room crisis assault on the patient. Saving a life or a system costs a great deal in time and resources.
Be prepared to fail. I know that one must be optimistic. I am okay with a positive outlook, but I am even more satisfied when those buying a system which has a verifiable probability of failing in two thirds of its installs are pragmatic.
Recognize that in processing unstructured content, there are problems which no software nor human centric system can solve. Even human indexers are lucky if they can deliver 90 percent accuracy in tagging. Software does not do as well as humans, a fact many vendors do not explain because the “accuracy” is usually what one would get from a dull normal human.

So, when I use the phrase exogenous complexity I am embracing the messy and uncontrollable aspects of language, software, and human behavior. In short, exogenous complexity means trouble ahead. You are free to define the bound phrase any way you want. I just try to steer clear. I try to cover some of these issues in SharePoint Semantics and Inteltrax.

Stephen E Arnold, January 31, 2012

Comments

3 Responses to “Exogenous Complexity 1: Search”

Search and Exogenous Complexity – (inside vs. outside?) « Another Word For It on January 31st, 2012 4:38 pm

[…] Search and Exogenous Complexity […]
Patrick Durusau on February 1st, 2012 12:38 pm

I have penned a somewhat lengthy response at: http://tm.durusau.net/?p=21242

In summary form:

I am not sure there inside/outside metaphor is all that helpful. Semantic issues exist and/or are aggravated by factors “inside” projects.

Not to mention that poor search design, the vacuum approach to data, is deeply flawed and useful only for some purposes.

Finally, I think at least some of the issues you raise (which I agree with more than my post may seem to say) can be dealt with by good requirements practices so that the parties share expectations that can be measured as the project progresses.

Patrick
NASA and Technical Information Search : Beyond Search on February 2nd, 2012 10:13 am

[…] NASA makes informed decisions, not choices based on budget limitations, expediency, or overlooking exogenous factors such as […]

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.