Nosing Beyond the Machine Learning from Human Curated Data Sets: Autonomy 1996 to Smart Software 2019

April 24, 2019

How does one teach a smart indexing system like Autonomy’s 1996 “neurodynamic” system?* Subject matter experts (SMEs) assembled training collection of textual information. The article and other content would replicate the characteristics of the content which the Autonomy system would process; that is, index and make searchable or analyzable. The work was important. Get the training data wrong and the indexing system would assign metadata or “index terms” and “category names” which could cause a query to generate results the user could perceive as incorrect.

How would a licensee adjust the Autonomy “black box”? (Think of my reference to Autonomy and search as a way of approaching “smart software” and “artificial intelligence.”)

The method was to perform re-training. The approach was practical and for most content domains, the re-training worked. It was an iterative process. Because the words in the corpus fed into the “black box” included new words, concepts, bound phrases, entities, and key sequences, there were several functions integrated into the basic Autonomy system as it matured. Examples ranged from support for term lists (controlled vocabularies) and dictionaries.

The combination of re-training and external content available to the system allowed Autonomy to deliver useful outputs.

Where the optimal results departed from the real world results usually boiled down to several factors, often working in concert. First, licensees did not want to pay for re-training. Second, maintenance of the external dictionaries was necessary because new entities arrive with reasonable frequency. Third, testing and organizing the freshening training sets and the editorial work required to keep dictionaries ship shape was too expensive, time consuming, and tedious.

Not surprisingly, some licensees grew unhappy with their Autonomy IDOL (integrated data operating layer) system. That, in my opinion, was not Autonomy’s fault. Autonomy explained in the presentations I heard what was required to get a system up and running and outputting results that could easily hit 80 percent or higher on precision and recall tests.

The Autonomy approach is widely used. In fact, wherever there is a Bayesian system in use, there is the training, re-training, external knowledge base demand. I just took a look at Haystax Constellation. It’s Bayesian and Haystax makes it clear that the “model” has to be training. So what’s changed between 1996 and 2019 with regards to Bayesian methods?

Nothing. Zip. Zero.

There are other approaches to indexing and, based on my research for the Enterprise Search Report to my most recent system reviews in the Dark Web Notebook, Bayesian and nine or 10 other methods are used in most modern systems. In fact, these systems are more alike than different. The companies developing thee systems go to great lengths to add point and click interfaces that reduce user drudgery. But under the hood, there is quite a bit of similarity.

I thought about these factual realities when I read “Machine Teaching: How People’s Expertise Makes AI Even More Powerful.” The idea is that a person with expertise provides input to a smart (artificial intelligence?) software system. Err, excuse me, but that is how IBM Watson was trained for one of its medical solutions. Cancer experts sat around a table with IBM engineers who asked questions, made notes, and generated observations about how to get the expertise of the human doctors into the Watson system. The doctors balked, and in the end, Watson was kicked out of Houston and New York, at least for cancer diagnoses and treatment recommendations.

Here’s a passage I noted from the article:

Machine teaching seeks to gain knowledge from people rather than extracting knowledge from data alone. A person who understands the task at hand—whether how to decide which department in a company should receive an incoming email or how to automatically position wind turbines to generate more energy—would first decompose that problem into smaller parts. Then they would provide a limited number of examples, or the equivalent of lesson plans, to help the machine learning algorithms solve it.

What struck me is that information about training smart systems has not made much progress in the last 30 years. I don’t fault the young wizards for this failure.

The 80 to 85 percent accuracy is quite useful. The lack of progress makes clear these things to me:

The past is not just forgotten; it is ignored. That leads to inefficiency and rediscovering information highways that have been bulldozed, paved, and used by data truckers for many years.
The present approaches are not substantively different from the procedures kept secret by the British government after World War II and effectively productized by Mike Lynch, who founded Autonomy. Technical progress has stalled in Bayes’ grave.,
The lack of progress and the dismal sameness of how adding metadata to content has progressed means that innovation has failed. Yes, cross correlating social media with some content yields payoffs, but for the majority of text, images, audio recordings or intercepts, and videos, generating meaningful context and applying those contextual cues to a document or a video is just not available to most markets.

What’s the fix?

It is easier for me to identify what the fix is “not” than it is to explain how to solve the problem of processing human generated content objects so they can be searched, retrieved, and analyzed. I just wanted to point out to the current crop of wizards that I am getting tired of reading about new approaches which are little more than the same old, same old.

By the way, experts:

Disagree about some things
Fail to keep up with their discipline and distribute false information. (Think back to the college professor who told you that Robert Burns’ poem about the red, red rose was “about” a flower. Wrong, wrong, wrong.)
Have biases because they take money from outfits like pharma companies or venture firms to beat a certain rhythm on their drums.
Get impatient and move on to other tasks. (Remember being told by a teacher, “Hurry up and learn it already”?)
Refuse to tell the whole story. Knowledge is power and money, a good truism to keep in mind when talking about humanoid experts.

A blend of training sets and human experts. How expensive is that? Let’s ask a neural net or query ad supported Google? Yeah, let’s.

Stephen E Arnold, April 24, 2019

* I sold this outfit stuff in the past.

Written by Stephen E. Arnold · Filed Under AI, Feature, Text analytics, Text processing

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.