Traditional Entity Extraction’s Six Weaknesses

September 26, 2011

Editor’s Note: This is an article written by Tim Estes, founder of Digital Reasoning, one of the world’s leading providers of technology for entity based analytics. You can learn more about Digital Reasoning at www.digitalreasoning.com.

Most university programming courses ignore entity extraction. Some professors talk about the challenges of identifying people, places, things, events, Social Security Numbers and leave the rest to the students. Other professors may have an assignment related to parsing text and detecting anomalies or bound phrases. But most of those emerging with a degree in computer science consign the challenge of entity extraction to the Miscellaneous file.

Entity extraction means processing text to identify, tag, and properly account for those elements that are the names of person, numbers, organizations, locations, and expressions such as a telephone number, among other items. An entity can consist of a single word like Cher or a bound sequence of words like White House. The challenge of figuring out names is tough one for several reasons. Many names exist in richly varied forms. You can find interesting naming conventions in street addresses in Madrid, Spain, and for the owner of a falafel shop in Tripoli.

Entities, as information retrieval experts have learned since the first DARPA conference on the subject in 1987, are quite important to certain types of content analysis. Digital Reasoning has been working for more than 11 years on entity extraction and related content processing problems. Entity oriented analytics have become a very important issue these days as companies deal with too much data, the need to understand the meaning and not the just the statistics of the data and finally to understand entities in context – critical to understanding code terms, etc.

I want to highlight the six weaknesses of traditional entity extraction and highlight Digital Reasoning’s patented, fully automated method. Let’s look at the weaknesses.

1 Prior Knowledge

Traditional entity extraction systems assume that the system will “know” about the entities. This information has been obtained via training or specialized knowledge bases. The idea is that a system processes content similar to that which the system will process when fully operational. When the system is able to locate or a human “helps” the system locate an entity, the software will “remember” the entity. In effect, entity extraction assumes that the system either has a list of entities to identify and tag or a human will interact with various parsing methods to “teach” the system about the entities. The obvious problem is that when a new entity becomes available and is mentioned one time, the system may not identify the entity.

2 Human Inputs

I have already mentioned the need for a human to interact with the system. The approach is widely used, even in the sophisticated systems associated with firms such as Hewlett Packard Autonomy and Microsoft Fast Search. The problem with relying on humans is a time and cost equation. As the volume of data to be processed goes up, more human time is needed to make sure the system is identifying and tagging correctly. In our era of data doubling every four months, the cost of coping with massive data flows makes human intermediated entity identification impractical.

3 Slow Throughput

Most content processing systems talk about high performance, scalability, and massively parallel computing. The reality is that most of the subsystems required to manipulate content for the purpose of identifying, tagging, and performing other operations on entities are bottlenecks. What is the solution? Most vendors of entity extraction solutions push the problem back to the client. Most information technology managers solve performance problems by adding hardware to either an on premises or cloud-based solution. The problem is that adding hardware is at best a temporary fix. In the present era of big data, content volume will increase. The appetite for adding hardware lessens in a business climate characterized by financial constraints. Not surprisingly entity extraction systems are often “turned off” because the client cannot afford the infrastructure required to deal with the volume of data to be processed. A great system that is too expensive introduces some flaws in the analytic process.

4 Content Exceptions

Traditional entity extraction programs have been built for a world view that often begins at Route 128 in Boston, Massachusetts, and ends at the exit from Highway 101 to Palo Alto, California. Today’s world is vastly different. Content comes in hundreds of languages and dialect variants. New content types such as Twitter 140 character content or social content that derives meaning from a quite narrow context force entity extraction systems to solve crossword type puzzles for which the systems were not designed. When complex names or aliases are added to the mix, the entity extraction system clogs with “exceptions.” These are files written when the traditional entity extraction system does not know how to handle a particular document. If the system is processing a handful of Twitter or Facebook documents a day, the problem may be management. Scale to hundreds of millions of documents, and then the exceptions break the functionality of the system.

5 Sense Making

Once one has identified, tagged, and stored entities, the problem shifts to sense making. The idea is that a table of products or proper names is of little use unless one can relate the entities to times, one another, events, places, and other significant items of information. Traditional entity extraction systems can count and perform the types of statistical manipulations taught in most university classes as introductions to statistics. The problem is that mathematics has advanced beyond the introductory methods. When one dabbles in lambda calculus or mereology’s set theory, traditional entity extraction systems are stuck in the “count and make a graph” landscape. With large volumes of data, mathematical methods are needed to identify trends within related groups of entities, changes to an entity, and relationship through time among entities the system has identified as “of interest.” In the era of big data, sense making cannot be handled by a human analyst. There is simply too much data and a pie chart won’t do the job of finding meaningful relationships.

6. Communication

Traditional entity extraction systems work quite well on certain types of content. However, in the era when a Facebook post by a person who may want to do harm to young adults in Norway contains high value information, no analyst is going to be able to spot that outlier. The traditional systems are mute. In fact, the problem of getting actionable information out of traditional entity extraction systems is perhaps their greatest flaw.

So what’s the remedy?

I founded Digital Reasoning to deliver a number of sophisticated content processing, search, and analytic functions. You can get more information about our system, method, and products at www.digitalreasoning.com. I want to highlight where we are with regard to these six weaknesses found in many entity extraction products:

1.      Prior Knowledge. Synthesys®, the patented Digital Reasoning method, is fully automatic. A licensee can push content through the system and entities are identified, tagged, and made available to our analytic engine. That’s why we offer entity oriented analytics as a core operation.

2.      Human Inputs. These are not needed. If available, Synthesys can make use of provided inputs. But our tests reveal very high accuracy – on par with trained approaches – without analysts’ having to interact with the system.

3.      Slow Throughput. Synthesys is parallelized and leverages Hadoop and other cloud-oriented technologies. Synthesys is available as an on premises or cloud based solution. There is no performance issue with our next generation architecture.

4.      Content Exceptions. In general, content exceptions are unusual. Synthesys uses advanced algorithmic and proprietary methods to reduce significantly the content requiring special handling. The system works independent of the input language.

5.      Sense Making. Synthesys’ leverages our patented algorithms and methods to understand entities in context during the NLP process. By understanding entities in context, Synthesys has the ability to automate the process of finding connections and associated terms even in “dirty” transcriptions or coded language.

6.      Communication. Synthesys is verbose, not in terms of the size of the reports, but in terms of reporting on meaningful events. Synthesys provides outputs that make sense of entities in sense and time. When an important correlation is discovered, the system has the ability to “alert” other processes or a human depending on the level of integration to the workflow solutions used.

Synthesys includes comprehensive application programming interfaces so it can be “snapped into” existing content processing systems. In short, Synthesys avoids the problems of traditional content processing system and delivers automated natural language processing without taxonomies. Digital Reasoning builds upon a decade of experience and uses real data and a unique algorithmic approach to understanding data at large scale. Synthesys is built to be the hub of the entity oriented enterprise.

Tim Estes, Founder and Chief Executive Officer, Digital Reasoning

September 26, 2011

Comments

One Response to “Traditional Entity Extraction’s Six Weaknesses”

  1. Traditional Entity Extraction’s Six Weaknesses « Another Word For It on September 28th, 2011 7:36 pm

    […] Traditional Entity Extraction’s Six Weaknesses […]

  • Archives

  • Recent Posts

  • Meta