AeroText: A New Breakthrough in Entity Extraction

June 30, 2014

I returned from a brief visit to Europe to an email asking about Rocket Software’s breakthrough technology AeroText. I poked around in my archive and found a handful of nuggets about the General Electric Laboratories’ technology that migrated to Martin Marietta, then to Lockheed Martin, and finally in 2008 to the low profile Rocket Software, an IBM partner.

When did the text extraction software emerge? Is Rocket Software AeroText a “new kid on the block”? The short answer is that AeroText is pushing 30, maybe 35 years young.

Digging into My Archive of Search Info

As far as my archive goes, it looks as though the roots of AeroText are anchored in the 1980s, Yep, that works out to an innovation about the same age as the long in the tooth ISYS Search system, now owned by Lexmark. Over the years, the AeroText “product” has evolved, often in response to US government funding opportunities. The precursor to AeroText was an academic exercise at General Electric. Keep in mind that GE makes jet engines, so GE at one time had a keen interest in anything its aerospace customers in the US government thought was a hot tamale.

The AeroText interface circa mid 2000. On the left is the extraction window. On the right is the document window. From “Information Extraction Tools: Deciphering Human Language, IT Pro, November December 2004, page 28.

The GE project, according to my notes, appeared as NLToolset, although my files contained references to different descriptions such as Shogun. GE’s team of academics and “real” employees developed a bundle of tools for its aerospace activities and in response to Tipster. (As a side note, in 2001, there were a number of Tipster related documents in the www.firstgov.gov system. But the new www.usa.gov index does not include that information. You will have to do your own searching to unearth these text processing jump start documents.)

The aerospace connection is important because the Department of Defense in the 1980s was trying to standardize on markup for documents. Part of this effort was processing content like technical manuals and various types of unstructured content to figure out who was named, what part was what, and what people, places, events, and things were mentioned in digital content. The utility of NLToolset type software was for cost reduction associated with documents and the intelligence value of processed information.

The need for a markup system that worked without 100 percent human indexing was important. GE got with the program and appears to have assigned some then-young folks to the project. The government speak for this type of content processing involves terms like “message understanding” or MU, “entity extraction,” and “relationship mapping. The outputs of an NLToolset system were intended for use in other software subsystems that could count, process, and perform other operations on the tagged content. Today, this class of software would be packaged under a broad term like “text mining.” GE exited the business, which ended up in the hands of Martin Marietta. When the technology landed at Martin Marietta, the suite of tools was used in what was called in the late 1980s and early 1990s, the Louella Parsing System. When Lockheed and Martin merged to form the giant Lockheed Martin, Louella was renamed AeroText.

Over the years, the AeroText system competed with LingPipe, SRA’s NetOwl and Inxight’s tools. In the hay day of natural language processing, there were dozens and dozens of universities and start ups competing for Federal funding. I have mentioned in other articles the importance of the US government in jump starting the craziness in search and content processing.

In 2005, I recall that Lockheed Martin released AeroText 5.1 for Linux, but I have lost track of the open source versions of the system. The point is that AeroText is not particularly new, and as far as I know, the last major upgrade took place in 2007 before Lockheed Martin sold the property to AeroText. At the time of the sale, AeroText incorporated a number of subsystems, including a useful time plotting feature. A user could see tagged events on a timeline, a function long associated with the original version of i2’s the Analyst Notebook. A US government buyer can obtain AeroText via the GSA because Lockheed Martin seems to be a reseller of the technology. Before the sale to Rocket, Lockheed Martin followed SAIC’s push into Australia. Lockheed signed up NetMap Analytics to handle Australia’s appetite for US government accepted systems.

AeroText Functionality

What does AeroText purport to do that caused the person who contacted me to see a 1980s technology as the next best thing to sliced bread?

AeroText is an extraction tool; that is, it has capabilities to identify and tag entities at somewhere between 50 percent and 80 percent accuracy. (See NIST 2007 Automatic Content Extraction Evaluation Official Results for more detail.)

The AeroText approach uses knowledgebases, rules, and patterns to identify and tag pre-specified types of information. AeroText references patterns and templates, both of which assume the licensee knows beforehand what is needed and what will happen to processed content.

In my view, the licensee has to know what he or she is looking for in order to find it. This is a problem captured in the famous snippet, “You don’t know what you don’t know” and the “unknown unknowns” variation popularized by Donald Rumsfeld. Obviously without prior knowledge the utility of an AeroText-type of system has to be matched to mission requirements. AeroText pounded the drum for the semantic Web revolution. One of AeroText’s key functions was its ability to perform the type of markup the Department of Defense required of its XML. The US DoD used a variant called DAML or Darpa Agent Markup Language. natural language processing, Louella, and AeroText collected the dust of SPARQL, unifying logic, RDF, OWL, ontologies, and other semantic baggage as the system evolved through time.

Also, staff (headcount) and on-going services are required to keep a Louella/AeroText-type system generating relevant and usable outputs. AeroText can find entities, figure out relationships like person to person and person to organization, and tag events like a merger or an arrest “event.” In one briefing about AeroText I attended, I recall that the presenter emphasized that AeroText did not require training. (The subtext for those in the know was that Autonomy required training to deliver actionable outputs.) The presenter did not dwell on the need for manual fiddling with AeroText’s knowledgebases and I did not raise this issue.)

The benefit of AeroText, as I recall, is that processed text presents an analyst with a synopsis, visual snapshot, or key points in a text or flow of text. When AeroText was part of Lockheed Martin, many analysts had to deal with 10,000 or more documents per shift. The US government was, and may still be, chasing a magical solution to having too much information for an informed analyst to use in his or her work process. Today, the US government is riding the predictive analytics pony. Unlike AeroText, predictive methods like those employed by Google/Recorded Future “tell” the analyst what he or she needs to know even through the analyst has zero prior knowledge. AeroText, an old school system, requires that useful prior knowledge to be available at system launch.

AeroText, in my opinion is like ClearForest; that is, bound by rules. According to the Department of Justice analysis:

While no specific information was found, [Noble]reports that “[d]eveloping rules for a new
domain can be labor intensive, sometimes requiring more than a month of effort from experienced AeroText™ users.” (Source: Free Text Conversion and Semantic Analysis Survey, Department of Justice, August 2007, page 470

AeroText users could use a graphical interface to craft rules related to the extraction task. From “Information Extraction Tools: Deciphering Human Language, IT Pro, November December 2004, page 28.

In order to make it easier for an analyst to create rules for the AeroText system, Lockheed Martin added a graphical rule building interface. The interface is better than asking an analyst to write code, but the analyst using this interface has to be trained in the AeroText conventions. Obviously expertise in rule building varies among trained analysts; therefore, AeroText-type systems require some additional housekeeping to help make certain that the results output from the system are, in fact, on point to the user’s mission. You can figure out for yourself what happens when system outputs drift off target. In general, flawed outputs are not as useful as accurate outputs, particularly when decisions based on the data have significant risks and consequences.

Components of AeroText Prior to Sale to Rocket

Selected parts of AeroText were in the 2005-2007 period:

XML and DAML wrappers
A Run Time Integration Toolkit
Automatic database generator
Document summarization
Document routing
Link analysis
Application programming interface (Java, C, or Component Object Model)
Key word search
Visualization tools
Rules editor
Extraction interface
Corpus Analyzer or clusterer
Scoring component (Answer Key)
Knowledgebases and management tools (language support, basic facts, etc.)
Event plotting on a timeline.

Rocket explains the capabilities of AeroText as having three components: interface, extraction, and knowledge base function.

People Known to Have Been Associated with AeroText

People associated with the project over the years have included:

George R. Krupka (GE Labs era)
Paul Jacobs and Lisa Rau(GE era husband and wife team)
Ira Sider (Martin Marietta era)
Lois C. Childs (Martin Marietta era)
Sara M. Taylor (Lockheed Martin era)
Paul Kogut (Lockheed Martin era, professor at Penn State)

Closer to the Present

At the time of the 2008 acquisition for “undisclosed terms” Rocket CEO Andy Youniss said:

“This is a great addition to our roster of products. Our primary focus is on serving the needs of the AeroText customers and the terrific relationship we have with Lockheed Martin. After we release AeroText V6, we expect to explore potential integration with our other enterprise software offerings. Stay tuned.” (Source: Reuters, June 23, 2008)

The inquiry to me suggested that someone did tune into Rocket’s sales pitch. Here are some highlights of the technologies’ journey from lab project to Department of Defense focal point to commercial software in the CorVu NG product. The intrepid researcher will find the CorVu installation manual interesting reading. What appears to be new in the world of content processing may not be exactly what a system is.

Like all things in search, knowing the history of a technology as manifested in a “system” can be helpful.

Stephen E Arnold, June 30, 2014

Written by Stephen E. Arnold · Filed Under Analytics, Feature, Search

Comments

One Response to “AeroText: A New Breakthrough in Entity Extraction”

IDC Attivio Report Spotted by a Librarian : Stephen E. Arnold @ Beyond Search on July 1st, 2014 8:54 am

[…] Also, if you want free search and content processing profiles, you can check out write ups like the AeroText story and 11,000 other search- and content related stories in Beyond Search or peruse the list of […]

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.