AeroText: A New Breakthrough in Entity Extraction

June 30, 2014

I returned from a brief visit to Europe to an email asking about Rocket Software’s breakthrough technology AeroText. I poked around in my archive and found a handful of nuggets about the General Electric Laboratories’ technology that migrated to Martin Marietta, then to Lockheed Martin, and finally in 2008 to the low profile Rocket Software, an IBM partner.

When did the text extraction software emerge? Is Rocket Software AeroText a “new kid on the block”? The short answer is that AeroText is pushing 30, maybe 35 years young.

Digging into My Archive of Search Info

As far as my archive goes, it looks as though the roots of AeroText are anchored in the 1980s, Yep, that works out to an innovation about the same age as the long in the tooth ISYS Search system, now owned by Lexmark. Over the years, the AeroText “product” has evolved, often in response to US government funding opportunities. The precursor to AeroText was an academic exercise at General Electric. Keep in mind that GE makes jet engines, so GE at one time had a keen interest in anything its aerospace customers in the US government thought was a hot tamale.

1_interface

The AeroText interface circa mid 2000. On the left is the extraction window. On the right is the document window. From “Information Extraction Tools: Deciphering Human Language, IT Pro, November December 2004, page 28.

The GE project, according to my notes, appeared as NLToolset, although my files contained references to different descriptions such as Shogun. GE’s team of academics and “real” employees developed a bundle of tools for its aerospace activities and in response to Tipster. (As a side note, in 2001, there were a number of Tipster related documents in the www.firstgov.gov system. But the new www.usa.gov index does not include that information. You will have to do your own searching to unearth these text processing jump start documents.)

The aerospace connection is important because the Department of Defense in the 1980s was trying to standardize on markup for documents. Part of this effort was processing content like technical manuals and various types of unstructured content to figure out who was named, what part was what, and what people, places, events, and things were mentioned in digital content. The utility of NLToolset type software was for cost reduction associated with documents and the intelligence value of processed information.

The need for a markup system that worked without 100 percent human indexing was important. GE got with the program and appears to have assigned some then-young folks to the project. The government speak for this type of content processing involves terms like “message understanding” or MU, “entity extraction,” and “relationship mapping. The outputs of an NLToolset system were intended for use in other software subsystems that could count, process, and perform other operations on the tagged content. Today, this class of software would be packaged under a broad term like “text mining.” GE exited the business, which ended up in the hands of Martin Marietta. When the technology landed at Martin Marietta, the suite of tools was used in what was called in the late 1980s and early 1990s, the Louella Parsing System. When Lockheed and Martin merged to form the giant Lockheed Martin, Louella was renamed AeroText.

Over the years, the AeroText system competed with LingPipe, SRA’s NetOwl and Inxight’s tools. In the hay day of natural language processing, there were dozens and dozens of universities and start ups competing for Federal funding. I have mentioned in other articles the importance of the US government in jump starting the craziness in search and content processing.

In 2005, I recall that Lockheed Martin released AeroText 5.1 for Linux, but I have lost track of the open source versions of the system. The point is that AeroText is not particularly new, and as far as I know, the last major upgrade took place in 2007 before Lockheed Martin sold the property to AeroText. At the time of the sale, AeroText incorporated a number of subsystems, including a useful time plotting feature. A user could see tagged events on a timeline, a function long associated with the original version of i2’s the Analyst Notebook. A US government buyer can obtain AeroText via the GSA because Lockheed Martin seems to be a reseller of the technology. Before the sale to Rocket, Lockheed Martin followed SAIC’s push into Australia. Lockheed signed up NetMap Analytics to handle Australia’s appetite for US government accepted systems.

AeroText Functionality

What does AeroText purport to do that caused the person who contacted me to see a 1980s technology as the next best thing to sliced bread?

AeroText is an extraction tool; that is, it has capabilities to identify and tag entities at somewhere between 50 percent and 80 percent accuracy. (See NIST 2007 Automatic Content Extraction Evaluation Official Results for more detail.)

The AeroText approach uses knowledgebases, rules, and patterns to identify and tag pre-specified types of information. AeroText references patterns and templates, both of which assume the licensee knows beforehand what is needed and what will happen to processed content.

In my view, the licensee has to know what he or she is looking for in order to find it. This is a problem captured in the famous snippet, “You don’t know what you don’t know” and the “unknown unknowns” variation popularized by Donald Rumsfeld. Obviously without prior knowledge the utility of an AeroText-type of system has to be matched to mission requirements. AeroText pounded the drum for the semantic Web revolution. One of AeroText’s key functions was its ability to perform the type of markup the Department of Defense required of its XML. The US DoD used a variant called DAML or Darpa Agent Markup Language. natural language processing, Louella, and AeroText collected the dust of SPARQL, unifying logic, RDF, OWL, ontologies, and other semantic baggage as the system evolved through time.

Also, staff (headcount) and on-going services are required to keep a Louella/AeroText-type system generating relevant and usable outputs. AeroText can find entities, figure out relationships like person to person and person to organization, and tag events like a merger or an arrest “event.” In one briefing about AeroText I attended, I recall that the presenter emphasized that AeroText did not require training. (The subtext for those in the know was that Autonomy required training to deliver actionable outputs.) The presenter did not dwell on the need for manual fiddling with AeroText’s knowledgebases and I did not raise this issue.)

Read more

HP Autonomy Makes Analytics Human

June 24, 2014

HP Autonomy has undergone a redesign, or as HP phrases it, a rebirth. HP is ready to make the unveiling official, and those interested can read about the details in the article, “Analytics for Human Information: HP IDOL 10.6 Just Released: A Story of Something Bigger.”

The article begins:

“Under the direction of SVP and General Manager Robert Youngjohns, this past year has been a time of transformation for HP Autonomy—with a genuine commitment to customer satisfaction, breakthrough technological innovation, and culture of transparency. Internally, to emphasize the importance of this fresh new thinking and business approach, we refer to this change as #AutonomyReborn.”

Quarterly releases are promising rapid updates, and open source integration is front and center. Current users and interested new users can download the latest version from the customer support site.

Emily Rae Aldridge, June 24, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Repositioning Autonomy

June 19, 2014

HP says that it has been spending the past year rebuilding Autonomy into a flagship, foundational technology for HP IDOL 10. HP discusses the new changes in “Analytics For Human Information: HP IDOL 10.6 Just Released A Story Of Something Bigger.” Autonomy had problems in the past when its capabilities of organizing and analyzing unstructured information were called into question after HP purchased it. HP claims that under its guidance HP IDOL 10 is drastically different from its previous incarnations:

“HP IDOL 10, released under HP’s stewardship, reflects in many ways the transformation that has occurred under HP.  IDOL 10 is fundamentally different from Autonomy IDOL 7 in the same way that HP Autonomy as a company differs pre- and post- acquisition. They may share the name IDOL, but the differences are so vast from both strategic and technology points-of-view that we consider IDOL 10 a wholly new product from IDOL 7, and not just a version update. HP sees IDOL as a strategic pillar of HAVEn – HP’s comprehensive big data platform – and isn’t shy to use its vast R&D resources to invest heavily into the technology.”

Some of the changes include automatic time zone conversion, removal of sensitive or offensive material, and better site administration. All clients who currently have an IDOL support contract will be able to download the upgrade free of charge.

HP really wants to be in the headlines for some positive news, instead of lawsuits. They are still ringing from the Autonomy purchase flub and now they are working on damage control. How long will they be doing that? Something a bit more impressive than a filter and time zone conversion is called for to sound the trumpets.

Whitney Grace, June 19, 2014
Sponsored by ArnoldIT.com, developer of Augmentext

Palantir Advises More Abstraction for Less Frustration

June 10, 2014

At this year’s Gigaom Structure Data conference, Palantir’s Ari Gesher offered an apt parallel for the data field’s current growing pains: using computers before the dawn of operating systems. Gigaom summarizes his explanation in, “Palantir: Big Data Needs to Get Even More Abstract(ions).” Writer Tom Krazit tells us:

“Gesher took attendees on a bit of a computer history lesson, recalling how computers once required their users to manually reconfigure the machine each time they wanted to run a new program. This took a fair amount of time and effort: ‘if you wanted to use a computer to solve a problem, most of the effort went into organizing the pieces of hardware instead of doing what you wanted to do.’

“Operating systems brought abstraction, or a way to separate the busy work from the higher-level duties assigned to the computer. This is the foundation of modern computing, but it’s not widely used in the practice of data science.

“In other words, the current state of data science is like ‘yak shaving,’ a techie meme for a situation in which a bunch of tedious tasks that appear pointless actually solve a greater problem. ‘We need operating system abstractions for data problems,’ Gesher said.”

An operating system for data analysis? That’s one way to look at it, I suppose. The article invites us to click through to a video of the session, but as of this writing it is not functioning. Perhaps they will heed the request of one commenter and fix it soon.

Based in Palo Alto, California, Palantir focuses on improving the methods their customers use to analyze data. The company was founded in 2004 by some folks from PayPal and from Stanford University. The write-up makes a point of noting that Palantir is “notoriously secretive” and that part(s) of the U.S. government can be found among its clients. I’m not exactly sure, though, how that ties into Gesher’s observations. Does Krazit suspect it is the federal government calling for better organization and a simplified user experience? Now, that would be interesting.

Cynthia Murrell, June 10, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Elasticsearch: Bulldozing Content Processing

June 7, 2014

When I left the intelligence conference in Prague, there were a number of companies in my graphic about open source search. When I got off the airplane, I edited my slide. Looks to me as if Elasticsearch has just bulldozed the search and content sector, commercialized open source group. I would not want to be the CEO of LucidWorks, Ikanow, or any other open sourcey search and content processing company this weekend.

I read “Elasticsearch Scores $70 Million to Help Sites Crunch Tons of Data Fast.” Forget the fact that Elasticsearch is built on Lucene and some home grown code. Ignore the grammar in “data fast.” Skip over the sports analogy “scores.” Dismiss the somewhat narrow definition of what Elasticsearch ELK can really deliver.

What’s important is the $70 million committed to Elasticsearch. Added to the $30 or $40 million the outfit had obtained before, we are looking at a $100 million bet on an open source search based business. Compare this to the trifling $40 million the proprietary vendor Coveo had gathered or the $30 million put on LucidWorks to get into the derby.

I have been pointing out that Elasticsearch has demonstrated that it had several advantages over its open source competitors; namely, developers, developers, and developers.

Now I want to point out that it has another angle of attack: money, money, and money.

With the silliness of the search and content processing vendors’ marketing over the last two years, I think we have the emergence of a centralizing company.

No, it’s not HP’s new cloudy Autonomy. No, it’s not the wonky Watson game and recipe code from IBM. No, it’s not the Google Search Appliance, although I do love the little yellow boxes.

I will be telling those who attend my lectures to go with Elasticsearch. That’s where the developers and the money are.

Stephen E Arnold, June 7, 2014

Software AG Happy About JackBe

May 30, 2014

Business Wire via Sys Con has some great news: “Software AG’s Acquisition Of JackBe Recognized As Strategic M&A Deal Of The Year.” Software AG is a big data, integration, and business process technologies firm driven to help companies achieve their desired outcomes. With the acquisition of real time visual analytics and intelligence software provider JackBe will be the foundation for Software AG’s new Intelligent Business Operations Platform. The acquisition even garnered attention from the Association for Corporate Growth and was recognized as the Strategic M&A deal of the year in the $100 million category.

JackBe will allow Software AG to offers its clients a broader range of enterprise functions in real-time, especially in areas related to the Internet of Things and customer experience management.

“The real-time analysis and visualization of massive amounts of data is increasingly becoming the basis for fast and intelligent business decisions. With the capabilities of JackBe integrated in its Intelligent Business Operations platform, Software AG has been able to provide customers with a comprehensive 360-degree view of operational processes by combining live, historical and transactional data with machine-to-machine communications.”

Purchasing JackBe was one of the largest big data deals in 2013 and it also proves that technology used by the US government can be turned into a viable commercial industry.

Software AG definitely has big plans for 2014. Will they continue to make headlines this year?
Whitney Grace, May 30, 2014
Sponsored by ArnoldIT.com, developer of Augmentext

This is Microsoft Embracing Predictive Analysis

May 19, 2014

Now here’s a valuable use of predictive analytics. Digital Trends reports, “Microsoft to Use Bing Search Data to Predict Outcomes of Reality Shows.” Microsoft announced the initiative in this Bing blog post. It is good to see such an influential company investing its resources in issues that affect the quality of life for all humanity. Writer Konrad Krawczyk tells us:

“Beginning today [April 21], Bing will attempt to forecast the results of shows like ‘The Voice,’ ‘American Idol’ and ‘Dancing With The Stars,’ by scanning search data, along with ‘social input’ from Facebook and Twitter. For instance, if you head over to Bing right now and search ‘American Idol predictions’ like we did, the top of the page will feature a set forecasts for five singers. We’ll refrain from adding in any potential Bing-generated spoilers here, but you’re free to check out what the search engine thinks for yourself.

“‘In broad strokes, we define popularity as the frequency and sentiment of searches combined with social signals and keywords. Placing these signals into our model, we can predict the outcome of an event with high confidence,’ the Bing Predictions Team says in its blog post.

“Microsoft also says that Bing’s predictions incorporate numerous emotionally-driven factors into how it generates predictions, allegedly accounting for biases like favoritism, regardless of how a person’s favorite singer/contestant performs from one week to the next.”

While this example does sum up the gist of predictive analysis, we can think of several areas to which the technology could be better applied. To be fair, the Bing Predictions Team says reality TV is not the pinnacle of its prediction projects. Will the next initiative be aimed at similarly vacuous forecasts?

Cynthia Murrell, May 19, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

SAS Text Miner Gets An Upgrade

May 5, 2014

SAS is a well-recognized player in IT game as a purveyor of data, security, and analytics software. In modern terms they are a big player in big data and in order to beef up their offerings we caught word that SAS had updated its Text Miner. SAS Text Miner is advertised as a way for users to not only harness information in legacy data, but also in Web sites, databases, and other text sources. The process can be used to discover new ideas and improve decision-making.

SAS Text Miner a variety of benefits that make it different from the standard open source download. Not only do users receive the license and tech support, but Text Miner offers the ability to process and analyze knowledge in minutes, an interactive user interface, and predictive and data mining modeling techniques. The GUI is what will draw in developers:

“Interactive GUIs make it easy to identify relevance, modify algorithms, document assignments and group materials into meaningful aggregations. So you can guide machine-learning results with human insights. Extend text mining efforts beyond basic start-and-stop lists using custom entities and term trend discovery to refine automatically generated rules.”

Being able to modify proprietary software is a deal breaker these days. With multiple options for text mining software, being able to make it unique is what will sell it.

Whitney Grace, May 05, 2014
Sponsored by ArnoldIT.com, developer of Augmentext

Big Data: Can the Latest Trend Deliver?

April 25, 2014

If you track Big Data, you will want to read “Why Big Data Is Stillborn (for Now).” The write up hits the highlights of the flickering hyperbole machine that sells fancy math to the government and organizations desperate for a Silver Bullet.

The article asserts:

Most “big data” has to be moved in physical containers. Most data centers do not have excess capacity to handle petabyte level simultaneous search and pattern discovery.

Believe in real time and high speed access? Consider this statement:

Bandwidth, throughput, and how “real time” is defined all come down to the weak link in the chain and we have many ***very weak*** links across the chain and especially in Washington, D.C. The bottom line is always “who benefits?” The FCC decision to destroy net neutrality is in error. The citizen, not the corporation, is “root” in a Smart Nation.

If you wonder why your Big Data investments have yet to deliver a golden goose pumping out 24 caret eggs everyday, check out this write up. Worth reading.

Stephen E Arnold, April 25, 2014

Small Analytics Firms Reaping the Benefit of Investment Cycle

April 23, 2014

Small time analytics isn’t really as startup-y as people may think anymore. These companies are in high demand and are pulling in some serious cash. We discovered just how much and how serious from a recent Cambridge Science Park article, “Cambridge Text Analytics Linguamatics Hits $10m in Sales.”

According to the story:

Linguamatics’ sales showed strong growth and exceeded ten million dollars in 2013, it was announced today – outperforming the company’s targeted growth and expected sales figures.  The increased sales came from a boost in new customers and increased software licenses to existing customers in the pharmaceutical and healthcare sectors. This included 130 per cent growth in healthcare sales plus increased sales in professional services.

This earning potential has clearly grabbed the attention of investors. This, is feeding a cycle of growth, which is why the Linguamaticses of the world can rake in impressive numbers. Just the other day, for example, Tech Circle reported on a microscopic Mumbai big data company that landed $3m in investments. They say it takes money to make money and right now, the world of big data analytics has that cycle down pat. It won’t last forever, but it’s fun to watch as it does.

Patrick Roland, April 23, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

« Previous PageNext Page »