AeroText: A New Breakthrough in Entity Extraction
June 30, 2014
I returned from a brief visit to Europe to an email asking about Rocket Software’s breakthrough technology AeroText. I poked around in my archive and found a handful of nuggets about the General Electric Laboratories’ technology that migrated to Martin Marietta, then to Lockheed Martin, and finally in 2008 to the low profile Rocket Software, an IBM partner.
When did the text extraction software emerge? Is Rocket Software AeroText a “new kid on the block”? The short answer is that AeroText is pushing 30, maybe 35 years young.
Digging into My Archive of Search Info
As far as my archive goes, it looks as though the roots of AeroText are anchored in the 1980s, Yep, that works out to an innovation about the same age as the long in the tooth ISYS Search system, now owned by Lexmark. Over the years, the AeroText “product” has evolved, often in response to US government funding opportunities. The precursor to AeroText was an academic exercise at General Electric. Keep in mind that GE makes jet engines, so GE at one time had a keen interest in anything its aerospace customers in the US government thought was a hot tamale.
The AeroText interface circa mid 2000. On the left is the extraction window. On the right is the document window. From “Information Extraction Tools: Deciphering Human Language, IT Pro, November December 2004, page 28.
The GE project, according to my notes, appeared as NLToolset, although my files contained references to different descriptions such as Shogun. GE’s team of academics and “real” employees developed a bundle of tools for its aerospace activities and in response to Tipster. (As a side note, in 2001, there were a number of Tipster related documents in the www.firstgov.gov system. But the new www.usa.gov index does not include that information. You will have to do your own searching to unearth these text processing jump start documents.)
The aerospace connection is important because the Department of Defense in the 1980s was trying to standardize on markup for documents. Part of this effort was processing content like technical manuals and various types of unstructured content to figure out who was named, what part was what, and what people, places, events, and things were mentioned in digital content. The utility of NLToolset type software was for cost reduction associated with documents and the intelligence value of processed information.
The need for a markup system that worked without 100 percent human indexing was important. GE got with the program and appears to have assigned some then-young folks to the project. The government speak for this type of content processing involves terms like “message understanding” or MU, “entity extraction,” and “relationship mapping. The outputs of an NLToolset system were intended for use in other software subsystems that could count, process, and perform other operations on the tagged content. Today, this class of software would be packaged under a broad term like “text mining.” GE exited the business, which ended up in the hands of Martin Marietta. When the technology landed at Martin Marietta, the suite of tools was used in what was called in the late 1980s and early 1990s, the Louella Parsing System. When Lockheed and Martin merged to form the giant Lockheed Martin, Louella was renamed AeroText.
Over the years, the AeroText system competed with LingPipe, SRA’s NetOwl and Inxight’s tools. In the hay day of natural language processing, there were dozens and dozens of universities and start ups competing for Federal funding. I have mentioned in other articles the importance of the US government in jump starting the craziness in search and content processing.
In 2005, I recall that Lockheed Martin released AeroText 5.1 for Linux, but I have lost track of the open source versions of the system. The point is that AeroText is not particularly new, and as far as I know, the last major upgrade took place in 2007 before Lockheed Martin sold the property to AeroText. At the time of the sale, AeroText incorporated a number of subsystems, including a useful time plotting feature. A user could see tagged events on a timeline, a function long associated with the original version of i2’s the Analyst Notebook. A US government buyer can obtain AeroText via the GSA because Lockheed Martin seems to be a reseller of the technology. Before the sale to Rocket, Lockheed Martin followed SAIC’s push into Australia. Lockheed signed up NetMap Analytics to handle Australia’s appetite for US government accepted systems.
AeroText Functionality
What does AeroText purport to do that caused the person who contacted me to see a 1980s technology as the next best thing to sliced bread?
AeroText is an extraction tool; that is, it has capabilities to identify and tag entities at somewhere between 50 percent and 80 percent accuracy. (See NIST 2007 Automatic Content Extraction Evaluation Official Results for more detail.)
The AeroText approach uses knowledgebases, rules, and patterns to identify and tag pre-specified types of information. AeroText references patterns and templates, both of which assume the licensee knows beforehand what is needed and what will happen to processed content.
In my view, the licensee has to know what he or she is looking for in order to find it. This is a problem captured in the famous snippet, “You don’t know what you don’t know” and the “unknown unknowns” variation popularized by Donald Rumsfeld. Obviously without prior knowledge the utility of an AeroText-type of system has to be matched to mission requirements. AeroText pounded the drum for the semantic Web revolution. One of AeroText’s key functions was its ability to perform the type of markup the Department of Defense required of its XML. The US DoD used a variant called DAML or Darpa Agent Markup Language. natural language processing, Louella, and AeroText collected the dust of SPARQL, unifying logic, RDF, OWL, ontologies, and other semantic baggage as the system evolved through time.
Also, staff (headcount) and on-going services are required to keep a Louella/AeroText-type system generating relevant and usable outputs. AeroText can find entities, figure out relationships like person to person and person to organization, and tag events like a merger or an arrest “event.” In one briefing about AeroText I attended, I recall that the presenter emphasized that AeroText did not require training. (The subtext for those in the know was that Autonomy required training to deliver actionable outputs.) The presenter did not dwell on the need for manual fiddling with AeroText’s knowledgebases and I did not raise this issue.)
Duck Duck Go Reimagined
June 30, 2014
Duck Duck Go has launched a sleek redesigned web presence, complete with a flashy “What’s New” page to go over the highlights. Duck Duck Go is gaining more traction for users who are interested in secure search, so there will be great interest in what the team is bringing to the table.
Their overview says:
“DuckDuckGo is a search engine driven by community – you’re on the team! We’re not just servers and an algorithm. We’re so much more. Real Privacy. We Don’t Track You. Smarter Search. Get Answers Quicker. Less Clutter. Fewer Ads and Reduced Spam.”
Of course details are provided for those who want to seek them out. But as Google gets bigger and bigger, some users are looking for smart search that allows them to remain an anonymous face in the crowd, and that is Duck Duck Go’s specialty. It may not quite be a David and Goliath situation as the giant does not look like it is going down anytime soon, but Duck Duck Go is on the rise and worth keeping an eye on. But do keep in mind that DDG is a metasearch system, so its weakness is that it has to rely on others’ search indexes.
Emily Rae Aldridge, June 30, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
X1 Search 8 Moves to Unified Search
June 30, 2014
With the move to more data across a wider variety of repositories (SharePoint, OneDrive, Dropbox, and more) the need to search across platforms is becoming more urgent. X1’s search model has responded to the need by introducing X1 Search 8. Details are covered in the CollabShow article, “X1 Search 8—Unified Search for SharePoint, Desktop, Mail and More…”
The article begins:
“X1 has been analyzing the needs of the information worker and consumer in this space for over a decade. With their analyses, they have identified the need for fast retrieval and an intuitive, simple interface and powerful filtering across all of the repositories that a user uses and values. When you’re searching for information, you don’t want to have to go to a dozen different places across a variety of user interfaces. You’re likely to give up and end up spending hours duplicating effort or emailing someone else and wasting their time because you couldn’t find the email or document you were looking for and that you know you’ve seen somewhere.”
X1 is using familiar language – unified search, everything search, etc. And while it is perhaps trendy, it is not exactly original. The term “unified” is also used by Attivio, BA Insight, and Sinequa. Keep an eye out to see whether this trend turns into the norm in search. It stands to reason that all enterprise search has to be unified because of the natural direction of the technology. Time will tell.
Emily Rae Aldridge, June 30, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
Google and Disappearing Locations in Satellite Imagery
June 29, 2014
I am okay with information disappearing. Whether it is a pointer or the actual content, information is fluid. When doing routine updates of my information about enterprise search vendors, I come across file not found errors or documents that are different from the ones I previously accessed. Some content has vaporized, the target page displaying a blank white screen. A recent example of this is information about the Aerotext entity extraction system now owned by an outfit called Rocket.
I read with some interest “Erasing Your Home from Google Maps Is Way Easier Than You Think.” As satellite imagery for public access creeps toward higher resolution, certain locations require blurring. The article explains how you can “blur” your property in a Google Maps’ image. I learned:
The process is relatively simple. First go to Google Maps and enter your home address (or the address of whatever you want blurred). Enter ?street view” mode by dragging the little man on the right side of the screen to the spot you want blurred. Once there, hit the ?Report a problem” button on the lower-right corner of the screen. It will pull up a page where you can specify whatever image you want to have blurred.
The write up explains how a criminal can use online imagery. The list is incomplete, but it may create more awareness of the consequences of not knowing what one does not know.
How is this relevant to search? Well, if it is incorrect, altered, or not there, it is tough to make certain types of informed decisions. Ignorance can be bliss as long as those who are ignorant are not making certain types of decisions that require precise, current, and accurate information.
Stephen E Arnold, June 29, 2014
HP: Chasing Autonomy Execs the Next Big Thing
June 28, 2014
I read “HP Will Settle 3 Lawsuits Over Its $11 Billion Autonomy Acquisition, Urge Shareholders To Sue Autonomy.” The settlement approach makes sense; otherwise, attorneys would be able to purchase the total output of McLaren and Ferrari before a final decision stumbles from a courtroom.
The write up states:
To recap: less than a year after buying British software maker Autonomy for $11 billion, HP wrote off $8.8 billion and alleged that Autonomy had improperly inflated its revenues and margins, to the tune of $5 billion. HP called it fraud, named a whole bunch of ways it believed Autonomy had done this, and asked for investigations by the authorities.
But the portion of the article that caught my attention was this passage:
The shareholders will agree to drop all claims against HP’s executives and board members, including CEO Meg Whitman, but they will be free to pursue former officials at Autonomy. Plus, the shareholders’ attorneys will “receive fees for helping HP pursue any further claims” Reuters reports.
My take is that HP wants to covert Sir Michael Lynch from the most successful software entrepreneur in England to scapegoat. The only hitch in the git along is that HP bought a company after Board approval.
Fascinating, but the approach may lead to a fundamental breakthrough in computing. That processing power can be applied to HP decision theory problems. Now that Autonomy IDOL is a cloud service, I assume a computer revolution is not too challenging for HP management.
Stephen E Arnold, June 28, 2014
Enterprise Search Adoption Survey: More Shifting Sand
June 27, 2014
I read “Enterprise Search Adoption Remains Low: Survey.” On one hand the notion that most organizations have trouble finding email like the US Internal Revenue Service is a truism. On the other hand, enterprise search is one of the enterprise applications that promises everything and often disappoints a large percentage of users.
The write up asserts:
It asked 300 enterprise IT security professionals at two major security-themed industry events, and found that only 38% of IT departments have invested in or plan to invest in enterprise search capabilities for a range of reasons topped by security concerns. When asked to choose the biggest obstacle to enterprise search adoption, 68% cited the risk of employees locating and accessing files they should not have permission to view.
The article includes a quote from an outfit called Varonis and a plug for a product with which I am unfamiliar, DatAnswers. The full enterprise search study can be found at http://info.varonis.com/enterprise-search-report
Oh, and the sponsor of the study? Varonis and DatAnswers. Enterprise search is nothing if not consistent in its marketing efforts.
Stephen E Arnold, June 27, 2014
Predicting the Future of Search
June 27, 2014
Enterprise search dates back to the 1960s under IBM, but Google has definitely dictated the average user’s expectations regarding modern day information retrieval. So while the past is important, the future is uncertain and inquiring minds want to know what to expect. Docurated turns to the experts in their article, “Enterprise Search: 14 Industry Experts Predict the Future of Search.”
The article begins:
“We wanted to gain a clearer understanding of current state of the enterprise search industry. Given the steady evolution of enterprise search, we also wanted to gain some insight into what the future may hold. To do so, we gathered a select number of industry experts and asked two simple questions:
1. What is your assessment of today’s enterprise search industry?
2. What do you think the future of ‘search’ will look like?”
The results from the experts are mixed. Few think that the model will change dramatically though many do mention continued innovation in the areas of big data, open source, visual search, and others. And even if all the experts did agree, the future would still be uncertain. Those interested in the future of search should stay tuned in for the latest news as it hits, and just hold on for the ride.
Emily Rae Aldridge, June 27, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
Enterprise Graph Search Changes the Retrieval Game
June 27, 2014
New models of information retrieval are emerging, taking the field away from the traditional keyword format. Facebook made quite a stir with its implementation of graph search, and now experts feel it may have implications for the enterprise. Read more in the Synata article, “Enterprise Graph Search: A Game Changer in Information Retrieval.”
The article begins:
“In Facebook Graph Search, results are based on both the content of the user and their friends’ profiles and the relationships between the user and their friends. They’re personalized for the individual user. But what does this have to do with the enterprise? . . . Because of today’s massively scalable infrastructure platforms, API ubiquity, and graph analysis capabilities, the rapid gains in query understanding and information retrieval techniques are about to have resounding implications for enterprise search.”
The graph model allows for extreme relevance, bringing all the floating connections together in a bigger picture, the graph. And while theorists are saying that this type of technology has huge implications, implementation has yet to be realized. Keep an eye out for the breakthrough of graph search. When it hits SharePoint it will have made the mainstream.
Emily Rae Aldridge, June 27, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
SonicSearch and EasyAsk: Some Work to Do
June 26, 2014
I read “EasyAsk Helps Sonic Sense Offer Unprecedented Search Flexibility and Accuracy on Magento Site.”
The EasyAsk for Magento solution has allowed Sonic Sense to deliver a much richer user experience with visual Search-as-you-Type, natural language search with highly accurate results and dynamic relevant navigation.
EasyAsk is a better choice than Solr, according to the write up:
“Sonic Sense is another shining example of the dramatic improvements in customer experience that EasyAsk delivers for Magento or any e-commerce site, said Craig Bassin, EasyAsk CEO. “EasyAsk’s solution is head and shoulders above the SOLR option and other third party search solutions for Magento Enterprise which is proven by the results at Sonic Sense and dozens of Magento customers flocking to EasyAsk.”
I navigated to www.sonicsearch.com and ran some queries. I will boil down my experience to one representative query, and invite you to run your own queries to make sure I did not miss a key point.
My test query was “audio mixer recorder.” I received three results pages. The results on the first page did include audio mixer with recording functions. However, the results on pages 2 and 3 were not relevant. This type of query relaxation allows a company to display more results, giving the impression of a hefty line up of products.
However, the faceted navigation function did not work. On page three, when I clicked on the option for the two products between $1 and $100, the system did not return a results page.
Response time struck me as sluggish. I did not expect Amazon-type displays, but I found myself wondering about the suitability of the SonicSense infrastructure to the demands of the search system.
For more information about EasyAsk, a natural language search system once owned by Progress Software, navigate to www.easyask.com.
Stephen E Arnold, June 26, 2014
Google: Search Is So Yesterday
June 26, 2014
When I learned about Backrub in the late 1990s, I was struck by the cross fertilization of ideas the concept embraced. There were hints of Jon Kleinberg’s Clever, a dusting of the Fuzzy-era Lycos, and the clever optimization methods of Alta Vista.
At the time of the dust up between Google and Yahoo about online advertising, I saw a shift from quasi-objective search results to a pay-to-play approach to information. At the time, the Internet was a novelty for many people. The commercial database search business demonstrated that sophisticated searching was difficult for many trained information professionals. The future, I concluded, was for mass access to users who would never grasp the difference between precision and recall or care much about information provenance and accuracy.
I have followed the write ups about Google’s massive self-promotion conference. As other for-fee conferences struggle, the Google gathering is headline news. Comparing the Google I/O event to a conference focused on search or database technology is like comparing the World Cup to a high school soccer match. Big difference.
A good example of the distance between Backrub (the precursor to Google search) and today’s multi-billion behemoth built on advertising is the story “Everything You Need to Know about Google’s I/O Keynote.” The categorical affirmative is a variation on listicle rhetoric: Get the info quickly.
Here’s the key point I noted:
More than anything else, today’s keynote demonstrated Google’s ambition to take its mobile OS basically everywhere: to your car, to your body, to your television, to your laptop, even into your workplace. We saw a preview of Android’s L release, with a new design—Google calls it material design—that adds some depth features, support for 64 bit, and an enhanced notifications system that will help people interact with their applications in powerful ways without ever launching the app itself.
What’s this tell us about the role of search at Google? This is an easy question. It strikes me that search is a convenient way to generate revenue as Google works overtime to develop a revenue stream to complement online advertising.
For those who expect objective search results, the future of search looks dim. No other vendor has stepped forward and offered an option that provides large numbers of users with objective results. There are some promising systems, but these are largely demonstrations or graduate student projects.
Net net: It will become more difficult to obtain objective search results going forward. For a person who needs accurate, timely information the research job is going to get more difficult in the months ahead.
Good news for those who want information about certain topics, less positive for those who assume that online information is accurate, easy, free, and ubiquitous.Perhaps search results should be displayed with a trigger warning?
Stephen E Arnold, June 26, 2014