August 4, 2014
In 2010, Attensity purchased Biz360. The Beyond Search comment on this deal is at http://bit.ly/1p4were. One of the goslings reminded me that I had not instructed a writer to tackle Attensity’s July 2014 announcement “Attensity Adds to Patent Portfolio for Unstructured Data Analysis Technology.” PR-type “stories” can disappear, but for now you can find a description of “Attensity Adds to Patent Portfolio for Unstructured Data Analysis Technology” at http://reut.rs/1qU8Sre.
My researcher showed me a hard copy of 8,645,395, and I scanned the abstract and claims. The abstract, like many search and content processing inventions, seemed somewhat similar to other text parsing systems and methods. The invention was filed in April 2008, two years before Attensity purchased Biz360, a social media monitoring company. Attensity, as you may know, is a text analysis company founded by Dr. David Bean. Dr. Bean employed various “deep” analytic processes to figure out the meaning of words, phrases, and documents. My limited understanding of Attensity’s methods suggested to me that Attensity’s Bean-centric technology could process text to achieve a similar result. I had a phone call from AT&T regarding the utility of certain Attensity outputs. I assume that the Bean methods required some reinforcement to keep pace with customers’ expectations about Attensity’s Bean-centric system. Neither the goslings nor I are patent attorneys. So after you download 395, seek out a patent attorney and get him/her to explain its mysteries to you.
The abstract states:
A system for evaluating a review having unstructured text comprises a segment splitter for separating at least a portion of the unstructured text into one or more segments, each segment comprising one or more words; a segment parser coupled to the segment splitter for assigning one or more lexical categories to one or more of the one or more words of each segment; an information extractor coupled to the segment parser for identifying a feature word and an opinion word contained in the one or more segments; and a sentiment rating engine coupled to the information extractor for calculating an opinion score based upon an opinion grouping, the opinion grouping including at least the feature word and the opinion word identified by the information extractor.
This invention tackles the Mean Joe Green of content processing from the point of view of a quite specific type of content: A review. Amazon has quite a few reviews, but the notion of an “shaped” review is a thorny one. See, for example, http://bit.ly/1pz1q0V.) The invention’s approach identifies words with different roles; some words are “opinion words” and others are “feature words.” By hooking a “sentiment engine” to this indexing operation, the Biz360 invention can generate an “opinion score.” The system uses item, language, training model, feature, opinion, and rating modifier databases. These, I assume, are either maintained by subject matter experts (expensive), smart software working automatically (often evidencing “drift” so results may not be on point), or a hybrid approach (humans cost money).
The Attensity/Biz360 system relies on a number of knowledge bases. How are these updated? What is the latency between identifying new content and updating the knowledge bases to make the new content available to the user or a software process generating an alert or another type of report?
The 20 claims embrace the components working as a well oiled content analyzer. The claim I noted is that the system’s opinion score uses a positive and negative range. I worked on a sentiment system that made use of a stop light metaphor: red for negative sentiment and green for positive sentiment. When our system could not figure out whether the text was positive or negative we used a yellow light.
The approach used for a US government project a decade ago, used a very simple metaphor to communicate a situation without scores, values, and scales. Image source: http://bit.ly/1tNvkT8
Attensity said, according the news story cited above:
By splitting the unstructured text into one or more segments, lexical categories can be created and a sentiment-rating engine coupled to the information can now evaluate the opinions for products, services and entities.
Okay, but I think that the splitting of text into segment was a function of iPhrase and search vendors converting unstructured text into XML and then indexing the outputs.
Attensity’s Jonathan Schwartz, General Counsel at Attensity is quoted in the news story as asserting:
“The issuance of this patent further validates the years of research and affirms our innovative leadership. We expect additional patent issuances, which will further strengthen our broad IP portfolio.”
Okay, this sounds good but the invention took place prior to Attensity’s owning Biz360. Attensity, therefore, purchased the invention of folks who did not work at Attensity in the period prior to the filing in 2008. I understand that company’s buy other companies to get technology and people. I find it interesting that Attensity’s work “validates” Attensity’s research and “affirms” Attensity’s “innovative leadership.”
I would word what the patent delivers and Attensity’s contributions differently. I am no legal eagle or sentiment expert. I do like less marketing razzle dazzle, but I am in the minority on this point.
Net net: Attensity is an interesting company. Will it be able to deliver products that make the licensees’ sentiment score move in a direction that leads to sustaining revenue and generous profits. With the $90 million in funding the company received in 2014, the 14-year-old company will have some work to do to deliver a healthy return to its stakeholders. Expert System, Lexalytics, and others are racing down the same quarter mile drag strip. Which firm will be the winner? Which will blow an engine?
Stephen E Arnold, August 4, 2014
July 31, 2014
I know there are quite a few experts in enterprise search, content processing, and the near mystical Big Data thing. I wanted to point out that if you want to know more about Markov Chains so you can explain how stuff works in most content centric systems with fancy math work, this is for you. Navigate to Setosa Blog and Markov Chains: A Visual Explanation. This one is pretty good. You can poke around for an IBM presentation on the same subject. IBM includes some examples of the way the numerical recipe can assign a probability to an event that is likely to take place.
Stephen E Arnold, July 31, 2014
July 28, 2014
I read “Google Searches Hold Key to Future Market Crashes.” The main idea in my opinion is:
Moat [female big thinker at Warwick Business School’ continued, “Our results are in line with the hypothesis that increases in searches relating to both politics and business could be a sign of concern about the state of the economy, which may lead to decreased confidence in the value of stocks, resulting in transactions at lower prices.”
So will the Warwick team cash in on the stock market?
Well, there is a cautionary item as well:
“Our results provide evidence of a relationship between the search behavior of Google users and stock market movements,” said Tobias Preis, Associate Professor of Behavioral Science and Finance at Warwick Business School. “However, our analysis found that the strength of this relationship, using this very simple weekly trading strategy, has diminished in recent years. This potentially reflects the increasing incorporation of Internet data into automated trading strategies, and highlights that more advanced strategies are now needed to fully exploit online data in financial trading.”
Rats. Quants are already on this it seems.
What’s fascinating to me is that the Warwick experts overlooked a couple of points; namely:
- Google is using its own predictive methods to determine what users see when they get a search result based on the behavior of others. Recursion, anyone?
- Google provides more searches with each passing day to those using mobile devices. By their nature, traditional desktop queries are not exactly the same as mobile device searches. As a workaround, Google uses clusters and other methods to give users what Google thinks the user really wants. Advertising, anyone?
- The stock pickers that are the cat’s pajamas at the B school have to demonstrate their acumen on the trading floor. Does insider trading play a role? Does working at a Goldman Sachs-type of firm help a bit?
Like perpetual motion, folks will keep looking for a way to get an edge. Why are large international banks paying some hefty fines? Humans, I believe, not algorithms.
Stephen E Arnold, July 28, 2014
July 21, 2014
I read “Scientific Data Has Become So Complex, We Have to Invent New Math to Deal With It.” My hunch is that this article will become Google spider food with a protein punch.
In my lectures for the police and intelligence community, I review research findings from journals and my work that reveal a little appreciated factoid; to wit: The majority of today’s content processing systems use a fairly narrow suite of numerical recipes that have been embraced for decades by vendors, scientists, mathematicians, and entrepreneurs. Due to computational constraints and limitations of even the slickest of today’s modern computers, processing certain data sets is a very difficult and expensive in humans, programming, and machine time job.
Thus, the similarity among systems comes from several factors.
- The familiar is preferred to the onerous task of finding a slick new way to compute k-means or perform one of the other go-to functions in information processing
- Systems have to deliver certain types of functions in order to make it easy for a procurement team or venture oriented investor to ask, “Does your system cluster?” Answer: Yes. Venture oriented investor responds, “Check.” The procedure accounts for the sameness of the feature lists between Palantir, Recorded Future, and simile systems. When the similarities make companies nervous, litigation results. Example: Palantir versus i2 Ltd. (now a unit of IBM).
- Alternative methods of addressing tasks in content processing exist, but they are tough to implement in today’s computing systems. The technical reason for the reluctance to use some fancy math from my uncle Vladimir Ivanovich Arnold’s mentor Andrey Kolmogorov is that in many applications the computing system cannot complete the computation. The buzzword for this is P=NP? Here’s MIT’s 2009 explanation
- Savvy researchers have to find a way to get from A to B that works within the constraints of time, confidence level required, and funding.
The Wired article identifies other hurdles; for example, the need for constant updating. A system might be able to compute a solution using fancy math on a right sized data set. But toss in constantly updating information and the computing resources often just keep getting hungrier for more storage, bandwidth, and computational power. Then the bigger the data, the computing system has to shove that data around. As fast as an iPad or modern Dell notebook seems, the friction adds latency to a system. For some analyses, delays can have significant repercussions. Most Big Data systems are not the fleetest of foot.
The Wired article explains how fancy math folks cope with these challenges:
Vespignani uses a wide range of mathematical tools and techniques to make sense of his data, including text recognition. He sifts through millions of tweets looking for the most relevant words to whatever system he is trying to model. DeDeo adopted a similar approach for the Old Bailey archives project. His solution was to reduce his initial data set of 100,000 words by grouping them into 1,000 categories, using key words and their synonyms. “Now you’ve turned the trial into a point in a 1,000-dimensional space that tells you how much the trial is about friendship, or trust, or clothing,” he explained.
Wired labels this approach as “piecemeal.”
The fix? Wired reports:
the big data equivalent of a Newtonian revolution, on par with the 17th century invention of calculus, which he [Yalie mathematician Ronald Coifman] believes is already underway.
Topological analyses and sparsity, may offer a path forward.
The kicker in the Wired story is the use of the phrase “tractable computational techniques.” The notion of “new math” is an appealing one.
For the near future, the focus will be on optimization of methods that can be computed on today’s gizmos. One widely used method in Autonomy, Recommind, and many other systems originates with Sir Thomas Bayes who died in 1761. My relative died 2010. I understand there were some promising methods developed after Kolmogorov died in 1987.
Inventing new math is underway. The question is, “When will computing systems become available to use these methods without severe sampling limitations?” In the meantime, Big Data keep on rolling in, possibly mis-analyzed and contributing to decisions with unacceptable levels of risk.
Stephen E Arnold, July 21, 2014
July 21, 2014
The article titled Text Analytics Company Linguamatics Boosts Enterprise Search with Semantic Enrichment on MarketWatch discusses the launch of 12E Semantic Enrichment from Linguamatics. The new release allows for the mining of a variety of texts, from scientific literature to patents to social media. It promises faster, more relevant search for users. The article states,
“Enterprise search engines consume this enriched metadata to provide a faster, more effective search for users. I2E uses natural language processing (NLP) technology to find concepts in the right context, combined with a range of other strategies including application of ontologies, taxonomies, thesauri, rule-based pattern matching and disambiguation based on context. This allows enterprise search engines to gain a better understanding of documents in order to provide a richer search experience and increase findability, which enables users to spend less time on search.”
Whether they are spinning semantics for search, or if it is search spun for semantics, Linguamatics has made their technology available to tens of thousands of users of enterprise search. Representative John M. Brimacombe was straightforward in his comments about the disappointment surrounding enterprise search, but optimistic about 12E. It is currently being used by many top organizations, as well as the Food and Drug Administration.
Chelsea Kerwin, July 21, 2014
July 8, 2014
I read an interview conducted by the consulting firm PWC. The interview appeared with the title “Making Hadoop Suitable for Enterprise Data Science.” The interview struck me as important for two reasons. The questioner and the interview subject introduce a number of buzzwords and business generalizations that will be bandied about in the near future. Second, the interview provides a glimpse of the fish with sharp teeth that swim in what seems to be a halcyon data lake. With Hadoop goodness replenishing the “data pond,” Big Data is a life sustaining force. That’s the theory.
The interview subject is Mike Lang, the CEO of Revelytix. (I am not familiar with Revelytix, and I don’t know how to pronounce the company’s name.) The interviewer is one of those tag teams that high end consulting firms deploy to generate “real” information. Big time consulting firms publish magazines, emulating the McKinsey Quarterly. The idea is that Big Ideas need to be explained so that MBAs can convert information into anxiety among prospects. The purpose of these bespoke business magazines is to close deals and highlight technologies that may be recommended to a consulting firm’s customers. Some quasi consulting firms borrow other people’s work. For an example of this short cut approach, see the IDC Schubmehl write up.
Several key buzzwords appear in the interview:
- Nimble. Once data are in Hadoop, the Big Data software system, has to be quick and light in movement or action. Sounds very good, especially for folks dealing with Big Data. So with Hadoop one has to use “nimble analytics.” Also, sounds good. I am not sure what a “nimble analytic” is, but, hey, do not slow down generality machines with details, please.
- Data lakes. These are “pools” of data from different sources. Once data is in a Hadoop “data lake”, every water or data molecule is the same. It’s just like chemistry sort of…maybe.
- A dump. This is a mixed metaphor, but it seems that PWC wants me to put my heterogeneous data which is now like water molecules in a “dump”. Mixed metaphor is it not? Again. A mere detail. A data lake has dumps or a dump has data lakes. I am not sure which has what. Trivial and irrelevant, of course.
- Data schema. To make data fit a schema with an old fashioned system like Oracle, it takes time. With a data lake and a dump, someone smashes up data and shapes it. Here’s the magic: “They might choose one table and spend quite a bit of time understanding and cleaning up that table and getting the data into a shape that can be used in their tool. They might do that across three different files in HDFS [Hadoop Distributed File System]. But, they clean it as they’re developing their model, they shape it, and at the very end both the model and the schema come together to produce the analytics.” Yep, magic.
- Predictive analytics, not just old boring statistics. The idea is that with a “large scale data lake”, someone can make predictions. Here’s some color on predictive analytics: “This new generation of processing platforms focuses on analytics. That problem right there is an analytical problem, and it’s predictive in its nature. The tools to help with that are just now emerging. They will get much better about helping data scientists and other users. Metadata management capabilities in these highly distributed big data platforms will become crucial—not nice-to-have capabilities, but I-can’t-do-my-work-without-them capabilities. There’s a sea of data.”
My take is that PWC is going to bang the drum for Hadoop. Never mind that Hadoop may not be the Swiss Army knife that some folks want it to be. I don’t want to rain on the parade, but Hadoop requires some specialized skills. Fancy math requires more specialized skills. Interpretation of the outputs from data lakes and predictive systems requires even more specialized skills.
No problem as long as the money lake is sufficiently deep, broad, and full.
The search for a silver bullet continues. That’s what makes search and content processing so easy. Unfortunately the buzzwords may not deliver the type of results that inform decisions. Fill that money lake because it feeds the dump.
Stephen E Arnold, July 7, 2014
July 7, 2014
The data-analysis work of recently prominent economist Thomas Pikkety receives another whack, this time from computer scientist and blogger Daniel Lemire in, “You Shouldn’t Use a Spreadsheet for Important Work (I Mean It).” Pikkety is not alone in Lemire’s reproach; last year, he took Harvard-based economists Carmen Reinhart and Kenneth Rogoff to task for building their influential 2010 paper on an Excel spreadsheet.
The article begins by observing that Pikkety’s point, that in today’s world the rich get richer and the poor poorer, is widely made but difficult to prove. Though he seems to applaud Pikkety’s attempt to do so, Lemire really wishes the economist had chosen specialized software, like STATA, SAS, or “even” R or Fortran. He writes:
“What is remarkable regarding Piketty’s work, is that he backed his work with comprehensive data and thorough analysis. Unfortunately, like too many people, Piketty used speadsheets instead of writing sane software. On the plus side, he published his code… on the negative side, it appears that Piketty’s code contains mistakes, fudging and other problems….
“I will happily use a spreadsheet to estimate the grades of my students, my retirement savings, or how much tax I paid last year… but I will not use Microsoft Excel to run a bank or to compute the trajectory of the space shuttle. Spreadsheets are convenient but error prone. They are at their best when errors are of little consequence or when problems are simple. It looks to me like Piketty was doing complicated work and he bet his career on the accuracy of his results.”
The write-up notes that Piketty admits there are mistakes in his work, but asserts they are “probably inconsequential.” That’s missing the point, says Lemire, who insists that a responsible data analyst would have taken more time to ensure accuracy. My parents always advised me to use the right tool for a job: that initial choice can make a big difference in the outcome. It seems economists may want to heed that common (and common sense) advice.
Cynthia Murrell, July 07, 2014
June 30, 2014
I returned from a brief visit to Europe to an email asking about Rocket Software’s breakthrough technology AeroText. I poked around in my archive and found a handful of nuggets about the General Electric Laboratories’ technology that migrated to Martin Marietta, then to Lockheed Martin, and finally in 2008 to the low profile Rocket Software, an IBM partner.
When did the text extraction software emerge? Is Rocket Software AeroText a “new kid on the block”? The short answer is that AeroText is pushing 30, maybe 35 years young.
Digging into My Archive of Search Info
As far as my archive goes, it looks as though the roots of AeroText are anchored in the 1980s, Yep, that works out to an innovation about the same age as the long in the tooth ISYS Search system, now owned by Lexmark. Over the years, the AeroText “product” has evolved, often in response to US government funding opportunities. The precursor to AeroText was an academic exercise at General Electric. Keep in mind that GE makes jet engines, so GE at one time had a keen interest in anything its aerospace customers in the US government thought was a hot tamale.
The AeroText interface circa mid 2000. On the left is the extraction window. On the right is the document window. From “Information Extraction Tools: Deciphering Human Language, IT Pro, November December 2004, page 28.
The GE project, according to my notes, appeared as NLToolset, although my files contained references to different descriptions such as Shogun. GE’s team of academics and “real” employees developed a bundle of tools for its aerospace activities and in response to Tipster. (As a side note, in 2001, there were a number of Tipster related documents in the www.firstgov.gov system. But the new www.usa.gov index does not include that information. You will have to do your own searching to unearth these text processing jump start documents.)
The aerospace connection is important because the Department of Defense in the 1980s was trying to standardize on markup for documents. Part of this effort was processing content like technical manuals and various types of unstructured content to figure out who was named, what part was what, and what people, places, events, and things were mentioned in digital content. The utility of NLToolset type software was for cost reduction associated with documents and the intelligence value of processed information.
The need for a markup system that worked without 100 percent human indexing was important. GE got with the program and appears to have assigned some then-young folks to the project. The government speak for this type of content processing involves terms like “message understanding” or MU, “entity extraction,” and “relationship mapping. The outputs of an NLToolset system were intended for use in other software subsystems that could count, process, and perform other operations on the tagged content. Today, this class of software would be packaged under a broad term like “text mining.” GE exited the business, which ended up in the hands of Martin Marietta. When the technology landed at Martin Marietta, the suite of tools was used in what was called in the late 1980s and early 1990s, the Louella Parsing System. When Lockheed and Martin merged to form the giant Lockheed Martin, Louella was renamed AeroText.
Over the years, the AeroText system competed with LingPipe, SRA’s NetOwl and Inxight’s tools. In the hay day of natural language processing, there were dozens and dozens of universities and start ups competing for Federal funding. I have mentioned in other articles the importance of the US government in jump starting the craziness in search and content processing.
In 2005, I recall that Lockheed Martin released AeroText 5.1 for Linux, but I have lost track of the open source versions of the system. The point is that AeroText is not particularly new, and as far as I know, the last major upgrade took place in 2007 before Lockheed Martin sold the property to AeroText. At the time of the sale, AeroText incorporated a number of subsystems, including a useful time plotting feature. A user could see tagged events on a timeline, a function long associated with the original version of i2’s the Analyst Notebook. A US government buyer can obtain AeroText via the GSA because Lockheed Martin seems to be a reseller of the technology. Before the sale to Rocket, Lockheed Martin followed SAIC’s push into Australia. Lockheed signed up NetMap Analytics to handle Australia’s appetite for US government accepted systems.
What does AeroText purport to do that caused the person who contacted me to see a 1980s technology as the next best thing to sliced bread?
AeroText is an extraction tool; that is, it has capabilities to identify and tag entities at somewhere between 50 percent and 80 percent accuracy. (See NIST 2007 Automatic Content Extraction Evaluation Official Results for more detail.)
The AeroText approach uses knowledgebases, rules, and patterns to identify and tag pre-specified types of information. AeroText references patterns and templates, both of which assume the licensee knows beforehand what is needed and what will happen to processed content.
In my view, the licensee has to know what he or she is looking for in order to find it. This is a problem captured in the famous snippet, “You don’t know what you don’t know” and the “unknown unknowns” variation popularized by Donald Rumsfeld. Obviously without prior knowledge the utility of an AeroText-type of system has to be matched to mission requirements. AeroText pounded the drum for the semantic Web revolution. One of AeroText’s key functions was its ability to perform the type of markup the Department of Defense required of its XML. The US DoD used a variant called DAML or Darpa Agent Markup Language. natural language processing, Louella, and AeroText collected the dust of SPARQL, unifying logic, RDF, OWL, ontologies, and other semantic baggage as the system evolved through time.
Also, staff (headcount) and on-going services are required to keep a Louella/AeroText-type system generating relevant and usable outputs. AeroText can find entities, figure out relationships like person to person and person to organization, and tag events like a merger or an arrest “event.” In one briefing about AeroText I attended, I recall that the presenter emphasized that AeroText did not require training. (The subtext for those in the know was that Autonomy required training to deliver actionable outputs.) The presenter did not dwell on the need for manual fiddling with AeroText’s knowledgebases and I did not raise this issue.)
June 24, 2014
HP Autonomy has undergone a redesign, or as HP phrases it, a rebirth. HP is ready to make the unveiling official, and those interested can read about the details in the article, “Analytics for Human Information: HP IDOL 10.6 Just Released: A Story of Something Bigger.”
The article begins:
“Under the direction of SVP and General Manager Robert Youngjohns, this past year has been a time of transformation for HP Autonomy—with a genuine commitment to customer satisfaction, breakthrough technological innovation, and culture of transparency. Internally, to emphasize the importance of this fresh new thinking and business approach, we refer to this change as #AutonomyReborn.”
Quarterly releases are promising rapid updates, and open source integration is front and center. Current users and interested new users can download the latest version from the customer support site.
Emily Rae Aldridge, June 24, 2014
June 19, 2014
HP says that it has been spending the past year rebuilding Autonomy into a flagship, foundational technology for HP IDOL 10. HP discusses the new changes in “Analytics For Human Information: HP IDOL 10.6 Just Released A Story Of Something Bigger.” Autonomy had problems in the past when its capabilities of organizing and analyzing unstructured information were called into question after HP purchased it. HP claims that under its guidance HP IDOL 10 is drastically different from its previous incarnations:
“HP IDOL 10, released under HP’s stewardship, reflects in many ways the transformation that has occurred under HP. IDOL 10 is fundamentally different from Autonomy IDOL 7 in the same way that HP Autonomy as a company differs pre- and post- acquisition. They may share the name IDOL, but the differences are so vast from both strategic and technology points-of-view that we consider IDOL 10 a wholly new product from IDOL 7, and not just a version update. HP sees IDOL as a strategic pillar of HAVEn – HP’s comprehensive big data platform – and isn’t shy to use its vast R&D resources to invest heavily into the technology.”
Some of the changes include automatic time zone conversion, removal of sensitive or offensive material, and better site administration. All clients who currently have an IDOL support contract will be able to download the upgrade free of charge.
HP really wants to be in the headlines for some positive news, instead of lawsuits. They are still ringing from the Autonomy purchase flub and now they are working on damage control. How long will they be doing that? Something a bit more impressive than a filter and time zone conversion is called for to sound the trumpets.