July 28, 2014
I read “Google Searches Hold Key to Future Market Crashes.” The main idea in my opinion is:
Moat [female big thinker at Warwick Business School’ continued, “Our results are in line with the hypothesis that increases in searches relating to both politics and business could be a sign of concern about the state of the economy, which may lead to decreased confidence in the value of stocks, resulting in transactions at lower prices.”
So will the Warwick team cash in on the stock market?
Well, there is a cautionary item as well:
“Our results provide evidence of a relationship between the search behavior of Google users and stock market movements,” said Tobias Preis, Associate Professor of Behavioral Science and Finance at Warwick Business School. “However, our analysis found that the strength of this relationship, using this very simple weekly trading strategy, has diminished in recent years. This potentially reflects the increasing incorporation of Internet data into automated trading strategies, and highlights that more advanced strategies are now needed to fully exploit online data in financial trading.”
Rats. Quants are already on this it seems.
What’s fascinating to me is that the Warwick experts overlooked a couple of points; namely:
- Google is using its own predictive methods to determine what users see when they get a search result based on the behavior of others. Recursion, anyone?
- Google provides more searches with each passing day to those using mobile devices. By their nature, traditional desktop queries are not exactly the same as mobile device searches. As a workaround, Google uses clusters and other methods to give users what Google thinks the user really wants. Advertising, anyone?
- The stock pickers that are the cat’s pajamas at the B school have to demonstrate their acumen on the trading floor. Does insider trading play a role? Does working at a Goldman Sachs-type of firm help a bit?
Like perpetual motion, folks will keep looking for a way to get an edge. Why are large international banks paying some hefty fines? Humans, I believe, not algorithms.
Stephen E Arnold, July 28, 2014
July 21, 2014
I read “Scientific Data Has Become So Complex, We Have to Invent New Math to Deal With It.” My hunch is that this article will become Google spider food with a protein punch.
In my lectures for the police and intelligence community, I review research findings from journals and my work that reveal a little appreciated factoid; to wit: The majority of today’s content processing systems use a fairly narrow suite of numerical recipes that have been embraced for decades by vendors, scientists, mathematicians, and entrepreneurs. Due to computational constraints and limitations of even the slickest of today’s modern computers, processing certain data sets is a very difficult and expensive in humans, programming, and machine time job.
Thus, the similarity among systems comes from several factors.
- The familiar is preferred to the onerous task of finding a slick new way to compute k-means or perform one of the other go-to functions in information processing
- Systems have to deliver certain types of functions in order to make it easy for a procurement team or venture oriented investor to ask, “Does your system cluster?” Answer: Yes. Venture oriented investor responds, “Check.” The procedure accounts for the sameness of the feature lists between Palantir, Recorded Future, and simile systems. When the similarities make companies nervous, litigation results. Example: Palantir versus i2 Ltd. (now a unit of IBM).
- Alternative methods of addressing tasks in content processing exist, but they are tough to implement in today’s computing systems. The technical reason for the reluctance to use some fancy math from my uncle Vladimir Ivanovich Arnold’s mentor Andrey Kolmogorov is that in many applications the computing system cannot complete the computation. The buzzword for this is P=NP? Here’s MIT’s 2009 explanation
- Savvy researchers have to find a way to get from A to B that works within the constraints of time, confidence level required, and funding.
The Wired article identifies other hurdles; for example, the need for constant updating. A system might be able to compute a solution using fancy math on a right sized data set. But toss in constantly updating information and the computing resources often just keep getting hungrier for more storage, bandwidth, and computational power. Then the bigger the data, the computing system has to shove that data around. As fast as an iPad or modern Dell notebook seems, the friction adds latency to a system. For some analyses, delays can have significant repercussions. Most Big Data systems are not the fleetest of foot.
The Wired article explains how fancy math folks cope with these challenges:
Vespignani uses a wide range of mathematical tools and techniques to make sense of his data, including text recognition. He sifts through millions of tweets looking for the most relevant words to whatever system he is trying to model. DeDeo adopted a similar approach for the Old Bailey archives project. His solution was to reduce his initial data set of 100,000 words by grouping them into 1,000 categories, using key words and their synonyms. “Now you’ve turned the trial into a point in a 1,000-dimensional space that tells you how much the trial is about friendship, or trust, or clothing,” he explained.
Wired labels this approach as “piecemeal.”
The fix? Wired reports:
the big data equivalent of a Newtonian revolution, on par with the 17th century invention of calculus, which he [Yalie mathematician Ronald Coifman] believes is already underway.
Topological analyses and sparsity, may offer a path forward.
The kicker in the Wired story is the use of the phrase “tractable computational techniques.” The notion of “new math” is an appealing one.
For the near future, the focus will be on optimization of methods that can be computed on today’s gizmos. One widely used method in Autonomy, Recommind, and many other systems originates with Sir Thomas Bayes who died in 1761. My relative died 2010. I understand there were some promising methods developed after Kolmogorov died in 1987.
Inventing new math is underway. The question is, “When will computing systems become available to use these methods without severe sampling limitations?” In the meantime, Big Data keep on rolling in, possibly mis-analyzed and contributing to decisions with unacceptable levels of risk.
Stephen E Arnold, July 21, 2014
July 21, 2014
The article titled Text Analytics Company Linguamatics Boosts Enterprise Search with Semantic Enrichment on MarketWatch discusses the launch of 12E Semantic Enrichment from Linguamatics. The new release allows for the mining of a variety of texts, from scientific literature to patents to social media. It promises faster, more relevant search for users. The article states,
“Enterprise search engines consume this enriched metadata to provide a faster, more effective search for users. I2E uses natural language processing (NLP) technology to find concepts in the right context, combined with a range of other strategies including application of ontologies, taxonomies, thesauri, rule-based pattern matching and disambiguation based on context. This allows enterprise search engines to gain a better understanding of documents in order to provide a richer search experience and increase findability, which enables users to spend less time on search.”
Whether they are spinning semantics for search, or if it is search spun for semantics, Linguamatics has made their technology available to tens of thousands of users of enterprise search. Representative John M. Brimacombe was straightforward in his comments about the disappointment surrounding enterprise search, but optimistic about 12E. It is currently being used by many top organizations, as well as the Food and Drug Administration.
Chelsea Kerwin, July 21, 2014
July 8, 2014
I read an interview conducted by the consulting firm PWC. The interview appeared with the title “Making Hadoop Suitable for Enterprise Data Science.” The interview struck me as important for two reasons. The questioner and the interview subject introduce a number of buzzwords and business generalizations that will be bandied about in the near future. Second, the interview provides a glimpse of the fish with sharp teeth that swim in what seems to be a halcyon data lake. With Hadoop goodness replenishing the “data pond,” Big Data is a life sustaining force. That’s the theory.
The interview subject is Mike Lang, the CEO of Revelytix. (I am not familiar with Revelytix, and I don’t know how to pronounce the company’s name.) The interviewer is one of those tag teams that high end consulting firms deploy to generate “real” information. Big time consulting firms publish magazines, emulating the McKinsey Quarterly. The idea is that Big Ideas need to be explained so that MBAs can convert information into anxiety among prospects. The purpose of these bespoke business magazines is to close deals and highlight technologies that may be recommended to a consulting firm’s customers. Some quasi consulting firms borrow other people’s work. For an example of this short cut approach, see the IDC Schubmehl write up.
Several key buzzwords appear in the interview:
- Nimble. Once data are in Hadoop, the Big Data software system, has to be quick and light in movement or action. Sounds very good, especially for folks dealing with Big Data. So with Hadoop one has to use “nimble analytics.” Also, sounds good. I am not sure what a “nimble analytic” is, but, hey, do not slow down generality machines with details, please.
- Data lakes. These are “pools” of data from different sources. Once data is in a Hadoop “data lake”, every water or data molecule is the same. It’s just like chemistry sort of…maybe.
- A dump. This is a mixed metaphor, but it seems that PWC wants me to put my heterogeneous data which is now like water molecules in a “dump”. Mixed metaphor is it not? Again. A mere detail. A data lake has dumps or a dump has data lakes. I am not sure which has what. Trivial and irrelevant, of course.
- Data schema. To make data fit a schema with an old fashioned system like Oracle, it takes time. With a data lake and a dump, someone smashes up data and shapes it. Here’s the magic: “They might choose one table and spend quite a bit of time understanding and cleaning up that table and getting the data into a shape that can be used in their tool. They might do that across three different files in HDFS [Hadoop Distributed File System]. But, they clean it as they’re developing their model, they shape it, and at the very end both the model and the schema come together to produce the analytics.” Yep, magic.
- Predictive analytics, not just old boring statistics. The idea is that with a “large scale data lake”, someone can make predictions. Here’s some color on predictive analytics: “This new generation of processing platforms focuses on analytics. That problem right there is an analytical problem, and it’s predictive in its nature. The tools to help with that are just now emerging. They will get much better about helping data scientists and other users. Metadata management capabilities in these highly distributed big data platforms will become crucial—not nice-to-have capabilities, but I-can’t-do-my-work-without-them capabilities. There’s a sea of data.”
My take is that PWC is going to bang the drum for Hadoop. Never mind that Hadoop may not be the Swiss Army knife that some folks want it to be. I don’t want to rain on the parade, but Hadoop requires some specialized skills. Fancy math requires more specialized skills. Interpretation of the outputs from data lakes and predictive systems requires even more specialized skills.
No problem as long as the money lake is sufficiently deep, broad, and full.
The search for a silver bullet continues. That’s what makes search and content processing so easy. Unfortunately the buzzwords may not deliver the type of results that inform decisions. Fill that money lake because it feeds the dump.
Stephen E Arnold, July 7, 2014
July 7, 2014
The data-analysis work of recently prominent economist Thomas Pikkety receives another whack, this time from computer scientist and blogger Daniel Lemire in, “You Shouldn’t Use a Spreadsheet for Important Work (I Mean It).” Pikkety is not alone in Lemire’s reproach; last year, he took Harvard-based economists Carmen Reinhart and Kenneth Rogoff to task for building their influential 2010 paper on an Excel spreadsheet.
The article begins by observing that Pikkety’s point, that in today’s world the rich get richer and the poor poorer, is widely made but difficult to prove. Though he seems to applaud Pikkety’s attempt to do so, Lemire really wishes the economist had chosen specialized software, like STATA, SAS, or “even” R or Fortran. He writes:
“What is remarkable regarding Piketty’s work, is that he backed his work with comprehensive data and thorough analysis. Unfortunately, like too many people, Piketty used speadsheets instead of writing sane software. On the plus side, he published his code… on the negative side, it appears that Piketty’s code contains mistakes, fudging and other problems….
“I will happily use a spreadsheet to estimate the grades of my students, my retirement savings, or how much tax I paid last year… but I will not use Microsoft Excel to run a bank or to compute the trajectory of the space shuttle. Spreadsheets are convenient but error prone. They are at their best when errors are of little consequence or when problems are simple. It looks to me like Piketty was doing complicated work and he bet his career on the accuracy of his results.”
The write-up notes that Piketty admits there are mistakes in his work, but asserts they are “probably inconsequential.” That’s missing the point, says Lemire, who insists that a responsible data analyst would have taken more time to ensure accuracy. My parents always advised me to use the right tool for a job: that initial choice can make a big difference in the outcome. It seems economists may want to heed that common (and common sense) advice.
Cynthia Murrell, July 07, 2014
June 30, 2014
I returned from a brief visit to Europe to an email asking about Rocket Software’s breakthrough technology AeroText. I poked around in my archive and found a handful of nuggets about the General Electric Laboratories’ technology that migrated to Martin Marietta, then to Lockheed Martin, and finally in 2008 to the low profile Rocket Software, an IBM partner.
When did the text extraction software emerge? Is Rocket Software AeroText a “new kid on the block”? The short answer is that AeroText is pushing 30, maybe 35 years young.
Digging into My Archive of Search Info
As far as my archive goes, it looks as though the roots of AeroText are anchored in the 1980s, Yep, that works out to an innovation about the same age as the long in the tooth ISYS Search system, now owned by Lexmark. Over the years, the AeroText “product” has evolved, often in response to US government funding opportunities. The precursor to AeroText was an academic exercise at General Electric. Keep in mind that GE makes jet engines, so GE at one time had a keen interest in anything its aerospace customers in the US government thought was a hot tamale.
The AeroText interface circa mid 2000. On the left is the extraction window. On the right is the document window. From “Information Extraction Tools: Deciphering Human Language, IT Pro, November December 2004, page 28.
The GE project, according to my notes, appeared as NLToolset, although my files contained references to different descriptions such as Shogun. GE’s team of academics and “real” employees developed a bundle of tools for its aerospace activities and in response to Tipster. (As a side note, in 2001, there were a number of Tipster related documents in the www.firstgov.gov system. But the new www.usa.gov index does not include that information. You will have to do your own searching to unearth these text processing jump start documents.)
The aerospace connection is important because the Department of Defense in the 1980s was trying to standardize on markup for documents. Part of this effort was processing content like technical manuals and various types of unstructured content to figure out who was named, what part was what, and what people, places, events, and things were mentioned in digital content. The utility of NLToolset type software was for cost reduction associated with documents and the intelligence value of processed information.
The need for a markup system that worked without 100 percent human indexing was important. GE got with the program and appears to have assigned some then-young folks to the project. The government speak for this type of content processing involves terms like “message understanding” or MU, “entity extraction,” and “relationship mapping. The outputs of an NLToolset system were intended for use in other software subsystems that could count, process, and perform other operations on the tagged content. Today, this class of software would be packaged under a broad term like “text mining.” GE exited the business, which ended up in the hands of Martin Marietta. When the technology landed at Martin Marietta, the suite of tools was used in what was called in the late 1980s and early 1990s, the Louella Parsing System. When Lockheed and Martin merged to form the giant Lockheed Martin, Louella was renamed AeroText.
Over the years, the AeroText system competed with LingPipe, SRA’s NetOwl and Inxight’s tools. In the hay day of natural language processing, there were dozens and dozens of universities and start ups competing for Federal funding. I have mentioned in other articles the importance of the US government in jump starting the craziness in search and content processing.
In 2005, I recall that Lockheed Martin released AeroText 5.1 for Linux, but I have lost track of the open source versions of the system. The point is that AeroText is not particularly new, and as far as I know, the last major upgrade took place in 2007 before Lockheed Martin sold the property to AeroText. At the time of the sale, AeroText incorporated a number of subsystems, including a useful time plotting feature. A user could see tagged events on a timeline, a function long associated with the original version of i2’s the Analyst Notebook. A US government buyer can obtain AeroText via the GSA because Lockheed Martin seems to be a reseller of the technology. Before the sale to Rocket, Lockheed Martin followed SAIC’s push into Australia. Lockheed signed up NetMap Analytics to handle Australia’s appetite for US government accepted systems.
What does AeroText purport to do that caused the person who contacted me to see a 1980s technology as the next best thing to sliced bread?
AeroText is an extraction tool; that is, it has capabilities to identify and tag entities at somewhere between 50 percent and 80 percent accuracy. (See NIST 2007 Automatic Content Extraction Evaluation Official Results for more detail.)
The AeroText approach uses knowledgebases, rules, and patterns to identify and tag pre-specified types of information. AeroText references patterns and templates, both of which assume the licensee knows beforehand what is needed and what will happen to processed content.
In my view, the licensee has to know what he or she is looking for in order to find it. This is a problem captured in the famous snippet, “You don’t know what you don’t know” and the “unknown unknowns” variation popularized by Donald Rumsfeld. Obviously without prior knowledge the utility of an AeroText-type of system has to be matched to mission requirements. AeroText pounded the drum for the semantic Web revolution. One of AeroText’s key functions was its ability to perform the type of markup the Department of Defense required of its XML. The US DoD used a variant called DAML or Darpa Agent Markup Language. natural language processing, Louella, and AeroText collected the dust of SPARQL, unifying logic, RDF, OWL, ontologies, and other semantic baggage as the system evolved through time.
Also, staff (headcount) and on-going services are required to keep a Louella/AeroText-type system generating relevant and usable outputs. AeroText can find entities, figure out relationships like person to person and person to organization, and tag events like a merger or an arrest “event.” In one briefing about AeroText I attended, I recall that the presenter emphasized that AeroText did not require training. (The subtext for those in the know was that Autonomy required training to deliver actionable outputs.) The presenter did not dwell on the need for manual fiddling with AeroText’s knowledgebases and I did not raise this issue.)
June 24, 2014
HP Autonomy has undergone a redesign, or as HP phrases it, a rebirth. HP is ready to make the unveiling official, and those interested can read about the details in the article, “Analytics for Human Information: HP IDOL 10.6 Just Released: A Story of Something Bigger.”
The article begins:
“Under the direction of SVP and General Manager Robert Youngjohns, this past year has been a time of transformation for HP Autonomy—with a genuine commitment to customer satisfaction, breakthrough technological innovation, and culture of transparency. Internally, to emphasize the importance of this fresh new thinking and business approach, we refer to this change as #AutonomyReborn.”
Quarterly releases are promising rapid updates, and open source integration is front and center. Current users and interested new users can download the latest version from the customer support site.
Emily Rae Aldridge, June 24, 2014
June 19, 2014
HP says that it has been spending the past year rebuilding Autonomy into a flagship, foundational technology for HP IDOL 10. HP discusses the new changes in “Analytics For Human Information: HP IDOL 10.6 Just Released A Story Of Something Bigger.” Autonomy had problems in the past when its capabilities of organizing and analyzing unstructured information were called into question after HP purchased it. HP claims that under its guidance HP IDOL 10 is drastically different from its previous incarnations:
“HP IDOL 10, released under HP’s stewardship, reflects in many ways the transformation that has occurred under HP. IDOL 10 is fundamentally different from Autonomy IDOL 7 in the same way that HP Autonomy as a company differs pre- and post- acquisition. They may share the name IDOL, but the differences are so vast from both strategic and technology points-of-view that we consider IDOL 10 a wholly new product from IDOL 7, and not just a version update. HP sees IDOL as a strategic pillar of HAVEn – HP’s comprehensive big data platform – and isn’t shy to use its vast R&D resources to invest heavily into the technology.”
Some of the changes include automatic time zone conversion, removal of sensitive or offensive material, and better site administration. All clients who currently have an IDOL support contract will be able to download the upgrade free of charge.
HP really wants to be in the headlines for some positive news, instead of lawsuits. They are still ringing from the Autonomy purchase flub and now they are working on damage control. How long will they be doing that? Something a bit more impressive than a filter and time zone conversion is called for to sound the trumpets.
June 10, 2014
At this year’s Gigaom Structure Data conference, Palantir’s Ari Gesher offered an apt parallel for the data field’s current growing pains: using computers before the dawn of operating systems. Gigaom summarizes his explanation in, “Palantir: Big Data Needs to Get Even More Abstract(ions).” Writer Tom Krazit tells us:
“Gesher took attendees on a bit of a computer history lesson, recalling how computers once required their users to manually reconfigure the machine each time they wanted to run a new program. This took a fair amount of time and effort: ‘if you wanted to use a computer to solve a problem, most of the effort went into organizing the pieces of hardware instead of doing what you wanted to do.’
“Operating systems brought abstraction, or a way to separate the busy work from the higher-level duties assigned to the computer. This is the foundation of modern computing, but it’s not widely used in the practice of data science.
“In other words, the current state of data science is like ‘yak shaving,’ a techie meme for a situation in which a bunch of tedious tasks that appear pointless actually solve a greater problem. ‘We need operating system abstractions for data problems,’ Gesher said.”
An operating system for data analysis? That’s one way to look at it, I suppose. The article invites us to click through to a video of the session, but as of this writing it is not functioning. Perhaps they will heed the request of one commenter and fix it soon.
Based in Palo Alto, California, Palantir focuses on improving the methods their customers use to analyze data. The company was founded in 2004 by some folks from PayPal and from Stanford University. The write-up makes a point of noting that Palantir is “notoriously secretive” and that part(s) of the U.S. government can be found among its clients. I’m not exactly sure, though, how that ties into Gesher’s observations. Does Krazit suspect it is the federal government calling for better organization and a simplified user experience? Now, that would be interesting.
Cynthia Murrell, June 10, 2014
June 7, 2014
When I left the intelligence conference in Prague, there were a number of companies in my graphic about open source search. When I got off the airplane, I edited my slide. Looks to me as if Elasticsearch has just bulldozed the search and content sector, commercialized open source group. I would not want to be the CEO of LucidWorks, Ikanow, or any other open sourcey search and content processing company this weekend.
I read “Elasticsearch Scores $70 Million to Help Sites Crunch Tons of Data Fast.” Forget the fact that Elasticsearch is built on Lucene and some home grown code. Ignore the grammar in “data fast.” Skip over the sports analogy “scores.” Dismiss the somewhat narrow definition of what Elasticsearch ELK can really deliver.
What’s important is the $70 million committed to Elasticsearch. Added to the $30 or $40 million the outfit had obtained before, we are looking at a $100 million bet on an open source search based business. Compare this to the trifling $40 million the proprietary vendor Coveo had gathered or the $30 million put on LucidWorks to get into the derby.
I have been pointing out that Elasticsearch has demonstrated that it had several advantages over its open source competitors; namely, developers, developers, and developers.
Now I want to point out that it has another angle of attack: money, money, and money.
With the silliness of the search and content processing vendors’ marketing over the last two years, I think we have the emergence of a centralizing company.
No, it’s not HP’s new cloudy Autonomy. No, it’s not the wonky Watson game and recipe code from IBM. No, it’s not the Google Search Appliance, although I do love the little yellow boxes.
I will be telling those who attend my lectures to go with Elasticsearch. That’s where the developers and the money are.
Stephen E Arnold, June 7, 2014