September 4, 2014
Autonomy, Recommind, and dozens of other search and content processing firms rely on statistical procedures. Anyone who has survived Statistics 101 believe in the power of numbers. Textbook examples are—well—pat. The numbers work out even for B and C students.
The real world, on the other hand, is different. What was formulaic in the textbook exercises is more difficult with most data sets. The data are incomplete, inconsistent, generated by systems whose integrity is unknown, and often wrong. Human carelessness, the lack of time, a lack of expertise, and plain vanilla cluelessness makes those nifty data sets squishier than a memory foam pillow.
If you have some questions about statistical evidence in today’s go go world, check out “I Disagree with Alan Turing and Daniel Kahneman Regarding the Strength of Statistical Evidence.”
I noted this passage:
It’s good to have an open mind. When a striking result appears in the dataset, it’s possible that this result does not represent an enduring truth or even a pattern in the general population but rather is just an artifact of a particular small and noisy dataset. One frustration I’ve had in recent discussions regarding controversial research is the seeming unwillingness of researchers to entertain the possibility that their published findings are just noise.
An open mind is important. Just looking at the outputs of zippy systems that do prediction for various entities can be instructive. In the last couple of months, I learned that predictive systems:
- Failed to size the Ebola outbreak by orders of magnitude
- Did not provide reliable outputs for analysts trying to figure out where a crashed airplane was
- Came up short regarding resources available to ISIS.
The Big Data revolution is one of those hoped for events. The idea is that Big Data will allow content processing vendors to sell big buck solutions. Another is that massive flows of unstructured content can only be tapped in a meaningful way with expensive information retrieval solutions.
Dreams, hopes, wishes—yep, all valid for children waiting for the tooth fairy. The real world has slightly more bumps and sharp places.
Stephen E Arnold, September, 2014
September 3, 2014
Amplitude is a new analytics startup backed by Y Combinator and recently raised $1.975 million in seed funding. TechCrunch reports on the fundraising efforts and how Amplitude differentiates itself from its competition in the article, “Amplitude, The Analytics Startup Undercutting Mixpanel, Raises $2 million Seed Round.”
Amplitude grew because there was a 400 percent increase in their enterprise customer base. Its founders originally were working on a text-by-voice app and they created an analytics tool to examine their data. It did not take them long to discover that the analytics tool was the better application. Amplitude is a valuable product, because of its skilled engineering team and the claim that it a predict customer queries and save space. Which brings us to the price:
“Amplitude offers a freemium service that gives customers up to 5 million monthly events for free. In comparison, Mixpanel charges $600/month for 4 million data points. Amplitude also offers a $299/month plans for up to 50 million monthly events – something that would move into custom pricing territory at Mixpanel. Beyond that, Amplitude offers enterprise plans, and today has customers like The Hunt, Heyday, KeepSafe, and other larger customers still under NDA.”
That is very cheap compared to other popular business analytics plans. Amplitude offers a high quality product at a reasonable price. Will it catch on in today’s cash-strapped market? It already has. Be forewarned that prices need to change for other analytics companies or they will lose customers.
September 1, 2014
Last week I had a conversation with a publisher who has a keen interest in software that “knows” what content means. Armed with that knowledge, a system can then answer questions.
The conversation was interesting. I mentioned my presentations for law enforcement and intelligence professionals about the limitations of modern and computationally expensive systems.
Several points crystallized in my mind. One of these is addressed, in part, in a diagram created by a person interested in machine learning methods. Here’s the diagram created by SciKit:
The diagram is designed to help a developer select from different methods of performing estimation operations. The author states:
Often the hardest part of solving a machine learning problem can be finding the right estimator for the job. Different estimators are better suited for different types of data and different problems. The flowchart below is designed to give users a bit of a rough guide on how to approach problems with regard to which estimators to try on your data.
First, notice that there is a selection process for choosing a particular numerical recipe. Now who determines which recipe is the right one? The answer is the coding chef. A human exercises judgment about a particular sequence of operation that will be used to fuel machine learning. Is that sequence of actions the best one, the expedient one, or the one that seems to work for the test data? The answer to these questions determines a key threshold for the resulting “learning system.” Stated another way, “Does the person licensing the system know if the numerical recipe is the most appropriate for the licensee’s data?” Nah. Does a mid tier consulting firm like Gartner, IDC, or Forrester dig into this plumbing? Nah. Does it matter? Oh, yeah. As I point out in my lectures, the “accuracy” of a system’s output depends on this type of plumbing decision. Unlike a backed up drain, flaws in smart systems may never be discerned. For certain operational decisions, financial shortfalls or the loss of an operation team in a war theater can be attributed to one of many variables. As decision makers chase the Silver Bullet of smart, thinking software, who really questions the output in a slick graphic? In my experience, darned few people. That includes cheerleaders for smart software, azure chip consultants, and former middle school teachers looking for a job as a search consultant.
Second, notice the reference to a “rough guide.” The real guide is understanding of how specific numerical recipes work on a set of data that allegedly represents what the system will process when operational. Furthermore, there are plenty of mathematical methods available. The problem is that some of the more interesting procedures lead to increased computational cost. In a worst case, the more interesting procedures cannot be computed on available resources. Some developers know about N=NP and Big O. Others know to use the same nine or ten mathematical procedures taught in computer science classes. After all, why worry about math based on mereology if the machine resources cannot handle the computations within time and budget parameters? This means that most modern systems are based on a set of procedures that are computationally affordable, familiar, and convenient. Does this similar of procedures matter? Yep. The generally squirrely outputs from many very popular systems are perceived as completely reliable. Unfortunately, the systems are performing within a narrow range of statistical confidence. Stated in a more harsh way, the outputs are just not particularly helpful.
In my conversation with the publisher, I asked several questions:
- Is there a smart system like Watson that you would rely upon to treat your teenaged daughter’s cancer? Or, would you prefer the human specialist at the Mayo Clinic or comparable institution?
- Is there a smart system that you want directing your only son in an operational mission in a conflict in a city under ISIS control? Or, would you prefer the human-guided decision near the theater about the mission?
- Is there a smart system you want managing your retirement funds in today’s uncertain economy? Or, would you prefer the recommendations of a certified financial planner relying on a variety of inputs, including analyses from specialists in whom your analyst has confidence?
When I asked these questions, the publisher looked uncomfortable. The reason is that the massive hyperbole and marketing craziness about fancy new systems creates what I call the Star Trek phenomenon. People watch Captain Kirk talking to devices, transporting himself from danger, and traveling between far flung galaxies. Because a mobile phone performs some of the functions of the fictional communicator, it sure seems as if many other flashy sci-fi services should be available.
Well, this Star Trek phenomenon does help direct some research. But in terms of products that can be used in high risk environments, the sci-fi remains a fiction.
Believing and expecting are different from working with products that are limited by computational resources, expertise, and informed understanding of key factors.
Humans, particularly those who need money to pay the mortgage, ignore reality. The objective is to close a deal. When it comes to information retrieval and content processing, today’s systems are marginally better than those available five or ten years ago. In some cases, today’s systems are less useful.
August 26, 2014
Natural language processing—one of its most-discussed functions in business is sentiment analysis. Over at the SmartData Collective, Lexalytics’ Scott Van Boeyen tells us “Why Sentiment Analysis Engines Need Customization.” The short answer: slang. The write-up explains:
The problem with sentiment analysis is sometimes it’s wrong.[…]
“Oh man, that was nasty!” Is this sentence positive or negative? Surely, it must be negative. “Nasty” is a negative word, and everything else in this sentence is neutral. Final answer, negative! Drum roll…. Wrong! It’s positive.
The person who said this used the American slang definition of nasty, which has positive sentiment. There is absolutely no way to know by reading the sentence. So, if you (a human) were just tricked by reading this article, how is a machine supposed to figure it out? Answer: Tell the engine what’s positive and what’s negative.
High quality NLP engines will let you customize your sentiment analysis settings. “Nasty” is negative by default. If you’re processing slang where “nasty” is considered a positive term, you would access your engine’s sentiment customization function, and assign a positive score to the word.
The man has a point. Still, we are left with a few questions: How much more should one expect to pay for a customization feature? Also, how long does it take to teach an NLP platform comprehensive alternate vocabulary? How does one decide what slang to include—has anyone developed a list of suggestions? Perhaps one could start by consulting the Urban Dictionary.
Cynthia Murrell, August 26, 2014
August 21, 2014
Prediction is hard, even within the realm of the impossible some may say. However, prediction has taken a step forward with the work of a Web site, correlated.org. Their goal is to find correlation between seemingly unrelated things. They have been able to take the aggregated results and draw greater conclusions about the wider population. Read more in the Business Insider article, “Correlation Expert Explains How 5 Questions Allow Him To Predict A Bunch Of Traits About People.”
The article begins:
“Gallagher, a former newspaper editor, runs correlated.org, a site that polls registered users on a wide variety of questions to identify strange correlations, ranging from the tendency of pot smokers to prefer sweet snacks to the tendency of Twitter users to remember their dreams. He also recently released a book. ‘Our answers to five basic questions are enough to predict our preferences and opinions about a whole lot of other things,’ Gallagher wrote.”
This could be good news for the world of predictive analytics. Sure, predictive analytics are pretty tried and true in the world of insurance, but in terms of consumer behavior, and other more casual needs, it is harder to draw straight lines. Exploring these smaller, less linear relationships through correlated.org may produce big dividends for other areas. Quite frankly, it is impressive that they are successfully predicting anything, and it bodes well for the future.
Emily Rae Aldridge, August 21, 2014
August 20, 2014
I read “Can HP IDOL Jumpstart the Big Data App Economy?” My first reaction was, “A Big Data app?” and then “What’s the Big Data app economy?” I ploughed into the write up and learned that:
Hewlett-Packard Co. is looking to take the driver’s seat in bringing about the era of pre-packaged analytic applications with the IDOL platform from Autonomy, and according to the head of product marketing for the subsidiary, it already has results to show for the effort.
Okay. And the evidence:
Standing out among the case studies that were being demonstrated at the conference was a clinical data management system serving as a foundation for services that each implemented the underlying functionality in a different way. Veis [HP professional] pointed at the solution as a prime example of developer ingenuity that would not be facilitated had HP not made the capabilities of IDOL available for consumption from the cloud last December.
Well, there is some work required:
Despite the tremendous amount of progress that has been made on simplifying data processing in recent years, Veiss said that operationalizing information remains a widespread painpoint.
Okay, already. Solve the problem.
Apparently there is another hurdle:
Another major challenge is mobility, which Veiss sees as the “great equalizer” for user experience, especially as it pertains to delivering data insights.
Frankly I don’t know what this means.
I suppose this type of content marketing and jargonizing will sell some folks. For me, it’s confusing. IDOL is now about 15 years old. The DRE (digital reasoning engine) requires training and that means one has to know what type of information will be processed. In order to get useful results, content known to be like the content to be processed has to be assembled as a training set. Skip this set and the results are likely to be off point.
Has HP figured out how to crack this aspect of Big Data? I thought that IDOL and DRE required the licensee to train the system so that IDOL and DRE can deliver results that are on point for the content set.
My hunch is that by shifting the focus to apps, HP may be ignoring some of the time consuming intellectual work needed to allow IDOL and DRE to show their stuff.
HP has to find a way to generate billions to pay off the Autonomy buy and then make those lines of business return high margin, sustainable revenue. Apps may make sense to an MBA. Will apps deliver the truck loads of cash HP seeks from 15 year old technology?
Let me check the Apple apps store. Nope, no app for that.
Stephen E Arnold, August 20, 2014
August 9, 2014
The blurring of search, business intelligence, and number crouching makes it difficult to figure out exactly what a company licenses. In the case of Actuate, there are some crystal clear products and services, and there are some which weave across boundaries.
For some, Actuate means an open source business data-reporting project launched by the Eclipse Foundation in 2004. You can download Eclipse BIRT here.
Actuate released BIRT 4.4, a commercial product, in July 2014. The company issued a news release titled “Actuate Announces BIRT Analytics 4.4 for Even Easier and Faster Big Data Advanced Analytics for Business Professionals.” Actuate employs the jargon that electrifies those who ride the data analytics bandwagon; for example:
BIRT Analytics 4.4 is a sophisticated, end-to-end software solution that allows users to extract maximum value from Big Data, in the form of visual statistical insights that enable sharper commercial decision-making, and greater customer responsiveness, providing organizations a powerful competitive edge. The built-in, columnar database engine loads at an unrivalled speed of up to 60 GB/hour. With BIRT Analytics 4.4, users are able to explore up to 6 billion records in less than a second, and perform advanced analytics on a million records in under a minute. Business analysts and business users can get to the exact insight they need in seconds rather than days or weeks – freeing IT and data scientists to work on projects that require their expertise. A new user interface (UI) and instructions further increase productivity for business users and administrators.
The news release should pump some life into Actuate’s revenues which were $135 million for the year ending 12-31-2013. In May 2014, the company reported a quarterly decrease in net income and a decrease in net operating cash flow. Emerging Growth’s report “Actuate Corporation Offers Underwhelming Performance” stated:
The revenue fell significantly faster than the industry average of 6 percent. Compared to the same quarter last year, Actuate revenues fell by 31 percent.
Is Actuate struggling with some of the same market forces that bedevil search and content processing vendors? Announcements and feature upgrades have to translate into sustainable revenue; otherwise, stakeholders will become increasingly grumpy.
Stephen E Arnold, August 9, 2014
August 4, 2014
In 2010, Attensity purchased Biz360. The Beyond Search comment on this deal is at http://bit.ly/1p4were. One of the goslings reminded me that I had not instructed a writer to tackle Attensity’s July 2014 announcement “Attensity Adds to Patent Portfolio for Unstructured Data Analysis Technology.” PR-type “stories” can disappear, but for now you can find a description of “Attensity Adds to Patent Portfolio for Unstructured Data Analysis Technology” at http://reut.rs/1qU8Sre.
My researcher showed me a hard copy of 8,645,395, and I scanned the abstract and claims. The abstract, like many search and content processing inventions, seemed somewhat similar to other text parsing systems and methods. The invention was filed in April 2008, two years before Attensity purchased Biz360, a social media monitoring company. Attensity, as you may know, is a text analysis company founded by Dr. David Bean. Dr. Bean employed various “deep” analytic processes to figure out the meaning of words, phrases, and documents. My limited understanding of Attensity’s methods suggested to me that Attensity’s Bean-centric technology could process text to achieve a similar result. I had a phone call from AT&T regarding the utility of certain Attensity outputs. I assume that the Bean methods required some reinforcement to keep pace with customers’ expectations about Attensity’s Bean-centric system. Neither the goslings nor I are patent attorneys. So after you download 395, seek out a patent attorney and get him/her to explain its mysteries to you.
The abstract states:
A system for evaluating a review having unstructured text comprises a segment splitter for separating at least a portion of the unstructured text into one or more segments, each segment comprising one or more words; a segment parser coupled to the segment splitter for assigning one or more lexical categories to one or more of the one or more words of each segment; an information extractor coupled to the segment parser for identifying a feature word and an opinion word contained in the one or more segments; and a sentiment rating engine coupled to the information extractor for calculating an opinion score based upon an opinion grouping, the opinion grouping including at least the feature word and the opinion word identified by the information extractor.
This invention tackles the Mean Joe Green of content processing from the point of view of a quite specific type of content: A review. Amazon has quite a few reviews, but the notion of an “shaped” review is a thorny one. See, for example, http://bit.ly/1pz1q0V.) The invention’s approach identifies words with different roles; some words are “opinion words” and others are “feature words.” By hooking a “sentiment engine” to this indexing operation, the Biz360 invention can generate an “opinion score.” The system uses item, language, training model, feature, opinion, and rating modifier databases. These, I assume, are either maintained by subject matter experts (expensive), smart software working automatically (often evidencing “drift” so results may not be on point), or a hybrid approach (humans cost money).
The Attensity/Biz360 system relies on a number of knowledge bases. How are these updated? What is the latency between identifying new content and updating the knowledge bases to make the new content available to the user or a software process generating an alert or another type of report?
The 20 claims embrace the components working as a well oiled content analyzer. The claim I noted is that the system’s opinion score uses a positive and negative range. I worked on a sentiment system that made use of a stop light metaphor: red for negative sentiment and green for positive sentiment. When our system could not figure out whether the text was positive or negative we used a yellow light.
The approach used for a US government project a decade ago, used a very simple metaphor to communicate a situation without scores, values, and scales. Image source: http://bit.ly/1tNvkT8
Attensity said, according the news story cited above:
By splitting the unstructured text into one or more segments, lexical categories can be created and a sentiment-rating engine coupled to the information can now evaluate the opinions for products, services and entities.
Okay, but I think that the splitting of text into segment was a function of iPhrase and search vendors converting unstructured text into XML and then indexing the outputs.
Attensity’s Jonathan Schwartz, General Counsel at Attensity is quoted in the news story as asserting:
“The issuance of this patent further validates the years of research and affirms our innovative leadership. We expect additional patent issuances, which will further strengthen our broad IP portfolio.”
Okay, this sounds good but the invention took place prior to Attensity’s owning Biz360. Attensity, therefore, purchased the invention of folks who did not work at Attensity in the period prior to the filing in 2008. I understand that company’s buy other companies to get technology and people. I find it interesting that Attensity’s work “validates” Attensity’s research and “affirms” Attensity’s “innovative leadership.”
I would word what the patent delivers and Attensity’s contributions differently. I am no legal eagle or sentiment expert. I do like less marketing razzle dazzle, but I am in the minority on this point.
Net net: Attensity is an interesting company. Will it be able to deliver products that make the licensees’ sentiment score move in a direction that leads to sustaining revenue and generous profits. With the $90 million in funding the company received in 2014, the 14-year-old company will have some work to do to deliver a healthy return to its stakeholders. Expert System, Lexalytics, and others are racing down the same quarter mile drag strip. Which firm will be the winner? Which will blow an engine?
Stephen E Arnold, August 4, 2014
July 31, 2014
I know there are quite a few experts in enterprise search, content processing, and the near mystical Big Data thing. I wanted to point out that if you want to know more about Markov Chains so you can explain how stuff works in most content centric systems with fancy math work, this is for you. Navigate to Setosa Blog and Markov Chains: A Visual Explanation. This one is pretty good. You can poke around for an IBM presentation on the same subject. IBM includes some examples of the way the numerical recipe can assign a probability to an event that is likely to take place.
Stephen E Arnold, July 31, 2014
July 28, 2014
I read “Google Searches Hold Key to Future Market Crashes.” The main idea in my opinion is:
Moat [female big thinker at Warwick Business School’ continued, “Our results are in line with the hypothesis that increases in searches relating to both politics and business could be a sign of concern about the state of the economy, which may lead to decreased confidence in the value of stocks, resulting in transactions at lower prices.”
So will the Warwick team cash in on the stock market?
Well, there is a cautionary item as well:
“Our results provide evidence of a relationship between the search behavior of Google users and stock market movements,” said Tobias Preis, Associate Professor of Behavioral Science and Finance at Warwick Business School. “However, our analysis found that the strength of this relationship, using this very simple weekly trading strategy, has diminished in recent years. This potentially reflects the increasing incorporation of Internet data into automated trading strategies, and highlights that more advanced strategies are now needed to fully exploit online data in financial trading.”
Rats. Quants are already on this it seems.
What’s fascinating to me is that the Warwick experts overlooked a couple of points; namely:
- Google is using its own predictive methods to determine what users see when they get a search result based on the behavior of others. Recursion, anyone?
- Google provides more searches with each passing day to those using mobile devices. By their nature, traditional desktop queries are not exactly the same as mobile device searches. As a workaround, Google uses clusters and other methods to give users what Google thinks the user really wants. Advertising, anyone?
- The stock pickers that are the cat’s pajamas at the B school have to demonstrate their acumen on the trading floor. Does insider trading play a role? Does working at a Goldman Sachs-type of firm help a bit?
Like perpetual motion, folks will keep looking for a way to get an edge. Why are large international banks paying some hefty fines? Humans, I believe, not algorithms.
Stephen E Arnold, July 28, 2014