September 22, 2014
I read “How IBM’s Watson Could Do for Analytics What Search Did for Google.” I urge you to flip through a math book like Calculus on Manifolds: A Modern Approach to Classical Theorems of Advanced Calculus. Although an older book, some of its methods are now creeping into the artificial intelligence revolution that seems to be the next big thing. Then read the Datamation write up.
IBM is rolling out a “freemium model to move Watson, their [sic] English language AI interface for analytics, into the market more aggressively.” What could be more aggressive than university contents, recipes for Bon Appétit, and curing cancer?
The article points out that the only competitor to Watson is Google. Well, that’s an interesting assertion.
Google put an interface on search I learned. The rest is Google’s dominance. Now IBM wants to put an interface on analytics, and—I assume it follows to the thinkers at IBM—IBM’s dominance will tag along.
The article asserts:
We often talk about analytics needing data scientists who have a unique skill set, allowing them to get out the answers needed from highly complex data repositories. Since the results of the analysis are supposed to lead to better executive decisions the ideal skill set would have been an MBA Data Scientist, yet I’ve actually never seen one of those. Folks who are good at deep analysis and folks that are good at business tend to be very different folks, and data scientists are in very short supply at the moment.
Well, someone has to:
- Select numerical recipes
- Set thresholds
- Select process sequences
- Select data and ensure that they are valid
- Set up outputs, making decisions about what to show and what not to show
- Modify when the outputs do not match reality. (I realize that this step is of little interest to some analytics users.)
The article concludes:
The Freemium model has similar advantages. So if you wrap a product that line executives should prefer with an economic model that removes most of the financial barriers, you should end up with a solution that does for IBM what Search did for Google. And that could do some interesting things to the analytics market, creating a similar set of conditions to those that put IBM on top of technology in the last century.
What’s a freemium model? What’s the purpose of the analysis? What’s the method to validate results? What controls does a clueless user have over the Watson system?
Oh, wait. Watson is a search system. Google is a search system that people use. Watson is a search system that few use. Also, IBM still sells mainframes. This is a useful factoid to keep in mind.
Stephen E Arnold, September 22, 2014
September 18, 2014
I read “Palantir May Have Raised More Than We Thought, Perhaps $165 million.” The article presented a revisionist view of how much money is in the Palantir piggy bank. Here’s the number I circled: $165 million since February 2014. I also marked this paragraph:
The Palo Alto company led by CEO Alex Karp disclosed in a Securities and Exchange Commission filing on Friday that it had raised more than $440 million in a funding round that began last November.
The numbers add up. The write up asserted:
The company co-founded by Karp, Peter Thiel, Joe Lonsdale and others in 2004 has raised a total of about $1 billion, with some of that funding coming from In-Q-Tel, the venture arm of U.S. intelligence agencies.
This works out to a $9 billion valuation.
The question now becomes, “How long will it take Palantir to generate sufficient revenue to pay back the investors and turn a profit?” The reason I ask is that IBM is chasing this market along with a legion of other firms.
Terrorism, war fighting, and Fancy Dan analytics are growth buttons. Will there be enough customers to feed the appetites of the outfits chasing the available money?
My hunch is that some of the competitors in this segment will come up empty.
Also, the tonnage of money Palantir has had dropped in its bank account makes the separate injections of $30 million funding into three firms— Attivio, BA Insight, and Coveo—look modest indeed. Perhaps there is more to the Big Data pitch than just words?
Stephen E Arnold,
September 16, 2014
I read “Launching Today: Mathmatica Online.” The interface is similar to the desktop application. The benefits of having the Mathematica tool accessible on non desktop devices and without requiring a local installation of the program are many; for example, notebooks work on tablets. With refreshing candor, Dr. Wolfram notes:
There are some tradeoffs of course. For example, Manipulate can’t be as zippy in the cloud as it is on the desktop, because it has to run across the network. But because its Cloud CDF interface is running directly in the web browser, it can immediately be embedded in any web page, without any plug-in…
Worth a look at http://www.wolfram.com/mathematica/online/.
Stephen E Arnold, September 16, 2014
September 15, 2014
Short honk: Navigate to “How Google’s Autonomous Car Passed the First U.S. State Self-Driving Test.” Do you find this statement interesting?
Google chose the test route and set limits on the road and weather conditions that the vehicle could encounter, and that its engineers had to take control of the car twice during the drive.
I do. With intervention it is much easier to pass a test. The same method of shaping characterizes Google’s approach to modeling for “nowcasting.” I discuss this hand crafting of methods to deliver an acceptable result in my next KMWorld article.
Stephen E Arnold, September 15, 2014
September 11, 2014
The article titled ExtraHop Helps to Make Data Free–Streams Into MongoDB And ElasticSearch on Forbes.com discusses the broad coverage available through ExtraHop’s metrics. With all of the growing complexity of current IT applications, ExtraHop can help both traditional and non-traditional users through their real-time analytics and Open Data Stream. In fact, ExtraHop recently began offering the possibility of streaming data sets directly into analytic solutions including MongoDB and Elasticsearch. The article explains,
“Customers can leverage ExtraHop’s skills in delivering the most relevant and useful monitoring visualizations. But at the same time that can use that same data in ways that ExtraHop could have never thought of. It gives them the ability to deliver richer and deeper insights, but it also gives them more control over where data is stored and how it is queried and manipulated. It also opens up the possibility for organizations to use multiple monitoring solutions in parallel, simply because they can.”
Gartner is quoted as saying that the importance of these ITOA technologies lies in their ability to aid the explorative and creative processes. By having these insights available, more and more users will be able to realize their ideas and perhaps even make their dreams into realities.
Chelsea Kerwin, September 11, 2014
September 10, 2014
Other outfits have plenty of big thinkers and rely on nameless specialists to perform behind the scenes work.
A good example of this approach is revealed in “Predicting the Present with Bayesian Structural Time Series.” The scholarly write up explains a procedure to perform “nowcasting.” The idea is that one can use real time information to help predict other now happenings.
Instead of doing the wild and crazy Palantir/Recorded Future forward predicting, these Googlers focus on the now.
I am okay with whatever outputs predictive systems generate. What’s important about this paper is that the authors document when humans have to get involved in the processes constructed from numerical recipes known to many advanced math and statistics whizzes.
Here are several I noted:
- The modeler has to “choose components for the modeling trend.” No problem, but it is tedious and important work. Get this step wrong and the outputs can be misleading.
- Selecting sampling algorithms, page 6. Get this wrong and the outputs can be misleading.
- Simplify by making assumptions, page 7. “Another strategy one could pursue (but we have not) is to subjectively segment predictors into groups based on how likely the would be to enter the model.”
- Breaking with Bayesian, page 8. “Scaling by “s^2/y”* is a minor violation of the Bayesian paradigm because it means our prior is data determined.”
There are other examples. These range from selecting what outputs from Google Trends and Correlate to use to the sequence of numerical recipes implemented in the model.
My point is that Google is being upfront about the need for considerable manual work in order to make its nowcasting predictive model “work.”
Analytics deployed in organizations depend on similar human behind the scenes work. Get the wrong thresholds, put the procedures in a different order, or use bad judgment about what method to use and guess what?
The outputs are useless. As managers depend on analytics to aid their decision making and planners rely on models to predict the future, it is helpful to keep in mind that an end user may lack the expertise to figure out if the outputs are useful. If useful, how much confidence should a harried MBA put in predictive models.
Just a reminder that ISIS caught some folks by surprise, analytics vendor HP seemed to flub its predictions about Autonomy sales, and the outfits monitoring Ebola seem to be wrestling with underestimations.
Maybe enterprise search vendors can address these issues? I doubt it.
Note: my blog editor will not render mathematical typography. Check the original Google paper on page 8, line 4 for the correct representation.
Stephen E Arnold, September 10, 2014
September 4, 2014
Autonomy, Recommind, and dozens of other search and content processing firms rely on statistical procedures. Anyone who has survived Statistics 101 believe in the power of numbers. Textbook examples are—well—pat. The numbers work out even for B and C students.
The real world, on the other hand, is different. What was formulaic in the textbook exercises is more difficult with most data sets. The data are incomplete, inconsistent, generated by systems whose integrity is unknown, and often wrong. Human carelessness, the lack of time, a lack of expertise, and plain vanilla cluelessness makes those nifty data sets squishier than a memory foam pillow.
If you have some questions about statistical evidence in today’s go go world, check out “I Disagree with Alan Turing and Daniel Kahneman Regarding the Strength of Statistical Evidence.”
I noted this passage:
It’s good to have an open mind. When a striking result appears in the dataset, it’s possible that this result does not represent an enduring truth or even a pattern in the general population but rather is just an artifact of a particular small and noisy dataset. One frustration I’ve had in recent discussions regarding controversial research is the seeming unwillingness of researchers to entertain the possibility that their published findings are just noise.
An open mind is important. Just looking at the outputs of zippy systems that do prediction for various entities can be instructive. In the last couple of months, I learned that predictive systems:
- Failed to size the Ebola outbreak by orders of magnitude
- Did not provide reliable outputs for analysts trying to figure out where a crashed airplane was
- Came up short regarding resources available to ISIS.
The Big Data revolution is one of those hoped for events. The idea is that Big Data will allow content processing vendors to sell big buck solutions. Another is that massive flows of unstructured content can only be tapped in a meaningful way with expensive information retrieval solutions.
Dreams, hopes, wishes—yep, all valid for children waiting for the tooth fairy. The real world has slightly more bumps and sharp places.
Stephen E Arnold, September, 2014
September 3, 2014
Amplitude is a new analytics startup backed by Y Combinator and recently raised $1.975 million in seed funding. TechCrunch reports on the fundraising efforts and how Amplitude differentiates itself from its competition in the article, “Amplitude, The Analytics Startup Undercutting Mixpanel, Raises $2 million Seed Round.”
Amplitude grew because there was a 400 percent increase in their enterprise customer base. Its founders originally were working on a text-by-voice app and they created an analytics tool to examine their data. It did not take them long to discover that the analytics tool was the better application. Amplitude is a valuable product, because of its skilled engineering team and the claim that it a predict customer queries and save space. Which brings us to the price:
“Amplitude offers a freemium service that gives customers up to 5 million monthly events for free. In comparison, Mixpanel charges $600/month for 4 million data points. Amplitude also offers a $299/month plans for up to 50 million monthly events – something that would move into custom pricing territory at Mixpanel. Beyond that, Amplitude offers enterprise plans, and today has customers like The Hunt, Heyday, KeepSafe, and other larger customers still under NDA.”
That is very cheap compared to other popular business analytics plans. Amplitude offers a high quality product at a reasonable price. Will it catch on in today’s cash-strapped market? It already has. Be forewarned that prices need to change for other analytics companies or they will lose customers.
September 1, 2014
Last week I had a conversation with a publisher who has a keen interest in software that “knows” what content means. Armed with that knowledge, a system can then answer questions.
The conversation was interesting. I mentioned my presentations for law enforcement and intelligence professionals about the limitations of modern and computationally expensive systems.
Several points crystallized in my mind. One of these is addressed, in part, in a diagram created by a person interested in machine learning methods. Here’s the diagram created by SciKit:
The diagram is designed to help a developer select from different methods of performing estimation operations. The author states:
Often the hardest part of solving a machine learning problem can be finding the right estimator for the job. Different estimators are better suited for different types of data and different problems. The flowchart below is designed to give users a bit of a rough guide on how to approach problems with regard to which estimators to try on your data.
First, notice that there is a selection process for choosing a particular numerical recipe. Now who determines which recipe is the right one? The answer is the coding chef. A human exercises judgment about a particular sequence of operation that will be used to fuel machine learning. Is that sequence of actions the best one, the expedient one, or the one that seems to work for the test data? The answer to these questions determines a key threshold for the resulting “learning system.” Stated another way, “Does the person licensing the system know if the numerical recipe is the most appropriate for the licensee’s data?” Nah. Does a mid tier consulting firm like Gartner, IDC, or Forrester dig into this plumbing? Nah. Does it matter? Oh, yeah. As I point out in my lectures, the “accuracy” of a system’s output depends on this type of plumbing decision. Unlike a backed up drain, flaws in smart systems may never be discerned. For certain operational decisions, financial shortfalls or the loss of an operation team in a war theater can be attributed to one of many variables. As decision makers chase the Silver Bullet of smart, thinking software, who really questions the output in a slick graphic? In my experience, darned few people. That includes cheerleaders for smart software, azure chip consultants, and former middle school teachers looking for a job as a search consultant.
Second, notice the reference to a “rough guide.” The real guide is understanding of how specific numerical recipes work on a set of data that allegedly represents what the system will process when operational. Furthermore, there are plenty of mathematical methods available. The problem is that some of the more interesting procedures lead to increased computational cost. In a worst case, the more interesting procedures cannot be computed on available resources. Some developers know about N=NP and Big O. Others know to use the same nine or ten mathematical procedures taught in computer science classes. After all, why worry about math based on mereology if the machine resources cannot handle the computations within time and budget parameters? This means that most modern systems are based on a set of procedures that are computationally affordable, familiar, and convenient. Does this similar of procedures matter? Yep. The generally squirrely outputs from many very popular systems are perceived as completely reliable. Unfortunately, the systems are performing within a narrow range of statistical confidence. Stated in a more harsh way, the outputs are just not particularly helpful.
In my conversation with the publisher, I asked several questions:
- Is there a smart system like Watson that you would rely upon to treat your teenaged daughter’s cancer? Or, would you prefer the human specialist at the Mayo Clinic or comparable institution?
- Is there a smart system that you want directing your only son in an operational mission in a conflict in a city under ISIS control? Or, would you prefer the human-guided decision near the theater about the mission?
- Is there a smart system you want managing your retirement funds in today’s uncertain economy? Or, would you prefer the recommendations of a certified financial planner relying on a variety of inputs, including analyses from specialists in whom your analyst has confidence?
When I asked these questions, the publisher looked uncomfortable. The reason is that the massive hyperbole and marketing craziness about fancy new systems creates what I call the Star Trek phenomenon. People watch Captain Kirk talking to devices, transporting himself from danger, and traveling between far flung galaxies. Because a mobile phone performs some of the functions of the fictional communicator, it sure seems as if many other flashy sci-fi services should be available.
Well, this Star Trek phenomenon does help direct some research. But in terms of products that can be used in high risk environments, the sci-fi remains a fiction.
Believing and expecting are different from working with products that are limited by computational resources, expertise, and informed understanding of key factors.
Humans, particularly those who need money to pay the mortgage, ignore reality. The objective is to close a deal. When it comes to information retrieval and content processing, today’s systems are marginally better than those available five or ten years ago. In some cases, today’s systems are less useful.
August 26, 2014
Natural language processing—one of its most-discussed functions in business is sentiment analysis. Over at the SmartData Collective, Lexalytics’ Scott Van Boeyen tells us “Why Sentiment Analysis Engines Need Customization.” The short answer: slang. The write-up explains:
The problem with sentiment analysis is sometimes it’s wrong.[…]
“Oh man, that was nasty!” Is this sentence positive or negative? Surely, it must be negative. “Nasty” is a negative word, and everything else in this sentence is neutral. Final answer, negative! Drum roll…. Wrong! It’s positive.
The person who said this used the American slang definition of nasty, which has positive sentiment. There is absolutely no way to know by reading the sentence. So, if you (a human) were just tricked by reading this article, how is a machine supposed to figure it out? Answer: Tell the engine what’s positive and what’s negative.
High quality NLP engines will let you customize your sentiment analysis settings. “Nasty” is negative by default. If you’re processing slang where “nasty” is considered a positive term, you would access your engine’s sentiment customization function, and assign a positive score to the word.
The man has a point. Still, we are left with a few questions: How much more should one expect to pay for a customization feature? Also, how long does it take to teach an NLP platform comprehensive alternate vocabulary? How does one decide what slang to include—has anyone developed a list of suggestions? Perhaps one could start by consulting the Urban Dictionary.
Cynthia Murrell, August 26, 2014