CyberOSINT banner

Big Data and Value

May 19, 2016

I read “The Real Lesson for Data Science That is Demonstrated by Palantir’s Struggles · Simply Statistics.” I love write ups that plunk the word statistics near simple.

Here’s the passage I highlighted in money green:

… What is the value of data analysis?, and secondarily, how do you communicate that value?

I want to step away from the Palantir Technologies’ example and consider a broader spectrum of outfits tossing around the jargon “big data,” “analytics,” and synonyms for smart software. One doesn’t communicate value. One finds a person who needs a solution and crafts the message to close the deal.

When a company and its perceived technology catches the attention of allegedly informed buyers, a bandwagon effort kicks in. Talks inside an organization leads to mentions in internal meetings. The vendor whose products and services are the subject of these comments begins to hint at bigger and better things at conferences. Then a real journalist may catch a scent of “something happening” and writes an article. Technical talks at niche conferences generate wonky articles usually without dates or footnotes which make sense to someone without access to commercial databases. If a social media breeze whips up the smoldering interest, then a fire breaks out.

A start up should be so clever, lucky, or tactically gifted to pull off this type of wildfire. But when it happens, big money chases the outfit. Once money flows, the company and its products and services become real.

The problem with companies processing a range of data is that there are some friction inducing processes that are tough to coat with Teflon. These include:

  1. Taking different types of data, normalizing it, indexing it in a meaningful manner, and creating metadata which is accurate and timely
  2. Converting numerical recipes, many with built in threshold settings and chains of calculations, into marching band order able to produce recognizable outputs.
  3. Figuring out how to provide an infrastructure that can sort of keep pace with the flows of new data and the updates/corrections to the already processed data.
  4. Generating outputs that people in a hurry or in a hot zone can use to positive effect; for example, in a war zone, not get killed when the visualization is not spot on.

The write up focuses on a single company and its alleged problems. That’s okay, but it understates the problem. Most content processing companies run out of revenue steam. The reason is that the licensees or customers want the systems to work better, faster, and more cheaply than predecessor or incumbent systems.

The vast majority of search and content processing systems are flawed, expensive to set up and maintain, and really difficult to use in a way that produces high reliability outputs over time. I would suggest that the problem bedevils a number of companies.

Some of those struggling with these issues are big names. Others are much smaller firms. What’s interesting to me is that the trajectory content processing companies follow is a well worn path. One can read about Autonomy, Convera, Endeca, Fast Search & Transfer, Verity, and dozens of other outfits and discern what’s going to happen. Here’s a summary for those who don’t want to work through the case studies on my Xenky intel site:

Stage 1: Early struggles and wild and crazy efforts to get big name clients

Stage 2: Making promises that are difficult to implement but which are essential to capture customers looking actively for a silver bullet

Stage 3: Frantic building and deployment accompanied with heroic exertions to keep the customers happy

Stage 4: Closing as many deals as possible either for additional financing or for licensing/consulting deals

Stage 5: The early customers start grousing and the momentum slows

Stage 6: Sell off the company or shut down like Delphes, Entopia, Siderean Software and dozens of others.

The problem is not technology, math, or Big Data. The force which undermines these types of outfits is the difficulty of making sense out of words and numbers. In my experience, the task is a very difficult one for humans and for software. Humans want to golf, cruise Facebook, emulate Amazon Echo, or like water find the path of least resistance.

Making sense out of information when someone is lobbing mortars at one is a problem which technology can only solve in a haphazard manner. Hope springs eternal and managers are known to buy or license a solution in the hopes that my view of the content processing world is dead wrong.

So far I am on the beam. Content processing requires time, humans, and a range of flawed tools which must be used by a person with old fashioned human thought processes and procedures.

Value is in the eye of the beholder, not in zeros and ones.

Stephen E Arnold, May 19, 2016

Chinese Restaurant Names as Journalism

April 19, 2016

I read an article in Jeff Bezos’ newspaper. The title was “We Analyzed the Names of Almost Every Chinese Restaurant in America. This Is What We Learned.” The almost is a nifty way of slip sliding around the sampling method which used restaurants listed in Yelp. Close enough for “real” journalism.

Using the notion of a frequency count, the write up revealed:

  • The word appearing most frequently in the names of the sample was “restaurant.”
  • The words “China” and “Chinese” appear in about 15,000 of the sample’s restaurant names
  • “Express” is a popular word, not far ahead of “panda”.

The word list and their frequencies were used to generate a word cloud:


To answer the question where Chinese food is most popular in the US, the intrepid data wranglers at Jeff Bezos’ newspaper output a map:


Amazing. I wonder if law enforcement and intelligence entities know that one can map data to discover things like the fact that the word “restaurant” is the most used word in a restaurant’s name.

Stephen E Arnold, April 19, 2016

Machine Learning: 10 Numerical Recipes

April 8, 2016

The chatter about smart is loud. I cannot hear the mixes on my Creamfields 2014 CD. Mozart, you are a goner.

If you want to cook up some smart algorithms to pick music or drive your autonomous vehicle without crashing into a passenger carrying bus, navigate to “Top 10 Machine Learning Algorithms.”

The write up points out that just like pop music, there is a top 10 list. More important in my opinion is the concomitant observation that smart software may be based on a limited number of procedures. Hey, this stuff is taught in many universities. Go with what you know maybe?

What are the top 10? The write up asserts:

  1. Linear regression
  2. Logistic regression
  3. Linear discriminant analysis
  4. Classification and regression trees
  5. Naive Bayes
  6. K nearest neighbors
  7. Learning vector quantization
  8. Support vector machines
  9. Bagged decision trees and random forest
  10. Boosting and AdaBoost.

The article tosses in a bonus too: Gradient descent.

What is interesting is that there is considerable overlap with the list I developed for my lecture on manipulating content processing using shaped or weaponized text strings. How’s that, Ms. Null?

The point is that when systems use the same basic methods, are those systems sufficiently different? If so, in what ways? How are systems using standard procedures configured? What if those configurations or “settings” are incorrect?


Stephen E Arnold, April 8, 2016

Attensity Europe Has a New Name

March 30, 2016

Short honk: The adventure of Attensity continues. Attensity Europe has renamed itself Sematell Interactive Solutions. You can read about the change here. The news release reminds the reader that Sematell is “the leading provider of interaction solutions.” I am not able to define interaction solutions, but I assume the company named by combining semantic and intelligence will make the “interaction solutions” thing crystal clear. The url is

Stephen E Arnold, March 30, 2016

Text Analytics: Crazy Numbers Just Like the Good Old Days of Enterprise Search

March 16, 2016

Short honk: Want a growth business in a niche function that supports enterprise platforms? Well, gentle reader, look no farther than text analytics. Get your checkbook out and invest in this remarkable sector. It will be huuuuge.

Navigate to “Text Analytics Market to Account for US$12.16 bn in Revenue by 2024.”  What is text analytics? How big is text analytics today? How long has text analytics been a viable function supporting content processing?

Ah, good questions, but what’s really important is this passage:

According to this report, the global text analytics market revenue stood at US$2.82 bn in 2015 and is expected to reach US$12.16 bn by 2024, at a CAGR of 17.6% from 2016 to 2024.

I love these estimates. Imagine. Close out your life savings and invest in text analytics. You will receive a CAGR of 17.6 percent which you can cash in and buy stuff in 2024. That’s just eight years.

Worried about the economy? Want to seek the safe shelter of bonds? Forget the worries. If text analytics is so darned hot, why is the consulting firm pitching this estimate writing reports. Why not invest in text analytics?

Answer: Maybe the estimate is a consequence of spreadsheet fever?

Text analytics is a rocket just like the ones Jeff Bezos will use to carry you into space.

Stephen E Arnold, March 16, 2016

Data Insight: Common Sense Makes Sense

February 25, 2016

I am skeptical about lists of problems which hot buzzwords leave in their wake. I read “Why Data Insight Remains Elusive,” which I though was another content marketing pitch to buy, buy, buy. Not so. The write up contains some clearly expressed, common sense reminds for those who want to crunch big data and point and click their way through canned reports. Those who actually took the second semester of Statistics 101 know that ignoring the data quality and the nitty gritty of the textbook procedures can lead to bone head outputs.

The write up identifies some points to keep in mind, regardless of which analytics vendor system a person is using to make more informed or “augmented” decisions.

Here’s the pick of the litter:

  1. Manage the data. Yep, time consuming, annoying, and essential. Skip this step at your decision making peril.
  2. Manage the indexing. The buzzword is metadata, but assigning keywords and other indexing items makes the difference when trying to figure out who, what, why, when, and where. Time? Yep, metadata which not even the Alphabet Google thing does particularly well.
  3. Create data models. Do the textbook stuff. Get the model wrong, and what happens? Failure on a scale equivalent to fumbling the data management processes.
  4. Visualization is not analytics. Visualization makes outputs of numerical recipes appear in graphical form. Do not confuse Hollywood outputs with relevance, accuracy, or math on point to the problem one is trying to resolve.
  5. Knee jerking one’s way through analytics. Sorry, reflexes are okay but useless without context. Yep, have a problem, get the data, get the model, test, and examine the outputs.

Common sense. Most basic stuff was in the textbooks for one’s college courses. Too bad more folks did not internalize those floorboards and now seek contractors to do a retrofit. Quite an insight when the bill arrives.

Stephen E Arnold, February 25, 2016

Text Analytics Vendors for Your Retirement Fund

February 10, 2016

I located a list of companies involved in content processing. You may want to add one or more of these to your retirement investment portfolio. Which one will be the next Facebook, Google, or Uber? I know I would love to have a hat or T shirt from each of these outfits:
Automated Insights
Health Fidelity
Semantic Machines
TEMIS (Expert System)

Stephen E Arnold, February 8, 2016

HP Enterprise Investigative Analytics

February 5, 2016

Shiver me timbers. Batten the hatches. There is a storm brewing in the use of Autonomy-type methods to identify risks and fraud. To be fair, HP Enterprise no longer pitches Autonomy, but the sprit of Dr. Mike Lynch’s 1990s technology is there, just a hint maybe, but definitely noticeable to one who has embraced IDOL.

For the scoop, navigate to “HPE Launches Investigative Analytics, Using AI and Big Data to Identify Risk.” I was surprised that the story’s headline did not add “When Swimming in the Data Lake.” But the message is mostly clear despite the buzzwords.

Here’s a passage I highlighted:

The software is initially geared toward financial services organizations, and it combines existing HPE products like Digital Safe, IDOL, and Vertica all on one platform. By using big data analytics and artificial intelligence, it can analyze a large amount of data and help pinpoint potential risks of fraudulent behavior.

Note the IDOL thing.

The write up added:

Investigative Analytics starts by collecting both structured sources like trading systems, risk systems, pricing systems, directories, HR systems, and unstructured sources like email and chat. It then applies analysis to query “aggressively and intelligently across all those data sources,” Patrick [HP Enterprise wizard] said. Then, it creates a behavior model on top of that analysis to look at certain communication types and see if they can define a certain problematic behavior and map back to a particular historical event, so they can look out for that type of communication in the future.

This is okay, but the words, terminology, and phrasing remind me of more than 1990 Autonomy marketing collateral, BAE’s presentations after licensing Autonomy technology in the late 1990s, the i2 Ltd. Analyst Notebook collateral, and, more recently, the flood of jabber about Palantir’s Metropolitan Platform and Thomson Reuters’ version of Metropolitan called QA Direct or QA Studio or QA fill in the blank.

The fact that HP Enterprise is pitching this new service developed with “one bank” at a legal eagle tech conference is a bit like me offering to do my Dark Web Investigative Tools lecture at Norton Elementary School. A more appropriate audience might deliver more bang for each PowerPoint slide, might it not?

Will HP Enterprise put a dent in the vendors already pounding the carpeted halls of America’s financial institutions?

HP Enterprise stakeholders probably hope so. My hunch is that a me-too, me-too product is a less than inspiring use of the collection of acquired technologies HP Enterprise appears to put in a single basket.

Stephen E Arnold, February 5, 2016

Big Data: A Shopsmith for Power Freaks?

February 4, 2016

I read an article that I dismissed. The title nagged at my ageing mind and dwindling intellect. “This is Why Dictators Love Big Data” did not ring my search, content processing, or Dark Web chimes.

Annoyed at my inner voice, I returned to the story, annoyed with the “This Is Why” phrase in the headline.


Predictive analytics are not new. The packaging is better.

I think this is the main point of the write up, but I an never sure with online articles. The articles can be ads or sponsored content. The authors could be looking for another job. The doubts about information today plague me.

The circled passage is:

Governments and government agencies can easily use the information every one of us makes public every day for social engineering — and even the cleverest among us is not totally immune.  Do you like cycling? Have children? A certain breed of dog? Volunteer for a particular cause? This information is public, and could be used to manipulate you into giving away more sensitive information.

The only hitch in the git along is that this is not just old news. The systems and methods for making decisions based on the munching of math in numerical recipes has been around for a while. Autonomy? A pioneer in the 1990s. Nope. Not even the super secret use of Bayesian, Markov, and related methods during World War II reaches back far enough. Nudge the ball to hundreds of years farther on the timeline. Not new in my opinion.

I also noted this comment:

In China, the government is rolling out a social credit score that aggregates not only a citizen’s financial worthiness, but also how patriotic he or she is, what they post on social media, and who they socialize with. If your “social credit” drops below a certain level because you post anti-government messages online or because you’re socially associated with other dissidents, you could be denied credit approval, financial opportunities, job promotions, and more.

Just China? I fear not, gentle reader. Once again the “real” journalists are taking an approach which does not do justice to the wide diffusion of certain mathy applications.

Net net: I should have skipped this write up. My initial judgment was correct. Not only is the headline annoying to me, the information is par for the Big Data course.

Stephen E Arnold, February 4, 2016

Palantir: Revenue Distribution

January 27, 2016

I came across a write up in a Chinese blog about Palantir. You can find the original text at this link. I have no idea if the information are accurate, but I had not seen this breakdown before:


The chart from “Touchweb” shows that in FY 2015 privately held Palantir derives 71 percent of its revenue from commercial clients.

The report then lists the lines of business which the company offers. Again this was information I had not previously seen:

Energy, disaster recovery, consumer goods, and card services

  • Retail, pharmaceuticals, media, and insurance
  • Audit, legal prosecution
  • Cyber security, banking
  • Healthcare research
  • Local law enforcement, finance
  • Counter terrorism, war fighting, special forces.

Because Palantir is privately held, there is not solid, audited data available to folks in Kentucky at this time.

Nevertheless, the important point is that the Palantir search and content processing platform has a hefty valuation, lots of venture financing, and what appears to be a diversified book of business.

Stephen E Arnold, January 27, 2016

Next Page »