Bad Big Data? Get More Data Then

March 2, 2017

I like the idea that more is better. The idea is particularly magnetic when a company cannot figure out what it’s own, in house, proprietary data mean. Think of the legions of consultants from McKinsey and BCG telling executives what their own data “means.” Toss in the notion of a Big Data in a giant “data lake,” and you have decision makers who cannot use the information they already have.

Well, how does one fix that problem? Easy. Get more data. That sounds like a plan, particularly when the professionals struggling are in charge of figuring out if sales and marketing investments sort of pay for themselves.

I learned that I need more data by reading “Deepening The Data Lake: How Second-Party Data Increases AI For Enterprises.” The headline introduces the amazing data lake concept along with two giant lake front developments: More data and artificial intelligence.

Buzzwords? Heck no. Just solid post millennial reasoning; for example:

there are many marketers with surprisingly sparse data, like the food marketer who does not get many website visitors or authenticated customers downloading coupons. Today, those marketers face a situation where they want to use data science to do user scoring and modeling but, because they only have enough of their own data to fill a shallow lake, they have trouble justifying the costs of scaling the approach in a way that moves the sales needle.

I like that sales needle phrase. Marketers have to justify themselves and many have only “sparse” data. I would suggest that marketers have often useless data like the number of unique clicks, but that’s only polluting the data lake.

The fix is interesting. I learned:

we can think of the marketer’s first-party data – media exposure data, email marketing data, website analytics data, etc. – being the water that fills a data lake. That data is pumped into a data management platform (pictured here as a hydroelectric dam), pumped like electricity through ad tech pipes (demand-side platforms, supply-side platforms and ad servers) and finally delivered to places where it is activated (in the town, where people live)… this infrastructure can exist with even a tiny bit of water but, at the end of the cycle, not enough electricity will be generated to create decent outcomes and sustain a data-driven approach to marketing. This is a long way of saying that the data itself, both in quality and quantity, is needed in ever-larger amounts to create the potential for better targeting and analytics.

Yep, more data.

And what about making sense of the additional data? I learned:

The data is also of extremely high provenance, and I would also be able to use that data in my own environment, where I could model it against my first-party data, such as site visitors or mobile IDs I gathered when I sponsored free Wi-Fi at the last Country Music Awards. The ability to gather and license those specific data sets and use them for modeling in a data lake is going to create massive outcomes in my addressable campaigns and give me an edge I cannot get using traditional ad network approaches with third-party segments. Moreover, the flexibility around data capture enables marketers to use highly disparate data sets, combine and normalize them with metadata – and not have to worry about mapping them to a predefined schema. The associative work happens after the query takes place. That means I don’t need a predefined schema in place for that data to become valuable – a way of saying that the inherent observational bias in traditional approaches (“country music fans love mainstream beer, so I’d better capture that”) never hinders the ability to activate against unforeseen insights.

Okay, I think I understand. No wonder companies hire outfits like blue chip consulting firms to figure out what is going on in their companies. Stated another way, insiders live in the swamp. Outsiders can put the swamp into a context and maybe implement some pollution control systems.

Stephen E Arnold, March 2, 2017

Big Data Needs to Go Public

December 16, 2016

Big Data touches every part of our lives and we are unaware.  Have you ever noticed when you listen to the news, read an article, or watch a YouTube video that people say items such as: “experts claim, “science says,” etc.”  In the past, these statements relied on less than trustworthy sources, but now they can use Big Data to back up their claims.  However, popular opinion and puff pieces still need to back up their big data with hard fact. says that transparency is a big deal for Big Data and algorithm designers need to work on it in the article, “More Accountability For Big-Data Algorithms.”

One of the hopes is that big data will be used to bridge the divide between one bias and another, except that he opposite can happen.  In other words, Big Data algorithms can be designed with a bias:

There are many sources of bias in algorithms. One is the hard-coding of rules and use of data sets that already reflect common societal spin. Put bias in and get bias out. Spurious or dubious correlations are another pitfall. A widely cited example is the way in which hiring algorithms can give a person with a longer commute time a negative score, because data suggest that long commutes correlate with high staff turnover.

Even worse is that people and organizations can design an algorithm to support science or facts they want to pass off as the truth.  There is a growing demand for “algorithm accountability,” mostly in academia.  The demands are that data sets fed into the algorithms are made public.  There also plans to make algorithms that monitor algorithms for bias.

Big Data is here to say, but relying too much on algorithms can distort the facts.  This is why the human element is still needed to distinguish between fact and fiction.  Minority Report is closer to being our present than ever before.

Whitney Grace, December 16, 2016

Algorithm Bias in Beauty Contests

September 16, 2016

I don’t read about beauty contests. In my college dorm, I recall that the televised broadcast of the Miss America pageant was popular among some of the residents. I used the attention grabber as my cue to head to the library so I could hide reserved books from my classmates. Every little bit helps in the dog eat dog world of academic achievement.

When Artificial Intelligence Judges a Beauty Contest, White People Win” surprised me. I thought that algorithms were objective little numerical recipes. Who could fiddle 1=1=2?

I learned:

The foundation of machine learning is data gathered by humans, and without careful consideration, the machines learn the same biases of their creators. Sometimes bias is difficult to track, but other times it’s clear as the nose on someone’s face—like when it’s a face the algorithm is trying to process and judge.

Its seems that an algorithm likes white people. The write up informed me:

An online beauty contest called, run byYouth Laboratories (that lists big names in tech like Nvidia and Microsoft as “partners and supporters” on the contest website), solicited 600,000 entries by saying they would be graded by artificial intelligence. The algorithm would look at wrinkles, face symmetry, amount of pimples and blemishes, race, and perceived age. However, race seemed to play a larger role than intended; of the 44 winners, 36 were white.

Oh, oh. Microsoft and its smart software seem to play a role in this drama.

What’s the fix? Better data. The write up includes this statement from a Microsoft expert:

“If a system is trained on photos of people who are overwhelmingly white, it will have a harder time recognizing non-white faces,” writes Kate Crawford, principal researcher at Microsoft Research New York City, in a New York Times op-ed. “So inclusivity matters—from who designs it to who sits on the company boards and which ethical perspectives are included. Otherwise, we risk constructing machine intelligence that mirrors a narrow and privileged vision of society, with its old, familiar biases and stereotypes.”

In the last few months, Microsoft’s folks were involved in Tay, a chatbot which allegedly learned to be racist. Then there was the translation of “Daesh” as Saudi Arabia. Now algorithms appear to favor folks of a particular stripe.

Exciting math. But Microsoft has also managed to gum up webcams and Kindle access in Windows 10. Yep, the new Microsoft is a sparkling example of smart.

Stephen E Arnold, September 16, 2016

In-Q-Tel Wants Less Latency, Fewer Humans, and Smarter Dashboards

September 15, 2016

I read “The CIA Just Invested in a Hot Startup That Makes Sense of Big Data.” I love the “just.” In-Q-Tel investments are not like bumping into a friend in Penn Station. Zoomdata, founded in 2012, has been making calls, raising venture funding (more than $45 million in four rounds from 21 investors), and staffing up to about 100 full time equivalents. With its headquarters in Reston, Virginia, the company is not exactly operating from a log cabin west of Paducah, Kentucky.

The write up explains:

Zoom Data uses something called Data Sharpening technology to deliver visual analytics from real-time or historical data. Instead of a user searching through an Excel file or creating a pivot table, Zoom Data puts what’s important into a custom dashboard so users can see what they need to know immediately.

What Zoomdata does is offer hope to its customers for less human fiddling with data and faster outputs of actionable intelligence. If you recall how IBM i2 and Palantir Gotham work, humans are needed. IBM even snagged Palantir’s jargon of AI for “augmented intelligence.”

In-Q-Tel wants more smart software with less dependence on expensive, hard to train, and often careless humans. When incoming rounds hit near a mobile operations center, it is possible to lose one’s train of thought.

Zoomdata has some Booz, Allen DNA, some MIT RNA, and protein from other essential chemicals.

The write up mentions Palantir, but does not make explicit the need to reduce t6o some degree the human-centric approaches which are part of the major systems’ core architecture. You have nifty cloud stuff, but you have less nifty humans in most mission critical work processes.

To speed up the outputs, software should be the answer. An investment in Zoomdata delivers three messages to me here in rural Kentucky:

  1. In-Q-Tel continues to look for ways to move along the “less wait and less weight” requirement of those involved in operations. “Weight” refers to heavy, old-fashioned system. “Wait” refers to the latency imposed by manual processes.
  2. Zoomdata and other investments whips to the flanks of the BAE Systems, IBMs, and Palantirs chasing government contracts. The investment focuses attention not on scope changes but on figuring out how to deal with the unacceptable complexity and latency of many existing systems.
  3. In-Q-Tel has upped the value of Zoomdata. With consolidation in the commercial intelligence business rolling along at NASCAR speeds, it won’t take long before Zoomdata finds itself going to big company meetings to learn what the true costs of being acquired are.

For more information about Zoomdata, check out the paid-for reports at this link.

Stephen E Arnold, September 15, 2016

How Collaboration and Experimentation Are Key to Advancing Machine Learning Technology

September 12, 2016

The article on CIO titled Machine Learning “Still a Cottage Industry” conveys the sentiments of a man at the heart of the industry in Australia, Professor Bob Williamson. Williamson is the Commonwealth Scientific and Industrial Research Organisation’s (CSIRO’s) Data 61 group chief scientist. His work in machine learning and data analytics led him to the conclusion that for machine learning to truly move forward, scientists must find a way to collaborate. He is quoted in the article,

There’s these walled gardens: ‘I’ve gone and coded my models in a particular way, you’ve got your models coded in a different way, we can’t share’. This is a real challenge for the community. No one’s cracked this yet.” A number of start-ups have entered the “machine-learning-as-a-service” market, such as BigML, and Precog, and the big names including IBM, Microsoft and Amazon haven’t been far behind. Though these MLaaSs herald some impressive results, Williamson warned businesses to be cautious.

Williamson speaks to the possibility of stagnation in machine learning due to the emphasis on data mining as opposed to experimenting. He hopes businesses will do more with their data than simply look for patterns. It is a refreshing take on the industry from an outsider/insider, a scientist more interested in the science of it all than the massive stacks of cash at stake.

Chelsea Kerwin, September 12, 2016

Sponsored by, publisher of the CyberOSINT monograph
There is a Louisville, Kentucky Hidden Web/Dark Web meet up on September 27, 2016.
Information is at this link:

Data: Lakes, Streams, Whatever

June 15, 2016

I read “Data Lakes vs Data Streams: Which Is Better?” The answer seems to me to be “both.” Streams are now. Lakes are “were.” Who wants to make decisions based on historical data. On the other hand, real time data may mislead the unwary data sailor. The write up states:

The availability of these new ways [lakes and streams] of storing and managing data has created a need for smarter, faster data storage and analytics tools to keep up with the scale and speed of the data. There is also a much broader set of users out there who want to be able to ask questions of their data themselves, perhaps to aid their decision making and drive their trading strategy in real-time rather than weekly or quarterly. And they don’t want to rely on or wait for someone else such as a dedicated business analyst or other limited resource to do the analysis for them. This increased ability and accessibility is creating whole new sets of users and completely new use cases, as well as transforming old ones.

Good news for self appointed lake and stream experts. Bad news for a company trying to figure out how to generate new revenues.

The first step may be to answer some basic questions about what data are available, their reliability, and what person “knows” about data wrangling. Worrying about lakes and streams before one knows if the water is polluted is a good idea before diving into the murky waters.

Stephen E Arnold, June 15, 2016

Stanford Offers Course Overviewing Roots of the Google Algorithm

March 23, 2016

The course syllabus for Stanford’s Computer Science class titled CS 349: Data Mining, Search, and the World Wide Web on provides an overview of some of the technologies and advances that led to Google search. The syllabus states,

“There has been a close collaboration between the Data Mining Group (MIDAS) and the Digital Libraries Group at Stanford in the area of Web research. It has culminated in the WebBase project whose aims are to maintain a local copy of the World Wide Web (or at least a substantial portion thereof) and to use it as a research tool for information retrieval, data mining, and other applications. This has led to the development of the PageRank algorithm, the Google search engine…”

The syllabus alone offers some extremely useful insights that could help students and laypeople understand the roots of Google search. Key inclusions are the Digital Equipment Corporation (DEC) and PageRank, the algorithm named for Larry Page that enabled Google to become Google. The algorithm ranks web pages based on how many other websites link to them. John Kleinburg also played a key role by realizing that websites with lots of links (like a search engine) should also be seen as more important. The larger context of the course is data mining and information retrieval.


Chelsea Kerwin, March 23, 2016

Sponsored by, publisher of the CyberOSINT monograph


Infonomics and the Big Data Market Publishers Need to Consider

March 22, 2016

The article on Beyond the Book titled Data Not Content Is Now Publishers’ Product floats a new buzzword in its discussion of the future of information: infonomics, or the study of creation and consumption of information. The article compares information to petroleum as the resource that will cause quite a stir in this century. Grace Hong, Vice-President of Strategic Markets & Development for Wolters Kluwer’s Tax & Accounting, weighs in,

“When it comes to big data – and especially when we think about organizations like traditional publishing organizations – data in and of itself is not valuable.  It’s really about the insights and the problems that you’re able to solve,”  Hong tells CCC’s Chris Kenneally. “From a product standpoint and from a customer standpoint, it’s about asking the right questions and then really deeply understanding how this information can provide value to the customer, not only just mining the data that currently exists.”

Hong points out that the data itself is useless unless it has been produced correctly. That means asking the right questions and using the best technology available to find meaning in the massive collections of information possible to collect. Hong suggests that it is time for publishers to seize on the market created by Big Data.


Chelsea Kerwin, March 22, 2016

Sponsored by, publisher of the CyberOSINT monograph

Natural Language Processing App Gains Increased Vector Precision

March 1, 2016

For us, concepts have meaning in relationship to other concepts, but it’s easy for computers to define concepts in terms of usage statistics. The post Sense2vec with spaCy and Gensim from SpaCy’s blog offers a well-written outline explaining how natural language processing works highlighting their new Sense2vec app. This application is an upgraded version of word2vec which works with more context-sensitive word vectors. The article describes how this Sense2vec works more precisely,

“The idea behind sense2vec is super simple. If the problem is that duck as in waterfowl andduck as in crouch are different concepts, the straight-forward solution is to just have two entries, duckN and duckV. We’ve wanted to try this for some time. So when Trask et al (2015) published a nice set of experiments showing that the idea worked well, we were easy to convince.

We follow Trask et al in adding part-of-speech tags and named entity labels to the tokens. Additionally, we merge named entities and base noun phrases into single tokens, so that they receive a single vector.”

Curious about the meta definition of natural language processing from SpaCy, we queried natural language processing using Sense2vec. Its neural network is based on every word on Reddit posted in 2015. While it is a feat for NLP to learn from a dataset on one platform, such as Reddit, what about processing that scours multiple data sources?


Megan Feil, March 1, 2016

Sponsored by, publisher of the CyberOSINT monograph


Elasticsearch Works for Us 24/7

February 5, 2016

Elasticsearch is one of the most popular open source search applications and it has been deployed for personal as well as corporate use.  Elasticsearch is built on another popular open source application called Apache Lucene and it was designed for horizontal scalability, reliability, and easy usage.  Elasticsearch has become such an invaluable piece of software that people do not realize just how useful it is.  Eweek takes the opportunity to discuss the search application’s uses in “9 Ways Elasticsearch Helps Us, From Dawn To Dusk.”

“With more than 45 million downloads since 2012, the Elastic Stack, which includes Elasticsearch and other popular open-source tools like Logstash (data collection), Kibana (data visualization) and Beats (data shippers) makes it easy for developers to make massive amounts of structured, unstructured and time-series data available in real-time for search, logging, analytics and other use cases.”

How is Elasticsearch being used?  The Guardian is daily used by its readers to interact with content, Microsoft Dynamics ERP and CRM use it to index and analyze social feeds, it powers Yelp, and her is a big one Wikimedia uses it to power the well-loved and used Wikipedia.  We can already see how much Elasticsearch makes an impact on our daily lives without us being aware.  Other companies that use Elasticsearch for our and their benefit are Hotels Tonight, Dell, Groupon, Quizlet, and Netflix.

Elasticsearch will continue to grow as an inexpensive alternative to proprietary software and the number of Web services/companies that use it will only continues to grow.

Whitney Grace, February 5, 2016
Sponsored by, publisher of the CyberOSINT monograph

Next Page »

  • Archives

  • Recent Posts

  • Meta