Big Data Is a Big Mess

January 18, 2017

Big Data and Cloud Computing were supposed to make things easier for the C-Suites to take billion dollar decisions. But it seems things have started to fall apart.

In an article published by Forbes titled The Data Warehouse Has Failed, Will Cloud Computing Die Next?, the author says:

A company that sells software tools designed to put intelligence controls into data warehousing environments says that traditional data warehousing approaches are flaky. Is this just a platform to spin WhereScape wares, or does Whitehead have a point?

WhereScape, a key player in Data Warehousing is admitting that the buzzwords in the IT industry are fizzing out. The Big Data is being generated, in abundance, but companies still are unsure what to do with the enormous amount of data that their companies produce.

Large corporations who already have invested heavily in Big Data are yet to find any RoIs. As the author points out:

Data led organizations have no idea how good their data is. CEOs have no idea where the data they get actually comes from, who is responsible for it etc. yet they make multi million pound decisions based on it. Big data is making the situation worse not better.

Looks like after 3D-Printing, another buzzword in the tech world, Big Data and Cloud Computing is going to be just a fizzled out buzzword.

Vishal Ingole, January 18, 2017

The Software Behind the Web Sites

January 17, 2017

Have you ever visited an awesome Web site or been curious how an organization manages their Web presence?  While we know the answer is some type of software, we usually are not given a specific name.  Venture Beat reports that it is possible to figure out the software in the article, “SimilarTech’s Profiler Tells You All Of The Technologies That Web Companies Are Using.”

SimilarTech is a tool designed to crawl the Internet to analyze what technologies, including software, Web site operators use.  SimiliarTech is also used to detect which online payment tools are the most popular.  It does not come as a surprise that PayPal is the most widely used, with PayPal Subscribe and Alipay in second and third places.

Tracking what technology and software companies utilize for the Web is a boon for salespeople, recruiters, and business development professionals who want a competitive edge as well as:

Overall, SimilarTech provides big data insights about technology adoption and usage analytics for the entire internet, providing access to data that simply wasn’t available before. The insights are used by marketing and sales professionals for website profiling, lead generation, competitive analysis, and business intelligence.

SimiliarTech can also locate contact information for personnel responsible for Web operations, in other words new potential clients.

This tool is kind of like the mailing houses of the past. Mailing houses have data about people, places, organizations, etc. and can generate contact information lists of specific clientele for companies.  SimiliarTech offers the contact information, but it does one better by finding the technologies people use for Web site operation.

Whitney Grace, January 17, 2016

The Disconnect: Big Data and Business Strategy

January 9, 2017

Imagine that: Big Data may not have a direct impact on business strategy.

I read “Why Big Data and Algorithms Won’t Improve Business Strategy.” I learned that Big Data learns by playing algorithmic chess. The “moves” can be converted to patterns. The problem is that no one knows what the game is.

The write up points out:

White’s control panel is just a shadow of the landscape and the sequence of presses lacks any positional information or consistent understanding of movement on the board. When faced with a player who does understand the environment then no amount of large scale data analysis on combinations of sequences of presses through the control panel or application of artificial intelligence or algorithms that is going to help you.

The idea is that a disconnect occurs.

Data does not equal strategy for the game of “real” chess.

The write up includes an analysis of a famous battle. An accurate map may be more useful than an MBA analysis of a situationally ignorant analysis. Okay, I understand.

The write up points out:

In the game of Chess above, yes you can use large scale data analytics, AI and algorithms to discover new patterns in the sequences of presses and certainly this will help you against equally blind competitors. Such techniques will also help you in business improve your supply chain or understand user behavior or marketing or loyalty programs or operational performance or any number of areas in which we have some understanding of the environment.

The author adds:

But this won’t help you in strategy against the player with better situational awareness. Most business strategy itself operates in a near vacuum of situational awareness. For the vast majority then I’ve yet to see any real evidence to suggest that big data is going to improve this. There are a few and rare exceptions but in general, the key is first to understand the landscape and that a landscape exists.

The write up leaves me with an opportunity to hire the author. What’s clear is that content marketing and business strategy do connect. That’s reassuring. No analysis needed. No map either.

Stephen E Arnold, January 9, 2017

An Apologia for People. Big Data Are Just Peachy Keen

December 25, 2016

I read “Don’t Blame Big Data for Pollsters’ Failings.” The news about the polls predicting a victory for Hillary Clinton reached me in Harrod’s Creek five days after the election. Hey, Beyond Search is in rural Kentucky. It looks from the news reports and the New York Times’s odd letter about doing “real” journalism that the pundits predicted that the mare would win the US derby.

The write up explains that Big Data did not fail. The reason? The pollsters were not using Big Data. The sample sizes were about 1,000 people. Check your statistics book. In the back will be samples sizes for populations. If you have an older statistics book, you have to use the formula like


Big Data doesn’t fool around with formulas. Big Data just uses “big data.” Is the idea is that the bigger the data, the better the output?

The write up states that the problem was the sample itself: The actual humans.

The write up quotes a mid tier consultant from an outfit called Ovum which reminds me of eggs. I circled this statement:

“When you have data sets that are large enough, you can find signals for just about anything,” says Tony Baer, a big data analyst at Ovum. “So this places a premium on identifying the right data sets and asking the right questions, and relentlessly testing out your hypothesis with test cases extending to more or different data sets.”

The write up tosses in social media. Facebook takes the position that its information had minimal effect on the election. Nifty assertion that.

The solution is, as I understand the write up, to use a more real time system, different types of data, and math. The conclusion is:

With significant economic consequences attached to political outcomes, it is clear that those companies with sufficient depth of real-time behavioral data will likely increase in value.

My view is that hope and other distinctly human behaviors certainly threw an egg at reality. It is great to know that there is a fix and that Big Data emerge as the path forward. More work ahead for the consultants who often determine sample sizes by looking at Web sites like SurveySystem and get their sample from lists of contributors, a 20 something’s mobile phone contact list, or lists available from friends.

If you use Big Data, tap into real time streams of information, and do the social media mining—you will be able to predict the future. Sounds logical? Now about that next Kentucky Derby winner? Happy or unhappy holiday?

Stephen E Arnold, December 25, 2016

Big Data Needs to Go Public

December 16, 2016

Big Data touches every part of our lives and we are unaware.  Have you ever noticed when you listen to the news, read an article, or watch a YouTube video that people say items such as: “experts claim, “science says,” etc.”  In the past, these statements relied on less than trustworthy sources, but now they can use Big Data to back up their claims.  However, popular opinion and puff pieces still need to back up their big data with hard fact. says that transparency is a big deal for Big Data and algorithm designers need to work on it in the article, “More Accountability For Big-Data Algorithms.”

One of the hopes is that big data will be used to bridge the divide between one bias and another, except that he opposite can happen.  In other words, Big Data algorithms can be designed with a bias:

There are many sources of bias in algorithms. One is the hard-coding of rules and use of data sets that already reflect common societal spin. Put bias in and get bias out. Spurious or dubious correlations are another pitfall. A widely cited example is the way in which hiring algorithms can give a person with a longer commute time a negative score, because data suggest that long commutes correlate with high staff turnover.

Even worse is that people and organizations can design an algorithm to support science or facts they want to pass off as the truth.  There is a growing demand for “algorithm accountability,” mostly in academia.  The demands are that data sets fed into the algorithms are made public.  There also plans to make algorithms that monitor algorithms for bias.

Big Data is here to say, but relying too much on algorithms can distort the facts.  This is why the human element is still needed to distinguish between fact and fiction.  Minority Report is closer to being our present than ever before.

Whitney Grace, December 16, 2016

Social Media Surveillance Now a Booming Business

December 5, 2016

Many know that law enforcement often turns to social media for clues, but you may not be aware how far such efforts have gotten. LittleSis, a group that maps and publishes relationships between the world’s most powerful entities, shares what it has learned about the field of social-media spying in, “You Are Being Followed: The Business of Social Media Surveillance.”

LittleSis worked with MuckRock, a platform that shares a trove of original government documents online. The team identified eight companies now vending social-media-surveillance software to law enforcement agencies across the nation; see the article for the list, complete with links to more information on each company. Writer Aaron Cantú describes the project:

We not only dug into the corporate profiles of some of the companies police contract to snoop on your Tweets and Facebook rants, we also filed freedom of information requests to twenty police departments across the country to find out how, when, and why they monitor social media. …

One particularly well-connected firm that we believe is worth highlighting here is ZeroFOX, which actively monitored prominent Black Lives Matter protesters in Baltimore and labeled some of them, including former Baltimore mayoral candidate DeRay McKesson, ‘threat actors.’ The company reached out to Baltimore officials first, offering it services pro-bono, which ZeroFOX executives painted as a selfless gesture of civic responsibility. But city officials may have been especially receptive to ZeroFOX’s pitch because of the powerful names standing behind it.

Behind ZeroFOX are weighty names indeed, like Mike McConnell, former director of the NSA, and Robert Rodgiguez, who is tied to Homeland Security, the Secret Service, and a prominent security firm. Another company worth highlighting is Geofeedia, because its name appears in all the police-department records the project received so far. The article details how each of these departments have worked with that company, from purchase orders to contract specifications. According to its CEO, Geofeedia grew sevenfold in just the last two years.

Before closing with a call for readers to join the investigation through MuckRock, Cantú makes this key observation:

Because social media incites within us a compulsion to share our thoughts, even potentially illegal ones, law enforcement sees it as a tool to preempt behavior that appears threatening to the status quo. We caught a glimpse of where this road could take us in Michigan, where the local news recently reported that a man calling for civil unrest on Facebook because of the Flint water crisis was nearly the target of a criminal investigation. At its worst, social media monitoring could create classes of ‘pre-criminals’ apprehended before they commit crimes if police and prosecutors are able to argue that social media postings forecast intent. This is the predictive business model to which Geofeedia CEO Phil Harris aspires. [The link goes to a 23-minute interview with Harris at YouTube.]

Postings forecast intent”— because no one ever says anything online they don’t really mean, right? There is a reason the pre-crime-arrest concept is fodder for tales of dystopian futures. Where do details like civilian oversight and the protection of civil rights come in?

Cynthia Murrell, December 5, 2016

Big Data on Crime

December 5, 2016

An analytics company that collects crime related data from local law enforcement agencies plans to help reduce crime rates by using Big Data., in its FAQs says:

The data on CrimeReports is sent on an hourly, daily, or weekly basis from more than 1000 participating agencies to the CrimeReports map. Each agency controls their data flow to CrimeReports, including how often they send data, which incidents are included.

Very little is known about the service provider. WhoIs Lookup indicates that though the domain was registered way back in 1999, it was updated few days back on November 25th 2016 and is valid till November 2, 2017.

CrimeReports is linked to a local law enforcement agency that selectively shares the data on crime with the analytics firm. After some number crunching, the service provider then sends the data to its subscribers via emails. According to the firm:

Although no formal, third-party study has been commissioned, there is anecdotal evidence to suggest that public-facing crime mapping—by keeping citizens informed about crime in their area—helps them be more vigilant and implement crime prevention efforts in their homes, workplaces, and communities. In addition, there is anecdotal evidence to suggest that public-facing crime mapping fosters more trust in local law enforcement by members of the community.

To maintain data integrity, the data is collected only through official channels. The crime details are not comprehensive, rather they are redacted to protect victim and criminal’s privacy. As of now, CrimeReports get paid by law enforcement agencies. Certainly, this is something new and probably never tried.

Vishal Ingole, December 5, 2016

Computational Limits: Just a Reminder to the Cheerleaders for Big Data and Analytics

December 1, 2016

“Let’s index everything” or “Let’s process all the digital data”. Ever hear these statements or something similar? I have. In fact, I hear this type of misinformed blather almost every day. I read “Big Data Coming in Faster Than Biomedical Researchers Can Process It” seems to have figured out that yapping about capture and crunch are spitting out partial truths. (What’s new in the trendy world of fake news?)

The write up points out in a somewhat surprised way:

“It’s not just that any one data repository is growing exponentially, the number of data repositories is growing exponentially,” said Dr. Atul Butte, who leads the Institute for Computational Health Sciences at the University of California, San Francisco.

Now the kicker:

Prospecting for hints about health and disease isn’t going to be easy. The raw data aren’t very robust and reliable. Electronic medical records are often kept in databases that aren’t compatible with one another, at least without a struggle. Some of the potentially revealing details are also kept as free-form notes, which can be hard to extract and interpret. Errors commonly creep into these records. And data culled from scientific studies aren’t entirely trustworthy, either.

Net net: Lots of data. Inadequate resources. Inability to filter for relevance. Failure to hook “data” to actual humans. The yap about curing cancer or whatever disease generates a news release indicates an opportunity. But there’s no easy solution.

The resources to “make sense” of large quantities of historical and real time data are not available. But marketing is easy. Dealing with real world data is a bit more difficult. Keep that in mind if you develop a nifty disease and expect Big Data and analytics to keep the cookies from burning. Sure the “data” about making a blue ribbon batch of chocolate chips is available. Putting the right information into a context at the appropriate time is a bit more difficult even for the cognitive, smart software, text analytics cheerleaders.

Wait. I have a better idea. Why not just let a search system find and discover exactly what you need? Let me know how that works out for you.

Stephen E Arnold, December 1, 2016

Emphasize Data Suitability over Data Quantity

November 30, 2016

It seems obvious to us, but apparently, some folks need a reminder. Harvard Business Review proclaims, “You Don’t Need Big Data, You Need the Right Data.” Perhaps that distinction has gotten lost in the Big Data hype. Writer Maxwell Wessel points to Uber as an example. Though the company does collect a lot of data, the key is in which data it collects, and which it does not. Wessel explains:

In an era before we could summon a vehicle with the push of a button on our smartphones, humans required a thing called taxis. Taxis, while largely unconnected to the internet or any form of formal computer infrastructure, were actually the big data players in rider identification. Why? The taxi system required a network of eyeballs moving around the city scanning for human-shaped figures with their arms outstretched. While it wasn’t Intel and Hewlett-Packard infrastructure crunching the data, the amount of information processed to get the job done was massive. The fact that the computation happened inside of human brains doesn’t change the quantity of data captured and analyzed. Uber’s elegant solution was to stop running a biological anomaly detection algorithm on visual data — and just ask for the right data to get the job done. Who in the city needs a ride and where are they? That critical piece of information let the likes of Uber, Lyft, and Didi Chuxing revolutionize an industry.

In order for businesses to decide which data is worth their attention, the article suggests three guiding questions: “What decisions drive waste in your business?” “Which decisions could you automate to reduce waste?” (Example—Amazon’s pricing algorithms) and “What data would you need to do so?” (Example—Uber requires data on potential riders’ locations to efficiently send out drivers.) See the article for more notes on each of these guidelines.

Cynthia Murrell, November 30, 2016
Sponsored by, publisher of the CyberOSINT monograph

Do Not Forget to Show Your Work

November 24, 2016

Showing work is messy, necessary step to prove how one arrived at a solution.  Most of the time it is never reviewed, but with big data people wonder how computer algorithms arrive at their conclusions.  Engadget explains that computers are being forced to prove their results in, “MIT Makes Neural Networks Show Their Work.”

Understanding neural networks is extremely difficult, but MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) has developed a way to map the complex systems.  CSAIL figured the task out by splitting networks in two smaller modules.  One for extracting text segments and scoring according to their length and accordance and the second module predicts the segment’s subject and attempts to classify them.  The mapping modules sounds almost as complex as the actual neural networks.  To alleviate the stress and add a giggle to their research, CSAIL had the modules analyze beer reviews:

For their test, the team used online reviews from a beer rating website and had their network attempt to rank beers on a 5-star scale based on the brew’s aroma, palate, and appearance, using the site’s written reviews. After training the system, the CSAIL team found that their neural network rated beers based on aroma and appearance the same way that humans did 95 and 96 percent of the time, respectively. On the more subjective field of “palate,” the network agreed with people 80 percent of the time.

One set of data is as good as another to test CSAIL’s network mapping tool.  CSAIL hopes to fine tune the machine learning project and use it in breast cancer research to analyze pathologist data.

Whitney Grace, November 24, 2016
Sponsored by, publisher of the CyberOSINT monograph

Next Page »