Big Data Should Mean Big Quality

September 2, 2014

Why does logic seem to fail in the face of fancy jargon? DataFusion’s Blog posted on the jargon fallacy in the post, “It All Begins With Data Quality.” The post explains how with new terms like big data, real-time analytics, and self-service business intelligence that the basic fundamentals that make this technology work are forgotten. Cleansing, data capture, and governance form the foundation for data quality. Without data quality, big data software is useless. According to a recent Aberdeen Group study, data quality was ranked as the most important data management function.

Data quality also leads to other benefits:

“When examining organizations that have invested in improving their data, Aberdeen’s research shows that data quality tools do in fact deliver quantifiable improvements. There is also an additional benefit: employees spend far less time searching for data and fixing errors. Data quality solutions provided an average improvement of 15% more records that were complete and 20% more records that were accurate and reliable. Furthermore, organizations without data quality tools reported twice the number of significant errors within their records; 22% of their records had these errors.”

Data quality saves man hours, discovers missing errors, and deleted duplicate records. The Aberdeen Group’s study also revealed that poor data quality is a top concern. Organizations should deploy a data quality tool, so they too can take advantage of its many benefits. It is a logical choice.

Whitney Grace, September 02, 2014
Sponsored by, developer of Augmentext

Huff Po and a Search Vendor Debunk Big Data Myths

September 1, 2014

I suppose I am narrow minded. I don’t associate the Huffington Post with high technology analyses. My ignorance is understandable because I don’t read the Web site’s content.

However, a reader sent me a link to “Top Three Big Data Myths: Debunked”, authored by a search vendor’s employee at Recommind. Now Recommind is hardly a household word. I spoke with a Recommind PR person about my perception that Recommind is a variant of the technology embodied in Autonomy IDOL. Yep, that company making headlines because of the minor dust up with Hewlett Packard. Recommind provides a probabilistic search system to customers that were originally involved in the legal market. The company has positioned its technology to other markets and added a touch of predictive magic as well. At its core, Recommind indexes content and makes the indexes available to users and other services. The company in 2010 formed a partnership with the Solcara search folks. Solcara is now the go to search engine for Thomson Reuters. I have lost track of the other deals in which Recommind has engaged.

The write up reveals quite a bit about the need for search vendors to reach a broader market in order to gain visibility to make the cost of sales bearable. This write up is a good example of content marketing and the malleability of outfits like Huffington Post. The idea strikes me as something that looks interesting may get a shot at building the click traffic for Ms. Huffington’s properties.

So what does the article debunk? Fasten your seat belt and take your blood pressure medicine. The content of the write up may jolt you. Ready?

First, the article reveals that “all” data are not valuable. The way the write up expresses it takes this form, “Myth #1—All Data Is Valuable.” Set aside the subject verb agreement error. Data is the plural and datum is the singular. But in this remarkable content marketing essay, grammar is not my or the author’s concern. The notion of categorical propositions applied to data is interesting and raises many questions; for example, what data? So the first my is that if one if able to gather “all data”, it therefore follows that some is not germane. My goodness, I had a heart palpitation with this revelation.

Second, the next myth is that “with Big Data the more information the better.” I must admit this puzzles me. I am troubled by the statistical methods used to filter smaller, yet statistically valid, subsets of data. Obviously the predictive Bayesian methods of Recommind can address this issue. The challenges Autonomy like syst4ems face are well known to some Autonomy licensees and, I assume, to the experts at Hewlett Packard. The point is that if the training information is off base by a smidge and the flow of content does not conform to the training set, the outputs are often off point. Now with “more information” the sampling purists point to sampling theory and the value of carefully crafted training sets. No problem on my end, but aren’t we emphasizing that certain non Bayesian methods are just not a wonderful as Recommind’s methods? I think so.

The third myth that the write up “debunks” is “Big Data opportunities come with no costs.” I think this is a convoluted way of saying that get ready to spend a lot of money to embrace Big Data. When I flip this debunking on its head, and I get this hypothesis, “The Recommind method is less expensive than the Big Data methods that other hype artists are pitching as the best thing since sliced bread.

The fix is “information governance.” I musty admit that like knowledge management, I have zero idea what the phrase means. Invoking a trade association anchored in document scanning does not give me confidence that an explanation will illuminate the shadows.

Net net: The myths debunked just set up myths for systems based on aging technology. Does anyone notice? Doubt it.

Stephen E Arnold, September 1, 2014

Big Data Players Line Up

August 28, 2014

Technology moves fast. The race is always one to remain on top and relevant. Big data companies especially feel the push to develop new and improved products. Datamation makes a keen observation about big data competition in the article “30 Big Data Companies Leading The Way:”

“For Big Data companies, this is a critical period for competitive jockeying. These are the early days of Big Data, which means there are still a plethora of companies – a mix of new firms and old guard Silicon Valley firms – looking to stay current. Like everything else, the Big Data market will mature and consolidate. In five years, you can bet that many of the Big Data companies on this list will be gone – either out of business or merged/acquired with a larger player.”

Datamation continues the article with a list of big data companies that specialize in big data analytics. It is stressed that the list is not to be used as a buyer’s guide, but more as a rundown of the various services each of the thirty companies offers and how they try to distinguish themselves in the market. Big names like Google, Microsoft, IBM, and SAP rare among the first listed, while smaller companies are listed towards the bottom. Many of the smaller firms are ones that do not make the news often, but judging by their descriptions have comparable products.

Who will remain and who will stay in the next five years?

Whitney Grace, August 28, 2014
Sponsored by, developer of Augmentext

Will Apps Built on IDOL Gin Cash?

August 23, 2014

I don’t know if the data in “Most Smartphone Users Download Zero Apps per Month.” The majority (65%) of smartphone users download zero apps per month. I suppose the encouraging point in the write up is 35% of smartphone users download more than one per month. The Hewlett Packard IDOL app can be a slam dunk when HP unleashes IDOL enterprise apps. If HP converts just one percent of the 35 percent, millions will flow to the printer ink and personal computer company. At least, that’s one way to interpret the data the MBA way. Plug those numbers into Excel, fatten up the assumptions, and the money is in the virtual bank. At least that’s one way to leverage spreadsheet fever into a corporate initiative for Big Data IDOL enterprise apps.

Stephen E Arnold, August 23, 2014

Big Data: Oh, Oh, This Revolution Requires Grunt Work

August 18, 2014

I read “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights.” The write up from the newspaper that does not yet have hot links to the New York Times’ store, has revealed that Big Data involves “janitor work.”

Interesting. I thought that Big Data was a silver bullet, a magic jinni, a miracle, etc. The write up reports that “far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required.”

And who does the work? The MBAs? The venture capitalists? The failed Webmasters? The content management specialists? The faux consultants pitching information governance?


The work is done by data scientists.

The New York Times has learned:

Before a software algorithm can go looking for answers, the data must be cleaned up and converted into a unified form that the algorithm can understand.

Quiet a surprise for the folks at the newspaper.

How much of a data scientist’s time goes to data clean up? The New York Times has learned:

Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

What’s this mean in terms of cost?

Put simply, Big Data is likely to cost more than the MBAs, the venture capitalists, faux information government consultants, et al assumed.

No kidding.

So as the volume of Big Data expands ever larger, doesn’t this mean that the price tag for Big Data grows ever larger. I don’t want to follow this logic too far. Exponentiating costs and falling farther and farther behind the most recent data is likely to make the folks with those fancy, real time predictive models based on Big Data uncomfortable.

Don’t worry. Be happy. The Big Data did miss the Ebola issue, the caliphate, various financial problems, and a handful of trivial events.

Stephen E Arnold, August 18, 2014

Quote to Note: Big Data Getting Even Biggerest Super Fastly

August 11, 2014

I love quotes about Big Data. “Big” is relative. You have heard a doting patent ask a toddler, “How big are you?” The toddler puts up his or her arms and says, “So big.” Yep, big at a couple of years old and 30 inches tall.

If You Think Big Data’s Big Now, Just Wait” contains a quote attributed to a Big Data company awash in millions in funding money. Here’s the item I flagged for my Quote to Note file:

“The promise of big data has ushered in an era of data intelligence. From machine data to human thought streams, we are now collecting more data each day, so much that 90% of the data in the world today has been created in the last two years alone. In fact, every day, we create 2.5 quintillion bytes of data — by some estimates that’s one new Google every four days, and the rate is only increasing…

I like the 2.5 quintillion bytes of data.

I am confident that Helion, IBM’s brain chip, and Google’s sprawling system can make data manageable. Well, more correctly, fancy systems will give the appearance of making quintillions of whatevers yield actionable intelligence.

If you do the Samuel Taylor Coleridge thing and enter into a willing suspension of disbelief, Big Data is just another opportunity.

How do today’s mobile equipped MBAs make decisions? A Google search, ask someone, or guess? I suggest you consider how you make decisions. How often do you have an appetite for SPSS style number crunching or a desire to see what’s new from the folks at Moscow State University.

Yep, data intelligence for the tiny percentage of the one percent who paid attention in statistics class. This is a type of saucisson I enjoy so much. Will this information find its way into a Schubmehl-like report about a knowledge quotient? For sure I think.

Stephen E Arnold, August 11, 2014

Data Augmentation: Is a Step Missing or Mislocated?

August 6, 2014

I read “Data Warehouse Augmentation, Part 4.” You can find the write up a There are other sections of the write, but I want to focus on the diagrams in this fourth chapter/section.

IBM is working overtime to generate additional revenues. Some of the ideas are surprising; for example, positioning Vivisimo’s metasearch function as a Big Data solution or buying Cybertap and then making the quite valuable technology impossible to find unless one is an intelligence procurement official. Then there is Watson, and I am just not up to commenting on this natural language processing system.

To the matter at hand. There is basic information about in this write up about specific technical components of a Big Data solution. The words, for the most part, will not surprise anyone who has looked at marketing collateral from any of the Big Data vendors/integrators.

What is fascinating about the write up is the wealth of diagrams in the document. I worked through the text and the diagrams and I noticed that one task is not identified as important; specifically, the conversion of source content into a file type or form that the content processing system can process.

Here’s an example. First the IBM diagram:


Source: IBM, Data Warehouse Augmentation, 2014.

Notice that after “staging”, there is a function described in time-honored database speak, “ETL.” Now “extract, transform, and load” is a very important step. But is there a step that precedes ETL?


How can one extract from disparate content if a connector is not available or the source system cannot support file transfers, direct access, or reports that reflect in memory data?

In my experience, there will be different methods of acquiring content to process. There are internal systems. If there is an ancient AS/400, some work is required to generate outputs that provide the data required. Due to the nature of the AS/400, direct interaction with the outstanding memory system of the AS/400, some care is needed to get the data and the updates not yet written to disc without corrupting the in memory information. We have addressed this “memory fragility” by using a standalone machine that accepts an output from the AS/400 and then disconnects. The indexing system, then, connects to the standalone machine to pick up the AS/400 outputs. Clunky? You bet. But there are some upsides. To learn about the excitement of direct interaction with AS/400, just do some real time data acquisition. Let me know how this works out for you.

The same type of care is often needed with the content assembled for the data warehouse pipeline. Let me illustrate this. Assume the data warehouse will obtain data from these sources: internal legacy systems, third party providers, custom crawls with the content residing on a hosted service, and direct data acquisition from mobile devices that feed information into a collection point parked at Amazon.

Now each of these content streams has different feathers in its war bonnet. Some of the data will be well formed XML. Some will be JSON. Some will be a proprietary format unique to the source. For each file type, there will be examples of content objects that are different, due to a vendor format change or a glitch.

These disparate content objects, therefore, have to be processed before extraction can occur. So has IBM put ETL in the wrong place in this diagram or has IBM omitted the pre-processing (normalization) operation.

In our experience, content that cannot be processed is not available to the system. If big chunks of content end up in the exceptions folder, the resulting content processing may be flawed. One of the data points that must be checked is the number of content objects that can be normalized in a pre processing stream. We have encountered situations like these. Your mileage may vary:

  1. Entire streams of certain types of content are exceptions, so the resulting indexing does not contain the data. Example: outputs from certain intercept systems.
  2. Streams of content skip non processable content without writing exceptions to a file due to configuration or resource availability
  3. Streams of content are automatically “capped” when the processing system cannot keep pace. When the system accepts more content, it does not pull information from a cache or storage pool. The system just ignores the information it was unable to process.

There are fixes for each of these situations. What we have learned is that this pre processing function can be very expensive, have an impact on the reliability of the outputs from the data warehousing system when queried, and generate a bottleneck that affects downstream processes.

After decades of data warehousing refinement, why does this problem keep surfacing?

The answer is that recycling traditional thinking about content processing is much easier than figuring out what causes a complex system to derail itself. I think that may be part of the reason the IBM diagram may be misleading.

Pre-processing can be time consuming, hungry for machine resources, and very expensive to implement.

Stephen E Arnold, August 6, 2014

Big Data Boom Pushes Schools to Create Big Data Programs

July 29, 2014

Can education catch up to progress? Perhaps, especially when corporations take an interest. Fortune discusses “Educating the ‘Big Data’ Generation.” As companies try to move from simply collecting vast amounts of data to putting that information to use, they find a serious dearth of qualified workers in the field. In fact, Gartner predicted in 2012 that 4.4 million big-data IT jobs would be created globally by 2015 (1.9 million in the U.S.). Schools are now working to catch up with this demand, largely as the result of prodding from the big tech companies.

The field of big data collection and analysis presents a previously rare requirement—workers that understand both technology and business. Reporter Katherine Noyes cites MIT’s Erik Brynjolfsson, who will be teaching a course on big data this summer:

“We have more data than ever,’ Brynjolfsson said, ‘but understanding how to apply it to solve business problems needs creativity and also a special kind of person.’ Neither the ‘pure geeks’ nor the ‘pure suits’ have what it takes, he said. ‘We need people with a little bit of each.’”

Over at Arizona State, which boasts year-old master’s and bachelor’s programs in data analytics, Information Systems chair Michael Goul agrees:

“’We came to the conclusion that students needed to understand the business angle,’ Goul said. ‘Describing the value of what you’ve discovered is just as key as discovering it.’”

In order to begin meeting this new need for business-minded geeks (or tech-minded business people), companies are helping schools develop programs to churn out that heretofore suspect hybrid. For example, Noyes writes:

“MIT’s big-data education programs have involved numerous partners in the technology industry, including IBM […], which began its involvement in big data education about four years ago. IBM revealed to Fortune that it plans to expand its academic partnership program by launching new academic programs and new curricula with more than twenty business schools and universities, to begin in the fall….

“Business analytics is now a nearly $16 billion business for the company, IBM says—which might be why it is interested in cultivating partnerships with more than 1,000 institutions of higher education to drive curricula focused on data-intensive careers.”

Whatever forms these programs, and these jobs, ultimately take, one thing is clear: for those willing and able to gain the skills, the field of big data is wide open. Anyone with a strong love of (and aptitude for) working with data should consider entering the field now, while competition for qualified workers is so very high.

Cynthia Murrell, July 29, 2014

Sponsored by, developer of Augmentext

From Search to Sentiment

July 28, 2014

Attivio has placed itself in the news again, this time for scoring a new patent. Virtual-Strategy Magazine declares, “Attivio Awarded Breakthrough Patent for Big Data Sentiment Analysis.” I’m not sure “breakthrough” is completely accurate, but that’s the language of press releases for you. Still, any advance can provide an advantage. The write-up explains that the company:

“… announced it was awarded U.S. Patent No. 8725494 for entity-level sentiment analysis. The patent addresses the market’s need to more accurately analyze, assign and understand customer sentiment within unstructured content where multiple brands and people are referenced and discussed. Most sentiment analysis today is conducted on a broad level to determine, for example, if a review is positive, negative or neutral. The entire entry or document is assigned sentiment uniformly, regardless of whether the feedback contains multiple comments that express a combination of brand and product sentiment.”

I can see how picking up on nuances can lead to a more accurate measurement of market sentiment, though it does seem more like an incremental step than a leap forward. Still, the patent is evidence of Attivio’s continued ascent. Founded in 2007 and headquartered in Massachusetts, Attivio maintains offices around the world. The company’s award-winning Active Intelligence Engine integrates structured and unstructured data, facilitating the translation of that data into useful business insights.

Cynthia Murrell, July 28, 2014

Sponsored by, developer of Augmentext

Is New Math Really New Yet?

July 21, 2014

I read “Scientific Data Has Become So Complex, We Have to Invent New Math to Deal With It.” My hunch is that this article will become Google spider food with a protein punch.

In my lectures for the police and intelligence community, I review research findings from journals and my work that reveal a little appreciated factoid; to wit: The majority of today’s content processing systems use a fairly narrow suite of numerical recipes that have been embraced for decades by vendors, scientists, mathematicians, and entrepreneurs. Due to computational constraints and limitations of even the slickest of today’s modern computers, processing certain data sets is a very difficult and expensive in humans, programming, and machine time job.

Thus, the similarity among systems comes from several factors.

  1. The familiar is preferred to the onerous task of finding a slick new way to compute k-means or perform one of the other go-to functions in information processing
  2. Systems have to deliver certain types of functions in order to make it easy for a procurement team or venture oriented investor to ask, “Does your system cluster?” Answer: Yes. Venture oriented investor responds, “Check.” The procedure accounts for the sameness of the feature lists between Palantir, Recorded Future, and simile systems. When the similarities make companies nervous, litigation results. Example: Palantir versus i2 Ltd. (now a unit of IBM).
  3. Alternative methods of addressing tasks in content processing exist, but they are tough to implement in today’s computing systems. The technical reason for the reluctance to use some fancy math from my uncle Vladimir Ivanovich Arnold’s mentor Andrey Kolmogorov is that in many applications the computing system cannot complete the computation. The buzzword for this is P=NP? Here’s MIT’s 2009 explanation
  4. Savvy researchers have to find a way to get from A to B that works within the constraints of time, confidence level required, and funding.

The Wired article identifies other hurdles; for example, the need for constant updating. A system might be able to compute a solution using fancy math on a right sized data set. But toss in constantly updating information and the computing resources often just keep getting hungrier for more storage, bandwidth, and computational power. Then the bigger the data, the computing system has to shove that data around. As fast as an iPad or modern Dell notebook seems, the friction adds latency to a system. For some analyses, delays can have significant repercussions. Most Big Data systems are not the fleetest of foot.

The Wired article explains how fancy math folks cope with these challenges:

Vespignani uses a wide range of mathematical tools and techniques to make sense of his data, including text recognition. He sifts through millions of tweets looking for the most relevant words to whatever system he is trying to model. DeDeo adopted a similar approach for the Old Bailey archives project. His solution was to reduce his initial data set of 100,000 words by grouping them into 1,000 categories, using key words and their synonyms. “Now you’ve turned the trial into a point in a 1,000-dimensional space that tells you how much the trial is about friendship, or trust, or clothing,” he explained.

Wired labels this approach as “piecemeal.”

The fix? Wired reports:

the big data equivalent of a Newtonian revolution, on par with the 17th century invention of calculus, which he [Yalie mathematician Ronald Coifman] believes is already underway.

Topological analyses and sparsity,  may offer a path forward.

The kicker in the Wired story is the use of the phrase “tractable computational techniques.” The notion of “new math” is an appealing one.

For the near future, the focus will be on optimization of methods that can be computed on today’s gizmos. One widely used method in Autonomy, Recommind, and many other systems originates with Sir Thomas Bayes who died in 1761. My relative died 2010. I understand there were some promising methods developed after Kolmogorov died in 1987.

Inventing new math is underway. The question is, “When will computing systems become available to use these methods without severe sampling limitations?” In the meantime, Big Data keep on rolling in, possibly mis-analyzed and contributing to decisions with unacceptable levels of risk.

Stephen E Arnold, July 21, 2014

Next Page »