Oh, Oh. Big Data Has Problems. Impossible.
August 21, 2015
A happy quack to the reader who alerted me to “5 Problems with Big Data.” How can this be? Big Data is the new black, the new enterprise search, the new information management opportunity.
The write up states:
But when data gets big, big problems can arise.
The article identifies five issues. Most of these strike me as trivial for MBAs and failed middle school teachers to resolve before lunch. The alleged problems are:
- Storage. Hey, hey. I thought storage and the management thereof were a no brainer. But I have heard rumors that finding useful items and moving them around may contribute to digital heart burn.
- Bias. What! Incredible. I heard an MBA say at a conference not long ago that with Big Data little issues get smoothed out. Imagine. Big Data works like an electric iron with a spritz feature.
- False positives. Yo, dude. Those are things one talks about in Statistics 101. So a method says Tom and Betty have Ebola. After a quick check up at the doc in the box, both seem to be suffering from bad pizza and a sleepless night caused by worrying about the mid term statistics test. So what if a financial model predicts that GOOG and GOOGL shares no upward boundary. Hello, infinity.
- Complexity. Gasp. Layering SAP with SAS components within a SharePoint environment is complex. No way, José. This is century 21. We can crash a lander on an asteroid. We can handle a simple upgrade to an air traffic control system.
- Outputs which answer a question no one asked. Look, gentle reader, we have IBM Watson. That system can answer the question, “What sauce will tamarind enhance?” The answer which made perfect sense to me was barbeque sauce. Who worries if the question was a coded string intercepted from a anonymous post on a Dark Web forum.
Stepping back I have complete confidence in the confidence men and women pitching the Big Data thing. Five speed bumps presented as real, live problems. Big Data is the answer. Enterprise search vendors like Lucid Imagination and wizards like the IDC crowd which sold some of my work without my permission on Amazon (Dave Schubmehl, where are you?) know that Big Data will do the revenue trick.
Problems are just too darned negative. I want a happy face on that flawed, incomprehensible, irrelevant, and expensive report. This is the modern world, not tout at the chariot races pitching Nero’s team.
Get real. We have no “problems.” We have opportunities.
Stephen E Arnold, August 21, 2015
Quote to Note: Confluent
August 20, 2015
I read “Meet Confluent, The Big-Data Startup That Has Silicon Valley Buzzing.” Confluent can keep “he data flowing at some of the biggest and most information-rich firms in Silicon Valley.” The company’s Web site is http://www.confluent.io/. The company uses Apache Kafka to deliver its value to customers.
Here’s the passage i noted:
Experts suggest Confluent’s revenue could approach $10 million next year and pass $50 million in 2017. The company could echo the recent success of another open-source darling, Docker, which has turned record adoption of its computing tools called “containers” into a growing enterprise suite and a $1 billion valuation. Confluent is likely worth about one-sixth that today but not for long. “Every person we hire uncovers millions of dollars in sales,” says early investor Eric Vishria of Benchmark. “There’s real potential [for Confluent] to be an enterprise phenomenon.”
I noted the congruence of Docker and Confluence. I enjoyed the word “every”. Categorical affirmatives are thrilling. I liked also “phenomenon.” The article’s omission of a reference to Palantir surprised me.
Nevertheless, I have a question: “Has another baby unicorn been birthed?” According to Crunchbase, the company has raised more than $50 million. With 17 full time employees, Confluent may be hiring. Perhaps some lucid engineers will see the light?
Stephen E Arnold, August 20, 2015
Data Lake Alert: Tepid Water, High Concentration of Agricultural Runoff
August 13, 2015
Call me skeptical. Okay, call me a person who is fed up with silly jargon. You know what a database is, right? You know what a data warehouse is, well, sort of, maybe? Do you know what a data lake is? I don’t.
A lake, according to the search engine du jour Giburu:
An area prototypically filled with water, also of variable size.
A data lake, therefore, is an area filled with zeros and ones, also of variable size. How does a data lake differ from a database or a data warehouse?
According to the write up “Sink or Swim – Why your Organization Needs a Data Lake”:
A Data Lake is a storage repository that holds a vast amount of raw data in its native format for processing later by the business.
The magic in this unnecessary jargon is, in my opinion, a quest, perhaps Quixotic?) for sales leads. The write up points out that a data lake is available. A data lake is accessible. A data lake is—wait for it—Hadoop.
What happens if the water is neither clear nor pristine? One cannot unleash the hounds of the EPA to resolve the problem of data which may not very good until validated, normalized, and subjected to the ho hum tests which some folks want to have me believe may be irrelevant steps in the land of a marketer’s data lakes.
My admonition, “Don’t drink the water until you know it won’t make life uncomfortable—or worse. Think fatal.”
Stephen E Arnold, August 13, 2015
Coauthoring Documents in SharePoint to Save Time
August 4, 2015
SharePoint users are often looking for ways to save time and streamline the process of integration from other programs. Business Management Daily has devoted some attention to the topic with their article, “Co-authoring Documents in SharePoint and Office.” Read on for the full details of how to make the most of this feature.
The article begins:
“One of the best features of SharePoint 2010 and 2013 is the way it permits co-authoring. Co-authoring means more than one person is in a document, workbook or presentation at the same time editing different parts. It works differently in Word, Excel and PowerPoint . . . With Word 2013/SharePoint 2013, co-authors may edit either in Word Online (Word Web App) or the desktop version.”
SharePoint is a powerful but complicated solution that requires quite a bit of energy to maintain and use to the best of its ability. For those users and managers that are tasked with daily work in SharePoint, staying in touch with the latest tips and tricks is vital. Those users may benefit from Stephen E. Arnold’s Web site, ArnoldIT.com. A longtime leader in search, Arnold brings the latest SharePoint news together in one easy to digest news feed.
Emily Rae Aldridge, August 4, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Data Science, Senior Managers, and the Ever Interesting Notion of Truth
August 3, 2015
I read “Data Scientists to CEOs: You Can’t Handle the Truth.” I enjoy write ups about data science which start off with the notion of truth. I know that the “truth” referenced is the outputs of analytics systems.
Call me skeptical. If the underlying data are not normalized, validated, and timely, the likelihood of truth becomes even murkier than it was in my college philosophy class. Roger Ailes allegedly said:
Truth is whatever people will believe.
Toss in the criticism of a senior manager who in the US is probably a lawyer or an accountant, and you have a foul brew. Why would a manager charged with hitting quarterly targets or generating enough money to meet payroll quiver with excitement when a data scientist presents “truth.”
There is that pesky perception thing. There are frames of reference. There are subjective factors in play. Think of the dentist who killed Cecil. I am not sure data science will solve his business and personal challenges. Do you?
The write up is a silly fan rant for the fuzzy discipline of data science. Data science does not pivot on good old statisticians with their love of SAS and SPSS, fancy math, and 17th century notions of what constitutes a valid data set. Nope.
The data scientist has to communicate the known unknowns to his or her CEO. Shades of Rumsfeld. Does today’s CEO want to know more about the uncertainty in the business? The answer is, “Maybe.” But senior managers often get information that is filtered, shaped, and presented to create an illusion. Shattering those illusions can have some negative career consequences even for data scientists, assuming there is such a discipline as data science.
Evoking the truth from statistical processes which are output from system configured by others can be interesting. Those threshold settings are not theoretical. Those settings determine what the outputs are and what they are “about.”
Connecting an automated output to something that the data scientist asserts should be changed strikes me as somewhat parental. How does that work on a manager like Dick Cheney? How does that work on the manager of a volunteer committee working on a parent teacher luncheon?
I thought the Jack Benny program from the 1930s to 1960s was amusing. Some of the output about data science suggests that comedy may be a more welcoming profession than management based on truth from data science. Truth and statistics. Amazing comedy.
Stephen E Arnold, August 3, 2015
Big Data Lake: Are the Data Safe to Consume?
August 2, 2015
I read “The Analytics Journey Leading to the Business Data Lake.” Data lake is one of the terms floating around (pun definitely intended!) to stimulate sales. If one has a great deal of water, one needs a place to put it. Even though water is dammed, piped, used, recycled, and dumped—storage is the key.
Enter EMC, a company which is in the business of helping those with water store it and make use of that substance.
The write up reflects effort. I assume there was a PowerPoint slide deck in the mix. There are some snazzy graphics. Here’s one that caught my eye:
Instead of enterprise search being the go-to enterprise software solution, EMC has slugged in the following umbrella terms:
- Information ecosystem
- Business intelligence (perhaps an oxymoron in light of this article)
- Advanced analytics (obviously because regular analytics just are zippy enough)
- Knowledge layer (I remain puzzled about knowledge because I have a tough time defining. In fact, I resigned from my for fee knowledge management column because I just don’t know what the heck “knowledge” means.)
- The unfathomable data lake (yep, pun intended). What’s wrong with the word “storage” or “database” by the way?
- Master data which is also baffling. Is there servant data too?
- Machine data. Again I have no clue what this means.
The chart scatters undefined and fuzzy buzzwords like a crazed Jethro Tull, a water soluble blend of Jethro Tull (inventor of the seed drill) and Jethro Tull (the commercially successful and eccentric rock bands).
The write up is important because EMC has sucked in the jargon and assertions once associated with enterprise search and applied them to the dark and mysterious data lake.
I highlighted:
Our data lake is one logical data platform with multiple tiers of performance and storage levels to optimally serve various data needs based on Service Level Agreements (SLA). It will provide a vast amount of structured and unstructured data at the Hadoop and Greenplum layers to data scientists for advanced analytics innovation. The higher performance levels powered by Greenplum and in-memory caching databases will serve mission-critical and real-time analytics and application solutions. With more robust data governance and data quality management, we can ensure authoritative, high-quality data driving all of EMC business insights and analytics driven applications using data services from the lake.
Ah, the Mariana Trench of enterprise information: Governance. Like “knowledge” and “advanced analytics”, governance has euphony. I think of the water lapping against the shore of Lake Paseco.
So what? Several observations:
- This type of “suggest lots” marketing ended poorly for a number of companies who used this type of rhetoric when marketing search
- The folks who swallow this bait are likely to find themselves in a most uncomfortable spot
- The problems associated with making use of information to improve decision making by reducing risk are not going to be solved by crazy diagrams and unsupported assertions.
EMC has been able to return revenue growth. But the company’s profit margin has flat lined.
I am not sure that increasing the buzzword density in marketing write ups will help angle the red lines to low earth orbit. With better margins, it is much easier to check out the topographic view and see where lakes meet land.
Stephen E Arnold, August 2, 2015
The Hadoop Spark Thing: Simple, Simple
July 30, 2015
I am fascinated with the cheerleading about open source software which makes Big Data as easy as driving a Fiat 500 through a car wash. (Make sure the wheels fit inside the automated pulley system, of course.)
Navigate to “The Big Big Data Question: Hadoop or Spark?” Be prepared to read about two—count ‘em—two systems working as smoothly as the engine in a technical high school’s auto repair class’ project car.
I want to highlight two statements in the write up.
The first is:
As I [a Big Data practitioner] mentioned, Spark does not include its own system for organizing files in a distributed way (the file system) so it requires one provided by a third-party. For this reason many Big Data projects involve installing Spark on top of Hadoop, where Spark’s advanced analytics applications can make use of data stored using the Hadoop Distributed File System (HDFS).
In short, Spark is what I call a wrapper. One uses it like a taco shell to keep the good in position for real time munching.
The second is this comment:
The open source principle is a great thing, in many ways, and one of them is how it enables seemingly similar products to exist alongside each other – vendors can sell both (or rather, provide installation and support services for both, based on what their customers actually need in order to extract maximum value from their data.
What the write omits is that there are some other bits and pieces needed; for example, how does one locate a particular string amidst the Big Data?
The point, for me, is that these nested and layered systems are truly exciting to troubleshoot. Not only are their issues with the integrity of the data, there is the thrill of getting each subsystem to work and then figuring out how to get useful outputs from the digital equivalent of a Roy’s Place Lassie’s Double Revenge sandwich before it closed its doors in 2013.
A Lassie’s Double Revenge consisted of a knockwurst, cheese, grilled onions, baked beans, and assorted seasonings served to the discerning diner.
A little like an open source Big Data mash up.
As a bonus, one gets to hire consultants who can make separate products, systems, and solutions work in a way which benefits the licensee and the system’s users.
Stephen E Arnold, July 30, 2015
PowerPoint Enabled Big Data Presenters Rejoice
July 27, 2015
Navigate to “A Plethora of Big Data Infographics.” Note that the original write up misspells “plethora” at “pletora” but, as many in Big Data say, “it is close enough for horseshoes.”
I quit browsing after a baker’s dozen of these puppies. If you want to be an expert in Big Data, these charts will do the trick. I would steer clear of a person with a PhD in statistics, however.
Stephen E Arnold, July 27, 2015
Forbes and Some Big Data Forecasts
July 26, 2015
Short honk: For fee, mid tier consultants have had their thunder stolen. Forbes, the capitalist tool, wants to make certain its readers know how juicy Big Data is as a market. Navigate to “Roundup Of Analytics, Big Data & Business Intelligence Forecasts And Market Estimates, 2015.”
The write up summarizes the eye watering examples of spreadsheet fever’s impact on otherwise semi-rationale MBAs, senior managers, and used car sales professionals. IDC, without the inputs of Dave Schubmehl comes up with a spectacular number: $125 billion in 2015.
Sounds good, right?
The data will find their way into innumerable PowerPoint presentations. Snag ‘em while you can.
Stephen E Arnold, July 26, 2015
Big Data: Slow Down, Think
July 25, 2015
i read “Contradictions of Big Data.” Few articles which I see take a common sense approach to Big Data baloney. (Azure chip consultants bristle at my use of baloney. Too bad.) I liked this article.
The article appeared in my Overflight a day ago even though the write up was posted in March 2015. Big Data does not mean rapid data.
I highlighted this passage:
have been waging an uphill battle against the nonsensical and unsubstantiated idea that more data is better data, but now this view is getting some additional support, and from some surprising corners.
I do not agree. The yap about Big Data has almost overpowered the craziness of search engine optimization’s shouting about semantic search.
The write up points out:
Take it from me [Martyn Jones] , most businesses will not be basing their business strategies on the analysis of a glut of selfies, home videos of cute kittens, or the complete works of William Shakespeare or Dan Brown. Almost all business analysis will continue to be carried out on structured data obtained primarily from internal operational systems and external structured data providers.
The write up points out the silliness of velocity and several other slices of marketing baloney. (Make a sandwich, please.)
I found this paragraph insightful:
I have seen data scientists at work, and the word science doesn’t actually jump out and grab you. It’s difficult to make the connection, just as it is to accurately connect some popular science magazines with fundamental scientific research. If a professional and qualified statistician wants to label themselves a data scientist then I have no issue with that, it’s their problem, but I am not willing to lend credibility to the term ‘data scientist’ when it is merely an interesting job title, with at most a tenuous connection to the actual role, and one that is liberally applied, with the almost customary largesse of IT, to creative code hackers and business-averse dabblers in data.
Harsh words for those who combine an undergraduate degree minor in math with Twitter and come up with data scientist.
Hopefully other will pick up this practical approach to the sliced and processed meat wrapped in plastic and branded Big Data.
Stephen E Arnold, July 25, 2015