Data Are a Problem? And the Solution Is?
January 8, 2020
I attended a conference about managing data last year. I sat in six sessions and listened as enthusiastic people explained that in order to tap the value of data, one has to have a process. Okay? A process is good.
Then in each of the sessions, the speakers explained the problem and outlined that knowing about the data and then putting it in a system is the way to derive value.
Neither Pros Nor Cons: Just Consulting Talk
This morning I read an article called “The Pros and Cons of Data Integration Architectures.” The write up concludes with this statement:
Much of the data owned and stored by businesses and government departments alike is constrained by the silos it’s stuck in, many of which have been built over the years as organizations grow. When you consider the consolidation of both legacy and new IT systems, the number of these data silos only increases. What’s more, the impact of this is significant. It has been widely reported that up to 80 per cent of a data scientist’s time is spent on collecting, labeling, cleaning and organizing data in order to get it into a usable form for analysis.
Now this is most true. However, the 80 percent figure is not backed up. An IDG expert whipped up some percentages about data and time, and these, I suspect, have become part of the received wisdom of those struggling with silos for decades. Most of a data scientist’s time is frittered away in meetings, struggling with budgets and other resources, and figuring out what data are “good” and what to do with the data identified by person or machine as “bad.”
The source of this statement is MarkLogic, a privately held company founded in 2001 and a magnet for $173 million from funding sources. That works out to an 18 years young start up if DarkCyber adopts a Silicon Valley T shirt.
A modern silo is made of metal and impervious to some pests and most types of weather.
One question the write up begs is, “After 18 years, why hasn’t the methodology of MarkLogic swept the checker board?” But the same question can be asked of other providers’ solutions, open source solutions, and the home grown solutions creaking in some government agencies in Europe and elsewhere.
Several reasons:
- The technical solution offered by MarkLogic-type companies can “work”; however, proprietary considerations linked with the issues inherent in “silos” have caused data management solutions to become consultantized; that is, process becomes the task, not delivering on the promise of data, elther dark or sunlit.
- Customers realize that the cost of dealing with the secrecy, legal, and technical problems of disparate, digital plastic trash bags of bits cannot be justified. Like odd duck knickknacks one of my failed publishers shoved into his lumber room, ignoring data is often a good solution.
- Individuals tasked with organizing data begin with gusto and quickly morph into bureaucrats who treasure meetings with consultants and companies pitching magic software and expensive wizards able to make the code mostly work.
DarkCyber recognizes that with boundaries like budgets, timetables, measurable objectives, federation can deliver some zip.
Silos: A Moment of Reflection
The article uses the word “silo” five times. That’s the same frequency of its use in the presentations to which I listened in mid December 2019.
So you want to break down this missile silo which is hardened and protected by autonomous weapons? That’s what happens when a data scientist pokes around a pharma company’s lab notebook for a high potential new drug.
Let’s pause a moment to consider what a silo is. A silo is a tower or a pit used to store core, wheat, or some other grain. Dust is silos can be exciting. Tip: Don’t light a match in a silo on a dry, hot day in a state where farms still operate. A silo can also be a structure used to house a ballistic missile, but one has to be a child of the Cold War to appreciate this connotation.
As applied to data, it seems that a silo is a storage device containing data. Unlike a silo used to house maize or a nuclear capable missile, the data silo contains information of value. How much value? No one knows. Are the data in a digital silo explosive? Who knows? Maybe some people should not know? What wants to flick a Bic and poke around?
Why do data silos exist? Several reasons:
First, many projects begin in an uncertain way. No once knows what data will be collected or if the data will be usable. Why try to make this little pile of data into a Mount Everest with no facts to justify the meeting, the planning, and the whole ball of wax. But once data begins to flow, it often accretes, collects, and may never be viewed by a human or a software agent. Who wants to be a data archeologist in a revenue focused company? Better come up with some artifacts the company can sell and pay for the brave explorer, the overhead, and the downstream consequences of poking into long forgotten repositories of the unknown. Observation: Silos are a natural consequence of doing anything with a computing device.
Second, collections of data come about because there are rules for access. This is different from the natural consequence of working with computers. Secrecy, legal requirements, government security requirements, and common sense operate to keep data away from indiscriminate processing. Pharmaceutical companies, industrialized online criminal operations, and developers of 5G technology actively create silos of data, deny that they exist so eager beavers looking to unlock value don’t know anything about these data. Observation: Silos are preserved for good and maybe not so good reasons. But if you don’t know there’s a silo under than green patch, you won’t bother those in the silo.
Third, data are idiosyncratic. This is a non technical way of saying that when a repository of information is discovered, those data usually require quite a bit of work to move from system A to system B. No one likes to talk about this task because it takes humans, often subject matter experts and lawyers. Other barriers are figuring out file types and creating systems either from off the shelf software to customer coded scripts to convert the data in one “silo” to a format usable in another but larger data management system. Yep, this new system is a silo too. But let’s avoid the infinity and set of sets issues for now. Observation: Filtering, converting, cleaning, and moving—really expensive and difficult.
Data Federation Innovation
But when DarkCyber compares an Amazon-centric solution with any of the decades old solutions to fruit salad data, the future may not include many of the data senior citizens.
Signs of geriatric decline can be observed in companies embracing IBM- or Oracle-type data management systems. The open source cheerleaders are discovering that the whoop-ti-doo for Hadoop has quieted to a quiet resignation toward escalating complexity, decreasing performance, and increasing costs. Zippy sales engineers talk like MBAs who have a minor in computer science; that is, long on consultant talk and short on technical bang. It is, for example, easier to pontificate about smart software, graph databases, and semantics than to deal with the challenges obliquely grazed in the pros and cons article.
DarkCyber does not have a dog in the data fight or the data lake as the case may be. Net net: It seems that progress in consulting has been more rapid than innovation for converting siloed data into corn muffins. No wonder some established “start ups” and big data brands are browsing for walkers on Amazon’s shopping site.
Stephen E Arnold, January 8, 2020