A Xoogler Explains Why Big Data Is Going Nowhere Fast
March 3, 2023
The essay “Big Data Is Dead.” One of my essays from the Stone Age of Online used the title “Search Is Dead” so I am familiar with the trope. In a few words, one can surprise. Dead. Final. Absolute, well, maybe. On the other hand, the subject either Big Data or Search are part of the woodwork in the mini-camper of life.
I found this statement interesting:
Modern cloud data platforms all separate storage and compute, which means that customers are not tied to a single form factor. This, more than scale out, is likely the single most important change in data architectures in the last 20 years.
The cloud is the future. I recall seeing price analyses of some companies’ cloud activities; for example, “The Cloud vs. On-Premise Cost: Which One is Cheaper?” In my experience, cloud computing was pitched as better, faster, and cheaper. Toss in the idea that one can get rid of pesky full time systems personnel, and the cloud is a win.
What the cloud means is exactly what the quoted sentence says, “customers are not tied to a single form factor.” Does this mean that the Big Data rah rah combined with the sales pitch for moving to the cloud will set the stage for more hybrid sets up a return to on premises computing. Storage could become a combination of on premises and cloud base solutions. The driver, in my opinion, will be cost. And one thing the essay about Big Data does not dwell on is the importance of cost in the present economic environment.
The arguments for small data or subsets of Big Data is accurate. My reading of the essay is that some data will become a problem: Privacy, security, legal, political, whatever. The essay is an explanation for what “synthetic data.” Google and others want to make statistically-valid, fake data the gold standard for certain types of processes. In the data are a liability section of the essay, I noted:
Data can suffer from the same type of problem; that is, people forget the precise meaning of specialized fields, or data problems from the past may have faded from memory.
I wonder if this is a murky recasting of Google’s indifference to “old” data and to date and time stamping. The here and now not then and past are under the surface of the essay. I am surprised the notion of “forward forward” analysis did not warrant a mention. Outfits like Google want to do look ahead prediction in order to deal with inputs newer than what is in the previous set of values.
You may read the essay and come away with a different interpretation. For me, this is the type of analysis characteristic of a Googler, not a Xoogler. If I am correct, why doesn’t the essay hit the big ideas about cost and synthetic data directly?
Stephen E Arnold, March 3, 2023
Worthless Data Work: Sorry, No Sympathy from Me
February 27, 2023
I read a personal essay about “data work.” The title is interesting: “Most Data Work Seems Fundamentally Worthless.” I am not sure of the age of the essayist, but the pain is evident in the word choice; for example: Flavor of despair (yes, synesthesia in a modern technology awakening write up!), hopeless passivity (yes, a digital Sisyphus!), essentially fraudulent (shades of Bernie Madoff!), fire myself (okay, self loathing and an inner destructive voice), and much, much more.
But the point is not the author for me. The big idea is that when it comes to data, most people want a chart and don’t want to fool around with numbers, statistical procedures, data validation, and context of the how, where, and what of the collection process.
Let’s go to the write up:
How on earth could we have what seemed to be an entire industry of people who all knew their jobs were pointless?
Like Elizabeth Barrett Browning, the essayist enumerates the wrongs of data analytics as a vaudeville act:
- Talking about data is not “doing” data
- Garbage in, garbage out
- No clue about the reason for an analysis
- Making marketing and others angry
- Unethical colleagues wallowing in easy money
What’s ahead? I liked these statements which are similar to what a digital Walt Whitman via ChatGPT might say:
I’ve punched this all out over one evening, and I’m still figuring things out myself, but here’s what I’ve got so far… that’s what feels right to me – those of us who are despairing, we’re chasing quality and meaning, and we can’t do it while we’re taking orders from people with the wrong vision, the wrong incentives, at dysfunctional organizations, and with data that makes our tasks fundamentally impossible in the first place. Quality takes time, and right now, it definitely feels like there isn’t much of a place for that in the workplace.
Imagine. The data and working with it has an inherent negative impact. We live in a data driven world. Is that why many processes are dysfunctional. Hey, Sisyphus, what are the metrics on your progress with the rock?
Stephen E Arnold, February 27, 2023
Confessions? It Is That Time of Year
December 23, 2022
Forget St. Augustine.
Big data, data science, or whatever you want to call is was the precursor to artificial intelligence. Tech people pursued careers in the field, but after the synergy and hype wore off the real work began. According to WD in his RYX,R blog post: “Goodbye, Data Science,” the work is tedious, low-value, unwilling, and left little room for career growth.
WD worked as a data scientist for a few years, then quit in pursuit of the higher calling as a data engineer. He will be working on the implementation of data science instead of its origins. He explained why he left in four points:
• “The work is downstream of engineering, product, and office politics, meaning the work was only often as good as the weakest link in that chain.
• Nobody knew or even cared what the difference was between good and bad data science work. Meaning you could suck at your job or be incredible at it and you’d get nearly the same regards in either case.
• The work was often very low value-add to the business (often compensating for incompetence up the management chain).
• When the work’s value-add exceeded the labor costs, it was often personally unfulfilling (e.g. tuning a parameter to make the business extra money).”
WD’s experiences sound like everyone who is disenchanted with their line of work. He worked with managers who would not listen when they were told stupid projects would fail. The managers were more concerned with keeping their bosses and shareholders happy. He also mentioned that engineers are inflamed with self-grandeur and scientists are bad at code. He worked with young and older data people who did not know what they were doing.
As a data engineer, WD has more free time, more autonomy, better career advancements, and will continue to learn.
Whitney Grace, December 23, 2022
Common Sense: A Refreshing Change in Tech Write Ups
December 13, 2022
I want to give a happy quack to this article: “Forget about Algorithms and Models — Learn How to Solve Problems First.” The common sense write up suggests that big data cowboys and cowgirls make sure of their problem solving skills before doing the algorithm and model Lego drill. To make this point clear: Put foundations in place before erecting a structure which may fail in interesting ways.
The write up says:
For programmers and data scientists, this means spending time understanding the problem and finding high-level solutions before starting to code.
But in an era of do your own research and thumbtyping will common sense prevail?
Not often.
The article provides a list a specific steps to follow as part of the foundation for the digital confection. Worth reading; however, the write up tries to be upbeat.
A positive attitude is a plus. Too bad common sense is not particularly abundant in certain fascinating individual and corporate actions; to wit:
- Doing the FBX talkathons
- Installing spyware without legal okays
- Writing marketing copy that asserts a cyber security system will protect a licensee.
You may have your own examples. Common sense? Not abundant in my opinion. That’s why a book like How to Solve It: Modern Heuristics is unlikely to be on many nightstands of some algorithm and data analysts. Do I know this for a fact? Nope, just common sense. Thumbtypers, remember?
Stephen E Arnold, December 13, 2022
An Essay about Big Data Analytics: Trouble Amping Up
October 31, 2022
I read “What Moneyball for Everything Has Done to American Culture.” Who doesn’t love a thrilling data analytics story? Let’s narrow the scope of the question: What MBA, engineer, or Certified Financial Analyst doesn’t love a thrilling data analytics story?
Give up? The answer is 99.9 percent emit adrenaline and pheromone in copious quantities. Yeah, baby. Winner!
The essay in the “we beg for dollars politely” publication asserts:
The analytics revolution, which began with the movement known as Moneyball, led to a series of offensive and defensive adjustments that were, let’s say, _catastrophically successful_. Seeking strikeouts, managers increased the number of pitchers per game and pushed up the average velocity and spin rate per pitcher. Hitters responded by increasing the launch angles of their swings, raising the odds of a home run, but making strikeouts more likely as well. These decisions were all legal, and more important, they were all _correct_ from an analytical and strategic standpoint.
Well, that’s what makes outfits similar to Google-type, Amazon-type, and TikTok-type outfits so darned successful. Data analytics and nifty algorithms pay off. Moneyball!
The essay notes:
The sport that I fell in love with doesn’t really exist anymore.
Is the author talking about baseball or is the essaying pinpointing what’s happened in high technology user land?
My hunch is that baseball is a metaphor for the outstanding qualities of many admired companies. Privacy? Hey, gone. Security? There is a joke worthy of vaudeville. Reliability? Ho ho ho. Customer service from a person who knows a product? You have to be kidding.
I like the last paragraph:
Cultural Moneyballism, in this light, sacrifices exuberance for the sake of formulaic symmetry. It sacrifices diversity for the sake of familiarity. It solves finite games at the expense of infinite games. Its genius dulls the rough edges of entertainment. I think that’s worth caring about. It is definitely worth asking the question: In a world that will only become more influenced by mathematical intelligence, can we ruin culture through our attempts to perfect it?
Unlike a baseball team’s front office, we can’t fire these geniuses when the money is worthless and the ball disintegrates due to a lack of quality control.
Stephen E Arnold, October 31, 2022
A Data Taboo: Poisoned Information But We Do Not Discuss It Unless… Lawyers
October 25, 2022
In a conference call yesterday (October 24, 2022), I mentioned one of my laws of online information; specifically, digital information can be poisoned. The venom can be administered by a numerically adept MBA or a junior college math major taking short cuts because data validation is hard work. The person on the call was mildly surprised because the notion of open source and closed source “facts” intentionally weaponized is an uncomfortable subject. I think the person with whom I was speaking blinked twice when I pointed what should be obvious to most individuals in the intelware business. Here’s the pointy end of reality:
Most experts and many of the content processing systems assume that data are good enough. Plus, with lots of data any irregularities are crunched down by steamrolling mathematical processes.
The problem is that articles like “Biotech Firm Enochian Says Co Founder Fabricated Data” makes it clear that MBA math as well as experts hired to review data can be caught with their digital clothing in a pile. These folks are, in effect, sitting naked in a room with people who want to make money. Nakedness from being dead wrong can lead to some career turbulence; for example, prison.
The write up reports:
Enochian BioSciences Inc. has sued co-founder Serhat Gumrukcu for contractual fraud, alleging that it paid him and his husband $25 million based on scientific data that Mr. Gumrukcu altered and fabricated.
The article does not explain precisely how the data were “fabricated.” However, someone with Excel skills or access to an article like “Top 3 Python Packages to Generate Synthetic Data” and Fiverr.com or similar gig work site can get some data generated at a low cost. Who will know? Most MBAs math and statistics classes focus on meeting targets in order to get a bonus or amp up a “service” fee for clicking a mouse. Experts who can figure out fiddled data sets take the time if they are motivated by professional jealousy or cold cash. Who blew the whistle on Theranos? A data analyst? Nope. A “real” journalist who interviewed people who thought something was goofy in the data.
My point is that it is trivially easy to whip up data to support a run at tenure or at a group of MBAs desperate to fund the next big thing as the big tech house of cards wobbles in the winds of change.
Several observations:
- The threat of bad or fiddled data is rising. My team is checking a smart output by hand because we simply cannot trust what a slick, new intelware system outputs. Yep, trust is in short supply among my research team.
- Individual inspection of data from assorted open and closed sources is accepted as is. The attitude is that the law of big numbers, the sheer volume of data, or the magic of cross correlation will minimize errors. Sure these processes will, but what if the data are weaponized and crafted to avoid detection? The answer is to check each item. How’s that for a cost center?
- Uninformed individuals (yep, I am including some data scientists, MBAs, and hawkers of data from app users) don’t know how to identify weaponized data nor know what to do when such data are identified.
Does this suggest that a problem exists? If yes, what’s the fix?
[a] Ignore the problem
[b] Trust Google-like outfits who seek to be the source for synthetic data
[c] Rely on MBAs
[d] Rely on jealous colleagues in the statistics department with limited tenure opportunities
[e] Blink.
Pick one.
Stephen E Arnold, October 25, 2022
The Push for Synthetic Data: What about Poisoning and Bias? Not to Worry
October 6, 2022
Do you worry about data poisoning, use of crafted data strings to cause numerical recipes to output craziness, and weaponized information shaped by a disaffected MBA big data developer sloshing with DynaPep?
No. Good. Enjoy the outputs.
Yes. Too bad. You lose.
For a rah rah, it’s sunny in Slough look at synthetic data, read “Synthetic Data Is the Safe, Low-Cost Alternative to Real Data That We Need.”
The sub title is:
A new solution for data hungry AIs
And the sub sub title is:
Content provided by IBM and TNW.
Let’s check out what this IBM content marketing write up says:
One example is Task2Sim, an AI model built by the MIT-IBM Watson AI Lab that creates synthetic data for training classifiers. Rather than teaching the classifier to recognize one object at a time, the model creates images that can be used to teach multiple tasks. The scalability of this type of model makes collecting data less time consuming and less expensive for data hungry businesses.
What are the downsides of synthetic data? Downsides? Don’t be silly:
Synthetic data, however it is produced, offers a number of very concrete advantages over using real world data. First of all, it’s easier to collect way more of it, because you don’t have to rely on humans creating it. Second, the synthetic data comes perfectly labeled, so there’s no need to rely on labor intensive data centers to (sometimes incorrectly) label data. Third, it can protect privacy and copyright, as the data is, well, synthetic. And finally, and perhaps most importantly, it can reduce biased outcomes.
There is one, very small, almost miniscule issue stated in the write up; to wit:
As you might suspect, the big question regarding synthetic data is around the so-called fidelity — or how closely it matches real-world data. The jury is still out on this, but research seems to show that combining synthetic data with real data gives statistically sound results. This year, researchers from MIT and the MIT-IBM AI Watson Lab showed that an image classifier that was pretrained on synthetic data in combination with real data, performed as well as an image classifier trained exclusively on real data.
I loved the “seems to show” phrase I put in bold face. Seems is such a great verb. It “seems” almost accurate.
But what about that disaffected MBA developer fiddling with thresholds?
I know the answer to this question, “That will never happen.”
Okay, I am convinced. You know the “we need” thing.
Stephen E Arnold, October 6, 2022
Webb Wobbles: Do Other Data Streams Stumble Around?
October 4, 2022
I read an essay identified as an essay from The_Byte In Futurism with the content from Nature. Confused? I am.
The title of the article is “Scientists May Have Really Screwed Up on Early James Webb Findings.” The “Webb” is not the digital construct, but the space telescope. The subtitle about the data generated from the system is:
I don’t think anybody really expected this to be as big of an issue as it’s becoming.
Space is not something I think about. Decades ago I met a fellow named Fred G., who was engaged in a study of space warfare. Then one of my colleague Howard F. joined my team after doing some satellite stuff with a US government agency. He didn’t volunteer any information to me, and I did not ask. Space may be the final frontier, but I liked working on online from my land based office, thank you very much.
The article raises an interesting point; to wit:
When the first batch of data dropped earlier this summer, many dived straight into analysis and putting out papers. But according to new reporting by Nature, the telescope hadn’t been fully calibrated when the data was first released, which is now sending some astronomers scrambling to see if their calculations are now obsolete. The process of going back and trying to find out what parts of the work needs to be redone has proved “thorny and annoying,” one astronomer told Nature.
The idea is that the “Webby” data may have been distorted, skewed, or output with knobs and dials set incorrectly. Not surprisingly those who used these data to do spacey stuff may have reached unjustifiable conclusions. What about those nifty images, the news conferences, and the breathless references to the oldest, biggest, coolest images from the universe?
My thought is that the analyses, images, and scientific explanations are wrong to some degree. I hope the data are as pure as online clickstream data. No, no, strike that. I hope the data are as rock solid as mobile GPS data. No, no, strike that too. I hope the data are accurate like looking out the window to determine if it is a clear or cloudy day. Yes, narrowed scope, first hand input, and a binary conclusion.
Unfortunately in today’s world, that’s not what data wranglers do on the digital ranch.
If the “Webby” data are off kilter, my question is:
What about the data used to train smart software from some of America’s most trusted and profitable companies? Could these data be making incorrect decisions flow from models so that humans and downstream systems keep producing less and less reliable results?
My thought is, “Who wants to think about data being wrong, poisoned, or distorted?” People want better, faster, cheaper. Some people want to leverage data in cash or a bunker in Alaska. Others like Dr. Timnit Gebru wants her criticisms of the estimable Google to get some traction, even among those who snorkel and do deep dives.
If the scientists, engineers, and mathematicians fouled up with James Webb data, isn’t it possible that some of the big data outfits are making similar mistakes with calibration, data verification, analysis, and astounding observations?
I think the “Webby” moment is important. Marketers are not likely to worry too much.
Stephen E Arnold, October 4, 2022
Data and Dining: Yum Yum
August 30, 2022
Food and beverage companies hire consultants like Mike Kostyo to predict what dishes will soon be gracing menus. HuffPost describes the flavorful profession in the piece, “This Food Trendologist Knows What We’ll Be Eating Before Anyone Else.” As one might expect, the job involves traveling to many places and sampling many cuisines. But it also means analyzing a trove of data. Who knew? Writer Emily Laurence tells us:
“Kostyo explained that declaring something a trend requires actual data; it’s not done willy-nilly. A lot of his job is spent analyzing data to prepare food trend reports he and his team put together a few times a year. Some brands and companies use these trend reports to determine products they may want to create. ‘We have our eyes on all sorts of possible trends, with dedicated folders for each. Any time we come across a piece of data or anecdotal evidence related to a possible trend, we add it to the designated folder,’ Kostyo said, explaining that this allows them to see how a trend is building over time (or if it fizzles out, never actually turning into one). For example, he said he and his team use a tool that gives them access to more than 100,000 menus across the country. ‘We can use this tool to see what types of appetizers have grown the fastest in the past few years or what ingredients are being used more,’ Kostyo said.”
We would be curious to see that presumably proprietary data tool. For clients, the accuracy of these predictions can mean the difference between celebrating a profitable quarter and handing out pink slips. See the write-up for how one gets into this profession, factors that can affect food trends, and what Kostyo predicts diners will swallow next.
Cynthia Murrell, August 30, 2022
Data: Better Fresh
May 18, 2022
Decisions based on data are only as good as the data on which they are based. That seems obvious, but according to BetaNews, “Over 80 Percent of Companies Are Relying on Stale Data to Make Decisions.” Writer Ian Barker summarizes a recent study:
“The research, conducted by Dimensional Research for data integration specialist Fivetran, finds that 82 percent of companies are making decisions based on stale information. This is leading to wrong decisions and lost revenue according to 85 percent. In addition 86 percent of respondents say their business needs access to real-time ERP [Enterprise Resource Planning] data to make smart business decisions, yet only 23 percent have systems in place to make that possible. And almost all (99 percent) say they are struggling to gain consistent access to information stored in their ERP systems. Overall 65 percent of respondents say access to ERP data is difficult and 78 percent think software vendors intentionally make it so. Those surveyed say poor access to ERP data directly impacts their business with slowed operations, bad decision-making and lost revenue.”
The write-up includes a few info-graphics for the curious to peruse. Why most of those surveyed think vendors purposely make it difficult to access good data is not explained. Fivetran does emphasize the importance of “looking at the freshest, most complete dataset possible.” Yep, old info is not very helpful. The company asserts the answer lies in change data capture, a service it happens to offer (as do several other companies).
Cynthia Murrell, May 17, 2022