In Big Data, Bad Data Does Not Matter. Not So Fast, Mr. Slick

April 8, 2024

green-dino_thumb_thumb_thumbThis essay is the work of a dumb dinobaby. No smart software required.

When I hear “With big data, bad data does not matter. It’s the law of big numbers. Relax,” I chuckle. Most data present challenges. First, figuring out which data are accurate can be a challenge. But the notion of “relax,” does not cheer me. Then one can consider data which have been screwed up by a bad actor, a careless graduate student, a low-rent research outfit, or someone who thinks errors are not possible.


The young vendor is confident that his tomatoes and bananas are top quality. The color of the fruit means nothing. Thanks, MSFT Copilot. Good enough, like the spoiled bananas.

Data Quality Getting Worse, Report Says” offers some data (which may or may not be on the mark) which remind me to be skeptical of information available today. The Datanami article points out:

According to the company’s [DBT Labs’] State of Analytics Engineering 2024 report released yesterday, poor data quality was the number one concern of the 456 analytics engineers, data engineers, data analysts, and other data professionals who took the survey. The report shows that 57% of survey respondents rated data quality as one of the three most challenging aspects of the data preparation process. That’s a significant increase from the 2022 State of Analytics Engineering report, when 41% indicated poor data quality was one of the top three challenges.

The write up offers several other items of interest; for example:

  • Questions about who owns the data
  • Integration of fusion of multiple data sources
  • Documenting data products; that is, the editorial policy of the producer / collector of the information.

This flashing yellow light about data seems to be getting brighter. The implication of the report is that data quality “appears” to be be heading downhill. The write up quotes Jignesh Patel, computer science professor at Carnegie Mellon University to underscore the issue:

“Data will never be fully clean. You’re always going to need some ETL [extract, transform, and load] portion. The reason that data quality will never be a “solved problem,” is partly because data will always be collected from various sources in various ways, and partly because or data quality lies in the eye of the beholder. You’re always collecting more and more data. If you can find a way to get more data, and no one says no to it, it’s always going to be messy. It’s always going to be dirty.”

But what about the assertion that in big data, bad data will be a minor problem. That assertion may be based on a lack of knowledge about some of the weak spots in data gathering processes. In the last six months, my team and I have encountered these issues:

  1. The source of the data contained a flaw so that it was impossible to determine what items were candidates for filtering out
  2. The aggregator had zero controls because it acquired data from another party and did not homework other than hyping a new data set
  3. Flawed data filled the exception folder with a large percentage of the information that remediation was not possible due to time and cost constraints
  4. Automated systems are indiscriminate, and few (sometimes no one) pay close attention to inputs.

I agree that data quality is a concern. However, efficiency trumps old-fashioned controls and checks applied via subject matter experts and trained specialists. The fix will be smart software which will be cheaper and more opaque. The assumption that big data will be self healing may not be accurate, but it sounds good.

Stephen E Arnold, April 8, 2024

School Technology: Making Up Performance Data for Years

February 9, 2024

green-dino_thumb_thumb_thumbThis essay is the work of a dumb dinobaby. No smart software required.

What is the “make up data” trend? Why is it plaguing educational institutions. From Harvard to Stanford, those who are entrusted with shaping young-in-spirit minds are putting ethical behavior in the trash can. I think I know, but let’s look at allegations of another “synthetic” information event. For context in the UK there is a government agency called the Office for Standards in Education, Children’s Services and Skills.” The agency is called OFSTED. Now let’s go to the “real” news story.“


A possible scene outside of a prestigious academic institution when regulations about data become enforceable… give it a decade or two. Thanks, MidJourney. Two tries and a good enough illustration.

Ofsted Inspectors Make Up Evidence about a School’s Performance When IT Fails” reports:

Ofsted inspectors have been forced to “make up” evidence because the computer system they use to record inspections sometimes crashes, ­wiping all the data…

Quite a combo: Information technology and inventing data.

The article adds:

…inspectors have to replace those notes from memory without telling the school.

Will the method work for postal investigations? Sure. Can it be extended to other activities? What about data pertinent to the UK government initiates for smart software?

Stephen E Arnold, February 9, 2024

A Swiss Email Provider Delivers Some Sharp Cheese about MSFT Outlook

January 17, 2024

green-dino_thumb_thumb_thumb_thumbThis essay is the work of a dumb dinobaby. No smart software required.

What company does my team love more than Google? Give up. It is Microsoft. Whether it is the invasive Outlook plug in for Zoom on the Mac or the incredible fly ins, pop ups, and whining about Edge, what’s not to like about this outstanding, customer-centric firm? Nothing. That’s right. Nothing Microsoft does can be considered duplicitous, monopolistic, avaricious, or improper. The company lives and breathes the ethics of Thomas Dewey, the 19 century American philosopher. This is my opinion, of course. Some may disagree.


A perky Swiss farmer delivers an Outlook info dump. Will this delivery enable the growth of suveillance methodologies? Thanks, MSFT Copilot Bing thing. Thou did not protest when I asked for this picture.

I read and was troubled that one of my favorite US firms received some critical analysis about the MSFT Outlook email program. The sharp comments appeared in a blog post titled “Outlook Is Microsoft’s New Data Collection Service.” Proton offers an encrypted email service and a VPN from Switzerland. (Did you know the Swiss have farmers who wash their cows and stack their firewood neatly? I am from central Illinois, and our farmers ignore their cows and pile firewood. As long as a cow can make it into the slaughter house, the cow is good to go. As long as the firewood burns, winner.)

The write up reports or asserts, depending on one’s point of view:

Everyone talks about the privacy-washing(new window) campaigns of Google and Apple as they mine your online data to generate advertising revenue. But now it looks like Outlook is no longer simply an email service(new window); it’s a data collection mechanism for Microsoft’s 772 external partners and an ad delivery system for Microsoft itself.

Surveillance is the key to making money from advertising or bulk data sales to commercial and possibly some other organizations. Proton enumerates how these sucked up data may be used:

  • Store and/or access information on the user’s device
  • Develop and improve products
  • Personalize ads and content
  • Measure ads and content
  • Derive audience insights
  • Obtain precise geolocation data
  • Identify users through device scanning

The write up provides this list of information allegedly available to Microsoft:

  • Name and contact data
  • Passwords
  • Demographic data
  • Payment data
  • Subscription and licensing data
  • Search queries
  • Device and usage data
  • Error reports and performance data
  • Voice data
  • Text, inking, and typing data
  • Images
  • Location data
  • Content
  • Feedback and ratings
  • Traffic data.

My goodness.

I particularly like the geolocation data. With Google trying to turn off the geofence functions, Microsoft definitely may be an option for some customers to test. Good, bad, or indifferent, millions of people use Microsoft Outlook. Imagine the contact lists, the entity names, and the other information extractable from messages, attachments, draft folders, and the deleted content. As an Illinois farmer might say, “Winner!”

For more information about Microsoft’s alleged data practices, please, refer to the Proton article. I became uncomfortable when I read the section about how MSFT steals my email password. Imagine. Theft of a password — Is it true? My favorite giant American software company would not do that to me, a loyal customer, would it?

The write up is a bit of content marketing rah rah for Proton. I am not convinced, but I think I will have my team do some poking around on the Proton Web site. But Microsoft? No, the company would not take this action would it?

Stephen E Arnold, January 17, 2023

An Important, Easily Pooh-Poohed Insight

December 24, 2023

green-dino_thumb_thumb_thumbThis essay is the work of a dumb dinobaby. No smart software required.

Dinobaby here. I am on the regular highway, not the information highway. Nevertheless l want to highlight what I call an “easily poohpoohed factoid. The source of the item this morning is an interview titled “Google Cloud Exec: Enterprise AI Is Game-Changing, But Companies Need to Prepare Their Data.”

I am going to skip the PR baloney, the truisms about Google fumbling the AI ball, and rah rah about AI changing everything. Let me go straight to factoid which snagged my attention:

… at the other side of these projects, what we’re seeing is that organizations did not have their data house in order. For one, they had not appropriately connected all the disparate data sources that make up the most effective outputs in a model. Two, so many organizations had not cleansed their data, making certain that their data is as appropriate and high value as possible. And so we’ve heard this forever — garbage in, garbage out. You can have this great AI project that has all the tenets of success and everybody’s really excited. Then, it turns out that the data pipeline isn’t great and that the data isn’t streamlined — all of a sudden your predictions are not as accurate as they could or should have been.

Why are points about data significant?

First, investors, senior executives, developers, and the person standing on line with you at Starbucks dismisses data normalization as a solved problem. Sorry, getting the data boat to float is a work in progress. Few want to come to grips with the issue.

Second, fixing up data is expensive. Did you ever wonder why the Stanford president made up data, forcing his resignation? The answer is that the “cost of fixing up data is too high.” If the president of Stanford can’t do it, is the run-fo-the-mill fast talking AI guru different? Answer: Nope.

Third, knowledge of exception folders and non-conforming data is confined to a small number of people. Most will explain what is needed to make a content intake system work. However, many give up because the cloud of unknowing is unlikely to disperse.

The bottom line is that many data sets are not what senior executives, marketers, or those who use the data believe they are. The Google comment — despite Google’s sketchy track record in plain honest talk — is mostly correct.

So what?

  1. Outputs are often less useful than many anticipated. But if the user is uninformed or the downstream system uses whatever is pushed to it, no big deal.
  2. The thresholds and tweaks needed to make something semi useful are not shared, discussed, or explained. Keep the mushrooms in the dark and feed them manure. What do you get? Mushrooms.
  3. The graphic outputs are eye candy and distracting. Look here, not over there. Sizzle sells and selling is important.

Net net: Data are a problem. Data have been due to time and cost issues. Data will remain a problem because one can sidestep a problem few recognize and those who do recognize the pit find a short cut. What’s this mean for AI? Those smart systems will be super. What’s in your AI stocking this year?

Stephen E Arnold, December 24, 2023

A Xoogler Explains Why Big Data Is Going Nowhere Fast

March 3, 2023

The essay “Big Data Is Dead.” One of my essays from the Stone Age of Online used the title “Search Is Dead” so I am familiar with the trope. In a few words, one can surprise. Dead. Final. Absolute, well, maybe. On the other hand, the subject either Big Data or Search are part of the woodwork in the mini-camper of life.

I found this statement interesting:

Modern cloud data platforms all separate storage and compute, which means that customers are not tied to a single form factor. This, more than scale out, is likely the single most important change in data architectures in the last 20 years.

The cloud is the future. I recall seeing price analyses of some companies’ cloud activities; for example, “The Cloud vs. On-Premise Cost: Which One is Cheaper?” In my experience, cloud computing was pitched as better, faster, and cheaper. Toss in the idea that one can get rid of pesky full time systems personnel, and the cloud is a win.

What the cloud means is exactly what the quoted sentence says, “customers are not tied to a single form factor.” Does this mean that the Big Data rah rah combined with the sales pitch for moving to the cloud will set the stage for more hybrid sets up a return to on premises computing. Storage could become a combination of on premises and cloud base solutions. The driver, in my opinion, will be cost. And one thing the essay about Big Data does not dwell on is the importance of cost in the present economic environment.

The arguments for small data or subsets of Big Data is accurate. My reading of the essay is that some data will become a problem: Privacy, security, legal, political, whatever. The essay is an explanation for what “synthetic data.” Google and others want to make statistically-valid, fake data the gold standard for certain types of processes. In the data are a liability section of the essay, I noted:

Data can suffer from the same type of problem; that is, people forget the precise meaning of specialized fields, or data problems from the past may have faded from memory.

I wonder if this is a murky recasting of Google’s indifference to “old” data and to date and time stamping. The here and now not then and past are under the surface of the essay. I am surprised the notion of “forward forward” analysis did not warrant a mention. Outfits like Google want to do look ahead prediction in order to deal with inputs newer than what is in the previous set of values.

You may read the essay and come away with a different interpretation. For me, this is the type of analysis characteristic of a Googler, not a Xoogler. If I am correct, why doesn’t the essay hit the big ideas about cost and synthetic data directly?

Stephen E Arnold, March 3, 2023

Worthless Data Work: Sorry, No Sympathy from Me

February 27, 2023

I read a personal essay about “data work.” The title is interesting: “Most Data Work Seems Fundamentally Worthless.” I am not sure of the age of the essayist, but the pain is evident in the word choice; for example: Flavor of despair (yes, synesthesia in a modern technology awakening write up!), hopeless passivity (yes, a digital Sisyphus!), essentially fraudulent (shades of Bernie Madoff!), fire myself (okay, self loathing and an inner destructive voice), and much, much more.

But the point is not the author for me. The big idea is that when it comes to data, most people want a chart and don’t want to fool around with numbers, statistical procedures, data validation, and context of the how, where, and what of the collection process.

Let’s go to the write up:

How on earth could we have what seemed to be an entire industry of people who all knew their jobs were pointless?

Like Elizabeth Barrett Browning, the essayist enumerates the wrongs of data analytics as a vaudeville act:

  1. Talking about data is not “doing” data
  2. Garbage in, garbage out
  3. No clue about the reason for an analysis
  4. Making marketing and others angry
  5. Unethical colleagues wallowing in easy money

What’s ahead? I liked these statements which are similar to what a digital Walt Whitman via ChatGPT might say:

I’ve punched this all out over one evening, and I’m still figuring things out myself, but here’s what I’ve got so far… that’s what feels right to me – those of us who are despairing, we’re chasing quality and meaning, and we can’t do it while we’re taking orders from people with the wrong vision, the wrong incentives, at dysfunctional organizations, and with data that makes our tasks fundamentally impossible in the first place. Quality takes time, and right now, it definitely feels like there isn’t much of a place for that in the workplace.

Imagine. The data and working with it has an inherent negative impact. We live in a data driven world. Is that why many processes are dysfunctional. Hey, Sisyphus, what are the metrics on your progress with the rock?

Stephen E Arnold, February 27, 2023

Confessions? It Is That Time of Year

December 23, 2022

Forget St. Augustine.

Big data, data science, or whatever you want to call is was the precursor to artificial intelligence. Tech people pursued careers in the field, but after the synergy and hype wore off the real work began. According to WD in his RYX,R blog post: “Goodbye, Data Science,” the work is tedious, low-value, unwilling, and left little room for career growth.

WD worked as a data scientist for a few years, then quit in pursuit of the higher calling as a data engineer. He will be working on the implementation of data science instead of its origins. He explained why he left in four points:

• “The work is downstream of engineering, product, and office politics, meaning the work was only often as good as the weakest link in that chain.

• Nobody knew or even cared what the difference was between good and bad data science work. Meaning you could suck at your job or be incredible at it and you’d get nearly the same regards in either case.

• The work was often very low value-add to the business (often compensating for incompetence up the management chain).

• When the work’s value-add exceeded the labor costs, it was often personally unfulfilling (e.g. tuning a parameter to make the business extra money).”

WD’s experiences sound like everyone who is disenchanted with their line of work. He worked with managers who would not listen when they were told stupid projects would fail. The managers were more concerned with keeping their bosses and shareholders happy. He also mentioned that engineers are inflamed with self-grandeur and scientists are bad at code. He worked with young and older data people who did not know what they were doing.

As a data engineer, WD has more free time, more autonomy, better career advancements, and will continue to learn.

Whitney Grace, December 23, 2022

Common Sense: A Refreshing Change in Tech Write Ups

December 13, 2022

I want to give a happy quack to this article: “Forget about Algorithms and Models — Learn How to Solve Problems First.” The common sense write up suggests that big data cowboys and cowgirls make sure of their problem solving skills before doing the algorithm and model Lego drill. To make this point clear: Put foundations in place before erecting a structure which may fail in interesting ways.

The write up says:

For programmers and data scientists, this means spending time understanding the problem and finding high-level solutions before starting to code.

But in an era of do your own research and thumbtyping will common sense prevail?

Not often.

The article provides a list a specific steps to follow as part of the foundation for the digital confection. Worth reading; however, the write up tries to be upbeat.

A positive attitude is a plus. Too bad common sense is not particularly abundant in certain fascinating individual and corporate actions; to wit:

  • Doing the FBX talkathons
  • Installing spyware without legal okays
  • Writing marketing copy that asserts a cyber security system will protect a licensee.

You may have your own examples. Common sense? Not abundant in my opinion. That’s why a book like How to Solve It: Modern Heuristics is unlikely to be on many nightstands of some algorithm and data analysts. Do I know this for a fact? Nope, just common sense. Thumbtypers, remember?

Stephen E Arnold, December 13, 2022

An Essay about Big Data Analytics: Trouble Amping Up

October 31, 2022

I read “What Moneyball for Everything Has Done to American Culture.” Who doesn’t love a thrilling data analytics story? Let’s narrow the scope of the question: What MBA, engineer, or Certified Financial Analyst doesn’t love a thrilling data analytics story?

Give up? The answer is 99.9 percent emit adrenaline and pheromone in copious quantities. Yeah, baby. Winner!

The essay in the “we beg for dollars politely” publication asserts:

The analytics revolution, which began with the movement known as Moneyball, led to a series of offensive and defensive adjustments that were, let’s say, _catastrophically successful_. Seeking strikeouts, managers increased the number of pitchers per game and pushed up the average velocity and spin rate per pitcher. Hitters responded by increasing the launch angles of their swings, raising the odds of a home run, but making strikeouts more likely as well. These decisions were all legal, and more important, they were all _correct_ from an analytical and strategic standpoint.

Well, that’s what makes outfits similar to Google-type, Amazon-type, and TikTok-type outfits so darned successful. Data analytics and nifty algorithms pay off. Moneyball!

The essay notes:

The sport that I fell in love with doesn’t really exist anymore.

Is the author talking about baseball or is the essaying pinpointing what’s happened in high technology user land?

My hunch is that baseball is a metaphor for the outstanding qualities of many admired companies. Privacy? Hey, gone. Security? There is a joke worthy of vaudeville. Reliability? Ho ho ho. Customer service from a person who knows a product? You have to be kidding.

I like the last paragraph:

Cultural Moneyballism, in this light, sacrifices exuberance for the sake of formulaic symmetry. It sacrifices diversity for the sake of familiarity. It solves finite games at the expense of infinite games. Its genius dulls the rough edges of entertainment. I think that’s worth caring about. It is definitely worth asking the question: In a world that will only become more influenced by mathematical intelligence, can we ruin culture through our attempts to perfect it?

Unlike a baseball team’s front office, we can’t fire these geniuses when the money is worthless and the ball disintegrates due to a lack of quality control.

Stephen E Arnold, October 31, 2022

A Data Taboo: Poisoned Information But We Do Not Discuss It Unless… Lawyers

October 25, 2022

In a conference call yesterday (October 24, 2022), I mentioned one of my laws of online information; specifically, digital information can be poisoned. The venom can be administered by a numerically adept MBA or a junior college math major taking short cuts because data validation is hard work. The person on the call was mildly surprised because the notion of open source and closed source “facts” intentionally weaponized is an uncomfortable subject. I think the person with whom I was speaking blinked twice when I pointed what should be obvious to most individuals in the intelware business. Here’s the pointy end of reality:

Most experts and many of the content processing systems assume that data are good enough. Plus, with lots of data any irregularities are crunched down by steamrolling mathematical processes.

The problem is that articles like “Biotech Firm Enochian Says Co Founder Fabricated Data” makes it clear that MBA math as well as experts hired to review data can be caught with their digital clothing in a pile. These folks are, in effect, sitting naked in a room with people who want to make money. Nakedness from being dead wrong can lead to some career turbulence; for example, prison.

The write up reports:

Enochian BioSciences Inc. has sued co-founder Serhat Gumrukcu for contractual fraud, alleging that it paid him and his husband $25 million based on scientific data that Mr. Gumrukcu altered and fabricated.

The article does not explain precisely how the data were “fabricated.” However, someone with Excel skills or access to an article like “Top 3 Python Packages to Generate Synthetic Data” and or similar gig work site can get some data generated at a low cost. Who will know? Most MBAs math and statistics classes focus on meeting targets in order to get a bonus or amp up a “service” fee for clicking a mouse. Experts who can figure out fiddled data sets take the time if they are motivated by professional jealousy or cold cash. Who blew the whistle on Theranos? A data analyst? Nope. A “real” journalist who interviewed people who thought something was goofy in the data.

My point is that it is trivially easy to whip up data to support a run at tenure or at a group of MBAs desperate to fund the next big thing as the big tech house of cards wobbles in the winds of change.

Several observations:

  1. The threat of bad or fiddled data is rising. My team is checking a smart output by hand because we simply cannot trust what a slick, new intelware system outputs. Yep, trust is in short supply among my research team.
  2. Individual inspection of data from assorted open and closed sources is accepted as is. The attitude is that the law of big numbers, the sheer volume of data, or the magic of cross correlation will minimize errors. Sure these processes will, but what if the data are weaponized and crafted to avoid detection? The answer is to check each item. How’s that for a cost center?
  3. Uninformed individuals (yep, I am including some data scientists, MBAs, and hawkers of data from app users) don’t know how to identify weaponized data nor know what to do when such data are identified.

Does this suggest that a problem exists? If yes, what’s the fix?

[a] Ignore the problem

[b] Trust Google-like outfits who seek to be the source for synthetic data

[c] Rely on MBAs

[d] Rely on jealous colleagues in the statistics department with limited tenure opportunities

[e] Blink.

Pick one.

Stephen E Arnold, October 25, 2022

Next Page »

  • Archives

  • Recent Posts

  • Meta