Confessions? It Is That Time of Year

December 23, 2022

Forget St. Augustine.

Big data, data science, or whatever you want to call is was the precursor to artificial intelligence. Tech people pursued careers in the field, but after the synergy and hype wore off the real work began. According to WD in his RYX,R blog post: “Goodbye, Data Science,” the work is tedious, low-value, unwilling, and left little room for career growth.

WD worked as a data scientist for a few years, then quit in pursuit of the higher calling as a data engineer. He will be working on the implementation of data science instead of its origins. He explained why he left in four points:

• “The work is downstream of engineering, product, and office politics, meaning the work was only often as good as the weakest link in that chain.

• Nobody knew or even cared what the difference was between good and bad data science work. Meaning you could suck at your job or be incredible at it and you’d get nearly the same regards in either case.

• The work was often very low value-add to the business (often compensating for incompetence up the management chain).

• When the work’s value-add exceeded the labor costs, it was often personally unfulfilling (e.g. tuning a parameter to make the business extra money).”

WD’s experiences sound like everyone who is disenchanted with their line of work. He worked with managers who would not listen when they were told stupid projects would fail. The managers were more concerned with keeping their bosses and shareholders happy. He also mentioned that engineers are inflamed with self-grandeur and scientists are bad at code. He worked with young and older data people who did not know what they were doing.

As a data engineer, WD has more free time, more autonomy, better career advancements, and will continue to learn.

Whitney Grace, December 23, 2022

Common Sense: A Refreshing Change in Tech Write Ups

December 13, 2022

I want to give a happy quack to this article: “Forget about Algorithms and Models — Learn How to Solve Problems First.” The common sense write up suggests that big data cowboys and cowgirls make sure of their problem solving skills before doing the algorithm and model Lego drill. To make this point clear: Put foundations in place before erecting a structure which may fail in interesting ways.

The write up says:

For programmers and data scientists, this means spending time understanding the problem and finding high-level solutions before starting to code.

But in an era of do your own research and thumbtyping will common sense prevail?

Not often.

The article provides a list a specific steps to follow as part of the foundation for the digital confection. Worth reading; however, the write up tries to be upbeat.

A positive attitude is a plus. Too bad common sense is not particularly abundant in certain fascinating individual and corporate actions; to wit:

  • Doing the FBX talkathons
  • Installing spyware without legal okays
  • Writing marketing copy that asserts a cyber security system will protect a licensee.

You may have your own examples. Common sense? Not abundant in my opinion. That’s why a book like How to Solve It: Modern Heuristics is unlikely to be on many nightstands of some algorithm and data analysts. Do I know this for a fact? Nope, just common sense. Thumbtypers, remember?

Stephen E Arnold, December 13, 2022

An Essay about Big Data Analytics: Trouble Amping Up

October 31, 2022

I read “What Moneyball for Everything Has Done to American Culture.” Who doesn’t love a thrilling data analytics story? Let’s narrow the scope of the question: What MBA, engineer, or Certified Financial Analyst doesn’t love a thrilling data analytics story?

Give up? The answer is 99.9 percent emit adrenaline and pheromone in copious quantities. Yeah, baby. Winner!

The essay in the “we beg for dollars politely” publication asserts:

The analytics revolution, which began with the movement known as Moneyball, led to a series of offensive and defensive adjustments that were, let’s say, _catastrophically successful_. Seeking strikeouts, managers increased the number of pitchers per game and pushed up the average velocity and spin rate per pitcher. Hitters responded by increasing the launch angles of their swings, raising the odds of a home run, but making strikeouts more likely as well. These decisions were all legal, and more important, they were all _correct_ from an analytical and strategic standpoint.

Well, that’s what makes outfits similar to Google-type, Amazon-type, and TikTok-type outfits so darned successful. Data analytics and nifty algorithms pay off. Moneyball!

The essay notes:

The sport that I fell in love with doesn’t really exist anymore.

Is the author talking about baseball or is the essaying pinpointing what’s happened in high technology user land?

My hunch is that baseball is a metaphor for the outstanding qualities of many admired companies. Privacy? Hey, gone. Security? There is a joke worthy of vaudeville. Reliability? Ho ho ho. Customer service from a person who knows a product? You have to be kidding.

I like the last paragraph:

Cultural Moneyballism, in this light, sacrifices exuberance for the sake of formulaic symmetry. It sacrifices diversity for the sake of familiarity. It solves finite games at the expense of infinite games. Its genius dulls the rough edges of entertainment. I think that’s worth caring about. It is definitely worth asking the question: In a world that will only become more influenced by mathematical intelligence, can we ruin culture through our attempts to perfect it?

Unlike a baseball team’s front office, we can’t fire these geniuses when the money is worthless and the ball disintegrates due to a lack of quality control.

Stephen E Arnold, October 31, 2022

A Data Taboo: Poisoned Information But We Do Not Discuss It Unless… Lawyers

October 25, 2022

In a conference call yesterday (October 24, 2022), I mentioned one of my laws of online information; specifically, digital information can be poisoned. The venom can be administered by a numerically adept MBA or a junior college math major taking short cuts because data validation is hard work. The person on the call was mildly surprised because the notion of open source and closed source “facts” intentionally weaponized is an uncomfortable subject. I think the person with whom I was speaking blinked twice when I pointed what should be obvious to most individuals in the intelware business. Here’s the pointy end of reality:

Most experts and many of the content processing systems assume that data are good enough. Plus, with lots of data any irregularities are crunched down by steamrolling mathematical processes.

The problem is that articles like “Biotech Firm Enochian Says Co Founder Fabricated Data” makes it clear that MBA math as well as experts hired to review data can be caught with their digital clothing in a pile. These folks are, in effect, sitting naked in a room with people who want to make money. Nakedness from being dead wrong can lead to some career turbulence; for example, prison.

The write up reports:

Enochian BioSciences Inc. has sued co-founder Serhat Gumrukcu for contractual fraud, alleging that it paid him and his husband $25 million based on scientific data that Mr. Gumrukcu altered and fabricated.

The article does not explain precisely how the data were “fabricated.” However, someone with Excel skills or access to an article like “Top 3 Python Packages to Generate Synthetic Data” and Fiverr.com or similar gig work site can get some data generated at a low cost. Who will know? Most MBAs math and statistics classes focus on meeting targets in order to get a bonus or amp up a “service” fee for clicking a mouse. Experts who can figure out fiddled data sets take the time if they are motivated by professional jealousy or cold cash. Who blew the whistle on Theranos? A data analyst? Nope. A “real” journalist who interviewed people who thought something was goofy in the data.

My point is that it is trivially easy to whip up data to support a run at tenure or at a group of MBAs desperate to fund the next big thing as the big tech house of cards wobbles in the winds of change.

Several observations:

  1. The threat of bad or fiddled data is rising. My team is checking a smart output by hand because we simply cannot trust what a slick, new intelware system outputs. Yep, trust is in short supply among my research team.
  2. Individual inspection of data from assorted open and closed sources is accepted as is. The attitude is that the law of big numbers, the sheer volume of data, or the magic of cross correlation will minimize errors. Sure these processes will, but what if the data are weaponized and crafted to avoid detection? The answer is to check each item. How’s that for a cost center?
  3. Uninformed individuals (yep, I am including some data scientists, MBAs, and hawkers of data from app users) don’t know how to identify weaponized data nor know what to do when such data are identified.

Does this suggest that a problem exists? If yes, what’s the fix?

[a] Ignore the problem

[b] Trust Google-like outfits who seek to be the source for synthetic data

[c] Rely on MBAs

[d] Rely on jealous colleagues in the statistics department with limited tenure opportunities

[e] Blink.

Pick one.

Stephen E Arnold, October 25, 2022

The Push for Synthetic Data: What about Poisoning and Bias? Not to Worry

October 6, 2022

Do you worry about data poisoning, use of crafted data strings to cause numerical recipes to output craziness, and weaponized information shaped by a disaffected MBA big data developer sloshing with DynaPep?

No. Good. Enjoy the outputs.

Yes. Too bad. You lose.

For a rah rah, it’s sunny in Slough look at synthetic data, read “Synthetic Data Is the Safe, Low-Cost Alternative to Real Data That We Need.”

The sub title is:

A new solution for data hungry AIs

And the sub sub title is:

Content provided by IBM and TNW.

Let’s check out what this IBM content marketing write up says:

One example is Task2Sim, an AI model built by the MIT-IBM Watson AI Lab that creates synthetic data for training classifiers. Rather than teaching the classifier to recognize one object at a time, the model creates images that can be used to teach multiple tasks. The scalability of this type of model makes collecting data less time consuming and less expensive for data hungry businesses.

What are the downsides of synthetic data? Downsides? Don’t be silly:

Synthetic data, however it is produced, offers a number of very concrete advantages over using real world data. First of all, it’s easier to collect way more of it, because you don’t have to rely on humans creating it. Second, the synthetic data comes perfectly labeled, so there’s no need to rely on labor intensive data centers to (sometimes incorrectly) label data. Third, it can protect privacy and copyright, as the data is, well, synthetic. And finally, and perhaps most importantly, it can reduce biased outcomes.

There is one, very small, almost miniscule issue stated in the write up; to wit:

As you might suspect, the big question regarding synthetic data is around the so-called fidelity — or how closely it matches real-world data. The jury is still out on this, but research seems to show that combining synthetic data with real data gives statistically sound results. This year, researchers from MIT and the MIT-IBM AI Watson Lab showed that an image classifier that was pretrained on synthetic data in combination with real data, performed as well as an image classifier trained exclusively on real data.

I loved the “seems to show” phrase I put in bold face. Seems is such a great verb. It “seems” almost accurate.

But what about that disaffected MBA developer fiddling with thresholds?

I know the answer to this question, “That will never happen.”

Okay, I am convinced. You know the “we need” thing.

Stephen E Arnold, October 6, 2022

Webb Wobbles: Do Other Data Streams Stumble Around?

October 4, 2022

I read an essay identified as an essay from The_Byte In Futurism with the content from Nature. Confused? I am.

The title of the article is “Scientists May Have Really Screwed Up on Early James Webb Findings.” The “Webb” is not the digital construct, but the space telescope. The subtitle about the data generated from the system is:

I don’t think anybody really expected this to be as big of an issue as it’s becoming.

Space is not something I think about. Decades ago I met a fellow named Fred G., who was engaged in a study of space warfare. Then one of my colleague Howard F. joined my team after doing some satellite stuff with a US government agency. He didn’t volunteer any information to me, and I did not ask. Space may be the final frontier, but I liked working on online from my land based office, thank you very much.

The article raises an interesting point; to wit:

When the first batch of data dropped earlier this summer, many dived straight into analysis and putting out papers. But according to new reporting by Nature, the telescope hadn’t been fully calibrated when the data was first released, which is now sending some astronomers scrambling to see if their calculations are now obsolete. The process of going back and trying to find out what parts of the work needs to be redone has proved “thorny and annoying,” one astronomer told Nature.

The idea is that the “Webby” data may have been distorted, skewed, or output with knobs and dials set incorrectly. Not surprisingly those who used these data to do spacey stuff may have reached unjustifiable conclusions. What about those nifty images, the news conferences, and the breathless references to the oldest, biggest, coolest images from the universe?

My thought is that the analyses, images, and scientific explanations are wrong to some degree. I hope the data are as pure as online clickstream data. No, no, strike that. I hope the data are as rock solid as mobile GPS data. No, no, strike that too. I hope the data are accurate like looking out the window to determine if it is a clear or cloudy day. Yes, narrowed scope, first hand input, and a binary conclusion.

Unfortunately in today’s world, that’s not what data wranglers do on the digital ranch.

If the “Webby” data are off kilter, my question is:

What about the data used to train smart software from some of America’s most trusted and profitable companies? Could these data be making incorrect decisions flow from models so that humans and downstream systems keep producing less and less reliable results?

My thought is, “Who wants to think about data being wrong, poisoned, or distorted?” People want better, faster, cheaper. Some people want to leverage data in cash or a bunker in Alaska. Others like Dr. Timnit Gebru wants her criticisms of the estimable Google to get some traction, even among those who snorkel and do deep dives.

If the scientists, engineers, and mathematicians fouled up with James Webb data, isn’t it possible that some of the big data outfits are making similar mistakes with calibration, data verification, analysis, and astounding observations?

I think the “Webby” moment is important. Marketers are not likely to worry too much.

Stephen E Arnold, October 4, 2022

Data and Dining: Yum Yum

August 30, 2022

Food and beverage companies hire consultants like Mike Kostyo to predict what dishes will soon be gracing menus. HuffPost describes the flavorful profession in the piece, “This Food Trendologist Knows What We’ll Be Eating Before Anyone Else.” As one might expect, the job involves traveling to many places and sampling many cuisines. But it also means analyzing a trove of data. Who knew? Writer Emily Laurence tells us:

“Kostyo explained that declaring something a trend requires actual data; it’s not done willy-nilly. A lot of his job is spent analyzing data to prepare food trend reports he and his team put together a few times a year. Some brands and companies use these trend reports to determine products they may want to create. ‘We have our eyes on all sorts of possible trends, with dedicated folders for each. Any time we come across a piece of data or anecdotal evidence related to a possible trend, we add it to the designated folder,’ Kostyo said, explaining that this allows them to see how a trend is building over time (or if it fizzles out, never actually turning into one). For example, he said he and his team use a tool that gives them access to more than 100,000 menus across the country. ‘We can use this tool to see what types of appetizers have grown the fastest in the past few years or what ingredients are being used more,’ Kostyo said.”

We would be curious to see that presumably proprietary data tool. For clients, the accuracy of these predictions can mean the difference between celebrating a profitable quarter and handing out pink slips. See the write-up for how one gets into this profession, factors that can affect food trends, and what Kostyo predicts diners will swallow next.

Cynthia Murrell, August 30, 2022

Data: Better Fresh

May 18, 2022

Decisions based on data are only as good as the data on which they are based. That seems obvious, but according to BetaNews, “Over 80 Percent of Companies Are Relying on Stale Data to Make Decisions.” Writer Ian Barker summarizes a recent study:

“The research, conducted by Dimensional Research for data integration specialist Fivetran, finds that 82 percent of companies are making decisions based on stale information. This is leading to wrong decisions and lost revenue according to 85 percent. In addition 86 percent of respondents say their business needs access to real-time ERP [Enterprise Resource Planning] data to make smart business decisions, yet only 23 percent have systems in place to make that possible. And almost all (99 percent) say they are struggling to gain consistent access to information stored in their ERP systems. Overall 65 percent of respondents say access to ERP data is difficult and 78 percent think software vendors intentionally make it so. Those surveyed say poor access to ERP data directly impacts their business with slowed operations, bad decision-making and lost revenue.”

The write-up includes a few info-graphics for the curious to peruse. Why most of those surveyed think vendors purposely make it difficult to access good data is not explained. Fivetran does emphasize the importance of “looking at the freshest, most complete dataset possible.” Yep, old info is not very helpful. The company asserts the answer lies in change data capture, a service it happens to offer (as do several other companies).

Cynthia Murrell, May 17, 2022

TikTok: Innocuous? Maybe Not Among Friends

January 5, 2022

Short videos. No big deal.

The data about one’s friends are a big deal. A really big deal. TikTok may be activating a network effect. “TikTok Tests Its Own Version of the Retweet with a New Repost Button” suggests that a Twitter function is chugging along. What if the “friend” is not a registered user of TikTok? Perhaps the Repost function is a way to expand a user’s social network. What can one do with such data? Building out a social graph and cross correlating those data with other information might be a high value exercise. What other uses can be made of these data a year or two down the road? That’s an interesting question to consider, particularly from the point of view of Chinese intelligence professionals.

China Harvests Masses of Data on Western Targets, Documents Show” explains that China acquires data for strategic and tactical reasons. The write up doses not identify specific specialized software products, services, and tools. Furthermore, the price tags for surveillance expenditures seem modest. Nevertheless, there is a suggestive passage in the write up:

Highly sensitive viral trends online are reported to a 24-hour hotline maintained by the Cybersecurity administration of China (CAC), the body that oversees the country’s censorship apparatus…

What’s interesting is that China uses both software and human-intermediated systems.

Net net: Pundits and users have zero clue about China’s data collection activities in general. When it comes to specific apps and their functions on devices, users have effectively zero knowledge of the outflow of personal data which can be used to generate a profile for possible coercion. Pooh pooh-ing TikTok? Not a great idea.

Stephen E Arnold, January 5, 2022

Quantitative vs Qualitative Data, Defined

January 4, 2022

Sounding almost philosophical, The Future of Things posts, “What is Data? Types of Data, Explained.” Distinguishing between types of data can mean many things. One distinction we are always curious about is what data does one emit via mobile devices and what types are feeding the surveillance machines? This write-up, though, is more of a primer on a basic data science concept: the difference between quantitative and qualitative data. Writer Kris defines quantitative data:

“As the name suggests, quantitative data can be quantified — that is, it can be measured and expressed in numerical values. Thus, it is easy to manipulate quantitative data and represent it through statistical graphs and charts. Quantitative data usually answers questions like ‘How much?’, ‘How many?’ and ‘How often?’ Some examples of quantitative data include a person’s height, the amount of time they spend on social media and their annual income. There are two key types of quantitative data: discrete and continuous.”

Here is the difference between those types of quantitative data: Discrete data cannot be divided into parts smaller than a whole number, like customers (yikes!) or orders. Continuous data is measured on a scale and can include fractions; the height or weight of a product, for example.

Kris goes on to define quantitative data, which is harder to process and analyze but can provide insights that quantitative data simply cannot:

“Qualitative data … exists as words, images, objects and symbols. Sometimes qualitative data is called categorical data because the information must be sorted into categories instead of represented by numbers or statistical charts. Qualitative data tends to answer questions with more nuance, like ‘Why did this happen?’ Some examples of qualitative data business leaders might encounter include customers’ names, their favorite colors and their ethnicity. The two most common types of qualitative data are nominal data and ordinal data.”

As the names suggest: Nominal data names a piece of data without assigning it any order in relation to other pieces of data. Ordinal data ranks bits of information in an order of some kind. The difference is important when performing any type of statistical analysis.

Cynthia Murrell, January 4, 2022

Next Page »

  • Archives

  • Recent Posts

  • Meta