A Data Taboo: Poisoned Information But We Do Not Discuss It Unless… Lawyers
October 25, 2022
In a conference call yesterday (October 24, 2022), I mentioned one of my laws of online information; specifically, digital information can be poisoned. The venom can be administered by a numerically adept MBA or a junior college math major taking short cuts because data validation is hard work. The person on the call was mildly surprised because the notion of open source and closed source “facts” intentionally weaponized is an uncomfortable subject. I think the person with whom I was speaking blinked twice when I pointed what should be obvious to most individuals in the intelware business. Here’s the pointy end of reality:
Most experts and many of the content processing systems assume that data are good enough. Plus, with lots of data any irregularities are crunched down by steamrolling mathematical processes.
The problem is that articles like “Biotech Firm Enochian Says Co Founder Fabricated Data” makes it clear that MBA math as well as experts hired to review data can be caught with their digital clothing in a pile. These folks are, in effect, sitting naked in a room with people who want to make money. Nakedness from being dead wrong can lead to some career turbulence; for example, prison.
The write up reports:
Enochian BioSciences Inc. has sued co-founder Serhat Gumrukcu for contractual fraud, alleging that it paid him and his husband $25 million based on scientific data that Mr. Gumrukcu altered and fabricated.
The article does not explain precisely how the data were “fabricated.” However, someone with Excel skills or access to an article like “Top 3 Python Packages to Generate Synthetic Data” and Fiverr.com or similar gig work site can get some data generated at a low cost. Who will know? Most MBAs math and statistics classes focus on meeting targets in order to get a bonus or amp up a “service” fee for clicking a mouse. Experts who can figure out fiddled data sets take the time if they are motivated by professional jealousy or cold cash. Who blew the whistle on Theranos? A data analyst? Nope. A “real” journalist who interviewed people who thought something was goofy in the data.
My point is that it is trivially easy to whip up data to support a run at tenure or at a group of MBAs desperate to fund the next big thing as the big tech house of cards wobbles in the winds of change.
Several observations:
- The threat of bad or fiddled data is rising. My team is checking a smart output by hand because we simply cannot trust what a slick, new intelware system outputs. Yep, trust is in short supply among my research team.
- Individual inspection of data from assorted open and closed sources is accepted as is. The attitude is that the law of big numbers, the sheer volume of data, or the magic of cross correlation will minimize errors. Sure these processes will, but what if the data are weaponized and crafted to avoid detection? The answer is to check each item. How’s that for a cost center?
- Uninformed individuals (yep, I am including some data scientists, MBAs, and hawkers of data from app users) don’t know how to identify weaponized data nor know what to do when such data are identified.
Does this suggest that a problem exists? If yes, what’s the fix?
[a] Ignore the problem
[b] Trust Google-like outfits who seek to be the source for synthetic data
[c] Rely on MBAs
[d] Rely on jealous colleagues in the statistics department with limited tenure opportunities
[e] Blink.
Pick one.
Stephen E Arnold, October 25, 2022
The Push for Synthetic Data: What about Poisoning and Bias? Not to Worry
October 6, 2022
Do you worry about data poisoning, use of crafted data strings to cause numerical recipes to output craziness, and weaponized information shaped by a disaffected MBA big data developer sloshing with DynaPep?
No. Good. Enjoy the outputs.
Yes. Too bad. You lose.
For a rah rah, it’s sunny in Slough look at synthetic data, read “Synthetic Data Is the Safe, Low-Cost Alternative to Real Data That We Need.”
The sub title is:
A new solution for data hungry AIs
And the sub sub title is:
Content provided by IBM and TNW.
Let’s check out what this IBM content marketing write up says:
One example is Task2Sim, an AI model built by the MIT-IBM Watson AI Lab that creates synthetic data for training classifiers. Rather than teaching the classifier to recognize one object at a time, the model creates images that can be used to teach multiple tasks. The scalability of this type of model makes collecting data less time consuming and less expensive for data hungry businesses.
What are the downsides of synthetic data? Downsides? Don’t be silly:
Synthetic data, however it is produced, offers a number of very concrete advantages over using real world data. First of all, it’s easier to collect way more of it, because you don’t have to rely on humans creating it. Second, the synthetic data comes perfectly labeled, so there’s no need to rely on labor intensive data centers to (sometimes incorrectly) label data. Third, it can protect privacy and copyright, as the data is, well, synthetic. And finally, and perhaps most importantly, it can reduce biased outcomes.
There is one, very small, almost miniscule issue stated in the write up; to wit:
As you might suspect, the big question regarding synthetic data is around the so-called fidelity — or how closely it matches real-world data. The jury is still out on this, but research seems to show that combining synthetic data with real data gives statistically sound results. This year, researchers from MIT and the MIT-IBM AI Watson Lab showed that an image classifier that was pretrained on synthetic data in combination with real data, performed as well as an image classifier trained exclusively on real data.
I loved the “seems to show” phrase I put in bold face. Seems is such a great verb. It “seems” almost accurate.
But what about that disaffected MBA developer fiddling with thresholds?
I know the answer to this question, “That will never happen.”
Okay, I am convinced. You know the “we need” thing.
Stephen E Arnold, October 6, 2022
Webb Wobbles: Do Other Data Streams Stumble Around?
October 4, 2022
I read an essay identified as an essay from The_Byte In Futurism with the content from Nature. Confused? I am.
The title of the article is “Scientists May Have Really Screwed Up on Early James Webb Findings.” The “Webb” is not the digital construct, but the space telescope. The subtitle about the data generated from the system is:
I don’t think anybody really expected this to be as big of an issue as it’s becoming.
Space is not something I think about. Decades ago I met a fellow named Fred G., who was engaged in a study of space warfare. Then one of my colleague Howard F. joined my team after doing some satellite stuff with a US government agency. He didn’t volunteer any information to me, and I did not ask. Space may be the final frontier, but I liked working on online from my land based office, thank you very much.
The article raises an interesting point; to wit:
When the first batch of data dropped earlier this summer, many dived straight into analysis and putting out papers. But according to new reporting by Nature, the telescope hadn’t been fully calibrated when the data was first released, which is now sending some astronomers scrambling to see if their calculations are now obsolete. The process of going back and trying to find out what parts of the work needs to be redone has proved “thorny and annoying,” one astronomer told Nature.
The idea is that the “Webby” data may have been distorted, skewed, or output with knobs and dials set incorrectly. Not surprisingly those who used these data to do spacey stuff may have reached unjustifiable conclusions. What about those nifty images, the news conferences, and the breathless references to the oldest, biggest, coolest images from the universe?
My thought is that the analyses, images, and scientific explanations are wrong to some degree. I hope the data are as pure as online clickstream data. No, no, strike that. I hope the data are as rock solid as mobile GPS data. No, no, strike that too. I hope the data are accurate like looking out the window to determine if it is a clear or cloudy day. Yes, narrowed scope, first hand input, and a binary conclusion.
Unfortunately in today’s world, that’s not what data wranglers do on the digital ranch.
If the “Webby” data are off kilter, my question is:
What about the data used to train smart software from some of America’s most trusted and profitable companies? Could these data be making incorrect decisions flow from models so that humans and downstream systems keep producing less and less reliable results?
My thought is, “Who wants to think about data being wrong, poisoned, or distorted?” People want better, faster, cheaper. Some people want to leverage data in cash or a bunker in Alaska. Others like Dr. Timnit Gebru wants her criticisms of the estimable Google to get some traction, even among those who snorkel and do deep dives.
If the scientists, engineers, and mathematicians fouled up with James Webb data, isn’t it possible that some of the big data outfits are making similar mistakes with calibration, data verification, analysis, and astounding observations?
I think the “Webby” moment is important. Marketers are not likely to worry too much.
Stephen E Arnold, October 4, 2022
Data and Dining: Yum Yum
August 30, 2022
Food and beverage companies hire consultants like Mike Kostyo to predict what dishes will soon be gracing menus. HuffPost describes the flavorful profession in the piece, “This Food Trendologist Knows What We’ll Be Eating Before Anyone Else.” As one might expect, the job involves traveling to many places and sampling many cuisines. But it also means analyzing a trove of data. Who knew? Writer Emily Laurence tells us:
“Kostyo explained that declaring something a trend requires actual data; it’s not done willy-nilly. A lot of his job is spent analyzing data to prepare food trend reports he and his team put together a few times a year. Some brands and companies use these trend reports to determine products they may want to create. ‘We have our eyes on all sorts of possible trends, with dedicated folders for each. Any time we come across a piece of data or anecdotal evidence related to a possible trend, we add it to the designated folder,’ Kostyo said, explaining that this allows them to see how a trend is building over time (or if it fizzles out, never actually turning into one). For example, he said he and his team use a tool that gives them access to more than 100,000 menus across the country. ‘We can use this tool to see what types of appetizers have grown the fastest in the past few years or what ingredients are being used more,’ Kostyo said.”
We would be curious to see that presumably proprietary data tool. For clients, the accuracy of these predictions can mean the difference between celebrating a profitable quarter and handing out pink slips. See the write-up for how one gets into this profession, factors that can affect food trends, and what Kostyo predicts diners will swallow next.
Cynthia Murrell, August 30, 2022
Data: Better Fresh
May 18, 2022
Decisions based on data are only as good as the data on which they are based. That seems obvious, but according to BetaNews, “Over 80 Percent of Companies Are Relying on Stale Data to Make Decisions.” Writer Ian Barker summarizes a recent study:
“The research, conducted by Dimensional Research for data integration specialist Fivetran, finds that 82 percent of companies are making decisions based on stale information. This is leading to wrong decisions and lost revenue according to 85 percent. In addition 86 percent of respondents say their business needs access to real-time ERP [Enterprise Resource Planning] data to make smart business decisions, yet only 23 percent have systems in place to make that possible. And almost all (99 percent) say they are struggling to gain consistent access to information stored in their ERP systems. Overall 65 percent of respondents say access to ERP data is difficult and 78 percent think software vendors intentionally make it so. Those surveyed say poor access to ERP data directly impacts their business with slowed operations, bad decision-making and lost revenue.”
The write-up includes a few info-graphics for the curious to peruse. Why most of those surveyed think vendors purposely make it difficult to access good data is not explained. Fivetran does emphasize the importance of “looking at the freshest, most complete dataset possible.” Yep, old info is not very helpful. The company asserts the answer lies in change data capture, a service it happens to offer (as do several other companies).
Cynthia Murrell, May 17, 2022
TikTok: Innocuous? Maybe Not Among Friends
January 5, 2022
Short videos. No big deal.
The data about one’s friends are a big deal. A really big deal. TikTok may be activating a network effect. “TikTok Tests Its Own Version of the Retweet with a New Repost Button” suggests that a Twitter function is chugging along. What if the “friend” is not a registered user of TikTok? Perhaps the Repost function is a way to expand a user’s social network. What can one do with such data? Building out a social graph and cross correlating those data with other information might be a high value exercise. What other uses can be made of these data a year or two down the road? That’s an interesting question to consider, particularly from the point of view of Chinese intelligence professionals.
“China Harvests Masses of Data on Western Targets, Documents Show” explains that China acquires data for strategic and tactical reasons. The write up doses not identify specific specialized software products, services, and tools. Furthermore, the price tags for surveillance expenditures seem modest. Nevertheless, there is a suggestive passage in the write up:
Highly sensitive viral trends online are reported to a 24-hour hotline maintained by the Cybersecurity administration of China (CAC), the body that oversees the country’s censorship apparatus…
What’s interesting is that China uses both software and human-intermediated systems.
Net net: Pundits and users have zero clue about China’s data collection activities in general. When it comes to specific apps and their functions on devices, users have effectively zero knowledge of the outflow of personal data which can be used to generate a profile for possible coercion. Pooh pooh-ing TikTok? Not a great idea.
Stephen E Arnold, January 5, 2022
Quantitative vs Qualitative Data, Defined
January 4, 2022
Sounding almost philosophical, The Future of Things posts, “What is Data? Types of Data, Explained.” Distinguishing between types of data can mean many things. One distinction we are always curious about is what data does one emit via mobile devices and what types are feeding the surveillance machines? This write-up, though, is more of a primer on a basic data science concept: the difference between quantitative and qualitative data. Writer Kris defines quantitative data:
“As the name suggests, quantitative data can be quantified — that is, it can be measured and expressed in numerical values. Thus, it is easy to manipulate quantitative data and represent it through statistical graphs and charts. Quantitative data usually answers questions like ‘How much?’, ‘How many?’ and ‘How often?’ Some examples of quantitative data include a person’s height, the amount of time they spend on social media and their annual income. There are two key types of quantitative data: discrete and continuous.”
Here is the difference between those types of quantitative data: Discrete data cannot be divided into parts smaller than a whole number, like customers (yikes!) or orders. Continuous data is measured on a scale and can include fractions; the height or weight of a product, for example.
Kris goes on to define quantitative data, which is harder to process and analyze but can provide insights that quantitative data simply cannot:
“Qualitative data … exists as words, images, objects and symbols. Sometimes qualitative data is called categorical data because the information must be sorted into categories instead of represented by numbers or statistical charts. Qualitative data tends to answer questions with more nuance, like ‘Why did this happen?’ Some examples of qualitative data business leaders might encounter include customers’ names, their favorite colors and their ethnicity. The two most common types of qualitative data are nominal data and ordinal data.”
As the names suggest: Nominal data names a piece of data without assigning it any order in relation to other pieces of data. Ordinal data ranks bits of information in an order of some kind. The difference is important when performing any type of statistical analysis.
Cynthia Murrell, January 4, 2022
Data Science Information at Your Fingertips
December 27, 2021
Just a brief honk about a useful resource. Data scientist, engineer, and blogger Manpreet Singh draws our attention to an “Amazing List of Data Science Cheat Sheets.” Singh begins with a word to those wondering what, exactly, data science is—linking to UC Berkeley’s page on the subject. He then reveals the trove of quick-access info is located at GitHub, posted there by engineer Favio Vazquez. Singh includes a series of screenshots that give a taste of the collection. He writes:
“When you load up this repo you will see a few different folders, these folders house a ton of different cheat sheets for different disciplines: [screenshot 1] You can also scroll down a bit to see a breakdown of these sheets: [screenshot 2] These cheat sheets range in use, but they all offer a ton of value for your data science needs. All you have to do is click on the cheat sheets you want to see, you will then be redirected to some awesome looking cheat sheets: [screenshots 3 and 4] Without a doubt, if you’re planning on learning data science, I would highly recommend checking out these cheat sheets.”
With topics as wide-ranging as business science, calculus, SQL, and machine learning, this list is a one-stop source of reference material for the current or aspiring data scientist. Savvy readers may wish to bookmark the useful page.
Cynthia Murrell, December 27, 2021
Big Data Creates Its Own Closed Mind
December 23, 2021
New ideas that challenged established theories are always ridiculed. Depending on the circumstances, they are also deemed “heretical” against all accumulated knowledge. Mind Matters News discusses how it is harmful not to explore new ideas in the interview with author Erik J. Larson, “Why Big Data Can Be The Enemy Of New Ideas.” During the interview, Larson was asked how past ethics of innovations are useful today and he stated Copernicus’s heliocentric solar system model was an example.
Copernicus’s heliocentric model was not accepted by his colleagues, who believed in the Ptolemaic Earth-centered model. There was tons of data to support the Ptolemaic model, while Copernicus’s model was not as predictive for astronomy conundrums. It only solved a few questions. Copernicus innovated because he questioned scientific doctrine. Big data AI are incapable of thinking differently, because they are only as smart as they have been programmed. In other words, AI is incapable of thinking outside the box.
Computer technology cannot replicate the human brain. Millions of dollars were invested in an attempt to dubbed the Human Brain Project:
“Of course it was a total failure… The guy who started it actually ended up getting fired for a variety of reasons but tech didn’t solve that problem in science because focusing on technology rather than the actual natural world turns out to have not been a good idea. It’s almost like inserting an artificial layer. Trying to convert basic research in neuroscience into a software development project just means you’re going to end up with software ideas and ideas that are programmable on a computer. Your scientists are going to be working with existing theories because those are the ones you can actually write and code. And they’re not going to be looking for gaps in our existing theoretical knowledge in the brain.”
If people accept big data software as smarter than actual humans then that is a huge problem. It is comparable to how religious dogma (from all cultures) is used to exert control. Religion itself is not a problem, but blind obedience to its doctrine is dangerous. An example is religious fundamentalists of all kinds, including Abrahamic, Buddhist, and Hindu followers.
Big data does solve and prevent problems, but it cannot be a replacement for the human brain. Thinking creatively does not compute for AI.
Whitney Grace, December 23, 2021
Veraset: Another Data Event
November 22, 2021
Here is a good example of how personal data, in this case tracking data, can be used without one’s knowledge. In its article “Files: Phone Data Shared” the Arkansas Democrat Gazette reports that data broker Veraset provided phone location data to the US Department of Health last year as part of a free trial. The transaction was discovered by digital-rights group Electronic Frontier Foundation. The firm marketed the data as valuable for COVID research, but after the trial period was up the agency declined to move forward with a partnership. The data was purportedly stripped of names and other personal details and the researchers found no evidence it was misused. However, Washington Post reporter Drew Harwell writes:
“[Foundation technologist Bennett Cyphers] noted that Veraset’s location data includes sequences of code, known as ‘advertising identifiers,’ that can be used to pinpoint individual phones. Researchers have also shown that such data can be easily ‘de-anonymized’ and linked to a specific person. Apple and Google announced changes earlier this year that would allow people to block their ID numbers from being used for tracking. Veraset and other data brokers have worked to improve their public image and squash privacy concerns by sharing their records with public health agencies, researchers and news organizations.”
Amidst a pandemic, that tactic just might work. How do data brokers get this information in the first place? We learn:
“Data brokers pay software developers to include snippets of code in their apps that then sent a user’s location data back to the company. Some companies have folded their code into games and weather apps, but Veraset does not say which apps it works with. Critics have questioned whether users are aware that their data is being shared in such a way. The company is a spinoff of the location-data firm SafeGraph, which Google banned earlier this year as part of an effort to restrict covert location tracking.”
Wow, banned by Google—that is saying something. Harwell reports SafeGraph shared data with the CDC during the first few weeks of the pandemic. The agency used that data to track how many people were staying home for its COVID Data Tracker.
App users, often unwittingly, agree to data sharing in those opaque user agreements most of us do not read. The alternative, of course, is to deprive oneself of technology that is increasingly necessary to operate in today’s world. It is almost as if that were by design.
Cynthia Murrell November 22, 2021