A Data Taboo: Poisoned Information But We Do Not Discuss It Unless… Lawyers
October 25, 2022
In a conference call yesterday (October 24, 2022), I mentioned one of my laws of online information; specifically, digital information can be poisoned. The venom can be administered by a numerically adept MBA or a junior college math major taking short cuts because data validation is hard work. The person on the call was mildly surprised because the notion of open source and closed source “facts” intentionally weaponized is an uncomfortable subject. I think the person with whom I was speaking blinked twice when I pointed what should be obvious to most individuals in the intelware business. Here’s the pointy end of reality:
Most experts and many of the content processing systems assume that data are good enough. Plus, with lots of data any irregularities are crunched down by steamrolling mathematical processes.
The problem is that articles like “Biotech Firm Enochian Says Co Founder Fabricated Data” makes it clear that MBA math as well as experts hired to review data can be caught with their digital clothing in a pile. These folks are, in effect, sitting naked in a room with people who want to make money. Nakedness from being dead wrong can lead to some career turbulence; for example, prison.
The write up reports:
Enochian BioSciences Inc. has sued co-founder Serhat Gumrukcu for contractual fraud, alleging that it paid him and his husband $25 million based on scientific data that Mr. Gumrukcu altered and fabricated.
The article does not explain precisely how the data were “fabricated.” However, someone with Excel skills or access to an article like “Top 3 Python Packages to Generate Synthetic Data” and Fiverr.com or similar gig work site can get some data generated at a low cost. Who will know? Most MBAs math and statistics classes focus on meeting targets in order to get a bonus or amp up a “service” fee for clicking a mouse. Experts who can figure out fiddled data sets take the time if they are motivated by professional jealousy or cold cash. Who blew the whistle on Theranos? A data analyst? Nope. A “real” journalist who interviewed people who thought something was goofy in the data.
My point is that it is trivially easy to whip up data to support a run at tenure or at a group of MBAs desperate to fund the next big thing as the big tech house of cards wobbles in the winds of change.
Several observations:
- The threat of bad or fiddled data is rising. My team is checking a smart output by hand because we simply cannot trust what a slick, new intelware system outputs. Yep, trust is in short supply among my research team.
- Individual inspection of data from assorted open and closed sources is accepted as is. The attitude is that the law of big numbers, the sheer volume of data, or the magic of cross correlation will minimize errors. Sure these processes will, but what if the data are weaponized and crafted to avoid detection? The answer is to check each item. How’s that for a cost center?
- Uninformed individuals (yep, I am including some data scientists, MBAs, and hawkers of data from app users) don’t know how to identify weaponized data nor know what to do when such data are identified.
Does this suggest that a problem exists? If yes, what’s the fix?
[a] Ignore the problem
[b] Trust Google-like outfits who seek to be the source for synthetic data
[c] Rely on MBAs
[d] Rely on jealous colleagues in the statistics department with limited tenure opportunities
[e] Blink.
Pick one.
Stephen E Arnold, October 25, 2022
How Apps Use Your Data: Just a Half Effort
April 28, 2022
I read an quite enthusiastic article called “Google Forces Developers to Provide Details on How Apps Use Your Data.” The main idea is virtue signaling with one of those flashing airport beacons. These can be seen through certain types of “info fog,” just not today’s info fog. The digital climate has a number of characteristics. One is obfuscation.
The write up states:
… the Data safety feature is now on the Google Play Store and aims to bolster security by providing users details on how an app is using their information. Developers are required to complete this section for their apps by July 20, and will need to provide updates if they change their data handling practices, too.
That sounds encouraging. Google’s been at the data harvesting combine controls for more than two decades. Now app developers have to provide information about their use of an app user’s data and presumably flip on the yellow fog lights for what the folks who have access to those data via an API or a bulk transfer are doing. Amusing thought forced regulation after 240 months on the info highway.
However, what app users do with data is half of the story, maybe less. The interesting question to me is, “What does Google do with those data?”
The Data Safety initiative does not focus on the Google. Data Safety shifts the attention to app developers, presumably some of whom have crafty ideas. My interest is Google’s own data surfing; for example, ad diffusion, and my fave Snorkelization and synthetic “close enough for horseshoes” data. Real data may be to “real” for some purposes.
After a couple of decades, Google is taking steps toward a data destination. I just don’t know where that journey is taking people.
Stephen E Arnold, April 28, 2022
Why Be Like ClearView AI? Google Fabs Data the Way TSMC Makes Chips
April 8, 2022
Machine learning requires data. Lots of data. Datasets can set AI trainers back millions of dollars, and even that does not guarantee a collection free of problems like bias and privacy issues. Researchers at MIT have developed another way, at least when it comes to image identification. The World Economic Forum reports, “These AI Tools Are Teaching Themselves to Improve How they Classify Images.” Of course, one must start somewhere, so a generative model is first trained on some actual data. From there, it generates synthetic data that, we’re told, is almost indistinguishable from the real thing. Writer Adam Zewe cites the paper‘s lead author Ali Jahanian as he emphasizes:
“But generative models are even more useful because they learn how to transform the underlying data on which they are trained, he says. If the model is trained on images of cars, it can ‘imagine’ how a car would look in different situations — situations it did not see during training — and then output images that show the car in unique poses, colors, or sizes. Having multiple views of the same image is important for a technique called contrastive learning, where a machine-learning model is shown many unlabeled images to learn which pairs are similar or different. The researchers connected a pretrained generative model to a contrastive learning model in a way that allowed the two models to work together automatically. The contrastive learner could tell the generative model to produce different views of an object, and then learn to identify that object from multiple angles, Jahanian explains. ‘This was like connecting two building blocks. Because the generative model can give us different views of the same thing, it can help the contrastive method to learn better representations,’ he says.”
Ah, algorithmic teamwork. Another advantage of this method is the nearly infinite samples the model can generate, since more samples (usually) make for a better trained AI. Jahanian also notes once a generative model has created a repository of synthetic data, that resource can be posted online for others to use. The team also hopes to use their technique to generate corner cases, which often cannot be learned from real data sets and are especially troublesome when it comes to potentially dangerous uses like self-driving cars. If this hope is realized, it could be a huge boon.
This all sounds great, but what if—just a minor if—the model is off base? And, once this tech moves out of the laboratory, how would we know? The researchers acknowledge a couple other limitations. For one, their generative models occasionally reveal source data, which negates the privacy advantage. Furthermore, any biases in the limited datasets used for the initial training will be amplified unless the model is “properly audited.” It seems like transparency, which somehow remains elusive in commercial AI applications, would be crucial. Perhaps the researchers have an idea how to solve that riddle.
Funding for the project was supplied, in part, by the MIT-IBM Watson AI Lab, the United States Air Force Research Laboratory, and the United States Air Force Artificial Intelligence Accelerator.
Cynthia Murrell, April 8, 2022
Datasets: An Analysis Which Tap Dances around Some Consequences
December 22, 2021
I read “3 Big Problems with Datasets in AI and Machine Learning.” The arguments presented support the SAIL, Snorkel, and Google type approach to building datasets. I have addressed some of my thoughts about configuring once and letting fancy math do the heavy lifting going forward. This is probably not the intended purpose of the Venture Beat write up. My hunch is that pointing out other people’s problems frames the SAIL, Snorkel, and Google type approaches. No one asks, “What happens if the SAIL, Snorkel, and Google type approaches don’t work or have some interesting downstream consequences?” Why bother?
Here are the problems as presented by the cited article:
- The Training Dilemma. The write up says: “History is filled with examples of the consequences of deploying models trained using flawed datasets.” That’s correct. The challenge is that creating and validating a training set for a discipline, topic, or “space” is that new content arrives using new lingo and even metaphors instead of words like “rock.” Building a dataset and doing what informed people from the early days of Autonomy’s neuro-linguistic method know is that no one wants to spend money, time, and computing resources in endless Sisyphean work. That rock keeps rolling back down the hill. This is a deal breaker, so considerable efforts has been expended figuring out how to cut corners, use good enough data, set loose shoes thresholds, and rely on normalization to smooth out the acne scars. Thus, we are in an era of using what’s available. Make it work or become a content creator on TikTok.
- Issues with Labeling. I don’t like it when the word “indexing” is replaced with works like labels, metatags, hashtags, and semantic sign posts. Give me a break. Automatic indexing is more consistent than human indexers who get tired and fall back on a quiver of terms because who wants to work too hard at a boring job for many. But the automatic systems are in the same “good enough” basket as smart training data set creation. The problem is words and humans. Software is clueless when it comes to snide remarks, cynicism, certain types of fake news and bogus research reports in peer reviewed journals, etc. Indexing using esoteric words means the Average Joe and Janet can’t find the content. Indexing with everyday words means that search results work great for pizza near me but no so well for beatles diet when I want food insects eat, not what kept George thin. The write up says: “Still other methods aim to replace real-world data with partially or entirely synthetic data — although the jury’s out on whether models trained on synthetic data can match the accuracy of their real-world-data counterparts.” Yep, let’s make up stuff.
- A Benchmarking Problem. The write up asserts: “SOTA benchmarking [also] does not encourage scientists to develop a nuanced understanding of the concrete challenges presented by their task in the real world, and instead can encourage tunnel vision on increasing scores. The requirement to achieve SOTA constrains the creation of novel algorithms or algorithms which can solve real-world problems.” Got that. My view is that validating data is a bridge too far for anyone except a graduate student working for a professor with grant money. But why benchmark when one can go snorkeling? The reality is that datasets are in most cases flawed but no one knows how flawed. Just use them and let the results light the path forward. Cheap and sounds good when couched in jargon.
What’s the fix? The fix is what I call the SAIL, Snorkel, and Google type solution. (Yep, Facebook digs in this sandbox too.)
My take is easily expressed just not popular. Too bad.
- Do the work to create and validate a training set. Rely on subject matter experts to check outputs and when the outputs drift, hit the brakes, and recalibrate and retrain.
- Admit that outputs are likely to be incomplete, misleading, or just plain wrong. Knock of the good enough approach to information.
- Return to methods which require thresholds to be be validated by user feedback and output validity. Letting cheap and fast methods decide which secondary school teacher gets fired strikes me as not too helpful.
- Make sure analyses of solutions don’t functions as advertisements for the world’s largest online ad outfit.
Stephen E Arnold, December 22, 2021
Startup Gretel Building Anonymized Data Platform
March 19, 2020
There is a lot of valuable but sensitive data out there that developers and engineers would love to get their innovative hands on, but it is difficult to impossible for them to access. Until now.
Enter Gretel, a startup working to anonymize confidential data. We learn about the upcoming platform from Inventiva’s article, “A Group of Ex-NSA And Amazon Engineers Are Building a ‘GitHub for Data’.” Co-founders Alex Watson, John Myers, Ali Golshan, and Laszlo Bock were inspired by the source code sharing platform GitHub. Reporter surbhi writes:
“Often, developers don’t need full access to a bank of user data — they just need a portion or a sample to work with. In many cases, developers could suffice with data that looks like real user data. … ‘We’re building right now software that enables developers to automatically check out an anonymized version of the data set,’ said Watson. This so-called ‘synthetic data’ is essentially artificial data that looks and works just like regular sensitive user data. Gretel uses machine learning to categorize the data — like names, addresses and other customer identifiers — and classify as many labels to the data as possible. Once that data is labeled, it can be applied access policies. Then, the platform applies differential privacy — a technique used to anonymize vast amounts of data — so that it’s no longer tied to customer information. ‘It’s an entirely fake data set that was generated by machine learning,’ said Watson.”
The founders are not the only ones who see the merit in this idea; so far, the startup has raised $3.5 million in seed funding. Gretel plans to charge users based on consumption, and the team hopes to make the platform available within the next six months.
Cynthia Murrell, March 19, 2020
Is Real News Synthetic?
June 13, 2018
There are new artificial intelligence algorithms being designed to develop new security measures. AI algorithms “learn” when they are fed large datasets to discover patterns, inconsistencies, and other factors. It is harder than one thinks to generate large datasets, so Google has turned to fake…er…synthetic data over real. Valuewalk wrote more about synthetic data in, “Why Facebook Now Uses Synthetic (‘Fake’) Data.”
Facebook recently announced plans to open two new AI labs to develop user security tools and the algorithms would be built on synthetic data. Sergey Nikolenko, a data scientist, complimented the adoption of synthetic data, especially since it would enable progress without hindering user privacy.
“ ‘While fake news has caused problems for Facebook, fake data will help fix those problems,’ said Nikolenko. ‘In a computing powerhouse like Facebook, where reams of data are generated every day, you want a solution in place that will help you quickly train different AI algorithms to perform different tasks, even if all the training data is. That’s where synthetic data gets the job done!’ “
One of the biggest difficulties AI developers face is a lack of usable data. In other words, data that is high-quality, task-specific and does not compromise user privacy. Companies like Neuromation nabbed this niche, so they started creating qualifiable data.
Facebook will use the AI tools to fight online harassment, political propaganda from foreign governments, fake news, and various networking tools and apps. This might be the start of better safety protocols protecting users and preventing online bullies.
Perhaps “real news” is synthetic?
Whitney Grace, June 13, 2018
An Upside to Fake Data
February 2, 2018
We never know if “data” are made up or actual factual. Nevertheless, we read “How Fake Data Can Help the Pentagon Track Rogue Weapons.” The main idea from our point of view is predictive analytics which can adapt to that which has not yet happened. We circled this statement from the company with the contract to make “fake” data useful under a US government contract:
IvySys Founder and Chief Executive Officer James DeBardelaben compared the process to repeatedly finding a needle in a haystack, but making both the needle and haystack look different every time. Using real-world data, agencies can only train algorithms to spot threats that already exist, he said, but constantly evolving synthetic datasets can train tools to spot patterns that have yet to occur.
Worth monitoring IvySys at https://www.ivysys.com/.
Stephen E Arnold, February 2, 2018
Enterprise Search: Will Synthetic Hormones Produce a Revenue Winner?
October 27, 2017
One of my colleagues provided me with a copy of the 24 page report with the hefty title:
In Search for Insight 2017. Enterprise Search and Findability Survey. Insights from 2012-2017
I stumbled on the phrase “In Search for Insight 2017.”
The report combines survey data with observations about what’s going to make enterprise search great again. I use the word “again” because:
- The buy up and sell out craziness which culminated with Microsoft’s buying Fast Search & Transfer in 2008 and Hewlett Packard’s purchase of Autonomy in 2011 marked the end of the old-school enterprise search vendors. As you may recall, Fast Search was the subject of a criminal investigation and the HP Autonomy deal continues to make its way through the legal system. You may perceive these two deals as barn burners. I see them as capstones for the era during which search was marketed as the solution to information problems in organizations.
- The word “search” has become confusing and devalued. For most people, “search” means the Danny Sullivan search engine optimization systems and methods. For those with some experience in information science, “search” means locating relevant information. SEO erodes relevance; the less popular connotation of the word suggests answering a user’s question. Not surprisingly, jargon has been used for many years in an effort to explain that “enterprise search” is infused with taxonomies, ontologies, semantic technologies, clustering, discovery, natural language processing, and other verbal chrome trim to make search into a Next Big Thing again. From my point of view, search is a utility and a code word for spoofing Google so that an irrelevant page appears instead of the answer the user seeks.
- The enterprise search landscape (the title of one of my monographs) has been bulldozed and reworked. The money in the old school precision and recall type of search comes from consulting. Search Technologies was acquired by Accenture to add services revenue to the management consulting firm’s repertoire of MBA fixes. What is left are companies offering “solutions” which require substantial engineering, consulting, and training services. The “engine”, in many cases, are open source systems which one can download without burdensome license fees. From my point of view, search boils down to picking an open source solution. If those don’t work, one can license a proprietary system wrapped around open source. If one wants a proprietary system, there are some available, but these are not likely to reach the lofty heights of the Fast Search or Autonomy IDOL systems in the salad days of enterprise search and its promises of a universal search system. The universal search outfit Google pulled out of enterprise search for a reason.
I want to highlight five of the points in the 24 page write up. Please, register to get your own copy of this document.
Here are my five highlights. My comments are in italics after each quote from the document:
Big Data and Its Less-Than-Gentle Lessons
August 1, 2013
I read “9 Big Data Lessons Learned.” The write up is interesting because it explores the buzzword that every azure chip consultant has used in their marketing pitches over the last year. Some true believers have the words Big Data tattooed on their arms like those mixed martial arts fighters sporting the names of casinos. Very attractive I say.
Because “big data” has sucked up search, content processing, and analytics, the term is usually not defined. The “problems” of Big Data are ignored. Since not much works when it comes to search and content processing, use of another undefined term is not particularly surprising. What caught my attention is that Datamation reports about some “lessons” its real journalists have tracked down and verified.
Please, read the entire original write up to get the full nine lessons. I want to highlight three of them:
First, Datamation points out that getting data from Point A to Point B can be tricky. I think that once the data has arrived at Point B, the next task is to get the data into a “Big Data” system. Datamation does not provide any cost information in its statement “Don’t underestimate the data integration challenges.” I would point out that the migration task can be expensive. Real expensive.
Second, Datamation sates, “Big Data success requires scale and speed.” I agree that scale and speed are important. Once again, Datamation does not bring these generalizations down to an accounting person’s desktop. Scale and speed cost money. Often a lot of money. In the analysis I did of “real time” a year or two ago, chopping latency down to a millisecond or two exponentiates the cost of scale and speed. Bandwidth and low latency storage are not sporting WalMart price tags.
Third, Datamation warns (maybe threatens) those with children in school and mortgages with, “If you’re not in the Big Data pool now, the lifespan of your career is shrinking by the day.” A couple of years ago this sentence would have said, “If you’re not in the social media pool now, the lifespan of your career is shrinking by the day.” How long with these all-too-frequent “next big things” sweep through information technology. I just learned that “CIO” means chief innovation officer. I also learned that the future of computing rests with synthetic biology.
The Big Data revolution is here. The problem is that the tools, the expertise, and the computational environment are inadequate for most Big Data problems. Companies with the resources like Google and Microsoft are trimming the data in order to get a handle on what today’s algorithms assert is important. Is it reasonable to think that most organizations can tackle Big Data when large organizations struggle to locate attachments in intra-organization email?
Reality has not hampered efforts to surf on the next big thing. Some waves are more challenging than others, however. I do like the fear angle. Nice touch at a time when senior managers are struggling to keep revenues and profits from drifting down. The hope is that Big Data will shore up products and services which are difficult to sell.
Catch the wave I suppose.
Stephen E Arnold, August 1, 2013
Sponsored by Xenky
Kapow Reinforces It Is a Big Data Platform
July 21, 2013
Short honk: Data integration, like search, is expanding. We noted a news release called “Kapow Software Quarterly Revenue Rises as Newly Acquired Customer Bookings and Subscriptions Fuel Growth.” The news release explains that a privately held firm is growing. The important point for me was this phrase: “a leading Big Data solution provider.”
The news release explains:
The Kapow Enterprise Big Data Integration Platform enables companies to integrate any cloud or on-premise data source using Kapow Software’s patented, intelligent integration workflows and Synthetic APIs™. Once the critical data is found and surgically extracted, Kapow Enterprise 9.2 delivers timely information to the workforce in an easily consumable form called Kapow Kapplets™ through an enterprise app library offering called the Kapow KappZone™. KappZones can be easily branded and distributed for employees to discover and use on any computing device they choose.
The Kapow Web site points out that the company’s business includes:
- Content integration
- Content migration
- Legacy application integration
- Enterprise search.
The company also offers three aforementioned products: Katalyst, Kapplets, and KappZone. I find this semantic embrace fascinating and indicative of a trend in which vendors pretty much do anything related to information which is, it seems, Big Data.
Stephen E Arnold, July 21, 2013
Sponsored by Xenky