Google and Its Use of the Word “Public”: A Clever and Revenue-Generating Policy Edit
July 6, 2023
Note: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.
If one has the cash, one can purchase user-generated data from more than 500 data publishers in the US. Some of these outfits are unknown. When a liberal Wall Street Journal reporter learns about Venntel or one of these outfits, outrage ensues. I am not going to explain how data from a user finds its ways into the hands of a commercial data aggregator or database publisher. Why not Google it? Let me know how helpful that research will be.
Why are these outfits important? The reasons include:
- Direct from app information obtained when a clueless mobile user accepts the Terms of Use. Do you hear the slurping sounds?
- Organizations with financial data and savvy data wranglers who cross correlate data from multiple sources?
- Outfits which assemble real-time or near-real-time user location data. How useful are those data in identifying military locations with a population of individuals who exercise wearing helpful heart and step monitoring devices?
Navigate to “Google’s Updated Privacy Policy States It Can Use Public Data to Train its AI Models.” The write up does not make clear what “public data” are. My hunch is that the Google is not exceptionally helpful with its definitions of important “obvious” concepts. The disconnect is the point of the policy change. Public data or third-party data can be purchased, licensed, used on a cloud service like an Oracle-like BlueKai clone, or obtained as part of a commercial deal with everyone’s favorite online service LexisNexis or one of its units.
A big advertiser demonstrates joy after reading about Google’s detailed prospect targeting reports. Dossiers of big buck buyers are available to those relying on Google for online text and video sales and marketing. The image of this happy media buyer is from the elves at MidJourney.
The write up states with typical Silicon Valley “real” news flair:
By updating its policy, it’s letting people know and making it clear that anything they publicly post online could be used to train Bard, its future versions and any other generative AI product Google develops.
Okay. “the weekend” mentioned in the write up is the 4th of July weekend. Is this a hot news or a slow news time? If you picked “hot”, you are respectfully wrong.
Now back to “public.” Think in terms of Google’s licensing third-party data, cross correlating those data with its log data generated by users, and any proprietary data obtained by Google’s Android or Chrome software, Gmail, its office apps, and any other data which a user clicking one of those “Agree” boxes cheerfully mouses through.
The idea, if the information in Google patent US7774328 B2. What’s interesting is that this granted patent does not include a quite helpful figure from the patent application US2007 0198481. Here’s the 16 year old figure. The subject is Michael Jackson. The text is difficult to read (write your Congressman or Senator to complain). The output is a machine generated dossier about the pop star. Note that it includes aliases. Other useful data are in the report. The granted patent presents more vanilla versions of the dossier generator, however.
The use of “public” data may enhance the type of dossier or other meaty report about a person. How about a map showing the travels of a person prior to providing a geo-fence about an individual’s location on a specific day and time. Useful for some applications? If these “inventions” are real, then the potential use cases are interesting. Advertisers will probably be interested? Can you think of other use cases? I can.
The cited article focuses on AI. I think that more substantive use cases fit nicely with the shift in “policy” for public data. Have your asked yourself, “What will Mandiant professionals find interesting in cross correlated data?”
Stephen E Arnold, July 6, 2023
Datasette: Useful Tool for Crime Analysts
February 15, 2023
If you want to explore data sets, you may want to take a look at the “open source multi-tool for exploring and publishing data.” The Datasette Swiss Army knife “is a tool for exploring and publishing data.”
The company says,
It helps people take data of any shape, analyze and explore it, and publish it as an interactive website and accompanying API. Datasette is aimed at data journalists, museum curators, archivists, local governments, scientists, researchers and anyone else who has data that they wish to share with the world. It is part of a wider ecosystem of 42 tools and 110 plugins dedicated to making working with structured data as productive as possible.
A handful of demos are available. Worth a look.
Stephen E Arnold, February 15, 2023
The Internet: Cue the Music. Hit It, Regrets, I Have Had a Few
December 21, 2022
I have been around online for a few years. I know some folks who were involved in creating what is called “the Internet.” I watched one of these luminaries unbutton his shirt and display a tee with the message, “TCP on everything.” Cute, cute, indeed. (I had the task of introducing this individual only to watch the disrobing and the P on everything joke. Tip: It was not a joke.)
Imagine my reaction when I read “Inventor of the World Wide Web Wants Us to Reclaim Our Data from Tech Giants.” The write up states:
…in an era of growing concern over privacy, he believes it’s time for us to reclaim our personal data.
Who wants this? Tim Berners-Lee and a startup. Content marketing or a sincere effort to derail the core functionality of ad trackers, beacons, cookies which expire in 99 years, etc., etc.
The article reports:
Berners-Lee hopes his platform will give control back to internet users. “I think the public has been concerned about privacy — the fact that these platforms have a huge amount of data, and they abuse it,” he says. “But I think what they’re missing sometimes is the lack of empowerment. You need to get back to a situation where you have autonomy, you have control of all your data.”
The idea is that Web 3 will deliver a different reality.
Do you remember this lyric:
Yes, there were times I’m sure you knew
When I bit off more than I could chew
But through it all, when there was doubt
I ate it up and spit it out
I faced it all and I stood tall and did it my way.
The my becomes big tech, and it is the information highway. There’s no exit, no turnaround, and no real chance of change before I log off for the final time.
Yeah, digital regrets. How’s that working out at Amazon, Facebook, Google, Twitter, and Microsoft among others? Unintended consequences and now the visionaries are standing tall on piles of money and data.
Change? Sure, right away.
Stephen E Arnold, December 21, 2022
TikTok: Algorithmic Data Slurping
November 14, 2022
There are several reasons TikTok rocketed to social-media dominance in just a few years. For example, Its user friendly creation tools plus a library of licensed tunes make it easy to create engaging content. Then there was the billion-dollar marketing campaign that enticed users away from Facebook and Instagram. But, according to the Guardian, it was the recommendation engine behind its For You Page (FYP) that really did the trick. Writer Alex Hern describes “How TikTok’s Algorithm Made It a Success: ‘It Pushes the Boundaries.’” He tells us:
“The FYP is the default screen new users see when opening the app. Even if you don’t follow a single other account, you’ll find it immediately populated with a never-ending stream of short clips culled from what’s popular across the service. That decision already gave the company a leg up compared to the competition: a Facebook or Twitter account with no friends or followers is a lonely, barren place, but TikTok is engaging from day one. It’s what happens next that is the company’s secret sauce, though. As you scroll through the FYP, the makeup of videos you’re presented with slowly begins to change, until, the app’s regular users say, it becomes almost uncannily good at predicting what videos from around the site are going to pique your interest.”
And so a user is hooked. Beyond the basics, specifically how the algorithm works is a mystery even, we’re told, to those who program it. We do know the AI takes the initiative. Instead of only waiting for users to select a video or tap a reaction, it serves up test content and tweaks suggestions based on how its suggestions are received. This approach has another benefit. It ensures each video posted on the platform is seen by at least one user, and every positive interaction multiplies its reach. That is how popular content creators quickly amass followers.
Success can be measured different ways, of course. Though TikTok has captured a record number of users, it is not doing so well in the critical monetization category. Estimates put its 2021 revenue at less than 5% of Facebook’s, and efforts to export its e-commerce component have not gone as hoped. Still, it looks like the company is ready to try, try again. Will its persistence pay off?
Cynthia Murrell, November 14, 2022
TikTok: Allegations of Data Sharing with China! Why?
June 21, 2022
If one takes a long view about an operation, some planners find information about the behavior of children or older, yet immature, creatures potentially useful. What if a teenager, puts up a TikTok video presenting allegedly “real” illegal actions? Might that teen in three or four years be a target for soft persuasion? Leaking the video to an employer? No, of course not. Who would take such an action?
I read “Leaked Audio from 80 Internal TikTok Meetings Shows That US User Data Has Been Repeatedly Accessed from China.” Let’s assume that this allegation has a tiny shred of credibility. The financially-challenged Buzzfeed might be angling for clicks. Nevertheless, I noted this passage:
…according to leaked audio from more than 80 internal TikTok meetings, China-based employees of ByteDance have repeatedly accessed nonpublic data about US TikTok users…
Is the audio deeply faked? Could the audio be edited by a budding sound engineer?
Sure.
And what’s with the TikTok “connection” to Oracle? Probably just a coincidence like one of Oracle’s investment units participating in Board meetings for Voyager Labs. A China-linked firm was on the Board for a while. No big deal. Voyager Labs? What does that outfit do? Perhaps it is the Manchester Square office and the delightful restaurants close at hand?
The write up refers to data brokers too. That’s interesting. If a nation state wants app generated data, why not license it. No one pays much attention to “marketing services” which acquire and normalize user data, right?
Buzzfeed tried to reach a wizard at Booz, Allen. That did not work out. Why not drive to Tyson’s Corner and hang out in the Ritz Carlton at lunch time. Get a Booz, Allen expert in the wild.
Yep, China. No problem. Take a longer-term view for creating something interesting like an insider who provides a user name and password. Happens every day and will into the future. Plan ahead I assume.
Real news? Good question.
Stephen E Arnold, June 21, 2022
Doing Good for Data Harvesting
March 10, 2022
What a class act. We learn from TechDirt that a “Suicide Hotline Collected, Monetized the Data of Desperate People, Because Of Course it Did.” The culprit is Crisis Text Line, one of the largest nonprofit support services for suicidal individuals in the US. Naturally, the organization is hiding behind the assertion of anonymized data. Writer Karl Bode tells us:
“A Politico report last week highlighted how the company has been caught collecting and monetizing the data of callers… to create and market customer service software. More specifically, Crisis Text Line says it ‘anonymizes’ some user and interaction data (ranging from the frequency certain words are used, to the type of distress users are experiencing) and sells it to a for-profit partner named Loris.ai. Crisis Text Line has a minority stake in Loris.ai, and gets a cut of their revenues in exchange. As we’ve seen in countless privacy scandals before this one, the idea that this data is ‘anonymized’ is once again held up as some kind of get out of jail free card. … But as we’ve noted more times than I can count, ‘anonymized’ is effectively a meaningless term in the privacy realm. Study after study after study has shown that it’s relatively trivial to identify a user’s ‘anonymized’ footprint when that data is combined with a variety of other datasets. For a long time the press couldn’t be bothered to point this out, something that’s thankfully starting to change.”
Well that is something, we suppose. The hotline also swears the data is only being wielded for good, to “put more empathy into the world.” Sure.
Bode examines several factors that have gotten us here as a society: He points to the many roadblocks corporate lobbyists have managed to wedge in the way of even the most basic privacy laws. Then there is the serious dearth of funding for quality mental health care, leaving the vulnerable little choice but to reach out to irresponsible outfits like Crisis Text Line. And let us not forget the hamstrung privacy regulators at the FTC. That agency is understaffed and underfunded, is often prohibited from moving against nonprofits, and can only impose inconsequential penalties when it can act. Finally, the whole ecosystem involving big tech and telecom is convoluted by design, making oversight terribly difficult. Like similar misdeeds, Bode laments, this scandal is likely to move out of the news cycle with no more repercussion than a collective tut-tut. Stay tuned for the next one.
Cynthia Murrell, March 10, 2022
Data Federation? Loser. Go with a Data Lake House
February 8, 2022
I have been the phrase “data lake house” or “datalake house.” I noted some bold claims about a new data lake house approach in “Managed Data Lakehouse Startup Onehouse Launches with $8M in Funding.” The write up states:
One of the flagship features of Onehouse’s lakehouse service is a technology called incremental processing. It allows companies to start analyzing their data soon after it’s generated, which is difficult when using traditional technologies.
The write up adds:
The company’s lakehouse service automatically optimizes customers’ data ingestion workflows to improve performance, the startup says. Because the service is delivered via the cloud on a fully managed basis, customers don’t have to manage the underlying infrastructure.
The idea of course is that traditional methods of handling data are [a] slow, [b] expensive, and [c] difficult to implement.
The premise is that the data lake house delivers more efficient use of data and a way to “future proof the data architected for machine learning / data science down the line.”
When I read this I thought of Vivisimo’s explanation of its federating method. IBM bought Vivisimo, and I assume that it is one of the ingredient in IBM’s secret big data sauce. MarkLogic also suggested in one presentation I sat through that its system would ingest data and the MarkLogic system (once eyed by the Google as a possible acquisition) would allow near real time access to the data. One person in the audience was affiliated with the US Library of Congress, and that individual seemed quite enthused about MarkLogic. And there are companies which facilitate data manipulation; for example, Kofax and its data connectors.
From my point of view, the challenge is that today large volumes of data are available. These data have to be moved from point A to point B. Ideally data do not require transformation. At some point in the flow, data in motion can be processed. There are firms which offer real time or near real time data analytics; for example, Trendalyze.com.
Conversion, moving, saving, and then doing something “more” with the data remain challenges. Maybe Onehouse has the answer?
Stephen E Arnold, February 8, 2022
Why Some Outputs from Smart Software Are Wonky
July 26, 2021
Some models work like a champ. Utility rate models are reasonably reliable. When it is hot, use of electricity goes up. Rates are then “adjusted.” Perfect. Other models are less solid; for example, Bayesian systems which are not checked every hour or large neural nets which are “assumed” to be honking along like a well-ordered flight of geese. Why do I offer such Negative Ned observations? Experience for one thing and the nifty little concepts tossed out by Ben Kuhn, a Twitter persona. You can locate this string of observations at this link. Well, you could as of July 26, 2021, at 630 am US Eastern time. Here’s a selection of what are apparently the highlights of Mr. Kuhn’s conversation with “a former roommate.” That’s provenance enough for me.
Item One:
Most big number theory results are apparently 50-100 page papers where deeply understanding them is ~as hard as a semester-long course. Because of this, ~nobody has time to understand all the results they use—instead they “black-box” many of them without deeply understanding.
Could this be true? How could newly minted, be an expert with our $40 online course, create professionals who use models packaged in downloadable and easy to plug in modules be unfamiliar with the inner workings of said bundles of brilliance? Impossible? Really?
Item Two:
A lot of number theory is figuring out how to stitch together many different such black boxes to get some new big result. Roommate described this as “flailing around” but also highly effective and endorsed my analogy to copy-pasting code from many different Stack Overflow answers.
Oh, come on. Flailing around. Do developers flail or do they “trust” the outfits who pretend to know how some multi-layered systems work. Fiddling with assumptions, thresholds, and (close your ears) the data themselves are never, ever a way to work around a glitch.
Item Three
Roommate told a story of using a technique to calculate a number and having a high-powered prof go “wow, I didn’t know you could actually do that”
No kidding? That’s impossible in general, and that expression would never be uttered at Amazon-, Facebook-, and Google-type operations, would it?
Will Mr. Kuhn be banned for heresy. [Keep in mind how Wikipedia defines this term: “is any belief or theory that is strongly at variance with established beliefs or customs, in particular the accepted beliefs of a church or religious organization.”] Just repeating an idea once would warrant a close encounter with an Iron Maiden or a pile of firewood. Probably not today. Someone might emit a slightly critical tweet, however.
Stephen E Arnold, July 26, 2021
Data Federation: Sure, Works Perfectly
June 1, 2021
How easy is it to snag a dozen sets of data, normalize them, parse them, and extract useful index terms, assign classifications, and other useful hooks? “Automated Data Wrangling” provides an answer sharply different from what marketers assert.
A former space explorer, now marooned on a beautiful dying world explains that the marketing assurances of dozens upon dozens of companies are baloney. Here’s a passage I noted:
Most public data is a mess. The knowledge required to clean it up exists. Cloud based computational infrastructure is pretty easily available and cost effective. But currently there seems to be a gap in the open source tooling. We can keep hacking away at it with custom rule-based processes informed by our modest domain expertise, and we’ll make progress, but as the leading researchers in the field point out, this doesn’t scale very well. If these kinds of powerful automated data wrangling tools are only really available for commercial purposes, I’m afraid that the current gap in data accessibility will not only persist, but grow over time. More commercial data producers and consumers will learn how to make use of them, and dedicate financial resources to doing so, knowing that they’ll be reap financial rewards. While folks working in the public interest trying to create universal public goods with public data and open source software will be left behind struggling with messy data forever.
Marketing is just easier than telling the truth about what’s needed in order to generate information which can be processed by a downstream procedure.
Stephen E Arnold, June xx, 2021
Does Google Manifest Addiction to Personal Data?
March 31, 2021
I read an amusing “we don’t do that!” write up in “Google Collects 20 Times More Telemetry from Android Devices Than Apple from iOS.” The cyber security firm Recorded Future points to academic research asserting:
The study unearthed some uncomfortable results. For starters, Prof. Leith said that “both iOS and Google Android transmit telemetry, despite the user explicitly opting out of this [option].” Furthermore, “this data is sent even when a user is not logged in (indeed even if they have never logged in),” the researcher said. [Weird bold face in original text removed.]
Okay, this is the stuff of tenure. The horrors of monopolies and clueless users who happily gobble up free services.
What’s amazing is that the write up does not point out the value of these data for predictive analytics. That’s the business of Recorded Future, right? Quite an oversight. That’s what happens when “news” stumbles over the business model paying for marketing via content. Clever? Of course.
The reliability of the probabilities generated by the Recorded Future methods pivot on having historical and real time data. No wonder Google and Apple suggest that “we don’t do that.”
Recorded Future’s marketing is one thing, but Google’s addiction to data is presenting itself in quite fascinating ways. Navigate to “Google’s New App Automagically Organizes Your Scanned Documents.” The write up states:
The app lets you scan documents and then it uses AI to automatically name and sort them into different categories such as bills, IDs, and vehicles.
And what happens?
To make it easy to find documents, you can also search through the full text of the document.
What types of documents does a happy user scan? Maybe the Covid vaccination card? Maybe legal documents like mortgages, past due notices from a lawyer, divorce papers, and similar tough-to-obtain information of a quite private and personal nature?
My point is that mobile devices are data collection devices. The data are used to enable the Apple and Google business models. Ads, information about preferences, clues to future actions, and similar insights are now routinely available to those with access to the data and analytic systems.
The professor on the tenure track or gunning for an endowed chair can be surprised by practices which have been refined over many years. Not exactly ground breaking research.
Google obtaining access to scanned personal documents? No big deal. Think how easy and how convenient the free app makes taming old fashioned paper. I wonder if Google has an addiction to data and can no longer help itself?
Without meaningful regulation, stunned professors and mobile device users in love with convenience are cementing monopoly control over information flows.
Oh, Recorded Future was once a start up funded by Google and In-Q-Tel. Is that a useful fact?
Stephen E Arnold, March 31, 2021