Data Confidence: The Check Is in the Mail

October 15, 2021

Why are we not surprised? SeattlePI reports, “Americans Have Little Trust in Online Security: AP-NORC Poll.” Writer Matt O’Brien reveals:

“The poll by The Associated Press-NORC Center for Public Affairs Research and MeriTalk shows that 64% of Americans say their social media activity is not very or not at all secure. About as many have the same security doubts about online information revealing their physical location. Half of Americans believe their private text conversations lack security. And they’re not just concerned. They want something done about it. Nearly three-quarters of Americans say they support establishing national standards for how companies can collect, process and share personal data.”

Few have any hope such standards will be enacted by federal officials, however. Even after years filled with private sector hacks and scandals, we’re told 56% of respondents would trust corporations to safeguard their data before they would the government. The write-up continues:

“About 71% of Americans believe that individuals’ data privacy should be treated as a national security issue, with a similar level of support among Democrats and Republicans. But only 23% are very or somewhat satisfied in the federal government’s current efforts to protect Americans’ privacy and secure their personal data online. ‘This is not a partisan issue,’ said Colorado state Rep. Terri Carver, a Republican who co-sponsored a consumer data privacy bill signed into law by Democratic Gov. Jared Polis in July. It takes effect in 2023.”

The bill would give users in Colorado the right to access and delete personal information online, echoing similar legislation in Virginia and California. Predictably, Facebook and other tech companies opposed the bill.

Cynthia Murrell, October 15, 2021

TikTok: Privacy Spotlight

September 15, 2021

There is nothing like rapid EU response to privacy matters. “TikTok Faces Privacy Investigations by EU Watchdog” states:

The watchdog is looking into its processing of children’s personal data, and whether TikTok is in line with EU laws about transferring personal data to other countries, such as China.

The data hoovering capabilities of a TikTok-type app have been known for what — a day or two or a decade? My hunch is that we are leaning toward the multi-year awareness side of the privacy fence. The write up points out:

TikTok said privacy was “our highest priority”.

Plus about a year ago an EU affiliated unit poked into the TikTok privacy matter.

However, the write up fails to reference a brilliant statement by a Swisher-type of thinker. My recollection is that the gist of the analysis of the TikTok privacy issue in the US was, “Hey, no big deal.”

We’ll see. I wait for a report on this topic. Perhaps a TikTok indifferent journalist will make a TikTok summary of the report findings.

Stephen E Arnold, September 15, 2021

Not an Onion Report: Handwaving about Swizzled Data

August 24, 2021

I read at the suggestion of a friend “These Data Are Not Just Excessively Similar. They Are Impossibly Similar.” At first glance, I thought the write up was a column in an Onion-type of publication. Nope, someone copied the same data set and pasted it into itself.

Here’s what the write up says:

The paper’s Excel spreadsheet of the source data indicated mathematical malfeasance.

Malfeasance. Okay.

But what caught my interest was the inclusion of this name: Dan Ariley. If this is the Dan Ariely who wrote these books, that fact alone is suggestive. If it is a different person, then we are dealing with routine data dumbness or data dishonesty.

image

The write up contains what I call academic ducking and covering. You may enjoy this game, but I find it boring. Non reproducible results, swizzled data, and massaged numerical recipes are the status quo.

Is there a fix? Nope, not as long as most people cannot make change or add up the cost of items in a grocery basket. Smart software depends on data. And if those data are like those referenced in this Metafilter article, well. Excitement.

Stephen E Arnold, August 24, 2021

Health And Human Services Continues Palantir Contract

August 23, 2021

The Us Department of Health and Human Services (HHS) renewed its contract with Palantir to continue using Tiberius. Fed Scoop shares the details about the renewal in the article, “HHS Renews, Expands Palantir’s Tiberius Contract To $31M.” Palantir designed Tiberius as a COVID-19 vaccine distribution platform. It has evolved beyond assisting HHS employees understand the vaccine supply chain to being the central information source for dosage programs.

HHS partnered with Palantir in mid-2020 under Trump’s administration. It was formerly known as Operation Warp Speed and now is called Countermeasure Acceleration Group. The renewed contract expands the Palantir’s deal from $17 million to $31 million. Palantir will continue upgrading Tiberius. Agencies will now use the platform to determine policy decision about additional doses, boosters, and international distribution.

When Palantir was first implemented it had not been designed to handle Federal Retail Pharmacy nor Long-Term Car Facility programs. These now provide more analysis gaps for vaccination gaps. Tiberius is also used for:

“Tiberius already has between 2,000 and 3,000 users including those at HHS, CDC, BARDA, the Countermeasure Acceleration Group, the Office of the Assistant Secretary for Preparedness and Response, the Federal Emergency Management Agency, the Pentagon, and other agencies involved in pandemic response. State and territory employees make up two-thirds of the user base, which also includes sub-state entities that receive vaccines like New York City and Chicago and commercial users including all retail pharmacies.”

Trump was supportive of Palantir; Biden’s team seems okay with the platform.

Whitney Grace, August 23, 2021

Big Data, Algorithmic Bias, and Lots of Numbers Will Fix Everything (and Your Check Is in the Mail)

August 20, 2021

We must remember, “The check is in the mail” and “I will always respect you” and “You can trust me.” Ah, great moments in the University of Life’s chapbook of factoids.

I read “Moving Beyond Algorithmic Bias Is a Data Problem”. I was heartened by the essay. First, the document has a document object identifier and a link to make checking updates easy. Very good. Second, the focus of the write up is the inherent problem of most of the Fancy Dan baloney charged big data marketing to which I have been subjected in the last six or seven years. Very, very good.

I noted this statement in the essay:

Why, despite clear evidence to the contrary, does the myth of the impartial model still hold allure for so many within our research community? Algorithms are not impartial, and some design choices are better than others.

Notice the word “myth”. Notice the word “choices.” Yep, so much for the rock solid nature of big data, models, and predictive silliness based on drag-and-drop math functions.

I also starred this important statement by Donald Knuth:

Donald Knuth said that computers do exactly what they are told, no more and no less.

What’s the real world behavior of smart anti-phishing cyber security methods? What about the autonomous technology in some nifty military gear like the Avenger drone?

Google may not be thrilled with the information in this essay nor thrilled about the nailing of the frat bros’ tail to the wall; for example:

The belief that algorithmic bias is a dataset problem invites diffusion of responsibility. It absolves those of us that design and train algorithms from having to care about how our design choices can amplify or curb harm. However, this stance rests on the precarious assumption that bias can be fully addressed in the data pipeline. In a world where our datasets are far from perfect, overall harm is a product of both the data and our model design choices.

Perhaps this explains why certain researchers’ work is not zipping around Silicon Valley at the speed of routine algorithm tweaks? The statement could provide some useful insight into why Facebook does not want pesky researchers at NYU’s Ad Observatory digging into how Facebook manipulates perception and advertisers.

The methods for turning users and advertisers into puppets is not too difficult to figure out. That’s why certain companies obstruct researchers and manufacture baloney, crank up the fog machine, and offer free jargon stew to everyone including researchers. These are the same entities which insist they are not monopolies. Do you believe that these are mom-and-pop shops with a part time mathematician and data wrangler coming in on weekends? Gee, I do.

The “Moving beyond” article ends with a snappy quote:

As Lord Kelvin reflected, “If you cannot measure it, you cannot improve it.”

Several observations are warranted:

  1. More thinking about algorithmic bias is helpful. The task is to get people to understand what’s happening and has been happening for decades.
  2. The interaction of math most people don’t understand and very simple objectives like make more money or advance this agenda is a destabilizing force in human behavior. Need an example. The Taliban and its use of WhatsApp is interesting, is it not?
  3. The fix to the problems associated with commercial companies using algorithms as monetary and social weapons requires control. The question is from whom and how.

Stephen E Arnold, August 20, 2021

Governments Heavy Handed on Social Media Content

July 21, 2021

In the US, government entities “ask” for data. In other countries, there may be different approaches; for example, having data pushed directly to government data lakes.

Governments around the world are paying a lot more attention to content on Twitter and other social media, we learn from, “Twitter Sees Big Jump in Gov’t Demands to Remove Content of Journalists” at TechCentral. According to data released by the platform, demands increased by 26% in the second half of last year. We wonder how many of these orders involved false information and how many simply contained content governments did not like. That detail is not revealed, but we do learn the 199 journalist and news outlet accounts were verified. The report also does not divulge which countries made the demands or which ones Twitter obliged. We do learn:

“Twitter said in the report that India was now the single largest source of all information requests from governments during the second half of 2020, overtaking the US, which was second in the volume of requests. The company said globally it received over 14,500 requests for information between 1 July and 31 December, and it produced some or all of the information in response to 30% of the requests. Such information requests can include governments or other entities asking for the identities of people tweeting under pseudonyms. Twitter also received more than 38,500 legal demands to take down various content, which was down 9% from the first half of 2020, and said it complied with 29% of the demands. Twitter has been embroiled in several conflicts with countries around the world, most notably India over the government’s new rules aimed at regulating content on social media. Last week, the company said it had hired an interim chief compliance officer in India and would appoint other executives in order to comply with the rules.”

Other platforms are also receiving scrutiny from assorted governments. In response to protests, for example, Cuba has restricted access to Facebook and messaging apps. Also recently, Nigeria banned Twitter altogether and prohibited TV and radio stations from using it as a source of information. Meanwhile, social media companies continue to face scrutiny for the presence of hate speech, false information, and propaganda on their sites. We are reminded CEOs Jack Dorsey of Twitter, Mark Zuckerberg of Facebook, and Sundar Pichai of Google appeared in a hearing before the US congress on misinformation just last March. And most recently, all three platforms had to respond to criticisms over racist attacks against black players on England’s soccer team. Is it just me, or are these problems getting worse instead of better?

Cynthia Murrell, July 21, 2021

Shaping Data Is Indeed a Thing and Necessary

April 12, 2021

I gave a lecture at Microsoft Research many years ago. I brought up the topic of Kolmogorov’s complexity idea and making fast and slow smart software sort of work. (Remember that Microsoft bought Fast Search & Transfer which danced around making automated indexing really super wonderful like herring worked over by a big time cook.) My recollection of the Microsoft group’s reaction was, “What is this person talking about?” There you go.

If you are curious about the link between a Russian math person once dumb enough to hire one of my relatives to do some grunt work, check out the 2019 essay “Are Deep Neural Networks Dramatically Overfitted?” Spoiler: You betcha.

The essay explains that mathy tests signal when a dataset is just right. No more nor no less data are needed. Thus, if the data are “just right,” the outputs will be on the money, accurate, and close enough for horse shoes.

The write up states:

The number of parameters is not correlated with model overfitting in the field of deep learning, suggesting that parameter counting cannot indicate the true complexity of deep neural networks.

Simplifying: “Oh, oh.”

Then there is a work around. The write up points out:

The lottery ticket hypothesis states that a randomly initialized, dense, feed-forward network contains a pool of subnetworks and among them only a subset are “winning tickets” which can achieve the optimal performance when trained in isolation. The idea is motivated by network pruning techniques — removing unnecessary weights (i.e. tiny weights that are almost negligible) without harming the model performance. Although the final network size can be reduced dramatically, it is hard to train such a pruned network architecture successfully from scratch.

Simplifying again: “Yep, close enough for most applications.”

What’s the fix? Keep the data small.

Doesn’t that create other issues? Sure does. For example, what about real time streaming data which diverge from the data used to train smart software. You know the “change” thing when historical data no longer apply. Smart software is possible as long as the aperture is small and the data shaped.

There you go. Outputs are good enough but may be “blind” in some ways.

Stephen E Arnold, April 12, 2021

Fruit of Tainted Tree: An Interesting Metaphor and a Challenge for Data Removal Methods

March 22, 2021

I am not legal eagle. In fact, legal eagles frighten me. I clutch my billfold, grab my sweater, and trundle away as fast as my 77 year old legs permit. I do read legal info which seems interesting. “FTC Says That One Cannot Retain the Fruit of the Tainted Tree.” That’s a flashy metaphor for lawyers, but the “tainted” thing is intriguing. If an apple is stolen and that apple is poisoned, what happens if someone makes apple sauce, serves it to the PTA, and a pride of parents die? Tainted, right?

The write up explains:

the FTC has found that the work product of ill-gotten data is no longer retainable by the developer.

Okay, let’s say a developer creates an application or service and uses information available on a public Web site. But those data were uploaded by a bad actor and made available as an act of spite. Then the intrepid developer recycles those data and the original owner of the data cries, “Foul.”

The developer now has to remove those data. But how does one remove what may be individual datum from a data storage system and a dynamic distributed, modern software component.

Deletions are not really removals. The deletion leaves the data, just makes it unfindable in the index. To remove an item of information, more computational work is required. Faced with many deletions, short cuts are needed. Explaining what deletions are and aren’t in a modern distributed system can be an interesting exercise.

Now back to the tainted tree. If the ruling sticks, exactly what data will have to be removed. Is a single datum a fruit. Years ago, Dun & Bradstreet learned that some of its data, collected then by actual humans talking to contacts in financial institutions or in gyms, could not be the property of the outstanding data aggregation company. A phone number is or used to be a matter of fact. Facts were not something an outfit could own unless they were organized in a work and even then I never understood exactly what the rules were. When I worked in the commercial database business, we tried to enter into agreements with sources. Tedious, yes, but we had a deal and were not los banditos.

Some questions crossed my mind:

  1. How exactly will tainted fruit (apples, baskets of apples, or the aforementioned apple sauce) be removed? How long will a vendor have to remove data? (The Google right to be forgotten method seems sluggish, but that’s just my perception of time, not the GOOG’s or the EC regulators’.)
  2. How will one determine if data have been removed? There are back up tapes and sys admins who can examine data tables with a hex editor to locate certain items of information.
  3. What is the legal exposure of a person who uses tainted fruit which is identified as tainted after reuse? What if the delay is in lawyer time; for example, a year or more later?
  4. What happens when outfits use allegedly public domain images to train an AI and an image is not really public domain? Does the AI system have to be dumped? (I am thinking about Facebook’s push into image recognition.)

Worth watching if this write up is spot on and how the legal eagles circle this “opportunity” for litigation.

Stephen E Arnold, March 22, 2021

Cision: More Data from Online Monitoring

March 1, 2021

Cision calls online monitoring “listening.” That’s friendly. The objective: More particular data to cross correlate with the firm’s other data holdings. Toss in about one million journalists’ email addresses, and you have the ingredients for a nifty business. “Brandwatch Is Acquired by Cision for $450M, Creating a PR, Marketing and Social Listening Giant” says:

Abel Clark, CEO of Cision said: “The continued digital shift and widespread adoption of social media is rapidly and fundamentally changing how brands and organizations engage with their customers. This is driving the imperative that PR, marketing, social, and customer care teams fully incorporate the unique insights now available into consumer-led strategies. Together, Cision and Brandwatch will help our clients to more deeply understand, connect and engage with their customers at scale across every channel.”

Cision data may open some new markets for the PR outfit. Do you, gentle reader, law enforcement and intelligence professionals would be interested in these data? Do you think that Amazon might license the data to stir into its streaming data market place stew?

No answers yet. Worth “monitoring” or “listening.”

Stephen E Arnold, March 1, 2021

The Building Blocks of Smart Software: Combine Them to Meet Your Needs

January 25, 2021

I have a file of listicles. One called “Top 10 Algorithms in Data Mining” appeared in 2007. I spotted another list which is, not surprisingly, quite like the Xindong Wu et al write up. The most recent listing is “All Machine Learning Algorithms You Should Know in 2021.” And note the “all.” I included a short item about a book of business intelligence algorithms in the DarkCyber for January 26, 2021, at this link. That book had more than 600 pages, and I am reasonably confident that the authors did not use the word “all” to describe their effort.

What’s the line up of “all” you ask? In the table below, I present the list from 2008 in red and the list from 2021 in blue.

2008 Xindong Dong et al 2021 “All” KDNuggets’
List
1 Decision Trees Linear regression
2 k-means Logistic regression
3 Support Vector Machines k nearest neighbor
4 A priori Naive Bayes
5 Expectation-Maximization (EM) Support vector machines
6 Page Rank (voting) Decision trees
7 Ada Boost Random forest
8 k nearest neighbor classification AdaBoost
9 Naive Bayes Gradient boost
10 Classification and Regression trees XGBoost

The KDNuggets’ opinion piece also includes LightGMB (a variation of XGBoost) and CatBoost (is a more efficient gradient boost). Hence, I have focused on 10 algorithms. I performed a similar compression with Xindong Dong et al’s labored discussion of rules and cases grouped under “decision trees” in the table above.

Several observations are possible from these data:

  1. “All” is misleading in the KDNuggets’ title. Why not skip the intellectually shallow “all”?
  2. In the 14 years between the scholarly article and the enthusiastic “all” paper, the tools of the smart software crowd have not advanced if the data in these two write ups are close enough for horse shoes
  3. Modern systems’ similarity in overall approaches is understandable because a limited set of tools are used by energetic “inventors” of smart software.

Net net: The mathematical recipes are evolving in terms of efficiency due to more machine horsepower and more data.

How about the progress in accuracy? Did IBM Watson uncover a drug to defeat Covid? How are those Google search results working for you? What about the smart cyber security software which appear to have missed entirely the SolarWinds’ misstep.

Why? Knowing algorithms is not the same as developing systems which most work. Marketers, however, can seize on these mathy names and work miracles. Too bad the systems built with them don’t.

Stephen E Arnold, January 25, 2021

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta