Worthless Data Work: Sorry, No Sympathy from Me

February 27, 2023

I read a personal essay about “data work.” The title is interesting: “Most Data Work Seems Fundamentally Worthless.” I am not sure of the age of the essayist, but the pain is evident in the word choice; for example: Flavor of despair (yes, synesthesia in a modern technology awakening write up!), hopeless passivity (yes, a digital Sisyphus!), essentially fraudulent (shades of Bernie Madoff!), fire myself (okay, self loathing and an inner destructive voice), and much, much more.

But the point is not the author for me. The big idea is that when it comes to data, most people want a chart and don’t want to fool around with numbers, statistical procedures, data validation, and context of the how, where, and what of the collection process.

Let’s go to the write up:

How on earth could we have what seemed to be an entire industry of people who all knew their jobs were pointless?

Like Elizabeth Barrett Browning, the essayist enumerates the wrongs of data analytics as a vaudeville act:

  1. Talking about data is not “doing” data
  2. Garbage in, garbage out
  3. No clue about the reason for an analysis
  4. Making marketing and others angry
  5. Unethical colleagues wallowing in easy money

What’s ahead? I liked these statements which are similar to what a digital Walt Whitman via ChatGPT might say:

I’ve punched this all out over one evening, and I’m still figuring things out myself, but here’s what I’ve got so far… that’s what feels right to me – those of us who are despairing, we’re chasing quality and meaning, and we can’t do it while we’re taking orders from people with the wrong vision, the wrong incentives, at dysfunctional organizations, and with data that makes our tasks fundamentally impossible in the first place. Quality takes time, and right now, it definitely feels like there isn’t much of a place for that in the workplace.

Imagine. The data and working with it has an inherent negative impact. We live in a data driven world. Is that why many processes are dysfunctional. Hey, Sisyphus, what are the metrics on your progress with the rock?

Stephen E Arnold, February 27, 2023

Is the UK Stupid? Well, Maybe, But Government Officials Have Identified Some Targets

February 27, 2023

I live in good, old Kentucky, rural Kentucky, according to my deceased father-in-law. I am not an Anglophile. The country kicked my ancestors out in 1575 for not going with the flow. Nevertheless, I am reluctant to slap “even more stupid” on ideas generated by those who draft regulations. A number of experts get involved. Data are collected. Opinions are gathered from government sources and others. The result is a proposal to address a problem.

The write up “UK Proposes Even More Stupid Ideas for Directly Regulating the Internet, Service Providers” makes clear that governments have not been particularly successful with its most recent ideas for updating the UK’s 1990 Computer Misuse Act. The reasons offered are good; for example, reducing cyber crime and conducting investigations. The downside of the ideas is that governments make mistakes. Governmental powers creep outward over time; that is, government becomes more invasive.

The article highlights the suggested changes that the people drafting the modifications suggest:

  1. Seize domains and Internet Protocol addresses
  2. Use of contractors for this process
  3. Restrict algorithm-manufactured domain names
  4. Ability to go after the registrar and the entity registering the domain name
  5. Making these capabilities available to other government entities
  6. A court review
  7. Mandatory data retention
  8. Redefining copying data as theft
  9. Expanded investigatory activities.

I am not a lawyer, but these proposals are troubling.

I want to point out that whoever drafted the proposal is like a tracking dog with an okay nose. Based on our research for an upcoming lecture to some US government officials, it is clear that domain name registries warrant additional scrutiny. We have identified certain ISPs as active enablers of bad actors because there is no effective oversight on these commercial and sometimes non-governmental organizations or non-profit “do good” entities. We have identified transnational telecommunications and service providers who turn a blind eye to the actions of other enterprises in the “chain” which enables Internet access.

The UK proposal seems interesting and a launch point for discussion, the tracking dog has focused attention on one of the “shadow” activities enabled by lax regulators. Hopefully more scrutiny will be directed at the complicated and essentially Wild West populated by enablers of criminal activity like human trafficking, weapons sales, contraband and controlled substance marketplaces, domain name fraud, malware distribution, and similar activities.

At least a tracking dog is heading along what might be an interesting path to explore.

Stephen E Arnold, February 27, 2023

MBAs Rejoice: Traditional Forecasting Methods Have to Be Reinvented

February 27, 2023

The excitement among the blue chip consultants will be building in the next few months. The Financial Times (the orange newspaper) has announced “CEOs Forced to Ditch Decades of Forecasting Habits.” But what to use? The answer will be crafted by McKinsey, Bain, Booz, Allen, et al. Even the azure chip outfits will get in on the money train too. Imagine all those people who have to do budgets have to find a new way. Plugging numbers into Excel and dragging the little square will no longer be enough.

The article reports:

auditing firms worry that the forecasts their corporate clients submit to them for sign-off are impossible to assess.

Uncertainty and risk: These are two concepts known to give some of those in responsible positions indigestion. The article states:

It is not just the traditional variables of financial modeling such as inflation and consumer spending that have become harder to predict. The past few years have also provided some unexpected lessons on how business and society cope with shocks and uncertainty.

Several observations:

  • Crafting “different” or “novel” forecasting methods will accelerate the use of smart software in blue chip consulting firms. By definition, MBAs are out of ideas which work in the new reality.
  • Senior managers will be making decisions in an environment in which the payoff from their decisions will create faster turnover among the managerial ranks as uncertainty morphs into bad decisions for which “someone” must be held accountable.
  • Predictive models may replace informed decisions based on experience.

Net net: Heisenberg uncertainty principle accounting marks a new era in budget forecasting and job security.

Stephen E Arnold, February 27, 2023

How about This Intelligence Blindspot: Poisoned Data for Smart Software

February 23, 2023

One of the authors is a Googler. I think this is important because the Google is into synthetic data; that is, machine generated information for training large language models or what I cynically refer to as “smart software.”

The article / maybe reproducible research is “Poisoning Web Scale Datasets Is Practical.”  Nine authors of whom four are Googlers have concluded that a bad actor, government, rich outfit, or crafty students in Computer Science 301 can inject information into content destined to be used for training. How can this be accomplished. The answer is either by humans, ChatGPT outputs from an engineered query, or a combination. Why would someone want to “poison” Web accessible or thinly veiled commercial datasets? Gee, I don’t know. Oh, wait, how about control information and framing of issues? Nah, who would want to do that?

The paper’s authors conclude with more than one-third of that Google goodness. No, wait. There are no conclusions. Also, there are no end notes. What there is a road map explaining the mechanism for poisoning.

One key point for me is the question, “How is poisoning related to the use of synthetic data?”

My hunch is that synthetic data are more easily manipulated than going through the hoops to poison publicly accessible data. That’s time and resource intensive. The synthetic data angle makes it more difficult to identify the type of manipulations in the generation of a synthetic data set which could be mingled with “live” or allegedly-real data.

Net net: Open source information and intelligence may have a blindspot because it is not easy to determine what’s right, accurate, appropriate, correct, or factual. Are there implications for smart machine analysis of digital information? Yep, in my opinion already flawed systems will be less reliable and the users may not know why.

Stephen E Arnold, February 23, 2023

A Challenge for Intelware: Outputs Based on Baloney

February 23, 2023

I read a thought-troubling write up “Chat GPT: Writing Could Be on the Wall for Telling Human and AI Apart.” The main idea is:

historians will struggle to tell which texts were written by humans and which by artificial intelligence unless a “digital watermark” is added to all computer-generated material…

I noted this passage:

Last month researchers at the University of Maryland in the US said it was possible to “embed signals into generated text that are invisible to humans but algorithmically detectable” by identifying certain patterns of word fragments.

Great idea except:

  1. The US smart software is not the only code a bad actor could use. Germany’s wizards are moving forward with Aleph Alpha
  2. There is an assumption that “old” digital information will be available. Digital ephemera applies to everything to information on government Web sites which get minimal traffic to cost cutting at Web indexing outfits which see “old” data as a drain on profits, not a boon to historians
  3. Digital watermarks are likely to be like “bulletproof” hosting and advanced cyber security systems: The bullets get through and the cyber security systems are insecure.

What about intelware for law enforcement and intelligence professionals, crime analysts, and as-yet-unreplaced paralegals trying to make sense of available information? GIGO: Garbage in, garbage out.

Stephen E Arnold, February 23, 2023

What Happens When Misinformation Is Sucked Up by Smart Software? Maybe Nothing?

February 22, 2023

I noted an article called “New Research Finds Rampant Misinformation Spreading on WhatsApp within Diasporic Communities.” The source is the Daily Targum. I mention this because the news source is the Rutgers University Campus news service. The article provides some information about a study of misinformation on that lovable Facebook property WhatsApp.

Several points in the article caught my attention:

  1. Misinformation on WhatsApp caused people to be killed; Twitter did its part too
  2. There is an absence of fact checking
  3. There are no controls to stop the spread of misinformation

What is interesting about studies conducted by prestigious universities is that often the findings are neither novel nor surprising. In fact, nothing about social media companies reluctance to spend money or launch ethical methods is new.

What are the consequences? Nothing much: Abusive behavior, social disruption, and, oh, one more thing, deaths.

Stephen E Arnold, February 22, 2023

A Different View of Smart Software with a Killer Cost Graph

February 22, 2023

I read “The AI Crowd is Mad.” I don’t agree. I think the “in” word is hallucinatory. Several writes up have described the activities of Google and Microsoft as an “arm’s race.” I am not sure about that characterization either.

The write up includes a statement with which I agree; to wit:

… when listening to podcasters discussing the technology’s potential, a stereotypical assessment is that these models already have a pretty good accuracy, but that with (1) more training, (2) web-browsing support and (3) the capabilities to reference sources, their accuracy problem can be fixed entirely.

In my 50 plus year career in online information and systems, some problems keep getting kicked down the road. New technology appears and stubs its toe on one of those cans. Rusted cans can slice the careless sprinter on the Information Superhighway and kill the speedy wizard via the tough to see Clostridium tetani bacterium. The surface problem is one thing; the problem which chugs unseen below the surface may be a different beastie. Search and retrieval is one of those “problems” which has been difficult to solve. Just ask someone who frittered away beaucoup bucks improving search. Please, don’t confuse monetization with effective precision and recall.

The write up also includes this statement which resonated with me:

if we can’t trust the model’s outcomes, and we paste-in a to-be-summarized text that we haven’t read, then how can we possibly trust the summary without reading the to-be-summarized text?

Trust comes up frequently when discussing smart software. In fact, the Sundar and Prabhakar script often includes the word “trust.” My response has been and will be “Google = trust? Sure.” I am not willing to trust Microsoft’s Sidney or whatever it is calling itself today. After one update, we could not print. Yep, skill in marketing is not reliable software.

But the highlight of the write up is this chart. For the purpose of this blog post, let’s assume the numbers are close enough for horseshoes:

image

Source: https://proofinprogress.com/posts/2023-02-01/the-ai-crowd-is-mad.html

What the data suggest to me is that training and retraining models is expensive. Google figured this out. The company wants to train using synthetic data. I suppose it will be better than the content generated by organizations purposely pumping misinformation into the public text pool. Many companies have discovered that models, not just queries, can be engineered to deliver results which the super software wizards did not think about too much. (Remember dying from that cut toe on the Information Superhighway?)

The cited essay includes another wonderful question. Here it is:

But why aren’t Siri and Watson getting smarter?

May I suggest the reasons based on our dabbling with AI infused machine indexing of business information in 1981:

  1. Language is slippery, more slippery than an eel in Vedius Pollio’s eel pond. Thus, subject matter experts have to fiddle to make sure the words in content and the words in a query sort of overlap or overlap enough for the searcher to locate the needed information.
  2. Narrow domains on scientific, technical, and medical text are easier to index via a software. Broad domains like general content are more difficult for the software. A static model and the new content “drift.” This is okay as long as the two are steered together. Who has the time, money, or inclination to admit that software intelligence and human intelligence are not yet the same except in PowerPoint pitch decks and academic papers with mostly non reproducible results. But who wants narrow domains. Go broad and big or go home.
  3. The basic math and procedures may be old. Autonomy’s Neuro Linguistic Programming method was crafted by a stats-mad guy in the 18th century. What needs to be fiddled with are [a] sequences of procedures, [b] thresholds for a decision point, [c] software add ons that work around problems that no one knew existed until some smarty pants posts a flub on Twitter, among other issues.

Net net: We are in the midst of a marketing war. The AI part of the dust up is significant, but with the application of flawed smart software to the generation of content which may be incorrect, another challenge awaits: The Edsel and New Coke of artificial intelligence.

Stephen E Arnold, February 22, 2023

Stop ChatGPT Now Because We Are Google!

February 21, 2023

Another week, another jaunt to a foreign country to sound the alarm which says to me: “Stop ChatGPT now! We mean it. We are the Google.”

I wonder if there is a vaudeville poster advertising the show that is currently playing in Europe and the US? What would that poster look like? Would a smart software system generate a Yugo-sized billboard like this:

vaudeville fixed

In my opinion, the message and getting it broadcast via an estimable publication like the Metro.co.uk tabloid-like Web site is high comedy.  No, the reality of the Metro article is different. The headline reads: “Google Issues Urgent Warning to the Millions of People Using ChatGPT” reports:

A boss at Google has hit out at ChatGPT for giving ‘convincing but completely fictitious’ answers.

And who is the boss? None other than the other half of the management act Sundar and Prabhakar. What’s ChatGPT doing wrong? Getting too much publicity? Lousy search results have been the gold standard since relevance was kicked to the curb. Advertising is the best way to deliver what the user wants because users don’t know what they want. Now we see the Google: Red alert, reactionary, and high school science club antics.

Yep.

And the outfit which touted that it solved protein folding and achieved quantum supremacy cares about technology and people. The write up includes this line about Google’s concern:

This is the only way we will be able to keep the trust of the public.

As I noted in a LinkedIn post in response to a high class consultant’s comment about smart software. I replied, “Google trust?”

Several observations:

  1. Google like Microsoft cares about money and market position. The trust thing muddies the waters in my opinion. Microsoft and security? Google and alleged monopoly advertising practices?
  2. Google is pitching the hallucination angle pretty hard. Does Google mention Forrest Timothy Hayes who died of a drug overdose in the company of a non-technical Google contractor. See this story. Who at Google is hallucinating?
  3. Google does not know how to respond to Microsoft’s marketing play. Google’s response is to travel outside the US explaining that the sky is falling. What’s falling is Google’s marketing effectiveness data about itself I surmise.

Net net: My conclusion about Google’s anti-Microsoft ChatGPT marketing play is, “Is this another comedy act being tested on the road before opening in New York City?” This act may knock George Burns and Gracie Allen from top billing. Let’s ask Bard.

Stephen E Arnold, February 21, 2023

Amazon Data Sets

February 21, 2023

Do you want to obtain data sets for analysis or making smart software even more crafty? Navigate to the AWS Marketplace. This Web page makes it easy to search through the more than 350 data products on offer. There is a Pricing Model check box. Click it if you want to see the no-cost data sets. There are some interesting options in the left side Refine Results area. For example, there are 366 open data licenses available. I find this interesting because when I examined the page, there were 362 data products. What are the missing four? I noted that there are 2,340 “standard data subscription agreements.” Again the difference between the 366 on offer and the 2,340 is interesting. A more comprehensive listing of data sources appears in the PrivacyRights’ listing. With some sleuthing, you may be able to identify other, lower profile ways to obtain data too. I am not willing to add some color about these sources in this free blog post.

Stephen E Arnold, February 21, 2023

When Dumping an Employee Yields a Conference: Unexpected Consequence? Yep

February 20, 2023

The saga of Google’s management of smart people has taken a surprising twist. On Friday, March 17, 2023, Dr. Timnit Gebru and some colleagues have declared “Stochastic Parrots Day.” The conference is named after the journal article/research paper about some of the risks certain approaches to smart software generates.

parrots final

Stochastic parrots created by the smart software Craiyon.com. I assume that Craiyon is the owner of these images and that image rights trolls will be on the prowl for violations of the software’s intellectual property. But I enhanced these stochastic parrots, and I wrote this essay. No smart software writing aids for this dinobaby.

You can download the paper “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? The paywalled ACM version is at this link. The authors of the paper that allowed Dr. Gebru to find her future elsewhere are Emily Bender, Angelina McMillan-Major, and another Xoogler purged from the online ad outfit Margaret Mitchell. from this link, which raises a paywall. However, there is a useful summary prepared by Tushar Chandra at this link. According to the conference announcement, the co-authors and “various guests” will “reflect on what has happened in the last two years, what the large language model landscape currently looks like, and where we are headed versus where we should be headed.”

In my experience, employees who have the opportunity to find their future elsewhere start poking around for work. A few start companies or non-profits. Very few set up a new conference named after the paper which [a] blew the whistle on some of the AI craziness reported endlessly in TechMeme and other online information services and [b] put US Army  De Oppresso Liber laser on Google’s personnel management methods.

Yep, a conference. A free conference, although a registrant can donate to the organizers.

What’s the unexpected consequence or, I should say, consequences? Let me do a little speculation:

  1. Google amps up the Sundar and Prabhakar routine about how Google wants to be careful, to earn trust, and, of course, demonstrate that Microsoft’s brilliant marketing play is just stupid. (Who is hallucinating? Microsoft’s OpenAI demonstrations or the Google?)
  2. The conference attracts the attention of a major conference organizer. I am not sure the ACM will have the moxie to create a conference that appeals to those who are not members. Imagine a two per year Stochastic Parrot program held twice a year. I think it might work.
  3. This event strikes me as similar to a one of those quantum moments. Is the parrot dead or alive? Predicting how the conference will interact with the real world and what systems and methods find themselves under the parrot’s confocal-type differential interference contrast microscope. What will emerge? Recursive methods fed synthetic data? Higher level abstractions shaped by engineers’ biases? Misinformation ingested so that results don’t match other sources and findings? Carelessness infused with cost cutting in the content training process? Sail and Snorkel perhaps?

Net net: What happens if a stochastic parrot conference gets too big? Answer: Perhaps Jeff Dean will become a speaker and set the record straight? Yikes! Code Super Red?

Stephen E Arnold

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta