LLM Unreliable? Probably Absolutely No Big Deal Whatsoever For Sure

July 19, 2023

Note: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.

My team and I are working on an interesting project. Part of that work requires that we grind through papers, journal articles, and self-published (and essentially unverifiable) comments about smart software.

“What do you mean the outputs from the smart software I have been using for my homework delivers the wrong answer?” says this disappointed user of a browser and word processor with artificial intelligence baked in. Is she damning recursion? MidJourney created this emotion-packed image of a person who has learned that she has been accursed of plagiarism by her Sociology 215 professor.

Not surprisingly, we come across some wild and crazy information. On rare occasions we come across a paper, mostly ignored, which presents information that confirms many of our tests of smart software. When we do tests, we arrive with specific queries in mind. These relate to the behaviors of bad actors; for example, online services which front for cyber criminals, systems which are purpose built to make it time consuming to unmask a bad actor, and determine what person owns a particular domain engaged in the sale of fullz.

You can probably guess that most of the smart and dumb online finding services are of little or no help. We have to check these, however, simply because we want to be thorough. At a meeting last week, one of my team members who has a degree in library science, pointed out that the outputs from the services we use were becoming less useful than they were several months ago. I don’t spend too much time testing these services because I am a dinobaby and I run projects. My doing days are over. But I do listen to informed feedback. Her comment was one I had not seen in the Google PR onslaught about its method, the utterances of Sam AI-Man at OpenAI, or from the assorted LinkedIn gurus who post about smart software.

Then I spotted “How Is ChatGPT’s Behavior Changing over Time?”

I think the authors of the paper have documented what my team member articulated to me and others working on a smart software project. The paper states is polite academic prose:

Our findings demonstrate that the behavior of GPT-3.5 and GPT-4 has varied significantly over a relatively short amount of time.

The authors provide some data, a few diagrams, and some footnotes.

What is fascinating is that the most significant item in the journal article, in my opinion, is the use of the word “drifts.” Here’s the specific line:

Monitoring reveals substantial LLM drifts.

Yep, drifts.

What exactly is a drift in a numerical mélange like a large language model, its algorithms, and its probabilistic pulsing? In a nutshell, LLMs are formed by humans and use information to some degree created by humans. The idea is that sharp corners are created from decisions and data which may have rounded corners or be the equivalent of wad of Play-Doh after a kindergartener manipulates the stuff. The idea is that layers of numerical recipes are hooked together to output information useful to a human or system.

Those who worked with early versions of the Autonomy Neuro Linguistic black box know about the Play-Doh effect. Train the system on a crafted set of documents (information). Run test queries. Adjust a few knobs and dials afforded by the Autonomy system. Turn it loose on the Word documents and other content for which filters were installed. Then let users run queries.

To be upfront, using the early version of Autonomy in 1999 or 2000 was pretty darned good. However, Autonomy recommended that the system be retrained every few months.

Why?

The answer, as I recall, is that as new data were encountered by the Autonomy Neuro Linguistic engine, the engine had to cope with new words, names of companies, and phrases. Without retraining, the system would use what it had from its initial set up and tuning. Without retraining or recalibration, the Autonomy system would return results which were less useful in some situations. Operate a system without retraining, the results would degrade over time.

Math types labor to make inference-hooked and probabilistic systems stay on course. The systems today use tricks that make a controlled vocabulary look like the tool of a dinobaby like me. Without getting into the weeds, the Autonomy system would drift.

And what does the cited paper say, “LLM drift too.”

What does this mean? Here’s my dinobaby list of items to keep in mind:

Smart software, if left to its own devices, will degrade over time; that is, outputs will drift from what the user wants. Feedback from users accelerates the drift because some feedback is from the smart software’s point of view is spot on even if it is crazy or off the wall. Do this over a period of time and you get what the paper’s authors and my team member pointed out: Degradation.
Users who know how to look at a system’s outputs and validate or identify off the mark results can take corrective action; that is, ignore the outputs or fix them up. This is not common, and it requires specialized knowledge, time, and mental sharpness. Those who depend on TikTok or a smart system may not have these qualities in equal amounts.
Entrepreneurs want money, power, or a new Tesla. Bringing up issues about smart software growing increasingly crazy like the dinobaby down the street is not valued. Hence, substantive problems with smart systems will require time, money, and expertise to remediate. Who wants that? Smart software is designed to improve efficiency, reduce costs, and make money. The result is a group of individuals who do PR, not up-to-snuff software.

Will anyone pay attention to this cited journal article? Sure, a few interns and maybe a graduate student or two. But at this time, the trend is that AI works and AI applied to something delivers a solution. Is that solution reliable or is it just good enough? What if the outputs deteriorate in a subtle way over time? What’s the fix? Who is responsible? The engineer who fiddled with thresholds? The VP of product development who dismissed objections about inherent bias in outputs?

I think you may have an answer to these questions. As a dinobaby, I can say, “Folks, I don’t have a clue about fixing up the smart software juggernaut.” I am skeptical of those who say, “Hey, it just works.” Okay, I hope you are correct.

Stephen E Arnold, July 19, 2023

Written by Stephen E. Arnold · Filed Under AI, Analytics, News, Text processing

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.