A Little AI Surprise: Reasoning Fail

October 22, 2024

Generative AI models predict text. That is it. Oh certainly, those predictions paths can be quite elaborate and complex. But no matter how complicated, LLM processes are simply not akin to human reasoning. So we are not surprised to learn that “Apple’s Study Proves that LLM-Based AI Models Are Flawed Because They Cannot Reason,” as Apple Insider reports. That a study was required to prove the point highlights how poorly this widely-deployed technology is understood.

Apple’s researchers set out to see if they could trip up popular LLMs by adding irrelevant, contextual information to mathematical queries. The answer was a resounding yes. In fact, the more of these extraneous details they added, the worse the models did. But even one was found to reduce the output’s accuracy by as much as 65%. Contributing Editor Charles Martin writes:

“The task the team developed, called ‘GSM-NoOp’ was similar to the kind of mathematic ‘word problems’ an elementary student might encounter. The query started with the information needed to formulate a result. ‘Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday.’ The query then adds a clause that appears relevant, but actually isn’t with regards to the final answer, noting that of the kiwis picked on Sunday, ‘five of them were a bit smaller than average.’ The answer requested simply asked ‘how many kiwis does Oliver have?’ The note about the size of some of the kiwis picked on Sunday should have no bearing on the total number of kiwis picked. However, OpenAI’s model as well as Meta’s Llama3-8b subtracted the five smaller kiwis from the total result.”

Unlike schoolchildren, LLMs do not get better at this sort of problem with practice. Martin reminds us these results mirror those of a study done five years ago:

“The faulty logic was supported by a previous study from 2019 which could reliably confuse AI models by asking a question about the age of two previous Super Bowl quarterbacks. By adding in background and related information about the games they played in, and a third person who was quarterback in another bowl game, the models produced incorrect answers.”

Of course they did. Because LLMs cannot reason. Perhaps another type of AI is, or will be, up to these tasks. But if so, it is by definition something other than generative AI? What we know is that some AI wizards cannot get along with their business partners? Is that reasonable? Sure.

Cynthia Murrell, October 22, 2024

Written by Stephen E. Arnold · Filed Under AI, News

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.