A Little AI Surprise: Reasoning Fail
October 22, 2024
Generative AI models predict text. That is it. Oh certainly, those predictions paths can be quite elaborate and complex. But no matter how complicated, LLM processes are simply not akin to human reasoning. So we are not surprised to learn that “Apple’s Study Proves that LLM-Based AI Models Are Flawed Because They Cannot Reason,” as Apple Insider reports. That a study was required to prove the point highlights how poorly this widely-deployed technology is understood.
Apple’s researchers set out to see if they could trip up popular LLMs by adding irrelevant, contextual information to mathematical queries. The answer was a resounding yes. In fact, the more of these extraneous details they added, the worse the models did. But even one was found to reduce the output’s accuracy by as much as 65%. Contributing Editor Charles Martin writes:
“The task the team developed, called ‘GSM-NoOp’ was similar to the kind of mathematic ‘word problems’ an elementary student might encounter. The query started with the information needed to formulate a result. ‘Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday.’ The query then adds a clause that appears relevant, but actually isn’t with regards to the final answer, noting that of the kiwis picked on Sunday, ‘five of them were a bit smaller than average.’ The answer requested simply asked ‘how many kiwis does Oliver have?’ The note about the size of some of the kiwis picked on Sunday should have no bearing on the total number of kiwis picked. However, OpenAI’s model as well as Meta’s Llama3-8b subtracted the five smaller kiwis from the total result.”
Unlike schoolchildren, LLMs do not get better at this sort of problem with practice. Martin reminds us these results mirror those of a study done five years ago:
“The faulty logic was supported by a previous study from 2019 which could reliably confuse AI models by asking a question about the age of two previous Super Bowl quarterbacks. By adding in background and related information about the games they played in, and a third person who was quarterback in another bowl game, the models produced incorrect answers.”
Of course they did. Because LLMs cannot reason. Perhaps another type of AI is, or will be, up to these tasks. But if so, it is by definition something other than generative AI? What we know is that some AI wizards cannot get along with their business partners? Is that reasonable? Sure.
Cynthia Murrell, October 22, 2024
Comments
Got something to say?