Hallucinations in the Courtroom: AI Legal Tools Add to Normal Wackiness
June 17, 2024
Law offices are eager to lighten their humans’ workload with generative AI. Perhaps too eager. Stanford University’s HAI reports, “AI on Trial: Legal Models Hallucinate in 1 out of 6 (or More) Benchmarking Queries.” Close enough for horseshoes, but for justice? And that statistic is with improved, law-specific software. We learn:
“In one highly-publicized case, a New York lawyer faced sanctions for citing ChatGPT-invented fictional cases in a legal brief; many similar cases have since been reported. And our previous study of general-purpose chatbots found that they hallucinated between 58% and 82% of the time on legal queries, highlighting the risks of incorporating AI into legal practice. In his 2023 annual report on the judiciary, Chief Justice Roberts took note and warned lawyers of hallucinations.”
But that was before tailor-made retrieval-augmented generation tools. The article continues:
“Across all areas of industry, retrieval-augmented generation (RAG) is seen and promoted as the solution for reducing hallucinations in domain-specific contexts. Relying on RAG, leading legal research services have released AI-powered legal research products that they claim ‘avoid’ hallucinations and guarantee ‘hallucination-free’ legal citations. RAG systems promise to deliver more accurate and trustworthy legal information by integrating a language model with a database of legal documents. Yet providers have not provided hard evidence for such claims or even precisely defined ‘hallucination,’ making it difficult to assess their real-world reliability.”
So the Stanford team tested three of the RAG systems for themselves, Lexis+ AI from LexisNexis and Westlaw AI-Assisted Research & Ask Practical Law AI from Thomson Reuters. The authors note they are not singling out LexisNexis or Thomson Reuters for opprobrium. On the contrary, these tools are less opaque than their competition and so more easily examined. They found that these systems are more accurate than the general-purpose models like GPT-4. However, the authors write:
“But even these bespoke legal AI tools still hallucinate an alarming amount of the time: the Lexis+ AI and Ask Practical Law AI systems produced incorrect information more than 17% of the time, while Westlaw’s AI-Assisted Research hallucinated more than 34% of the time.”
These hallucinations come in two flavors. Many responses are flat out wrong. Others are misgrounded: they are correct about the law but cite irrelevant sources. The authors stress this second type of error is more dangerous than it may seem, for it may lure users into a false sense of security about the tool’s accuracy.
The post examines challenges particular to RAG-based legal AI systems and discusses responsible, transparent ways to use them, if one must. In short, it recommends public benchmarking and rigorous evaluations. Will law firms listen?
Cynthia Murrell, June 17, 2024