Popular AI chatbots from
Large language models hallucinate at least 75% of the time when answering questions about a court’s core ruling, the researchers found. They tested more than 200,000 legal questions on OpenAI’s ChatGPT 3.5, Google’s PaLM 2, and Meta’s Llama 2—all general-purpose models not built for specific legal use.
Generative artificial intelligence has raised hopes that the powerful technology can help provide legal services to people who can’t afford a lawyer. According to the nonprofit Legal Services Corporation, low-income people in the US received inadequate or no help for 92% of civil legal issues they face.
But AI’s inaccuracies could put a damper on those hopes, warned Stanford researchers, who published a preprint study at the start of the year and announced their findings in a blog Jan. 11.
“The big finding here is that the hallucination rates are not isolated, they’re pretty pervasive,” said Daniel Ho, a law professor at Stanford and senior fellow at the school’s Institute for Human-Centered Artificial Intelligence, who co-authored the research paper.
‘Proceed With Much More Caution’
Generative AI tools specifically trained for legal use may perform better, but building those tools on general-purpose models could still lead to accuracy problems, Ho said.
“We should not take these very general purpose foundation models and naively deploy them and put them into all sorts of deployment settings, as a number of lawyers seem to have done,” he said. “Proceed with much more caution—where you really need lawyers, and people with some legal knowledge, to be able to assess the veracity of what an engine like this is giving to you.”
For one task, researchers asked the AI models to state whether two different court cases agreed or disagreed with each other—a core legal research task. The models do no better than random guessing, the study found.
The models made more frequent mistakes when asked about case law from lower federal district courts and were more accurate on cases from the US Supreme Court and the US Courts of Appeal for the Second Circuit and the Ninth Circuit, the research found. That could be because those cases are more frequently cited and discussed, so they appear more frequently in the models’ training data, the researchers said.
The Stanford researchers also found the models hallucinated more often when asked about very recent cases and very old Supreme Court cases, and were more accurate on later-20th century cases.
The models also suffered from “contra-factual bias”: They were likely to believe a false premise embedded in a user’s question, acting in a “sycophantic” way to reinforce the user’s mistake.
ChatGPT and PaLM were more likely to not question the truth of the prompt, while Llama is “more likely to question the premise” within a prompt than the other models they tested, Ho said. Llama was more likely to deny that a real case existed, the researchers found.
AI for Self-Represented Litigants
In his year-end report on the federal judiciary, Chief Justice John Roberts pointed to the hopes that AI can increase access to justice.
“For those who cannot afford a lawyer, AI can help,” he wrote. “These tools have the welcome potential to smooth out any mismatch between available resources and urgent needs in our court system.”
But the models’ accuracy issues were most pronounced where pro se—or self-represented—litigants would likely be using them, like in searching lower-court cases, the Stanford researchers said.
“The performance of these models tends to be focused in the areas that are already very well-served by high-powered, white-shoe, big law firms,” said Matthew Dahl, a JD/PhD student at Yale Law School and Yale University Department of Political Science who worked on the research. “The performance that we see in our papers, that’s not Supreme Court cases. There’s not a lot of pro se litigants that are litigating in the Supreme Court.”
The researchers hope to see the models perform better when asked about courts of first appearance, he added.
And the models’ tendency to go along with a user’s factually inaccurate question would more likely cause problems for non-lawyers asking legal questions, who “don’t know the answer to the question, but they don’t even know the question in the first place,” Dahl said. He added that he hoped to see models build better guardrails for correcting mistaken premises in queries.
A Google spokesperson declined to comment on the study directly, but said in a statement that the company is continuing to work on mitigating hallucinations and has been transparent about the limitations of large language models from the beginning. OpenAI and Meta didn’t immediately return requests for comment.
Written by: Isabel Gottlieb and Isaiah Poritz @Bloomberg Law
The post “Popular AI Chatbots Found to Give Error-Ridden Legal Answers” first appeared on Bloomberg Law