
AI Hallucinations May Soon Be History
We all are witness to the incredibly frenetic race to develop AI tools, which publicly kicked off on Nov. 30, 2022, with the release of ChatGPT by OpenAI. While the race was well underway prior to the first public release, we have been able to follow, version by version, the updates from OpenAI (GPT), Google (Gemini) and Anthropic (Claude), which are among the many versions of AI for personal and commercial use. This competition has been fast and furious. One of the most disturbing issues that has surfaced along the way has been the proclivity of generative AI to hallucinate.
In 2023, IBM dug into the hallucination issue:
“Generally, if a user makes a request of a GenAI tool, they desire an output that appropriately addresses the prompt (that is, a correct answer to a question). However, sometimes AI algorithms produce outputs that are not based on training data, are incorrectly decoded by the transformer or do not follow any identifiable pattern. In other words, it ‘hallucinates’ the response. The term may seem paradoxical, given that hallucinations are typically associated with human or animal brains, not machines. But from a metaphorical standpoint, hallucination accurately describes these outputs, especially in the case of image and pattern recognition (where outputs can be truly surreal in appearance).”
Roland Moore-Colyer reported in Live Science earlier this year that instances of hallucinations seemed to become more difficult to detect as the prompts became more advanced:
“However, AI hallucinations present a problem when it comes to delivering accurate and correct information, especially if users take the information at face value without any checks or oversight. ‘This is especially problematic in domains where decisions depend on factual precision, like medicine, law or finance,’ Watson said. ‘While more advanced models may reduce the frequency of obvious factual mistakes, the issue persists in more subtle forms. Over time, confabulation erodes the perception of AI systems as trustworthy instruments and can produce material harms when unverified content is acted upon.” And this problem looks to be exacerbated as AI advances. “As model capabilities improve, errors often become less overt but more difficult to detect.’”
A variety of attempts were made to correct the hallucination issue. Most notable was that many AI models added an architecture to optimize and enhance AI responses titled retrieval augmented generation (RAG). IBM explains in this article:
“RAG allows GenAI models to access additional external knowledge bases, such as internal organizational data, scholarly journals and specialized datasets. By integrating relevant information into the generation process, chatbots and other natural language processing (NLP) tools can create more accurate domain-specific content without needing further training.”
On Sept. 14, OpenAI researchers published a not-yet-peer-reviewed paper, “Why Language Models Hallucinate,” on arXiv. Gemini 2.5 Flash summarized the findings of the paper:
Key Findings from the Paper
Systemic Problem: Hallucinations are not simply bugs but a systemic consequence of how AI models are trained and evaluated.
Evaluation Incentives: Standard evaluation methods, particularly binary grading systems, reward models for generating an answer, even if it’s incorrect, and punish them for admitting uncertainty.
Pressure to Guess: This creates a statistical pressure for large language models (LLMs) to guess rather than say “I don’t know,” as guessing can improve test scores even with the risk of being wrong.
Fundamental Flaw: The paper highlights a fundamental flaw in the industry’s approach to evaluating AI progress, prioritizing raw accuracy over reliability and factual correctness.
This is much like the test-taking strategy in certain standardized tests, for which subjects are advised to guess rather than not answering a question for which they don’t have a reliable answer. Hence, in order to achieve the best outcome, models invent answers that could be plausible, but for which they don’t have solid evidence. That, of course, undermines the validity of the response and the credibility of the tool.
This, clearly, is unsatisfactory. With the realization of this shortcoming in the training protocol for the models, the companies have begun tweaking their products to respond that they are unable to answer. However, we are also seeing the increasing prevalence of “deep research”–capable models such as GPT-5 and Gemini 2.5 Pro. These are able to dig deeper and deeper, uncovering more and more sources that can hold the desired answer.
Here’s an example that is prompted by my lifelong enthusiasm for baseball. In previous “Online: Trending Now” columns, I have advocated we all provide more context and greater detail in our prompts. But, in this case, I presented a simple one-sentence prompt without context to three leading frontier generative AI apps: “Who called a players’-only meeting during a rain delay on a Wednesday night in Cleveland?” I included no mention of what sport, what team, what season, etc. Claude Sonnet 4 responded,
“I’d need more context to answer this question accurately. There have been many baseball games with rain delays on Wednesday nights in Cleveland over the years, and players’-only meetings during such delays aren’t always widely reported or memorable events.
Could you provide more details about:
- What year or approximate timeframe this occurred?
- Which team(s) were involved?
- Any other context about the situation or why this meeting was significant? With more specific information, I’d be better able to help identify who called the meeting and the circumstances surrounding it.”
Note that Claude, without deep research, did not hallucinate. It did recognize this was likely baseball, and it said it needed more information. Certainly, that is better than making up an answer such as we might have seen in previous months and years.
GPT-5 thought for 11 seconds and came up with the right answer in a succinct response.
Gemini 2.5 Pro thought for about the same length of time, digging through multiple sources, found the answer and confirming sources, then responded in more detail,
Jason Heyward Rallied Cubs with Players’-Only Meeting During Historic World Series Rain Delay
During a tense rain delay in the decisive Game 7 of the 2016 World Series, Chicago Cubs right fielder Jason Heyward called a players’-only meeting that has since become a legendary moment in the franchise’s storied history. The brief stoppage in play occurred on a Wednesday night in Cleveland after the ninth inning, with the Cubs and the then–Cleveland Indians locked in a tie. With the weight of a 108-year championship drought on their shoulders, Heyward gathered his teammates in the weight room behind the visitor’s [sic] dugout at Progressive Field … Following the delay and Heyward’s impromptu meeting, the Cubs came out in the top of the tenth inning and scored two runs. The Indians would rally for one run in the bottom of the frame, but the Cubs ultimately held on to win 8-7, securing their first World Series title since 1908.
Know that hallucinations are fading from generative AI. We can be more confident in the recently released deep research models. This provides the kind of reliable research that we depend upon in higher education.
Source link