What We Talk About When We Talk About Hallucinations

Elevated error rates in o3 and o4-mini • Why our language for AI failure reveals more about us than the models • How metaphors obscure the mechanics of machine error.

May 12, 2025

Originally published on the Eduaide.Ai Blog.

A recent article in The New York Times offered an unsettling but, the more I think about it, unsurprising headline: the latest generation of “reasoning” AI models are more error-prone, not less. Despite impressive advances in math and programming capabilities, factual grounding is deteriorating.

"The company found that o3 — its most powerful system — hallucinated 33 percent of the time when running its PersonQA benchmark test, which involves answering questions about public figures. That is more than twice the hallucination rate of OpenAI’s previous reasoning system, called o1. The new o4-mini hallucinated at an even higher rate: 48 percent.
When running another test called SimpleQA, which asks more general questions, the hallucination rates for o3 and o4-mini were 51 percent and 79 percent. The previous system, o1, hallucinated 44 percent of the time."
— Cade Metz & Karen Weise (5 May 2025)

A curious paradox thus emerges: as models become more “capable,” they become more convincingly wrong. So, what does a hallucination wrapped in the claim of reasoning mean? It means the AI’s language is fluid and coherent. Its logic may appear sound. Yet, the underlying claims are false. Perhaps you ask why or how this happens. The central refrain of the article is that we don’t know precisely why this occurs, but we know that it does.

Most public commentary treats hallucinations as a bug. Something to patch with better training, alignment, or more careful prompting. But these diagnoses are too shallow. To call these factual errors “hallucinations” is already to smuggle in a metaphor that obscures more than it reveals. Perhaps this is less of a concern in technical contexts where the term may serve as a pragmatic shorthand for denoting the divergence from known/trusted ground truth. Still, it remains eminently misleading in public and philosophical discourse.

As Ludwig Wittgenstein observed, philosophical confusion arises when “language goes on holiday.” That is to say, philosophical confusion, and I would argue confusion around AI more generally, occurs when words, untethered from their ordinary use, are repurposed into abstract problems that bear little resemblance to lived meaning. Take the term hallucination. It gestures toward some psychological malfunction—seeing things that aren’t there. This implies deviation. That is, something separate from the way things typically are, and that may be resolved. When, in truth, the system has no structurally embedded norm of truth to deviate from. In a hallucination, the model is not confused. It is not erring. It is simply continuing a sequence of likely tokens based on its training data. That’s what it was built to do. The result is words and sentences that are statistically relevant, but factually inaccurate. Let's call that taking the long way round to being wrong, factually.

You may be saying, "Statistically relevant token selection aside, wrong is wrong. So what's gone missing in these reasoning models? Aren't they supposedly more knowledgeable?”

What's gone missing is not the model's grasp on truth, but our grasp on what we mean by truth in this context at all. The language of cognition—reasoning, belief, knowledge—has been extended to statistical systems without a firm footing. So, we end up trying to think about hallucinations as if they were epistemic failures, rather than seeing them as artifacts of probabilistic generation. The developers of these systems are aware of this distinction, but it doesn’t seem the public is, and that matters.

If we are to think clearly about AI, we need to return language to its proper work: description grounded in function, not analogy. Not “What does the model believe?” but “What distribution is it sampling from?” Not “How can we make it reason better?” but “What scaffolds reduce compounding error in token prediction?” These are less exciting questions. But they are clearer, and clarity is the first requirement of responsible use. If you squint, you can see a clear and rigorous curriculum emerge from this misuse and confusion, since the technology is most definitely here to stay.

In summary, when an LLM gets something wrong, it’s not failing. It’s succeeding on different terms than we assume. And the more reasoning steps you introduce (via techniques like chain-of-thought prompting), the more space you create for this kind of drift. Statistical artifacts compound. Thus, in the end, the system may sound "more rational" while becoming less reliable.

And so we chase “alignment” like a ghost, all while misnaming the phenomenon we observe. We seek fixes for hallucinations instead of acknowledging that our models are epistemically hollow. They produce not knowledge, but coherent, convincing, and dare I say, human, noise.

Dear Reader: Thank you for the recent influx of subscribers to the newsletter. I endeavor to continue delivering essays and short-form posts worthy of your attention. I hope to engage with you all in the comment threads to learn how you're thinking about these topics.

Stay tuned, we have a few posts building on this one. We'll be discussing the evaluation benchmarks and going through some mental models for practically evaluating AI responses for educational purposes.

The Learning Stack

Discussion about this post