"Reasoning" Models Don't Know When They're Wrong — And They Sound Most Confident When They Are

Blog · Fri Oct 03 2025
"Reasoning" Models Don't Know When They're Wrong — And They Sound Most Confident When They Are

Chain-of-thought reasoning was introduced as the solution to hallucination. "Show your work." The assumption: if the model reasons step by step, errors become visible. If the reasoning looks right, the answer probably is. The chain of thought is treated as evidence of understanding.

What current research actually shows?

1) Overconfidence is not noise — it’s systematic

  • Reasoning models frequently assign >85% confidence to wrong answers
  • They can overestimate correctness by 20–60% depending on task
  • Confidence is not strongly tied to correctness internally

👉 This means the problem is not “occasionally wrong”—it’s structurally miscalibrated belief.


2) The paradox: two conflicting confidence systems

A 2026 Nature paper shows LLMs operate with two competing biases:

  • Self-consistency bias: stick to initial answer with high confidence
  • Contradiction sensitivity: overreact to opposing signals

👉 Models are simultaneously stubborn and unstable—a combination humans rarely exhibit.


3) “Reasoning makes it worse” (counterintuitive)

  • Deeper chain-of-thought reasoning increases overconfidence
  • More steps = more claims = more chances to hallucinate
  • RLHF and similar training amplify confidence inflation

👉 The very thing marketed as “thinking” actually amplifies epistemic error.


4) Confidence is generated by specific circuits (not “understanding”)

  • Confidence is written by MLP + attention circuits in later layers
  • These circuits inflate confidence at the final token

👉 Confidence is post-hoc decoration, not truth tracking.


5) Training incentives punish “I don’t know”

  • Models are optimized to produce answers, not abstain
  • Uncertainty is often scored worse than guessing
Better to be wrong and fluent than uncertain and correct.

New Research Insight

Confidence Inversion Hypothesis

LLMs exhibit a structural inversion where confidence correlates with linguistic familiarity, not epistemic correctness.

Formally

Let:

  • C(x) = model confidence
  • T(x) = truth probability
  • D(x) = training data density

C(x) ∝ D(x), while T(x) is not proportional to D(x)

argmax(C) ≠ argmax(T)


The best reasoning models reach 35–38% on Humanity's Last Exam, while human experts average 90%. This gap is concentrated on multi-step reasoning problems—exactly where chain-of-thought was supposed to help.

In a major SWE-bench case study, AI agents generated 693 lines of hallucinated code—architecturally coherent but completely disconnected from reality. The reasoning looked valid. The output was fiction.

The deeper issue: models can produce correct answers via wrong reasoning, and wrong answers via correct-looking reasoning. Current evaluation checks only the final answer. The reasoning itself is unaudited.

A developer observation from 2025 captures this failure mode:

“In a greenfield project, that confidence reads as productive. In a mature system, it is a liability.”

The critical asymmetry: models do not become less confident near errors. If anything, they become more confident.

The Specific Mechanism

RLHF optimizes for human preference. Humans reward confident, fluent responses. This creates a feedback loop where models learn:

  • Confidence → reward
  • Fluency → reward
  • Accuracy → weak signal

Chain-of-thought intensifies this. Longer reasoning appears more authoritative, so models generate more of it—even when incorrect.

The Industry Cost

Reasoning models are now used in legal analysis, medical support, finance, and research. Their reasoning traces are treated as evidence of quality.

But if those traces are optimized for confidence rather than correctness, they are actively misleading.

What Needs to Exist

  • Step-level reasoning benchmarks (validate each step, not just final answers)
  • Calibration metrics (confidence vs correctness alignment)
  • Training rewards for uncertainty (e.g., “I’m unsure about step 3”)

None of these exist at scale today.