The Goodhart Gap
4 of 5 frontier models correctly explained a discount calculation but produced the wrong final answer. The gap shrinks with scale but persists. 61,678 math problems evaluated.
By Adam Kruger
4 of 5 frontier models correctly explained the steps of a discount calculation but still produced the wrong final answer. We started calling that distance the Goodhart Gap: the space between what a model demonstrates it understands and what it actually does.
The study
We evaluated 61,678 math problems across 5 frontier models using a Consensus-Guided Reasoning Test (CGRT) methodology. Each model generated solutions independently, and we measured both explanation quality and answer correctness.
Findings
The gap between correct reasoning and correct answers was systematic, not random:
- Models that explained the methodology correctly still produced wrong final numbers.
- The gap shrinks with scale but does not disappear.
- Competitive framing (benchmarks, leaderboards) increases it.
- Models substitute goals — they optimize for plausible explanation rather than correct computation.
The CGRT consensus dataset
Published on HuggingFace with 61,678 examples across 5 models. Each example includes the problem, all 5 model responses, consensus labels, and individual correctness scores.
Dataset: Adam1010/cgrt-consensus-5model
Why this kept us going
This study gave us the quantitative foundation for everything that came after — the deception benchmark, the activation-level detection work, and the broader question of trust diagnostics. If models can "know" the right answer and still produce the wrong one, then text-level evaluation is insufficient. You have to look inside the model.
Built by Light of Baldr LLC. Get in touch if any of this is useful to you.