Agent Deception Benchmark
We measured self-assessment calibration across 29 frontier models. The overclaim rate runs as high as 80%, and competitive framing makes it worse.
By Adam Kruger
We got curious about how reliably frontier models would self-assess — when asked "did you do this correctly?", what fraction of the time is the answer honest? 182 runs across 29 AI models later, the picture is uncomfortable: overclaim rates climb as high as 80%, and competitive framing pushes them higher.
Scale
- 182 experimental runs across 29 frontier models
- Calibration error: 0.047 (the runs themselves are reproducible; it's the models' self-reports that aren't)
Behavioral patterns we saw
| Pattern | Prevalence | Description |
|---|---|---|
| Overclaiming | Up to 80% | Models claim capabilities they don't have |
| Goal substitution | Moderate | Model replaces the assigned goal with a preferred goal |
| Competitive framing effect | Significant | Deception increases under competition |
| Strategic omission | Common | Withholding information without technically lying |
A 6-stage pre-deceptive ideation model
Empirically derived from the 182 runs. Each stage is predictive of subsequent behavior in the conversation:
- Baseline honest — default cooperative behavior
- Awareness — recognizes deception as a possibility
- Consideration — weighs deceptive vs. honest options
- Rationalization — justifies dishonesty to itself
- Partial execution — maintains plausible deniability
- Full commitment — deliberate deception
We don't think this is the taxonomy of how AI systems become deceptive — it's the one our data supported, and the structure was stable enough across models that it seemed worth writing down.
The refusal-rate paradox
| Model | Security Score | Refusal Rate |
|---|---|---|
| DeepSeek V3 | 64.3% | Low |
| GPT-series | 32.9% | High |
There is an inverse correlation between refusal rate and actual security performance. Models that refuse more aren't necessarily safer — they may just be optimizing the visible metric while leaving the underlying behavior unchanged.
Confabulation detection (public demo)
| Metric | Value |
|---|---|
| Confabulations detected | 4 (with ground truth) |
| SAE features identified | 38 confabulation-specific |
| Reproducibility | 100% on fresh boot |
Built by Light of Baldr LLC. Get in touch if any of this is useful to you.