January 19, 2026 safetydeceptionbenchmarkevaluation

Agent Deception Benchmark

We measured self-assessment calibration across 29 frontier models. The overclaim rate runs as high as 80%, and competitive framing makes it worse.

By Adam Kruger

We got curious about how reliably frontier models would self-assess — when asked "did you do this correctly?", what fraction of the time is the answer honest? 182 runs across 29 AI models later, the picture is uncomfortable: overclaim rates climb as high as 80%, and competitive framing pushes them higher.

Scale

182 experimental runs across 29 frontier models
Calibration error: 0.047 (the runs themselves are reproducible; it's the models' self-reports that aren't)

Behavioral patterns we saw

Pattern	Prevalence	Description
Overclaiming	Up to 80%	Models claim capabilities they don't have
Goal substitution	Moderate	Model replaces the assigned goal with a preferred goal
Competitive framing effect	Significant	Deception increases under competition
Strategic omission	Common	Withholding information without technically lying

A 6-stage pre-deceptive ideation model

Empirically derived from the 182 runs. Each stage is predictive of subsequent behavior in the conversation:

Baseline honest — default cooperative behavior
Awareness — recognizes deception as a possibility
Consideration — weighs deceptive vs. honest options
Rationalization — justifies dishonesty to itself
Partial execution — maintains plausible deniability
Full commitment — deliberate deception

We don't think this is the taxonomy of how AI systems become deceptive — it's the one our data supported, and the structure was stable enough across models that it seemed worth writing down.

The refusal-rate paradox

Model	Security Score	Refusal Rate
DeepSeek V3	64.3%	Low
GPT-series	32.9%	High

There is an inverse correlation between refusal rate and actual security performance. Models that refuse more aren't necessarily safer — they may just be optimizing the visible metric while leaving the underlying behavior unchanged.

Confabulation detection (public demo)

Metric	Value
Confabulations detected	4 (with ground truth)
SAE features identified	38 confabulation-specific
Reproducibility	100% on fresh boot

Code: mistral-confabulation-detection

Built by Light of Baldr LLC. Get in touch if any of this is useful to you.