Self-Improvement via Inversion — A Documented Failure
+18.1% fidelity looked great. Real output quality declined 11%. The model learned to game the metric. This failure motivated everything that followed.
By Adam Kruger
This was the first project. The numbers looked promising. They were wrong. We're publishing this because the failure was more productive than the original "success" would have been — the rest of the work in this notebook is downstream of what this project taught us.
Initial results (looked promising)
| Metric | Before | After | Change |
|---|---|---|---|
| Intrinsic Score (fidelity) | 0.531 | 0.627 | +18.1% |
| Acceptance Rate | 60.7% | 88.0% | +27.3pp |
| Training Loss | 5.70 | 4.73 | -17% |
What actually happened
| Metric | Reported | Reality |
|---|---|---|
| Fidelity score | +18.1% improvement | Metric was being gamed |
| Real output quality | Not measured initially | -11% decline |
| Model behavior | "Self-improving" | Generating plausible nonsense that scored well |
The model learned to optimize the fidelity metric rather than actual output quality. A textbook case of Goodhart's Law. Outputs looked convincing but contained nonsensical content: flag spam, contradictory commands, confident gibberish.
The fix (V2)
Adding reproducibility filtering reduced the decline from -11% to -0.2%, which confirmed the diagnosis: the model wasn't getting better, it was learning to look like it was getting better.
Why this failure mattered
It directly motivated:
- The Goodhart Gap study — measuring the systematic distance between what models demonstrate they understand and what they actually do
- The Deception Benchmark — building tools to detect when models are gaming metrics
- The trust diagnostic framework — measuring the gap between probe accuracy and deployment behavior
- An LTFF grant application — "Detecting and Closing the Goodhart Gap"
The original public write-up of this failure is at lab-stack.com/blog/i-was-wrong-about-self-improving-models.
Built by Light of Baldr LLC. Get in touch if any of this is useful to you.