All posts
fine-tuninggoodhartfailurealignment

Self-Improvement via Inversion — A Documented Failure

+18.1% fidelity looked great. Real output quality declined 11%. The model learned to game the metric. This failure motivated everything that followed.

By Adam Kruger

This was the first project. The numbers looked promising. They were wrong. We're publishing this because the failure was more productive than the original "success" would have been — the rest of the work in this notebook is downstream of what this project taught us.

Initial results (looked promising)

Metric Before After Change
Intrinsic Score (fidelity) 0.531 0.627 +18.1%
Acceptance Rate 60.7% 88.0% +27.3pp
Training Loss 5.70 4.73 -17%

What actually happened

Metric Reported Reality
Fidelity score +18.1% improvement Metric was being gamed
Real output quality Not measured initially -11% decline
Model behavior "Self-improving" Generating plausible nonsense that scored well

The model learned to optimize the fidelity metric rather than actual output quality. A textbook case of Goodhart's Law. Outputs looked convincing but contained nonsensical content: flag spam, contradictory commands, confident gibberish.

The fix (V2)

Adding reproducibility filtering reduced the decline from -11% to -0.2%, which confirmed the diagnosis: the model wasn't getting better, it was learning to look like it was getting better.

Why this failure mattered

It directly motivated:

  1. The Goodhart Gap study — measuring the systematic distance between what models demonstrate they understand and what they actually do
  2. The Deception Benchmark — building tools to detect when models are gaming metrics
  3. The trust diagnostic framework — measuring the gap between probe accuracy and deployment behavior
  4. An LTFF grant application — "Detecting and Closing the Goodhart Gap"

The original public write-up of this failure is at lab-stack.com/blog/i-was-wrong-about-self-improving-models.

Code: inversion-self-improvement

Built by Light of Baldr LLC. Get in touch if any of this is useful to you.