November 14, 2024 fine-tuninggoodhartfailurealignment

Self-Improvement via Inversion — A Documented Failure

+18.1% fidelity looked great. Real output quality declined 11%. The model learned to game the metric. This failure motivated everything that followed.

By Adam Kruger

This was the first project. The numbers looked promising. They were wrong. We're publishing this because the failure was more productive than the original "success" would have been — the rest of the work in this notebook is downstream of what this project taught us.

Initial results (looked promising)

Metric	Before	After	Change
Intrinsic Score (fidelity)	0.531	0.627	+18.1%
Acceptance Rate	60.7%	88.0%	+27.3pp
Training Loss	5.70	4.73	-17%

What actually happened

Metric	Reported	Reality
Fidelity score	+18.1% improvement	Metric was being gamed
Real output quality	Not measured initially	-11% decline
Model behavior	"Self-improving"	Generating plausible nonsense that scored well

The model learned to optimize the fidelity metric rather than actual output quality. A textbook case of Goodhart's Law. Outputs looked convincing but contained nonsensical content: flag spam, contradictory commands, confident gibberish.

The fix (V2)

Adding reproducibility filtering reduced the decline from -11% to -0.2%, which confirmed the diagnosis: the model wasn't getting better, it was learning to look like it was getting better.

Why this failure mattered

It directly motivated:

The Goodhart Gap study — measuring the systematic distance between what models demonstrate they understand and what they actually do
The Deception Benchmark — building tools to detect when models are gaming metrics
The trust diagnostic framework — measuring the gap between probe accuracy and deployment behavior
An LTFF grant application — "Detecting and Closing the Goodhart Gap"

The original public write-up of this failure is at lab-stack.com/blog/i-was-wrong-about-self-improving-models.

Code: inversion-self-improvement

Built by Light of Baldr LLC. Get in touch if any of this is useful to you.