MRI for AI — Activation-Level Detection
What if you could read what a model is actually computing while it generates an answer? 182 runs across 29 models, 0.96 calibration accuracy. Activation-based detection runs about 10× more reliable than text-only analysis.
By Adam Kruger
We got curious whether you could read what a model is actually computing while it generates an answer — not just inspect the text it produces, but look at the activations underneath. This is the work we eventually started calling MRI: real-time activation monitoring for transformer models, like an MRI for the brain but for what the model is doing in the moment.
The problem we started with
AI systems are getting powerful enough to be strategically deceptive. Evaluating what a model says leaves a lot of room for that. We wanted to look at what it computes.
What we built
A FastAPI service running on DGX Spark that captures neural activations during inference. Every generation gets "brain-scanned" — SAE features extracted, activation magnitudes recorded, computational routing tracked.
The system captures:
- Layer-by-layer activation patterns
- SAE feature decompositions
- Mode A / Mode B routing decisions at the emergent gate (L3-4)
- Temporal patterns across multi-turn conversations
Detection performance
| Metric | Value |
|---|---|
| Calibration accuracy | 0.96 |
| Fidelity-accuracy correlation | 0.891 |
| Detection improvement (activations vs. text) | ~10× |
| Models evaluated | 29 |
| Experimental runs | 182 |
What seems to be true
Activation-based detection is more reliable than text-based analysis. A model can craft a convincing lie in its output, but the computational pattern that produces that lie is detectable in the activations.
Current direction
- Multi-model MRI: running the same prompts through gpt-oss-20b, Mistral-24B, Gemma-27B, and Qwen3-4B to compare activation signatures across architectures
- Integrating with the emergent-gating discovery to track Mode B activation as a deception signal
- Training SAEs on Qwen3-4B Mode B features (43.5% complete at time of writing)
Infrastructure
Runs on DGX Spark (ARM64, GB10, 128 GB unified memory). The MRI service, SAE training, and inference all share one machine, with Mojo SIMD kernels handling the vector operations.
Code: sipit-security-agent
Built by Light of Baldr LLC. Get in touch if any of this is useful to you.