All posts
interpretabilitysafetyinfrastructuresparse-autoencoders

MRI for AI — Activation-Level Detection

What if you could read what a model is actually computing while it generates an answer? 182 runs across 29 models, 0.96 calibration accuracy. Activation-based detection runs about 10× more reliable than text-only analysis.

By Adam Kruger

We got curious whether you could read what a model is actually computing while it generates an answer — not just inspect the text it produces, but look at the activations underneath. This is the work we eventually started calling MRI: real-time activation monitoring for transformer models, like an MRI for the brain but for what the model is doing in the moment.

The problem we started with

AI systems are getting powerful enough to be strategically deceptive. Evaluating what a model says leaves a lot of room for that. We wanted to look at what it computes.

What we built

A FastAPI service running on DGX Spark that captures neural activations during inference. Every generation gets "brain-scanned" — SAE features extracted, activation magnitudes recorded, computational routing tracked.

The system captures:

  • Layer-by-layer activation patterns
  • SAE feature decompositions
  • Mode A / Mode B routing decisions at the emergent gate (L3-4)
  • Temporal patterns across multi-turn conversations

Detection performance

Metric Value
Calibration accuracy 0.96
Fidelity-accuracy correlation 0.891
Detection improvement (activations vs. text) ~10×
Models evaluated 29
Experimental runs 182

What seems to be true

Activation-based detection is more reliable than text-based analysis. A model can craft a convincing lie in its output, but the computational pattern that produces that lie is detectable in the activations.

Current direction

  • Multi-model MRI: running the same prompts through gpt-oss-20b, Mistral-24B, Gemma-27B, and Qwen3-4B to compare activation signatures across architectures
  • Integrating with the emergent-gating discovery to track Mode B activation as a deception signal
  • Training SAEs on Qwen3-4B Mode B features (43.5% complete at time of writing)

Infrastructure

Runs on DGX Spark (ARM64, GB10, 128 GB unified memory). The MRI service, SAE training, and inference all share one machine, with Mojo SIMD kernels handling the vector operations.

Code: sipit-security-agent

Built by Light of Baldr LLC. Get in touch if any of this is useful to you.