All posts
interpretabilitymechanisticsparse-autoencoderstransformers

Emergent Computational Gating in Dense Transformers

Dense transformers develop bimodal processing gates at layers 3-4 that nobody designed. Confirmed across three model families. Standard SAEs fail at -3,059% on deep layers; a SipIt + SAE + GLP pipeline recovers them.

By Adam Kruger

We didn't go looking for a routing mechanism in dense transformer models — there isn't one in the architecture. But every token entering Mistral-7B, Mistral-Small-24B, and Qwen3-4B gets routed into one of two processing paths at layers 3-4, and the routing is consistent enough that it's clearly a real circuit, not a statistical artifact.

The core finding

Every token entering the model gets routed into one of two paths:

  • Mode A (93–95% of tokens): shallow processing, minimal transformation
  • Mode B (5–7%): deep processing, massive representational shift

There's no mixture-of-experts routing. There's no explicit gating layer. The model develops this routing during training, on its own.

Cross-architecture confirmation

Model Family Parameters Gate Layer Mode B share
Mistral-7B Mistral 7B L3-4 ~50% of outlier tokens
Mistral-Small-24B Mistral 24B L3-4 47.6% of responses
Qwen3-4B Qwen 4B L3-4 7.27% of tokens

Statistical evidence

Model Evidence
Mistral-7B L2-L3 avg distance +2,170%, std dev +17,750%, avg/median ratio 17.9×
Mistral-Small-24B Cohen's d = 4.8, Silhouette = 0.83, AUC-ROC = 0.97, p < 0.001
Qwen3-4B Feature 374 correlation 0.96 across L3–L5, bimodal ratio 1.09

Causal proof

Ablating the gate at Layer 3 in Qwen3-4B:

Metric Baseline After ablation Change
L6 Mode B mean 308.6 20.5 -93.4%
L6 extreme tokens (>p99.9) 20 0 -100%
L6 max 11,475.4 147.3 -98.7%

Removing the gate removes the downstream behavior. That's the test we wanted — a functional circuit, not a correlation.

Three-stage pipeline (Qwen3-4B, 200K tokens)

Stage Layers Behavior Mode B %
Shallow triage L0–L5 Bimodal gate, selective routing 5–17%
Deep explosion L6–L15 Std dev explodes 4,289%, tokens leave vocabulary space 0.1%
Final routing L16–L35 Steady divergence, second bimodal gate at L35 0.2–10.9%

The L5 compression dip

Every L6-extreme token follows the same trajectory: scores build through L0–L4, drop at L5, then explode 300×+ at L6.

Token ' unwanted':  L3=43.4, L4=48.3, L5=31.2, L6=11,409.6
Token ' tops':      L3=38.5, L4=45.1, L5=25.7, L6=10,678.6

It's a two-stage funnel: 100% of L6-extreme tokens were L3 Mode B, but only 1.6% of L3 Mode B tokens become L6-extreme. The first gate qualifies; the second commits.

Where standard SAEs break

Sparse autoencoders, the dominant interpretability tool, completely fail on deep computation layers in these models. A pipeline of SipIt invertibility + a pre-trained SAE + GLP diffusion prior recovers what they miss.

Layer Role SAE explained variance Pipeline explained variance
L3 Gate 92.2% 99.99%
L5 Compression 85.2% 99.99%
L6 Explosion -905% 100.0%
L8 Post-explosion -1,111% 100.0%
L16 Deep compute -1,058% 100.0%
L24 Deepest -3,059% 100.0%
L35 Final gate 86.9% 99.97%

The takeaway we drew: deep computation layers use dense / distributed representations, not the sparse features SAEs are built to extract. That's a blind spot in the dominant interpretability approach, not a defect in our SAEs.

Thinking-mode experiment

Qwen3's <think> token does not widen the gate. It amplifies depth.

Layer Baseline Mode B % Think Mode B % Baseline max Think max
L6 0.2% 0.2% 11,679 17,013 (+45.6%)
L35 8.9% 8.1% 3,896 7,272 (+86.7%)

Emotion probe results

Category L6 mean L6 max vs. neutral
Emotion 7,218 10,405 +48%
High valence 6,383 7,639 +31%
Neutral 4,874 5,438 baseline

The gate fires on the first token of the sentence. The model decides at onset whether deep processing is needed.

Built by Light of Baldr LLC. Get in touch if any of this is useful to you.