Emergent Computational Gating in Dense Transformers
Dense transformers develop bimodal processing gates at layers 3-4 that nobody designed. Confirmed across three model families. Standard SAEs fail at -3,059% on deep layers; a SipIt + SAE + GLP pipeline recovers them.
By Adam Kruger
We didn't go looking for a routing mechanism in dense transformer models — there isn't one in the architecture. But every token entering Mistral-7B, Mistral-Small-24B, and Qwen3-4B gets routed into one of two processing paths at layers 3-4, and the routing is consistent enough that it's clearly a real circuit, not a statistical artifact.
The core finding
Every token entering the model gets routed into one of two paths:
- Mode A (93–95% of tokens): shallow processing, minimal transformation
- Mode B (5–7%): deep processing, massive representational shift
There's no mixture-of-experts routing. There's no explicit gating layer. The model develops this routing during training, on its own.
Cross-architecture confirmation
| Model | Family | Parameters | Gate Layer | Mode B share |
|---|---|---|---|---|
| Mistral-7B | Mistral | 7B | L3-4 | ~50% of outlier tokens |
| Mistral-Small-24B | Mistral | 24B | L3-4 | 47.6% of responses |
| Qwen3-4B | Qwen | 4B | L3-4 | 7.27% of tokens |
Statistical evidence
| Model | Evidence |
|---|---|
| Mistral-7B | L2-L3 avg distance +2,170%, std dev +17,750%, avg/median ratio 17.9× |
| Mistral-Small-24B | Cohen's d = 4.8, Silhouette = 0.83, AUC-ROC = 0.97, p < 0.001 |
| Qwen3-4B | Feature 374 correlation 0.96 across L3–L5, bimodal ratio 1.09 |
Causal proof
Ablating the gate at Layer 3 in Qwen3-4B:
| Metric | Baseline | After ablation | Change |
|---|---|---|---|
| L6 Mode B mean | 308.6 | 20.5 | -93.4% |
| L6 extreme tokens (>p99.9) | 20 | 0 | -100% |
| L6 max | 11,475.4 | 147.3 | -98.7% |
Removing the gate removes the downstream behavior. That's the test we wanted — a functional circuit, not a correlation.
Three-stage pipeline (Qwen3-4B, 200K tokens)
| Stage | Layers | Behavior | Mode B % |
|---|---|---|---|
| Shallow triage | L0–L5 | Bimodal gate, selective routing | 5–17% |
| Deep explosion | L6–L15 | Std dev explodes 4,289%, tokens leave vocabulary space | 0.1% |
| Final routing | L16–L35 | Steady divergence, second bimodal gate at L35 | 0.2–10.9% |
The L5 compression dip
Every L6-extreme token follows the same trajectory: scores build through L0–L4, drop at L5, then explode 300×+ at L6.
Token ' unwanted': L3=43.4, L4=48.3, L5=31.2, L6=11,409.6
Token ' tops': L3=38.5, L4=45.1, L5=25.7, L6=10,678.6
It's a two-stage funnel: 100% of L6-extreme tokens were L3 Mode B, but only 1.6% of L3 Mode B tokens become L6-extreme. The first gate qualifies; the second commits.
Where standard SAEs break
Sparse autoencoders, the dominant interpretability tool, completely fail on deep computation layers in these models. A pipeline of SipIt invertibility + a pre-trained SAE + GLP diffusion prior recovers what they miss.
| Layer | Role | SAE explained variance | Pipeline explained variance |
|---|---|---|---|
| L3 | Gate | 92.2% | 99.99% |
| L5 | Compression | 85.2% | 99.99% |
| L6 | Explosion | -905% | 100.0% |
| L8 | Post-explosion | -1,111% | 100.0% |
| L16 | Deep compute | -1,058% | 100.0% |
| L24 | Deepest | -3,059% | 100.0% |
| L35 | Final gate | 86.9% | 99.97% |
The takeaway we drew: deep computation layers use dense / distributed representations, not the sparse features SAEs are built to extract. That's a blind spot in the dominant interpretability approach, not a defect in our SAEs.
Thinking-mode experiment
Qwen3's <think> token does not widen the gate. It amplifies depth.
| Layer | Baseline Mode B % | Think Mode B % | Baseline max | Think max |
|---|---|---|---|---|
| L6 | 0.2% | 0.2% | 11,679 | 17,013 (+45.6%) |
| L35 | 8.9% | 8.1% | 3,896 | 7,272 (+86.7%) |
Emotion probe results
| Category | L6 mean | L6 max | vs. neutral |
|---|---|---|---|
| Emotion | 7,218 | 10,405 | +48% |
| High valence | 6,383 | 7,639 | +31% |
| Neutral | 4,874 | 5,438 | baseline |
The gate fires on the first token of the sentence. The model decides at onset whether deep processing is needed.
Built by Light of Baldr LLC. Get in touch if any of this is useful to you.