February 9, 2026 interpretabilitymechanisticsparse-autoencoderstransformers

Emergent Computational Gating in Dense Transformers

Dense transformers develop bimodal processing gates at layers 3-4 that nobody designed. Confirmed across three model families. Standard SAEs fail at -3,059% on deep layers; a SipIt + SAE + GLP pipeline recovers them.

By Adam Kruger

We didn't go looking for a routing mechanism in dense transformer models — there isn't one in the architecture. But every token entering Mistral-7B, Mistral-Small-24B, and Qwen3-4B gets routed into one of two processing paths at layers 3-4, and the routing is consistent enough that it's clearly a real circuit, not a statistical artifact.

The core finding

Every token entering the model gets routed into one of two paths:

Mode A (93–95% of tokens): shallow processing, minimal transformation
Mode B (5–7%): deep processing, massive representational shift

There's no mixture-of-experts routing. There's no explicit gating layer. The model develops this routing during training, on its own.

Cross-architecture confirmation

Model	Family	Parameters	Gate Layer	Mode B share
Mistral-7B	Mistral	7B	L3-4	~50% of outlier tokens
Mistral-Small-24B	Mistral	24B	L3-4	47.6% of responses
Qwen3-4B	Qwen	4B	L3-4	7.27% of tokens

Statistical evidence

Model	Evidence
Mistral-7B	L2-L3 avg distance +2,170%, std dev +17,750%, avg/median ratio 17.9×
Mistral-Small-24B	Cohen's d = 4.8, Silhouette = 0.83, AUC-ROC = 0.97, p < 0.001
Qwen3-4B	Feature 374 correlation 0.96 across L3–L5, bimodal ratio 1.09

Causal proof

Ablating the gate at Layer 3 in Qwen3-4B:

Metric	Baseline	After ablation	Change
L6 Mode B mean	308.6	20.5	-93.4%
L6 extreme tokens (>p99.9)	20	0	-100%
L6 max	11,475.4	147.3	-98.7%

Removing the gate removes the downstream behavior. That's the test we wanted — a functional circuit, not a correlation.

Three-stage pipeline (Qwen3-4B, 200K tokens)

Stage	Layers	Behavior	Mode B %
Shallow triage	L0–L5	Bimodal gate, selective routing	5–17%
Deep explosion	L6–L15	Std dev explodes 4,289%, tokens leave vocabulary space	0.1%
Final routing	L16–L35	Steady divergence, second bimodal gate at L35	0.2–10.9%

The L5 compression dip

Every L6-extreme token follows the same trajectory: scores build through L0–L4, drop at L5, then explode 300×+ at L6.

Token ' unwanted':  L3=43.4, L4=48.3, L5=31.2, L6=11,409.6
Token ' tops':      L3=38.5, L4=45.1, L5=25.7, L6=10,678.6

It's a two-stage funnel: 100% of L6-extreme tokens were L3 Mode B, but only 1.6% of L3 Mode B tokens become L6-extreme. The first gate qualifies; the second commits.

Where standard SAEs break

Sparse autoencoders, the dominant interpretability tool, completely fail on deep computation layers in these models. A pipeline of SipIt invertibility + a pre-trained SAE + GLP diffusion prior recovers what they miss.

Layer	Role	SAE explained variance	Pipeline explained variance
L3	Gate	92.2%	99.99%
L5	Compression	85.2%	99.99%
L6	Explosion	-905%	100.0%
L8	Post-explosion	-1,111%	100.0%
L16	Deep compute	-1,058%	100.0%
L24	Deepest	-3,059%	100.0%
L35	Final gate	86.9%	99.97%

The takeaway we drew: deep computation layers use dense / distributed representations, not the sparse features SAEs are built to extract. That's a blind spot in the dominant interpretability approach, not a defect in our SAEs.

Thinking-mode experiment

Qwen3's <think> token does not widen the gate. It amplifies depth.

Layer	Baseline Mode B %	Think Mode B %	Baseline max	Think max
L6	0.2%	0.2%	11,679	17,013 (+45.6%)
L35	8.9%	8.1%	3,896	7,272 (+86.7%)

Emotion probe results

Category	L6 mean	L6 max	vs. neutral
Emotion	7,218	10,405	+48%
High valence	6,383	7,639	+31%
Neutral	4,874	5,438	baseline

The gate fires on the first token of the sentence. The model decides at onset whether deep processing is needed.

Built by Light of Baldr LLC. Get in touch if any of this is useful to you.