February 27, 2026 interpretabilitytopologystate-space-modelsfalsification

The Mamba Counterexample

If topological integration were universal, state space models should show it too. They don't. Mamba-370m maintains fragmented representations end-to-end.

By Adam Kruger

If the cluster-collapse pattern we documented in the Topology of Thought work were universal across all neural sequence models, state space models should show it too. They don't. Mamba-370m maintains fragmented representations from input to output, and the cluster count actually grows as you go deeper.

The test

Mamba-370m uses selective state space model (SSM) layers instead of attention. It has:

No query-key-value projections
No attention matrices
No direct token-to-token interaction
Only recurrent state evolution through selective scan

We ran the same persistent homology analysis (Vietoris-Rips via Ripser) on Mamba's hidden states as we'd run on Qwen3-4B and NanoChat.

Results

Metric	Qwen3-4B (attention)	NanoChat (attention)	Mamba-370m (SSM)
Start clusters	519	517	551
End clusters	1	1	987
Direction	Collapse	Collapse	Proliferation
ID profile	Inverted U (peak 9.8)	Inverted U (peak 12.1)	Flat (~3)
H1 loop peak	342	400	None

Mamba's cluster count increases through the network. There's no integration layer, no unified manifold, no topological phase transition.

A three-condition framework

This established a clean causal picture for what produces topological integration:

Condition	Trained transformer	Mamba (SSM)	Untrained transformer
Direct interaction (attention)	Yes	No	Yes
Optimization (gradient descent)	Yes	Yes	No
Topological collapse	Yes	No	No

Both conditions have to be present:

Attention without training (random init) → clusters proliferate (503 → 968)
Training without attention (Mamba) → clusters proliferate (551 → 987)
Attention with training → clusters collapse (517 → 1)

Why attention seems to be the mechanism

Attention creates direct token-to-token interaction. Every token can influence every other token's representation in a single layer. SSM recurrence processes tokens sequentially through a state vector, which acts as a bottleneck. The direct many-to-many interaction that attention provides looks like the thing that enables representational integration into a single manifold.

What we drew from this

Architecture matters for integration. Not all sequence models develop unified representations.
Attention isn't just about parallelism. It provides a qualitatively different kind of computation than sequential state evolution.
SSMs may have different computational tradeoffs — efficient sequence processing without representational integration.
Topology discriminates architectures. The same measurement cleanly separates models that integrate from those that don't.

Built by Light of Baldr LLC. Get in touch if any of this is useful to you.