The Mamba Counterexample
If topological integration were universal, state space models should show it too. They don't. Mamba-370m maintains fragmented representations end-to-end.
By Adam Kruger
If the cluster-collapse pattern we documented in the Topology of Thought work were universal across all neural sequence models, state space models should show it too. They don't. Mamba-370m maintains fragmented representations from input to output, and the cluster count actually grows as you go deeper.
The test
Mamba-370m uses selective state space model (SSM) layers instead of attention. It has:
- No query-key-value projections
- No attention matrices
- No direct token-to-token interaction
- Only recurrent state evolution through selective scan
We ran the same persistent homology analysis (Vietoris-Rips via Ripser) on Mamba's hidden states as we'd run on Qwen3-4B and NanoChat.
Results
| Metric | Qwen3-4B (attention) | NanoChat (attention) | Mamba-370m (SSM) |
|---|---|---|---|
| Start clusters | 519 | 517 | 551 |
| End clusters | 1 | 1 | 987 |
| Direction | Collapse | Collapse | Proliferation |
| ID profile | Inverted U (peak 9.8) | Inverted U (peak 12.1) | Flat (~3) |
| H1 loop peak | 342 | 400 | None |
Mamba's cluster count increases through the network. There's no integration layer, no unified manifold, no topological phase transition.
A three-condition framework
This established a clean causal picture for what produces topological integration:
| Condition | Trained transformer | Mamba (SSM) | Untrained transformer |
|---|---|---|---|
| Direct interaction (attention) | Yes | No | Yes |
| Optimization (gradient descent) | Yes | Yes | No |
| Topological collapse | Yes | No | No |
Both conditions have to be present:
- Attention without training (random init) → clusters proliferate (503 → 968)
- Training without attention (Mamba) → clusters proliferate (551 → 987)
- Attention with training → clusters collapse (517 → 1)
Why attention seems to be the mechanism
Attention creates direct token-to-token interaction. Every token can influence every other token's representation in a single layer. SSM recurrence processes tokens sequentially through a state vector, which acts as a bottleneck. The direct many-to-many interaction that attention provides looks like the thing that enables representational integration into a single manifold.
What we drew from this
- Architecture matters for integration. Not all sequence models develop unified representations.
- Attention isn't just about parallelism. It provides a qualitatively different kind of computation than sequential state evolution.
- SSMs may have different computational tradeoffs — efficient sequence processing without representational integration.
- Topology discriminates architectures. The same measurement cleanly separates models that integrate from those that don't.
Built by Light of Baldr LLC. Get in touch if any of this is useful to you.