All posts
interpretabilitytopologystate-space-modelsfalsification

The Mamba Counterexample

If topological integration were universal, state space models should show it too. They don't. Mamba-370m maintains fragmented representations end-to-end.

By Adam Kruger

If the cluster-collapse pattern we documented in the Topology of Thought work were universal across all neural sequence models, state space models should show it too. They don't. Mamba-370m maintains fragmented representations from input to output, and the cluster count actually grows as you go deeper.

The test

Mamba-370m uses selective state space model (SSM) layers instead of attention. It has:

  • No query-key-value projections
  • No attention matrices
  • No direct token-to-token interaction
  • Only recurrent state evolution through selective scan

We ran the same persistent homology analysis (Vietoris-Rips via Ripser) on Mamba's hidden states as we'd run on Qwen3-4B and NanoChat.

Results

Metric Qwen3-4B (attention) NanoChat (attention) Mamba-370m (SSM)
Start clusters 519 517 551
End clusters 1 1 987
Direction Collapse Collapse Proliferation
ID profile Inverted U (peak 9.8) Inverted U (peak 12.1) Flat (~3)
H1 loop peak 342 400 None

Mamba's cluster count increases through the network. There's no integration layer, no unified manifold, no topological phase transition.

A three-condition framework

This established a clean causal picture for what produces topological integration:

Condition Trained transformer Mamba (SSM) Untrained transformer
Direct interaction (attention) Yes No Yes
Optimization (gradient descent) Yes Yes No
Topological collapse Yes No No

Both conditions have to be present:

  • Attention without training (random init) → clusters proliferate (503 → 968)
  • Training without attention (Mamba) → clusters proliferate (551 → 987)
  • Attention with training → clusters collapse (517 → 1)

Why attention seems to be the mechanism

Attention creates direct token-to-token interaction. Every token can influence every other token's representation in a single layer. SSM recurrence processes tokens sequentially through a state vector, which acts as a bottleneck. The direct many-to-many interaction that attention provides looks like the thing that enables representational integration into a single manifold.

What we drew from this

  1. Architecture matters for integration. Not all sequence models develop unified representations.
  2. Attention isn't just about parallelism. It provides a qualitatively different kind of computation than sequential state evolution.
  3. SSMs may have different computational tradeoffs — efficient sequence processing without representational integration.
  4. Topology discriminates architectures. The same measurement cleanly separates models that integrate from those that don't.

Built by Light of Baldr LLC. Get in touch if any of this is useful to you.