Manifold Geometry Visualizer

Manifold

A manifold is a curved surface that locally looks flat. A sphere is a 2D manifold — it bends in 3D space but any small patch looks like a flat plane. Here, the model stores a count (one number) on a 1D manifold: a curve twisting through high-dimensional space. Different positions on the curve = different counts.

Low-Dimensional Subspace

The residual stream has thousands of dimensions. But the count manifold only uses ~6 of them (capturing 95% of the variance). Those 6 dimensions form a "subspace" — a smaller flat space embedded inside the larger one. The helix lives inside this subspace. The rest of the dimensions are used for other things.

Helix

A helix is a spiral that advances through space as it rotates. Counting on a helix means: as the count increases, you rotate around the circle AND move forward. Points far apart in count are orthogonal (perpendicular) — making them easy to distinguish even in noisy space. It's more resolution-efficient than a straight line or a circle alone.

Why Not Integers?

Storing count=42 as a dedicated neuron #42 requires N neurons for N possible values — dimensionally expensive. A helix uses ~6 dimensions for arbitrarily large counts. Dense but distinguishable.

Why Not a Scalar?

A single number (magnitude) loses fine-grained resolution under the noise of a high-dimensional residual stream. The helix encodes count as angular position, which is robust to magnitude noise because you're reading direction, not size.

Discretization via Sparse Features

The continuous helix curve is discretized by sparse features — each feature fires for a range of counts, like place cells in the brain. You can view the same representation as: (a) a family of discrete features firing at different thresholds, or (b) angular position on a continuous helix. Both describe the exact same thing.

Stage 1 — Accumulate

Each token has a character length (e.g. "hello" = 5). These are summed across tokens into a running character count, stored as position on the helix manifold.

Stage 2 — Twist

Attention heads geometrically transform the count manifold to produce a "distance-to-boundary" estimate. The line width constraint (e.g. 80 chars) acts as a reference. The twist operation is literally a rotation/shear in the 6D subspace.

Stage 3 — Decide

Multiple "distance to boundary" estimates from different attention heads are arranged orthogonally to each other. This creates a linear decision boundary that's easy to threshold: if the projection crosses zero, insert newline.

Helix manifold (count positions)

Current count position

Sparse feature activations

Decision boundary plane