---
abstract: |
  Language models trained on natural text learn to represent numbers using periodic features with dominant periods at $T=2, 5, 10$. In this paper, we identify a two-tiered hierarchy of these features: while Transformers, Linear RNNs, LSTMs, and classical word embeddings trained in different ways all learn features that have period-$T$ spikes in the Fourier domain, only some learn *geometrically separable* features that can be used to linearly classify a number mod-$T$. To explain this incongruity, we prove that Fourier domain sparsity is necessary but not sufficient for mod-$T$ geometric separability. Empirically, we investigate when model training yields geometrically separable features, finding that the data, architecture, optimizer, and tokenizer all play key roles. In particular, we identify two different routes through which models can acquire geometrically separable features: they can learn them from complementary co-occurrence signals in general language data, including text-number co-occurrence and cross-number interaction, or from multi-token (but not single-token) addition problems. Overall, our results highlight the phenomenon of *convergent evolution* in feature learning: A diverse range of models learn similar features from different training signals.

  `\Large`{=latex}`\huggingface`{=latex} Models: <https://hf.co/collections/deqing/convergent-evolution>
author:
- |
  **Deqing Fu**$^{\usc}$ **Tianyi Zhou**$^{\usc}$ **Mikhail Belkin**$^{\ucsd}$ **Vatsal Sharan**$^{\usc}$ **Robin Jia**$^{\usc}$\
  $^{\usc}$University of Southern California $^{\ucsd}$UC San Diego\
  `{deqingfu,tzhou029,vsharan,robinjia}@usc.edu, mbelkin@ucsd.edu`
bibliography:
- main.bib
title: |
  Convergent Evolution: How Different Language Models\
  Learn Similar Number Representations
---

```{=latex}
\newcommand{\usc}{\ensuremath{\textcolor[HTML]{990000}{\sigma}}}
```
```{=latex}
\newcommand{\ucsd}{\ensuremath{\textcolor[HTML]{182B49}{\psi}}}
```
```{=latex}
\newcommand{\fix}{\marginpar{FIX}}
```
```{=latex}
\newcommand{\new}{\marginpar{NEW}}
```
```{=latex}
\newcommand{\figleft}{{\em (Left)}}
```
```{=latex}
\newcommand{\figcenter}{{\em (Center)}}
```
```{=latex}
\newcommand{\figright}{{\em (Right)}}
```
```{=latex}
\newcommand{\figtop}{{\em (Top)}}
```
```{=latex}
\newcommand{\figbottom}{{\em (Bottom)}}
```
```{=latex}
\newcommand{\captiona}{{\em (a)}}
```
```{=latex}
\newcommand{\captionb}{{\em (b)}}
```
```{=latex}
\newcommand{\captionc}{{\em (c)}}
```
```{=latex}
\newcommand{\captiond}{{\em (d)}}
```
```{=latex}
\newcommand{\newterm}[1]{{\bf #1}}
```
```{=latex}
\def\figref#1{figure~\ref{#1}}
```
```{=latex}
\def\Figref#1{Figure~\ref{#1}}
```
```{=latex}
\def\twofigref#1#2{figures \ref{#1} and \ref{#2}}
```
```{=latex}
\def\quadfigref#1#2#3#4{figures \ref{#1}, \ref{#2}, \ref{#3} and \ref{#4}}
```
```{=latex}
\def\secref#1{section~\ref{#1}}
```
```{=latex}
\def\Secref#1{Section~\ref{#1}}
```
```{=latex}
\def\twosecrefs#1#2{sections \ref{#1} and \ref{#2}}
```
```{=latex}
\def\secrefs#1#2#3{sections \ref{#1}, \ref{#2} and \ref{#3}}
```
```{=latex}
\def\eqref#1{equation~\ref{#1}}
```
```{=latex}
\def\Eqref#1{Equation~\ref{#1}}
```
```{=latex}
\def\plaineqref#1{\ref{#1}}
```
```{=latex}
\def\chapref#1{chapter~\ref{#1}}
```
```{=latex}
\def\Chapref#1{Chapter~\ref{#1}}
```
```{=latex}
\def\rangechapref#1#2{chapters\ref{#1}--\ref{#2}}
```
```{=latex}
\def\algref#1{algorithm~\ref{#1}}
```
```{=latex}
\def\Algref#1{Algorithm~\ref{#1}}
```
```{=latex}
\def\twoalgref#1#2{algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\Twoalgref#1#2{Algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\partref#1{part~\ref{#1}}
```
```{=latex}
\def\Partref#1{Part~\ref{#1}}
```
```{=latex}
\def\twopartref#1#2{parts \ref{#1} and \ref{#2}}
```
```{=latex}
\def\ceil#1{\lceil #1 \rceil}
```
```{=latex}
\def\floor#1{\lfloor #1 \rfloor}
```
```{=latex}
\def\1{\bm{1}}
```
```{=latex}
\newcommand{\train}{\mathcal{D}}
```
```{=latex}
\newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}}
```
```{=latex}
\newcommand{\test}{\mathcal{D_{\mathrm{test}}}}
```
```{=latex}
\def\eps{{\epsilon}}
```
```{=latex}
\def\reta{{\textnormal{$\eta$}}}
```
```{=latex}
\def\ra{{\textnormal{a}}}
```
```{=latex}
\def\rb{{\textnormal{b}}}
```
```{=latex}
\def\rc{{\textnormal{c}}}
```
```{=latex}
\def\rd{{\textnormal{d}}}
```
```{=latex}
\def\re{{\textnormal{e}}}
```
```{=latex}
\def\rf{{\textnormal{f}}}
```
```{=latex}
\def\rg{{\textnormal{g}}}
```
```{=latex}
\def\rh{{\textnormal{h}}}
```
```{=latex}
\def\ri{{\textnormal{i}}}
```
```{=latex}
\def\rj{{\textnormal{j}}}
```
```{=latex}
\def\rk{{\textnormal{k}}}
```
```{=latex}
\def\rl{{\textnormal{l}}}
```
```{=latex}
\def\rn{{\textnormal{n}}}
```
```{=latex}
\def\ro{{\textnormal{o}}}
```
```{=latex}
\def\rp{{\textnormal{p}}}
```
```{=latex}
\def\rq{{\textnormal{q}}}
```
```{=latex}
\def\rr{{\textnormal{r}}}
```
```{=latex}
\def\rs{{\textnormal{s}}}
```
```{=latex}
\def\rt{{\textnormal{t}}}
```
```{=latex}
\def\ru{{\textnormal{u}}}
```
```{=latex}
\def\rv{{\textnormal{v}}}
```
```{=latex}
\def\rw{{\textnormal{w}}}
```
```{=latex}
\def\rx{{\textnormal{x}}}
```
```{=latex}
\def\ry{{\textnormal{y}}}
```
```{=latex}
\def\rz{{\textnormal{z}}}
```
```{=latex}
\def\rvepsilon{{\mathbf{\epsilon}}}
```
```{=latex}
\def\rvtheta{{\mathbf{\theta}}}
```
```{=latex}
\def\rva{{\mathbf{a}}}
```
```{=latex}
\def\rvb{{\mathbf{b}}}
```
```{=latex}
\def\rvc{{\mathbf{c}}}
```
```{=latex}
\def\rvd{{\mathbf{d}}}
```
```{=latex}
\def\rve{{\mathbf{e}}}
```
```{=latex}
\def\rvf{{\mathbf{f}}}
```
```{=latex}
\def\rvg{{\mathbf{g}}}
```
```{=latex}
\def\rvh{{\mathbf{h}}}
```
```{=latex}
\def\rvu{{\mathbf{i}}}
```
```{=latex}
\def\rvj{{\mathbf{j}}}
```
```{=latex}
\def\rvk{{\mathbf{k}}}
```
```{=latex}
\def\rvl{{\mathbf{l}}}
```
```{=latex}
\def\rvm{{\mathbf{m}}}
```
```{=latex}
\def\rvn{{\mathbf{n}}}
```
```{=latex}
\def\rvo{{\mathbf{o}}}
```
```{=latex}
\def\rvp{{\mathbf{p}}}
```
```{=latex}
\def\rvq{{\mathbf{q}}}
```
```{=latex}
\def\rvr{{\mathbf{r}}}
```
```{=latex}
\def\rvs{{\mathbf{s}}}
```
```{=latex}
\def\rvt{{\mathbf{t}}}
```
```{=latex}
\def\rvu{{\mathbf{u}}}
```
```{=latex}
\def\rvv{{\mathbf{v}}}
```
```{=latex}
\def\rvw{{\mathbf{w}}}
```
```{=latex}
\def\rvx{{\mathbf{x}}}
```
```{=latex}
\def\rvy{{\mathbf{y}}}
```
```{=latex}
\def\rvz{{\mathbf{z}}}
```
```{=latex}
\def\erva{{\textnormal{a}}}
```
```{=latex}
\def\ervb{{\textnormal{b}}}
```
```{=latex}
\def\ervc{{\textnormal{c}}}
```
```{=latex}
\def\ervd{{\textnormal{d}}}
```
```{=latex}
\def\erve{{\textnormal{e}}}
```
```{=latex}
\def\ervf{{\textnormal{f}}}
```
```{=latex}
\def\ervg{{\textnormal{g}}}
```
```{=latex}
\def\ervh{{\textnormal{h}}}
```
```{=latex}
\def\ervi{{\textnormal{i}}}
```
```{=latex}
\def\ervj{{\textnormal{j}}}
```
```{=latex}
\def\ervk{{\textnormal{k}}}
```
```{=latex}
\def\ervl{{\textnormal{l}}}
```
```{=latex}
\def\ervm{{\textnormal{m}}}
```
```{=latex}
\def\ervn{{\textnormal{n}}}
```
```{=latex}
\def\ervo{{\textnormal{o}}}
```
```{=latex}
\def\ervp{{\textnormal{p}}}
```
```{=latex}
\def\ervq{{\textnormal{q}}}
```
```{=latex}
\def\ervr{{\textnormal{r}}}
```
```{=latex}
\def\ervs{{\textnormal{s}}}
```
```{=latex}
\def\ervt{{\textnormal{t}}}
```
```{=latex}
\def\ervu{{\textnormal{u}}}
```
```{=latex}
\def\ervv{{\textnormal{v}}}
```
```{=latex}
\def\ervw{{\textnormal{w}}}
```
```{=latex}
\def\ervx{{\textnormal{x}}}
```
```{=latex}
\def\ervy{{\textnormal{y}}}
```
```{=latex}
\def\ervz{{\textnormal{z}}}
```
```{=latex}
\def\rmA{{\mathbf{A}}}
```
```{=latex}
\def\rmB{{\mathbf{B}}}
```
```{=latex}
\def\rmC{{\mathbf{C}}}
```
```{=latex}
\def\rmD{{\mathbf{D}}}
```
```{=latex}
\def\rmE{{\mathbf{E}}}
```
```{=latex}
\def\rmF{{\mathbf{F}}}
```
```{=latex}
\def\rmG{{\mathbf{G}}}
```
```{=latex}
\def\rmH{{\mathbf{H}}}
```
```{=latex}
\def\rmI{{\mathbf{I}}}
```
```{=latex}
\def\rmJ{{\mathbf{J}}}
```
```{=latex}
\def\rmK{{\mathbf{K}}}
```
```{=latex}
\def\rmL{{\mathbf{L}}}
```
```{=latex}
\def\rmM{{\mathbf{M}}}
```
```{=latex}
\def\rmN{{\mathbf{N}}}
```
```{=latex}
\def\rmO{{\mathbf{O}}}
```
```{=latex}
\def\rmP{{\mathbf{P}}}
```
```{=latex}
\def\rmQ{{\mathbf{Q}}}
```
```{=latex}
\def\rmR{{\mathbf{R}}}
```
```{=latex}
\def\rmS{{\mathbf{S}}}
```
```{=latex}
\def\rmT{{\mathbf{T}}}
```
```{=latex}
\def\rmU{{\mathbf{U}}}
```
```{=latex}
\def\rmV{{\mathbf{V}}}
```
```{=latex}
\def\rmW{{\mathbf{W}}}
```
```{=latex}
\def\rmX{{\mathbf{X}}}
```
```{=latex}
\def\rmY{{\mathbf{Y}}}
```
```{=latex}
\def\rmZ{{\mathbf{Z}}}
```
```{=latex}
\def\ermA{{\textnormal{A}}}
```
```{=latex}
\def\ermB{{\textnormal{B}}}
```
```{=latex}
\def\ermC{{\textnormal{C}}}
```
```{=latex}
\def\ermD{{\textnormal{D}}}
```
```{=latex}
\def\ermE{{\textnormal{E}}}
```
```{=latex}
\def\ermF{{\textnormal{F}}}
```
```{=latex}
\def\ermG{{\textnormal{G}}}
```
```{=latex}
\def\ermH{{\textnormal{H}}}
```
```{=latex}
\def\ermI{{\textnormal{I}}}
```
```{=latex}
\def\ermJ{{\textnormal{J}}}
```
```{=latex}
\def\ermK{{\textnormal{K}}}
```
```{=latex}
\def\ermL{{\textnormal{L}}}
```
```{=latex}
\def\ermM{{\textnormal{M}}}
```
```{=latex}
\def\ermN{{\textnormal{N}}}
```
```{=latex}
\def\ermO{{\textnormal{O}}}
```
```{=latex}
\def\ermP{{\textnormal{P}}}
```
```{=latex}
\def\ermQ{{\textnormal{Q}}}
```
```{=latex}
\def\ermR{{\textnormal{R}}}
```
```{=latex}
\def\ermS{{\textnormal{S}}}
```
```{=latex}
\def\ermT{{\textnormal{T}}}
```
```{=latex}
\def\ermU{{\textnormal{U}}}
```
```{=latex}
\def\ermV{{\textnormal{V}}}
```
```{=latex}
\def\ermW{{\textnormal{W}}}
```
```{=latex}
\def\ermX{{\textnormal{X}}}
```
```{=latex}
\def\ermY{{\textnormal{Y}}}
```
```{=latex}
\def\ermZ{{\textnormal{Z}}}
```
```{=latex}
\def\vzero{{\bm{0}}}
```
```{=latex}
\def\vone{{\bm{1}}}
```
```{=latex}
\def\vmu{{\bm{\mu}}}
```
```{=latex}
\def\vtheta{{\bm{\theta}}}
```
```{=latex}
\def\va{{\bm{a}}}
```
```{=latex}
\def\vb{{\bm{b}}}
```
```{=latex}
\def\vc{{\bm{c}}}
```
```{=latex}
\def\vd{{\bm{d}}}
```
```{=latex}
\def\ve{{\bm{e}}}
```
```{=latex}
\def\vf{{\bm{f}}}
```
```{=latex}
\def\vg{{\bm{g}}}
```
```{=latex}
\def\vh{{\bm{h}}}
```
```{=latex}
\def\vi{{\bm{i}}}
```
```{=latex}
\def\vj{{\bm{j}}}
```
```{=latex}
\def\vk{{\bm{k}}}
```
```{=latex}
\def\vl{{\bm{l}}}
```
```{=latex}
\def\vm{{\bm{m}}}
```
```{=latex}
\def\vn{{\bm{n}}}
```
```{=latex}
\def\vo{{\bm{o}}}
```
```{=latex}
\def\vp{{\bm{p}}}
```
```{=latex}
\def\vq{{\bm{q}}}
```
```{=latex}
\def\vr{{\bm{r}}}
```
```{=latex}
\def\vs{{\bm{s}}}
```
```{=latex}
\def\vt{{\bm{t}}}
```
```{=latex}
\def\vu{{\bm{u}}}
```
```{=latex}
\def\vv{{\bm{v}}}
```
```{=latex}
\def\vw{{\bm{w}}}
```
```{=latex}
\def\vx{{\bm{x}}}
```
```{=latex}
\def\vy{{\bm{y}}}
```
```{=latex}
\def\vz{{\bm{z}}}
```
```{=latex}
\def\vF{{\bm{F}}}
```
```{=latex}
\def\vpsi{{\boldsymbol{\psi}}}
```
```{=latex}
\def\evalpha{{\alpha}}
```
```{=latex}
\def\evbeta{{\beta}}
```
```{=latex}
\def\evepsilon{{\epsilon}}
```
```{=latex}
\def\evlambda{{\lambda}}
```
```{=latex}
\def\evomega{{\omega}}
```
```{=latex}
\def\evmu{{\mu}}
```
```{=latex}
\def\evpsi{{\psi}}
```
```{=latex}
\def\evsigma{{\sigma}}
```
```{=latex}
\def\evtheta{{\theta}}
```
```{=latex}
\def\eva{{a}}
```
```{=latex}
\def\evb{{b}}
```
```{=latex}
\def\evc{{c}}
```
```{=latex}
\def\evd{{d}}
```
```{=latex}
\def\eve{{e}}
```
```{=latex}
\def\evf{{f}}
```
```{=latex}
\def\evg{{g}}
```
```{=latex}
\def\evh{{h}}
```
```{=latex}
\def\evi{{i}}
```
```{=latex}
\def\evj{{j}}
```
```{=latex}
\def\evk{{k}}
```
```{=latex}
\def\evl{{l}}
```
```{=latex}
\def\evm{{m}}
```
```{=latex}
\def\evn{{n}}
```
```{=latex}
\def\evo{{o}}
```
```{=latex}
\def\evp{{p}}
```
```{=latex}
\def\evq{{q}}
```
```{=latex}
\def\evr{{r}}
```
```{=latex}
\def\evs{{s}}
```
```{=latex}
\def\evt{{t}}
```
```{=latex}
\def\evu{{u}}
```
```{=latex}
\def\evv{{v}}
```
```{=latex}
\def\evw{{w}}
```
```{=latex}
\def\evx{{x}}
```
```{=latex}
\def\evy{{y}}
```
```{=latex}
\def\evz{{z}}
```
```{=latex}
\def\mA{{\bm{A}}}
```
```{=latex}
\def\mB{{\bm{B}}}
```
```{=latex}
\def\mC{{\bm{C}}}
```
```{=latex}
\def\mD{{\bm{D}}}
```
```{=latex}
\def\mE{{\bm{E}}}
```
```{=latex}
\def\mF{{\bm{F}}}
```
```{=latex}
\def\mG{{\bm{G}}}
```
```{=latex}
\def\mH{{\bm{H}}}
```
```{=latex}
\def\mI{{\bm{I}}}
```
```{=latex}
\def\mJ{{\bm{J}}}
```
```{=latex}
\def\mK{{\bm{K}}}
```
```{=latex}
\def\mL{{\bm{L}}}
```
```{=latex}
\def\mM{{\bm{M}}}
```
```{=latex}
\def\mN{{\bm{N}}}
```
```{=latex}
\def\mO{{\bm{O}}}
```
```{=latex}
\def\mP{{\bm{P}}}
```
```{=latex}
\def\mQ{{\bm{Q}}}
```
```{=latex}
\def\mR{{\bm{R}}}
```
```{=latex}
\def\mS{{\bm{S}}}
```
```{=latex}
\def\mT{{\bm{T}}}
```
```{=latex}
\def\mU{{\bm{U}}}
```
```{=latex}
\def\mV{{\bm{V}}}
```
```{=latex}
\def\mW{{\bm{W}}}
```
```{=latex}
\def\mX{{\bm{X}}}
```
```{=latex}
\def\mY{{\bm{Y}}}
```
```{=latex}
\def\mZ{{\bm{Z}}}
```
```{=latex}
\def\mBeta{{\bm{\beta}}}
```
```{=latex}
\def\mPhi{{\bm{\Phi}}}
```
```{=latex}
\def\mLambda{{\bm{\Lambda}}}
```
```{=latex}
\def\mSigma{{\bm{\Sigma}}}
```
```{=latex}
\newcommand{\tens}[1]{\bm{\mathsfit{#1}}}
```
```{=latex}
\def\tA{{\tens{A}}}
```
```{=latex}
\def\tB{{\tens{B}}}
```
```{=latex}
\def\tC{{\tens{C}}}
```
```{=latex}
\def\tD{{\tens{D}}}
```
```{=latex}
\def\tE{{\tens{E}}}
```
```{=latex}
\def\tF{{\tens{F}}}
```
```{=latex}
\def\tG{{\tens{G}}}
```
```{=latex}
\def\tH{{\tens{H}}}
```
```{=latex}
\def\tI{{\tens{I}}}
```
```{=latex}
\def\tJ{{\tens{J}}}
```
```{=latex}
\def\tK{{\tens{K}}}
```
```{=latex}
\def\tL{{\tens{L}}}
```
```{=latex}
\def\tM{{\tens{M}}}
```
```{=latex}
\def\tN{{\tens{N}}}
```
```{=latex}
\def\tO{{\tens{O}}}
```
```{=latex}
\def\tP{{\tens{P}}}
```
```{=latex}
\def\tQ{{\tens{Q}}}
```
```{=latex}
\def\tR{{\tens{R}}}
```
```{=latex}
\def\tS{{\tens{S}}}
```
```{=latex}
\def\tT{{\tens{T}}}
```
```{=latex}
\def\tU{{\tens{U}}}
```
```{=latex}
\def\tV{{\tens{V}}}
```
```{=latex}
\def\tW{{\tens{W}}}
```
```{=latex}
\def\tX{{\tens{X}}}
```
```{=latex}
\def\tY{{\tens{Y}}}
```
```{=latex}
\def\tZ{{\tens{Z}}}
```
```{=latex}
\def\gA{{\mathcal{A}}}
```
```{=latex}
\def\gB{{\mathcal{B}}}
```
```{=latex}
\def\gC{{\mathcal{C}}}
```
```{=latex}
\def\gD{{\mathcal{D}}}
```
```{=latex}
\def\gE{{\mathcal{E}}}
```
```{=latex}
\def\gF{{\mathcal{F}}}
```
```{=latex}
\def\gG{{\mathcal{G}}}
```
```{=latex}
\def\gH{{\mathcal{H}}}
```
```{=latex}
\def\gI{{\mathcal{I}}}
```
```{=latex}
\def\gJ{{\mathcal{J}}}
```
```{=latex}
\def\gK{{\mathcal{K}}}
```
```{=latex}
\def\gL{{\mathcal{L}}}
```
```{=latex}
\def\gM{{\mathcal{M}}}
```
```{=latex}
\def\gN{{\mathcal{N}}}
```
```{=latex}
\def\gO{{\mathcal{O}}}
```
```{=latex}
\def\gP{{\mathcal{P}}}
```
```{=latex}
\def\gQ{{\mathcal{Q}}}
```
```{=latex}
\def\gR{{\mathcal{R}}}
```
```{=latex}
\def\gS{{\mathcal{S}}}
```
```{=latex}
\def\gT{{\mathcal{T}}}
```
```{=latex}
\def\gU{{\mathcal{U}}}
```
```{=latex}
\def\gV{{\mathcal{V}}}
```
```{=latex}
\def\gW{{\mathcal{W}}}
```
```{=latex}
\def\gX{{\mathcal{X}}}
```
```{=latex}
\def\gY{{\mathcal{Y}}}
```
```{=latex}
\def\gZ{{\mathcal{Z}}}
```
```{=latex}
\def\sA{{\mathbb{A}}}
```
```{=latex}
\def\sB{{\mathbb{B}}}
```
```{=latex}
\def\sC{{\mathbb{C}}}
```
```{=latex}
\def\sD{{\mathbb{D}}}
```
```{=latex}
\def\sF{{\mathbb{F}}}
```
```{=latex}
\def\sG{{\mathbb{G}}}
```
```{=latex}
\def\sH{{\mathbb{H}}}
```
```{=latex}
\def\sI{{\mathbb{I}}}
```
```{=latex}
\def\sJ{{\mathbb{J}}}
```
```{=latex}
\def\sK{{\mathbb{K}}}
```
```{=latex}
\def\sL{{\mathbb{L}}}
```
```{=latex}
\def\sM{{\mathbb{M}}}
```
```{=latex}
\def\sN{{\mathbb{N}}}
```
```{=latex}
\def\sO{{\mathbb{O}}}
```
```{=latex}
\def\sP{{\mathbb{P}}}
```
```{=latex}
\def\sQ{{\mathbb{Q}}}
```
```{=latex}
\def\sR{{\mathbb{R}}}
```
```{=latex}
\def\sS{{\mathbb{S}}}
```
```{=latex}
\def\sT{{\mathbb{T}}}
```
```{=latex}
\def\sU{{\mathbb{U}}}
```
```{=latex}
\def\sV{{\mathbb{V}}}
```
```{=latex}
\def\sW{{\mathbb{W}}}
```
```{=latex}
\def\sX{{\mathbb{X}}}
```
```{=latex}
\def\sY{{\mathbb{Y}}}
```
```{=latex}
\def\sZ{{\mathbb{Z}}}
```
```{=latex}
\def\emLambda{{\Lambda}}
```
```{=latex}
\def\emA{{A}}
```
```{=latex}
\def\emB{{B}}
```
```{=latex}
\def\emC{{C}}
```
```{=latex}
\def\emD{{D}}
```
```{=latex}
\def\emE{{E}}
```
```{=latex}
\def\emF{{F}}
```
```{=latex}
\def\emG{{G}}
```
```{=latex}
\def\emH{{H}}
```
```{=latex}
\def\emI{{I}}
```
```{=latex}
\def\emJ{{J}}
```
```{=latex}
\def\emK{{K}}
```
```{=latex}
\def\emL{{L}}
```
```{=latex}
\def\emM{{M}}
```
```{=latex}
\def\emN{{N}}
```
```{=latex}
\def\emO{{O}}
```
```{=latex}
\def\emP{{P}}
```
```{=latex}
\def\emQ{{Q}}
```
```{=latex}
\def\emR{{R}}
```
```{=latex}
\def\emS{{S}}
```
```{=latex}
\def\emT{{T}}
```
```{=latex}
\def\emU{{U}}
```
```{=latex}
\def\emV{{V}}
```
```{=latex}
\def\emW{{W}}
```
```{=latex}
\def\emX{{X}}
```
```{=latex}
\def\emY{{Y}}
```
```{=latex}
\def\emZ{{Z}}
```
```{=latex}
\def\emSigma{{\Sigma}}
```
```{=latex}
\newcommand{\etens}[1]{\mathsfit{#1}}
```
```{=latex}
\def\etLambda{{\etens{\Lambda}}}
```
```{=latex}
\def\etA{{\etens{A}}}
```
```{=latex}
\def\etB{{\etens{B}}}
```
```{=latex}
\def\etC{{\etens{C}}}
```
```{=latex}
\def\etD{{\etens{D}}}
```
```{=latex}
\def\etE{{\etens{E}}}
```
```{=latex}
\def\etF{{\etens{F}}}
```
```{=latex}
\def\etG{{\etens{G}}}
```
```{=latex}
\def\etH{{\etens{H}}}
```
```{=latex}
\def\etI{{\etens{I}}}
```
```{=latex}
\def\etJ{{\etens{J}}}
```
```{=latex}
\def\etK{{\etens{K}}}
```
```{=latex}
\def\etL{{\etens{L}}}
```
```{=latex}
\def\etM{{\etens{M}}}
```
```{=latex}
\def\etN{{\etens{N}}}
```
```{=latex}
\def\etO{{\etens{O}}}
```
```{=latex}
\def\etP{{\etens{P}}}
```
```{=latex}
\def\etQ{{\etens{Q}}}
```
```{=latex}
\def\etR{{\etens{R}}}
```
```{=latex}
\def\etS{{\etens{S}}}
```
```{=latex}
\def\etT{{\etens{T}}}
```
```{=latex}
\def\etU{{\etens{U}}}
```
```{=latex}
\def\etV{{\etens{V}}}
```
```{=latex}
\def\etW{{\etens{W}}}
```
```{=latex}
\def\etX{{\etens{X}}}
```
```{=latex}
\def\etY{{\etens{Y}}}
```
```{=latex}
\def\etZ{{\etens{Z}}}
```
```{=latex}
\newcommand{\pdata}{p_{\rm{data}}}
```
```{=latex}
\newcommand{\ptrain}{\hat{p}_{\rm{data}}}
```
```{=latex}
\newcommand{\Ptrain}{\hat{P}_{\rm{data}}}
```
```{=latex}
\newcommand{\pmodel}{p_{\rm{model}}}
```
```{=latex}
\newcommand{\Pmodel}{P_{\rm{model}}}
```
```{=latex}
\newcommand{\ptildemodel}{\tilde{p}_{\rm{model}}}
```
```{=latex}
\newcommand{\pencode}{p_{\rm{encoder}}}
```
```{=latex}
\newcommand{\pdecode}{p_{\rm{decoder}}}
```
```{=latex}
\newcommand{\precons}{p_{\rm{reconstruct}}}
```
```{=latex}
\newcommand{\laplace}{\mathrm{Laplace}}
```
```{=latex}
\newcommand{\E}{\mathbb{E}}
```
```{=latex}
\newcommand{\Ls}{\mathcal{L}}
```
```{=latex}
\newcommand{\R}{\mathbb{R}}
```
```{=latex}
\newcommand{\emp}{\tilde{p}}
```
```{=latex}
\newcommand{\lr}{\alpha}
```
```{=latex}
\newcommand{\reg}{\lambda}
```
```{=latex}
\newcommand{\rect}{\mathrm{rectifier}}
```
```{=latex}
\newcommand{\softmax}{\mathrm{softmax}}
```
```{=latex}
\newcommand{\sigmoid}{\sigma}
```
```{=latex}
\newcommand{\softplus}{\zeta}
```
```{=latex}
\newcommand{\KL}{D_{\mathrm{KL}}}
```
```{=latex}
\newcommand{\Var}{\mathrm{Var}}
```
```{=latex}
\newcommand{\standarderror}{\mathrm{SE}}
```
```{=latex}
\newcommand{\Cov}{\mathrm{Cov}}
```
```{=latex}
\newcommand{\normlzero}{L^0}
```
```{=latex}
\newcommand{\normlone}{L^1}
```
```{=latex}
\newcommand{\normltwo}{L^2}
```
```{=latex}
\newcommand{\normlp}{L^p}
```
```{=latex}
\newcommand{\normmax}{L^\infty}
```
```{=latex}
\newcommand{\parents}{Pa}
```
```{=latex}
\DeclareMathOperator*{\argmax}{arg\,max}
```
```{=latex}
\DeclareMathOperator*{\argmin}{arg\,min}
```
```{=latex}
\DeclareMathOperator{\sign}{sign}
```
```{=latex}
\DeclareMathOperator{\Tr}{Tr}
```
```{=latex}
\let\ab\allowbreak
```
```{=latex}
\newcommand{\cmark}{\ding{51}}
```
```{=latex}
\newcommand{\huggingface}{\scalerel*{\includegraphics[page=2380]{hwemoji-assets.pdf}}{X}}
```
```{=latex}
\ifcolmsubmission
```
```{=latex}
\linenumbers
```
```{=latex}
\fi
```
```{=latex}
\maketitle
```
Introduction {#sec:introduction}
============

Language models trained on natural language develop periodic representations for number tokens. For many Transformer-based language models, @zhou2024pre show that the embeddings of integer tokens have consistent spikes in the Fourier domain at periods $T = 2, 5$, and $10$. Such periodic features have also been well-documented in models' intermediate representations [@levy2025language] and in the model mechanisms that implement addition [@zhou2024pre; @kantamneni2025language]. @engels2025not even find analogous periodic structures for other cyclical concepts such as days of the week and months of the year. These findings have been broadly interpreted as evidence that language models learn structured numerical representations through next-token prediction.

In this paper, we first demonstrate that this phenomenon is far more general than previously recognized. `\Cref{fig:fourier_universality}`{=latex} shows that the same $T = 2, 5, 10$ spikes appear not only in Transformers [@transformer] of varying scale (GPT-2 [@gpt2], GPT-OSS [@gptoss], Llama-3 [@llama3], Llama-4 [@llama4], and DeepSeek-V3 [@deepseekv3]), but also in non-Transformer LLMs (Mamba [@mamba], Falcon-Mamba [@falcon-mamba], xLSTM [@xlstm], Kimi-Linear [@kimi-linear]) and classical word embeddings (GloVe [@pennington-etal-2014-glove] and FastText [@bojanowski-etal-2017-fasttext]). Even the raw token frequency distribution of numbers in the training corpus, with no model at all, exhibits the same periodic spectrum (see `\Cref{fig:fourier_no_probe}`{=latex}). We view this universality as a case of *convergent evolution*: different systems independently develop the same representation because they share the same constraints from training data and tokenization. In biology, convergent evolution refers to the independent emergence of similar traits in unrelated organisms facing shared environmental pressures, such as the independent evolution of eyes in vertebrates and cephalopods [@McGhee2011convergent]. Fourier features in number embeddings are analogous: a shared trait that arises from shared constraints on the learning process.

But do these Fourier spikes indicate that models have learned functional numerical structure? We identify a two-tiered hierarchy of periodic features: only some systems with Fourier spikes cleanly encode modular arithmetic properties in their embeddings. By this we mean that the residue class $n \bmod T$ is linearly decodable from the embedding $\ve(n)$. Period-$T$ features naturally group numbers by their value mod $T$, and the linear representation hypothesis [@park2023lrh] conjectures that such structure should be accessible via linear probes. We call the emergence of Fourier spikes *spectral convergence* and the emergence of linearly separable mod-$T$ classes *geometric convergence*. Spectral convergence appears in almost every system we examine, but geometric convergence does not: Transformers and linear RNNs trained on 10 billion tokens develop embeddings where mod-$T$ classes are linearly separable, while LSTMs trained on identical data develop more prominent Fourier spikes but achieve chance-level probing. Understanding what separates these two levels of convergence is the central question of this paper. We summarize our contributions below.

![**Universality of Fourier Features and Convergent Evolution**. (*Left*) Fourier spectrum of number embeddings across three architecture families: Transformer LLMs, non-Transformer LLMs, and classical word embeddings. Each row shows the median-normalized magnitude at each Fourier frequency. All models exhibit consistent spikes at frequencies of period $T = 2, 5, 10$, etc. (*Right*) Convergent Evolution of various models studied in this paper. It shows two types of convergence: *spectral convergence* where models learn Fourier spikes, and *geometric convergence* where models learn modular probes.](figures/fourier_universality.png "fig:"){#fig:fourier_universality width="0.49\\linewidth"} ![**Universality of Fourier Features and Convergent Evolution**. (*Left*) Fourier spectrum of number embeddings across three architecture families: Transformer LLMs, non-Transformer LLMs, and classical word embeddings. Each row shows the median-normalized magnitude at each Fourier frequency. All models exhibit consistent spikes at frequencies of period $T = 2, 5, 10$, etc. (*Right*) Convergent Evolution of various models studied in this paper. It shows two types of convergence: *spectral convergence* where models learn Fourier spikes, and *geometric convergence* where models learn modular probes.](figures/cladogram.png "fig:"){#fig:fourier_universality width="0.49\\linewidth"}

#### Fourier spikes are universal but probing is not.

We show that Fourier spikes at $T = 2, 5, 10$ appear in every system we examine, spanning Transformer and non-Transformer LLMs, classical word embeddings, and even the raw token frequency distribution (Figure `\ref{fig:fourier_universality}`{=latex}). We demonstrate both theoretically (`\Cref{thm:fourier}`{=latex}) and empirically (`\Cref{fig:fourier_no_probe}`{=latex}) that Fourier spikes are necessary but not sufficient for mod-$T$ probing, and explain how models with similar spectra can have vastly different probing accuracy (§`\ref{sec:problem_setup}`{=latex}).

#### Geometric convergence requires data, architecture, and optimizer to align.

Through controlled experiments on 300M-parameter models trained on identical data, we isolate three factors that jointly determine whether mod-$T$ classes become linearly separable in the embeddings. Our experimental methodology can be viewed as a form of *structure attribution*. Analogous to how influence functions [@koh2017influence] or Shapley values [@ghorbani2019data] attribute model predictions to individual training examples, our controlled perturbations attribute the emergence of learned representations to specific structural properties of the data distribution. We find that geometric convergence depends on several complementary data signals: perturbations that progressively remove text-number co-occurrence, cross-number interaction, or context length each degrade probing, while Fourier spikes persist across all conditions (§`\ref{ssec:data}`{=latex}). The architecture plays a critical role: Transformers and linear RNNs achieve strong probing while LSTMs trained on the same data remain at chance (§`\ref{ssec:arch_opt}`{=latex}). Regardless of the optimizer, both Transformers and linear RNNs learn the same Fourier spectrum but different probing performance.

#### Convergent evolution takes a different form under arithmetic task pressure.

In models trained on addition from scratch, the tokenizer determines whether Fourier structure emerges (§`\ref{sec:addition}`{=latex}). Multi-token addition requires computing each output digit as a sum modulo 1000, forcing the model to solve modular subproblems that produce circular representations. Single-token addition admits multiple strategies, and the resulting representations randomly depend on the optimizer.

Related Work {#sec:related_work}
============

#### Fourier Features.

Fourier features have been used for many years in computer vision as edge and orientation detectors [@olshausen1997sparse; @olah2020overview; @fiquet2023polar]. The original Transformer applies sinusoidal position encodings [@transformer], and several works have found that explicitly injecting high-frequency components into inputs helps with spatial and numerical tasks [@tancik2020fourier; @he2023frequency; @hua2024fourier]. Recently, this structure is also found to emerge without being designed in. Transformers trained on modular addition embed numbers on a circle and rotating to compute the answer [@nanda2023progress; @zhong2023clock; @gromov2023grokking]. The same holds in pretrained LLMs that number token embeddings break into Fourier components, and there are recognizable addition circuits in the attention and MLP layers [@zhou2024pre; @kantamneni2025language; @levy2025language]. @zhou2025fone show that hard-coding these Fourier features improves arithmetic learning. These studies document spectral structure but do not test whether it implies geometric separability --- a distinction our work shows is critical. In this paper, we further study the question these papers leave open, where the structure comes from in the first place.

#### Mechanistic Interpretability.

Mechanistic interpretability aims to reverse-engineer the representations and algorithms learned by language models. The linear representation hypothesis [@park2023lrh] conjectures that high-level concepts, if learned, should be linearly decodable from model representations. Probing [@orgad2025llms; @kossen2024semantic] is the standard tool for this. However, number representations are not linearly encoded [@nanda2023progress; @zhong2023clock; @gromov2023grokking]. @karkada2026symmetry also find circular representations for days of the week, which suggests this is a fairly general solution to any problem with rotational symmetry. @allen2025physics use controlled synthetic pretraining to isolate which capabilities emerge from which architectural and data choices. Separately, @huh2024platonic argue that representations across models and modalities are converging toward a shared statistical model of reality, measured via global kernel alignment. We ask a different question: whether models that converge on the similar representation of a specific concept have learned the same functional structure, and show that they can diverge fundamentally. In this paper, we vary tokenization, architecture, optimizer, and task to understand the convergent evolution of number representations.

Problem Setup and Preliminary Analysis {#sec:problem_setup}
======================================

![**A \`\`Spiky" Fourier Spectrum Does Not Imply Good Feature Learning.** (*Left*) Token embeddings of `\color{Maroon}`{=latex}Transformer, `\color{Purple}`{=latex}Gated DeltaNet and `\color{RoyalBlue}`{=latex}LSTM, and even simply the `\color{Green}`{=latex}Number Token Distribution Frequency exhibit distinct Fourier spikes at $T = 2, 5$, and $10$. (Middle) Linear probes reveal only the Transformer and Gated DeltaNet learned functional modular arithmetic with high Cohen's $\kappa$, while others remain at chance. (*Right*) `\Cref{thm:fourier}`{=latex} explains this discrepancy through the internal noise structure of the $T=10$ embeddings.](figures/fourier_probe_fisher.png){#fig:fourier_no_probe width="\\linewidth"}

We study the token embeddings of numbers $0$ through $N-1$ in language models, where $N = 1000$ corresponds to the set of numbers that receive single-token representations in the Llama-3 tokenizer [@llama3]. Let $\ve(n) \in \sR^d$ denote the token embedding of number $n$. To detect periodic structure in these embeddings, following @zhou2024pre, we compute the discrete Fourier transform along the token index: $$\vF_\nu = \frac{1}{\sqrt{N}} \sum_{n=0}^{N-1} \ve(n) \, e^{-2\pi i \nu n} \in \sC^d, \qquad \nu = \frac{k}{N},\; k = 0, \ldots, N-1.$$ The power at frequency $\nu$ is $\|\vF_\nu\|^2 = \sum_{j=1}^d |F_\nu^{(j)}|^2$. A *Fourier spike at period $T$* refers to a visible peak in $\|\vF_{1/T}\|^2$ relative to neighboring frequencies. For a period $T$ dividing $N$, define the mod-$T$ residue classes $C_r = \{n : n \equiv r \pmod{T}\}$ for $r = 0, \ldots, T-1$, each of size $|C_r| = N/T$. To evaluate whether the embeddings encode modular arithmetic at period $T$, we train a *linear probe* ($T$-class logistic regression) to predict $n \bmod T$ from $\ve(n)$.

A natural question is whether the presence of a Fourier spike at period $T$ guarantees good $\bmod$-$T$ probing accuracy. As we will show in `\Cref{fig:fourier_no_probe}`{=latex}, the answer is strikingly no: an LSTM trained on the same data as a Transformer develops larger Fourier power at $T = 10$ yet achieves chance-level probing. The following result shows how this is possible and demonstrates that the presence of a Fourier spike is a *necessary* but not a *sufficient* condition for learning modular probes.

```{=latex}
\begin{restatable}{theorem}{maintheorem}
\label{thm:fourier}
Given embeddings $\{\ve(n)\}_{n=0}^{N-1}$ and residue classes $\{C_r\}_{r=0}^{T-1}$ defined above, let the class means $\vmu_r$, grand mean $\vmu$, between-class scatter matrix $\mS_B$, and within-class scatter matrix $\mS_W$ be
\[
\vmu_r = \frac{1}{|C_r|}\sum_{n \in C_r} \ve(n), \qquad \vmu = \frac{1}{N}\sum_{n=0}^{N-1} \ve(n),
\]
\[
\mS_B = \frac{1}{T}\sum_{r=0}^{T-1}(\vmu_r - \vmu)(\vmu_r - \vmu)^\top, \qquad \mS_W = \frac{1}{N}\sum_{r=0}^{T-1}\sum_{n \in C_r}(\ve(n) - \vmu_r)(\ve(n) - \vmu_r)^\top.
\]
Let $\Phi_T = \sum_{\ell=1}^{T-1} \|\vF_{\ell/T}\|^2$ be the total power at harmonics of period $T$, and $H_T = \{0, 1/T, 2/T, \ldots, (T{-}1)/T\}$ be the set of harmonic frequencies.
\begin{enumerate}
\item[\emph{(i)}]
If $\Phi_T = 0$, then $\mS_B = \bm{0}$ and no linear probe can classify $n \bmod T$ above chance.

\item[\emph{(ii)}]
% For any $C > 0$, there exist embeddings $\ve(n)$ and period $T$ with $\Phi_T > C$ but no linear probe can classify $n \bmod T$ above chance.
%\vscomment{can we say that the fourier weight is sufficiently large, instead of just being non-zero? are the token embeddings normalized to be unit norm (that would give a natural scale)?}
For any $T \ge 2$, $C > 0$, and $\varepsilon > 0$, there exist $N$ divisible by $T$ and embeddings $\ve(n)$ satisfying $\Phi_T > C$ yet no $T$-class linear classifier achieves accuracy above $1/T + \varepsilon$, i.e., no more than $\varepsilon$ above random guessing.

\end{enumerate}
\end{restatable}
```
`\Cref{thm:fourier}`{=latex} establishes that $\Phi_T > 0$ is necessary but not sufficient. A natural follow-up is: what *quantitatively* determines whether a spiky $\Phi_T$ translates into geometric separability?

```{=latex}
\begin{remark}
The gap between two parts of \Cref{thm:fourier} can be made more precise through the lens of Fisher's Linear Discriminant Analysis \citep{fisher1936lda}. Assume $d \geq T - 1$ and $S_W$ is invertible.
Assume $d>T-1$ and $S_W$ is invertible. While a linear probe for $T$ classes separates data across a $(T-1)$-dimensional subspace, the absolute ceiling on its performance is dictated by its principal axis of separation: the single direction $v$ that maximizes the ratio of between-class to within-class variance, given by the Rayleigh quotient $\frac{v^\top \mS_B v}{v^\top \mS_W v}$.
The maximum achievable separability along this optimal axis is exactly the largest generalized eigenvalue of the scatter matrices, $\lambda_{\max}(\mS_W^{-1} \mS_B)$. We show (see \S~\ref{app:bounds}) that this maximum discriminant satisfies:
\[
\frac{1}{(T-1) \cdot \mathrm{cond}(\mS_W) } \cdot \frac{\Phi_T}{N \cdot \lambda_{\min}(\mS_W)} \;\leq\; \lambda_{\max}(\mS_W^{-1} \mS_B) \;\leq\; \frac{\Phi_T}{N \cdot \lambda_{\min}(\mS_W)}.
\]
The Fourier power $\Phi_T$ only guarantees a large total variance among the class means, which drives $\Phi_T = \Tr(\mS_B) / N$ in the numerator. It ensures the class centers are dispersed in space. However, actual linear separability depends on how this signal aligns with the within-class scatter, $\mS_W$. A large condition number, $\text{cond}(\mS_W) = \lambda_{\max}(\mS_W)/\lambda_{\min}(\mS_W)$, drastically lowers the minimum bound. If the periodic signal aligns with the dimensions of maximum within-class noise, the massive $\Phi_T$ is entirely swallowed by within-class variance. This drives $\lambda_{\max}$ toward zero and forces the entire $(T-1)$-dimensional subspace to collapse into overlapping classes.
\end{remark}
```
![**Examples for the Proof of `\Cref{thm:fourier}`{=latex} Part (ii) in § `\ref{app:insufficient}`{=latex}**. In the proof, every number $n \in \{0,\dots,N-1\}$ has a unique decomposition $n = r + mT$ with residue $r = n \bmod T \in \{0,\dots,T{-}1\}$ and block index $m = \lfloor n/T \rfloor \in \{0,\dots,K{-}1\}$, where $K = N/T$. We set the embedding `\color{Maroon}`{=latex}$e(n) = Ar + Bm$, so that $A$ controls the between-class scatter $S_B$ (and hence $\Phi_T = N \cdot S_B$) while $B$ controls only the within-class scatter $S_W$, with neither parameter affecting the other. As illustrated in the figure, fixing $A$ produces a persistent Fourier spike at period $T$ regardless of $B$: when $B$ is small the residue classes cluster into linearly separable groups, but as $B$ grows the embeddings interleave across classes so that the best linear classification accuracy goes closer to random guessing $1/T$.](figures/part_ii_figure_T5.png){#fig:thm_part_ii_examples width="\\linewidth"}

The proof of `\Cref{thm:fourier}`{=latex} is given in Appendix `\ref{app:proof-fourier}`{=latex}, where part (i) follows naturally from Fourier identities in `\Cref{lem:fourier-variance-identity}`{=latex} which shows that $\Tr(\mS_B) = \Phi_T / N$, and part (ii) constructs examples (see `\Cref{fig:thm_part_ii_examples}`{=latex}) so that $\Phi_T$ can be arbitrarily large but residue classes are not linearly separable. We now ask two empirical questions: how common are Fourier spikes in practice, and do they actually predict mod-$T$ probe accuracy?

Throughout this paper, unless stated otherwise, we train models with around 300 million parameters on 10B tokens from FineWeb-Edu [@lozhkov2024fineweb-edu] using the Llama-3 tokenizer, which assigns single tokens to each integer from 0 to 999 such that $N = 1000$. We include architecture details in the Appendix `\ref{app:model_training}`{=latex}. We evaluate number representations using two complementary measurements: (1) the *Fourier magnitude spectrum*, computed as the power $\|\vF_\nu\|^2$ normalized by the median across frequencies $\nu$, and (2) the *mod-$T$ probe accuracy*, measured by Cohen's $\kappa$ of a classifier trained to predict $n \bmod T$ from the token embedding $\ve(n)$. Cohen's $\kappa$ adjusts for the baseline accuracy of random guessing ($1/T$ for balanced classes), so that $\kappa = 0$ corresponds to chance and $\kappa = 100\%$ to perfect classification regardless of $T$. We report accuracy-based results in the appendix. We use three probe types: linear (logistic regression), MLP, and RFM kernel [@radhakrishnan2024rfm]; unless noted, we report linear probe results, with the others in § `\ref{app:mlp_rfm}`{=latex}. We report probe performance averaged over 30 runs: 3 random seeds, each with 10-fold cross-validation. The first measurement detects periodic structure while the second tests whether that structure supports mod-$T$ classification.

#### Fourier spikes are universal.

`\Cref{fig:fourier_universality}`{=latex} shows that every pretrained model we examine, spanning Transformer LLMs, non-Transformer LLMs, and classical word embeddings, exhibits peaks at $\nu = 1/10, 1/5, 1/2$. *Spectral convergence* is universal as long as natural languages are used for training. This holds across architectures across fundamentally different learning algorithms and across models never explicitly trained on numerical tasks.

#### Fourier spikes do not imply modular arithmetic.

Figure `\ref{fig:fourier_no_probe}`{=latex} presents a controlled comparison under our training setup: a Transformer, a Gated DeltaNet [@gdn], an LSTM [@lstm] (all around 300M parameters), and the raw number token frequency distribution from the training corpus, where each number $n$ is represented by its scalar corpus frequency $p_n$ via counting rather than a learned embedding. The left column shows that all four produce qualitatively similar Fourier spikes at periods $T = 2, 5, 10$. The middle column shows mod-$T$ probe accuracy: the Transformer and Gated DeltaNet achieve $\kappa = 96$ and $95$ at $T = 2$, and $\kappa = 85$ and $78$ at $T = 10$, while the LSTM and the token distribution remain at chance across all moduli. The right column explains this gap through Theorem `\ref{thm:fourier}`{=latex}. The LSTM has *larger* Fourier power $\Phi_{10}$ than the Transformer, yet its Fisher discriminant $\lambda_{\max}(\mS_W^{-1}\mS_B)$ is two orders of magnitude smaller. The difference lies in the condition number $\mathrm{cond}(\mS_W)$: the LSTM's within-class scatter is highly anisotropic, so the periodic signal is buried under within-class variance and the classes overlap despite visible Fourier spikes. Several recent studies have inspected Fourier spectra and concluded that models have learned modular structure [@nanda2023progress; @zhou2024pre]. Our analysis shows why this inference is unreliable: a visible spike at period $T$ guarantees $\Tr(\mS_B) > 0$, but probe accuracy depends on the eigenspectrum of $\mS_W^{-1}\mS_B$, which the Fourier power spectrum alone does not determine.

We call this arrangement of embeddings into linearly separable mod-$T$ classes *Geometric Convergence*. Unlike spectral convergence, geometric convergence is selective: only certain combinations of data, architecture, and optimizer produce it. Both are instances of **convergent evolution**: different systems arriving at similar representations because of shared constraints. The next sections identify what drives each.

Convergent Evolution in Language Model Pretraining {#sec:pretraining}
==================================================

Spectral convergence requires only the periodic frequency distribution of number tokens, but geometric convergence is selective. We now ask which constraints must be present for geometric convergence to emerge. Through controlled experiments that vary one factor at a time, we identify three: the data signal the model receives (§`\ref{ssec:data}`{=latex}), the architecture, and the optimizer (§`\ref{ssec:arch_opt}`{=latex}). All experiments use 300M-parameter models trained on 10B tokens from FineWeb-Edu [@lozhkov2024fineweb-edu] with the Llama-3 tokenizer, unless stated otherwise.

Structural Attribution to Data {#ssec:data}
------------------------------

To isolate the environmental pressures driving convergence, we fix the architecture (300M Transformer) and optimizer (Muon, [@jordan2024muon]) and vary only the training data. We apply controlled perturbations (`\Cref{tab:data_configs}`{=latex}) that each remove a specific type of co-occurrence while leaving others intact, attributing the emergence of spectral and geometric convergence to specific structural properties of the data.

#### Spectral convergence requires only token frequencies.

All perturbations produce nearly identical Fourier spectra (`\Cref{fig:data}`{=latex}, left), including `Unigram Replace`, which destroys all co-occurrence structure by independently resampling every number token from its marginal distribution. As predicted by the universality observed in §`\ref{sec:problem_setup}`{=latex}, the periodic frequency distribution of number tokens is sufficient to produce Fourier spikes, and no co-occurrence information is needed.

```{=latex}
\small
```
```{=latex}
\rowcolors{1}{white}{blue!4}
```
::: {#tab:data_configs}
  **Configuration**        `\hsize`{=latex}`\hsize`{=latex} **Perturbation**                                                                            `\hsize`{=latex}`\hsize`{=latex} **Structure of Data Removed**
  ------------------------ ---------------------------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------------------------------
  `Original`               `\hsize`{=latex}`\hsize`{=latex} $-$                                                                                         `\hsize`{=latex}`\hsize`{=latex} $-$
  `Isolate-`$k$            `\hsize`{=latex}`\hsize`{=latex} Each sequence contains at most $k$ numbers token via packing. We use $k = 1, 2,$ and $8$.   `\hsize`{=latex}`\hsize`{=latex} Interaction across numbers. $k = 1$ indicates no interaction.
  `ContextLength-`$\ell$   `\hsize`{=latex}`\hsize`{=latex} Sequences split into windows of $\ell$. We use $\ell = 2, 4, 8,$ and $64$                   `\hsize`{=latex}`\hsize`{=latex} Role of broad context
  `Swap Numbers`           `\hsize`{=latex}`\hsize`{=latex} Number token sequence replaced by that of another sequence, keeping number $n$-gram         `\hsize`{=latex}`\hsize`{=latex} Number $\leftrightarrow$ text association
  `Unigram Replace`        `\hsize`{=latex}`\hsize`{=latex} Every number token resampled i.i.d. from marginal distribution to replace the original      `\hsize`{=latex}`\hsize`{=latex} Co-occurrence v.s. frequency

  : Data perturbation used in §`\ref{ssec:data}`{=latex}. All models on the 10B tokens from FineWeb-Edu.
:::

![**Spectral convergence is universal but geometric convergence depends on the data signal.** (*Left*) Fourier spectra of Transformer embeddings trained under data perturbations in `\Cref{tab:data_configs}`{=latex}. All perturbations produce similar spikes to the original at periods $T = 2, 5, 10$, including `Unigram Replace`, which destroys all co-occurrence structure among number tokens. (*Right*) Cohen's $\kappa$ of linear probes for mod-$T$ classification tells a different story: `Original`, `Isolate-8`, and `Context Length 64` achieve strong probing, but with shorter context length or fewer numbers within each sequence (e.g. `Isolate-2`) is much weaker at $T = 5$ and $10$. `Swap Numbers` drops substantially, and `Unigram Replace` falls to chance.](figures/data_linear_probe.png){#fig:data width="\\linewidth"}

#### Geometric convergence draws on several complementary data signals.

The right panel of `\Cref{fig:data}`{=latex} reveals that geometric convergence degrades gradually as different types of co-occurrence information are removed. The probing power is measured with Cohen's $\kappa$ for balanced classes, the metric that removes random guessing from accuracy for $T$-way classifications defined as $\kappa = (\mathrm{Accuracy} - \frac{1}{T}) / (1 - \frac{1}{T})$. When $\kappa = 0$, the probe is at chance and $\kappa = 100\%$ is perfect classification regardless of $T$. `Swap Numbers`, which preserves number $n$-gram statistics but destroys the association between specific numbers and their text contexts, drops probing from $\kappa = 85.4$ to $28.8$ at $T = 10$, making text-number co-occurrence an important signal. But it is not the only one.

Longer context provides a second signal. At `Context Length 2`, where each token sees only one neighbor, mod-10 probing already reaches $\kappa = 40.3$, well above `Swap Numbers` ($28.8$). Increasing the window to $\ell = 4, 8,$ and $64$ steadily improves $\bmod 10$ probes ($\kappa = 47.3, 51.7,$ and $72.0$ respectively), showing that the model accumulates richer co-occurrence statistics from broader context.

Cross-number interaction provides an additional signal as well. `Isolate`-$k$ directly controls cross-number interaction by restricting each packed sequence to contain at most $k$ number tokens. The extreme limit of $k=1$ isolates text-number co-occurrence completely, ensuring no two number tokens can interact within the same attention window. Under this setting, it achieves $\kappa = 45.0$ at $T = 10$ and $85.9$ at $T = 2$. Notably, even $k=1$ with a Transformer surpasses PPMI ($\kappa=27.1$) and word2vec ($\kappa=29.3$), suggesting that autoregressive language modeling with text-number co-occurrence alone extracts richer modular structure than classical embedding methods. Allowing more numbers to co-occur within each attention block improves probing: $\kappa = 53.0$ at $k = 2$ and $77.2$ at $k = 8$ for $T = 10$. The fact that `Isolate` ($k = 1$) still outperforms `Swap Numbers` confirms that text-number co-occurrence alone provides strong signal, but the gap to `Original` almost closed by `Isolate` ($k = 8$) shows that cross-number interaction contributes on top of it. In all cases, Fourier spikes are fully preserved while probing degrades, reinforcing the two-tiered hierarchy between spectral and geometric convergence and they are driven by different mechanisms.

#### Probing accuracy varies sharply across moduli.

Across all conditions that achieve geometric convergence, the probing accuracy depends strongly on the modulus. Mod 2, 5, and 10 are consistently the easiest ($\kappa = 96.1, 63.5, 85.4$ for `Original`), mod 4 achieves nontrivial probing ($\kappa = 34.9$), while moduli that share no common factor with 10 (e.g., 3, 7, 9) remain near chance. This pattern is stable across all perturbations that preserve geometric convergence. In `\Cref{sec:addition}`{=latex}, we show that this structure can be traced to the tokenizer.

Structural Attribution to Architecture and Optimizer {#ssec:arch_opt}
----------------------------------------------------

We now fix the pretraining data, and vary the architecture and optimizer. We train a 300M parameter Transformer model and two linear RNNs (Gated DeltaNet and Mamba-2 [@mamba2]) whose architectural and training details are shown in § `\ref{app:model_training}`{=latex}. We vary two different optimizers for training LLMs: AdamW and Muon. Figure `\ref{fig:arch_opt}`{=latex} shows the results.

![**Geometric convergence depends on architecture and optimizer.** (*Left*) Fourier magnitude spectra of number embeddings across architectures and optimizers; all exhibit spikes at periods $T = 2, 5, 10$, including the `LSTM` and classical word embeddings. All three architectures produce similar Fourier spectra under both optimizers. (*Right*) Mod-$T$ probe accuracy separates the models into tiers. With both Muon and AdamW, `Transformer`, `Gated DeltaNet`, and `Mamba-2` all achieve strong geometric convergence, while the `LSTM` remains near zero. Muon outperforms AdamW for `Transformer` and `Gated DeltaNet`, but `Mamba-2` with AdamW slightly outperforms its Muon counterpart. `PPMI` and `word2vec` fall in between.](figures/architecture_linear_probe.png){#fig:arch_opt width="\\linewidth"}

#### Transformers and linear RNNs achieve geometric convergence; LSTMs do not.

Under both Muon and AdamW, `Transformer`, `Gated DeltaNet`, and `Mamba-2` all achieve strong probing, with the Transformer performing best. All three produce nearly identical Fourier spectra regardless of the optimizer, confirming that spectral convergence is architecture-independent and optimizer-independent in language pretraining task. The 12-layer `LSTM`, trained with AdamW, develops even more prominent Fourier spikes but its probing accuracy remains near chance across all moduli. A 4-layer LSTM shows no improvement nor degradation, indicating that the failure is architectural rather than a matter of capacity (see `\Cref{fig:lstm}`{=latex}). We note that as shown in `\Cref{fig:fourier_no_probe}`{=latex}, the Fourier spectrum of the LSTM embeddings closely resembles that of the number token marginal distribution, suggesting that the LSTM embeddings capture little beyond unigram frequency statistics for numbers. Classical word embeddings fall in between: `PPMI` and `word2vec`, trained on the same 10B tokens, achieve moderate probing ($\kappa = 27.1$ and $29.3$ at $T = 10$) with clear Fourier spikes yet weaker probing, illustrating the dissociation between spectral and geometric convergence.

#### The effect of optimizer is architecture-dependent.

Comparing the top and middle blocks of Figure `\ref{fig:arch_opt}`{=latex} isolates the effect of the optimizer. Muon produces stronger probing for the Transformer ($\kappa = 85.4$ vs $72.1$ at $T = 10$) and for Gated DeltaNet ($77.8$ vs $69.7$), but Mamba-2 trained with AdamW actually outperforms Mamba-2 trained with Muon ($80.1$ vs $76.7$). We find that the Transformer trained with Muon has the best probing performance. The optimizer's effect on geometric convergence thus depends on the architecture, and we do not observe a universal advantage for either optimizer in language pretraining task.

#### Spectral and geometric convergence co-emerge gradually.

`\Cref{fig:dynamics_transformer_original}`{=latex} tracks $\Phi_T$ and probe accuracy throughout Transformer pretraining for $T = 2, 5, 10$. Both increase smoothly with no phase transition, unlike the grokking observed in modular arithmetic [@nanda2023progress]. We will discuss more on model behavior when trained directly on arithmetic in `\Cref{sec:addition}`{=latex}.

Convergent Evolution in Training on Arithmetic {#sec:addition}
==============================================

Sections `\ref{sec:problem_setup}`{=latex} and `\ref{sec:pretraining}`{=latex} studied models trained on general language, where Fourier features emerge from the statistics of number tokens in natural text. We now ask whether convergent evolution also occurs when models are trained directly on arithmetic, where the training signal is purely numerical, and the prior given by human language is absent.

#### Experimental setup.

We train 300M Transformers from random initialization on integer addition using the same architecture as in §`\ref{sec:problem_setup}`{=latex}. Each example has the form $a + b = c$, with loss masked on prompt tokens up to $=$ sign. We train for 3B tokens under both Muon and AdamW. In 9-digit addition, operands have 1-9 digits with stratified digit-count sampling. Each operand may span multiple number tokens. In 3-digit addition, we enumerate all pairs $(a, b)$ with $a, b \in [0, 999]$ and $a + b \leq 999$. Every operand and sum is a single token. Training for 3 billion tokens amounts to roughly 1,000 epochs; we run two seeds per optimizer. We additionally train circular probes that project embeddings onto the unit circle (see `\Cref{fig:circular_probe}`{=latex}).

![**Tokenization determines convergence in arithmetic.** (*Top*) In 9-digit addition, both Muon and AdamW converge to the same spectral structure with sharp Fourier peaks and near-perfect $\kappa$ for mod 2, 5, and 10, showing both spectral and geometric convergence. (*Bottom*) In 3-digit addition, where every operand and sum fits in a single token, the Fourier spectra vary across optimizers and random seeds, and $\kappa$ remains near chance for all moduli. Multi-token tokenization forces modular subproblems that produce convergent representations; single-token tokenization leaves the representation unconstrained.](figures/addition_linear_probe.png){#fig:addition width="\\linewidth"}

#### 9-digit addition: convergence across optimizers.

Both Muon and AdamW converge to the same spectral structure (`\Cref{fig:addition}`{=latex}, top), with sharp Fourier peaks at the expected harmonics and near-perfect $\kappa$ for mod 2, 5, and 10. The two optimizers produce nearly identical Fourier spectra and probing accuracy, suggesting that the multi-token setting imposes constraints strong enough for both spectral and geometric convergence. Models determine the representation regardless of optimizer.

#### 3-digit addition: no convergence without modular pressure.

The single-token setting produces a different outcome (`\Cref{fig:addition}`{=latex}, bottom). With 500K unique pairs repeated over 1,000 epochs, the learned representations vary across optimizer and random seed. Muon develops Fourier peaks at frequencies that do not align with modular periods, and $\kappa$ remains near chance. AdamW exhibits grokking under one seed: training accuracy reaches 100% early while test accuracy remains low until a phase transition after training achieves roughly 1.6B to 2B tokens (`\Cref{fig:addition_dynamics}`{=latex}); under another seed, generalization never occurs. Unlike @nanda2023progress, who train on mod-113 addition where modular structure is explicit, single-token addition imposes no modular constraint: the sequences $a + b = c$ are identical whether interpreted as mod-1000 or mod-1111. Without this constraint, convergent evolution does not occur and the learned representation is seed-dependent.

#### Why tokenization determines learned representation.

In 9-digit addition after tokenization $[a_2,a_1,a_0]+[b_2,b_1,b_0]=[c_2,c_1,c_0]$, each output token satisfies $c_i = (a_i + b_i + \gamma_{i}) \bmod 1000$ where $\gamma_{i} \in \{0,1\}$ is the carry. Each output position is therefore a mod-1000 classification problem, especially the least significant position with no carry, i.e. $\gamma_{0} = 0$. Since all our models use tied embeddings (`\Cref{tab:config}`{=latex}), the output logits depend directly on the input embedding matrix, creating pressure for the embeddings to develop high $\Phi_{1000}$ to distinguish residue classes at each output position. This in turn implies non-trivial $\Phi_T$ for all $T \mid 1000 = 2^3 \cdot 5^3$. In single-token addition, no modular constraint is imposed, so $\Phi_T$ is unconstrained and depends on the optimizer and random seed. This reveals a second route through the two-tiered hierarchy: multi-token tokenization creates modular subproblems that produce both spectral and geometric convergence through carry propagation, while single-token tokenization guarantees neither.

Conclusion and Discussion {#sec:conclusion}
=========================

We have shown that periodic number representations in language models exhibit a two-tiered convergence: Fourier spikes are universal, but linearly separable mod-$T$ classes emerge only when data, architecture, and optimizer align. The central lesson is that visible structure in representations does not guarantee functional organization: the LSTM and even the raw token distribution develop more prominent Fourier spikes than the Transformer yet achieve chance-level probing.

More generally, any representation-level diagnostic could mistake statistical artifacts of the training distribution for learned structure. Our controlled perturbation approach offers a complementary lens to instance-level attribution methods such as influence functions [@koh2017influence]: rather than attributing predictions to individual training examples, we attribute learned representations to structural properties of the data distribution.

Analogous periodic representations have been found for days of the week and months of the year [@engels2025not; @karkada2026symmetry]; whether the spectral-geometric dissociation extends to these and other cyclic concepts is a natural next step. More broadly, the spectral-geometric hierarchy introduced here provides a concrete framework for distinguishing superficial from functional feature learning. This distinction may prove important well beyond numerical representations as we increasingly rely on representation-level diagnostics to understand large language models. `\ifcolmpreprint`{=latex}

Acknowledgments {#acknowledgments .unnumbered}
===============

The authors acknowledge the Center for Advanced Research Computing (CARC) at the University of Southern California for providing computing resources that have contributed to the research results reported within this publication. We also acknowledge the use of the USC NLP cluster provided by the USC NLP Group. DF and RJ were also supported by a gift from the USC-Capital One Center for Responsible AI and Decision Making in Finance (CREDIF). RJ was supported in part by the National Science Foundation under Grant No. IIS-2403436. VS was supported by National Science Foundation award CCF-2239265, an Amazon Research Award, a Google Research Scholar Award and an Okawa Foundation Research Grant. The work was done in part while some of the authors were visiting the Simons Institute for the Theory of Computing. This work used the Delta system at the National Center for Supercomputing Applications through allocation CIS250737 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants \#2138259, \#2138286, \#2138307, \#2137603, and \#2138296. This work was supported in part by the NVIDIA Academic Grant Program. The GPU resources provided by NVIDIA were essential for training and analyzing the 300M-parameter models examined in this study. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of the funding agencies. `\fi`{=latex}

```{=latex}
\bibliographystyle{colm2026_conference}
```
```{=latex}
\clearpage
```
```{=latex}
\appendix
```
Appendix {#appendix .unnumbered}
========

```{=latex}
\appendix
```
```{=latex}
\startcontents[appendix]
```
```{=latex}
\printcontents[appendix]{l}{1}{\setcounter{tocdepth}{2}}
```
```{=latex}
\clearpage
```
Proof of `\Cref{thm:fourier}`{=latex} {#app:proof-fourier}
=====================================

We first restate the theorem here. `\maintheorem*`{=latex}We first start with two lemmas and then prove them in parts.

Lemmas Fourier-variance identity
--------------------------------

We first establish a lemma connecting the class means $\vmu_r$ to the Fourier coefficients $\vF_\nu$.

```{=latex}
\begin{lemma}
\label{lem:class-mean-dft}
For each $\ell = 0, \ldots, T-1$, define
\[
\hat{\vmu}[\ell] \;=\; \frac{1}{\sqrt{T}} \sum_{r=0}^{T-1} \vmu_r \, e^{-2\pi i \ell r / T}.
\]
Then $\hat{\vmu}[\ell] = \sqrt{T/N} \cdot \vF_{\ell/T}$.
\end{lemma}
```
```{=latex}
\begin{proof}
Substituting $\vmu_r = \frac{1}{|C_r|}\sum_{n \in C_r} \ve(n) = \frac{T}{N}\sum_{n \in C_r} \ve(n)$
(since $|C_r| = N/T$ when $T | N$.) and pulling the constant into the outer sum:
\begin{align}
\hat{\vmu}[\ell]
&= \frac{\sqrt{T}}{N} \sum_{r=0}^{T-1} \sum_{n \in C_r} \ve(n) \, e^{-2\pi i \ell r / T}.
\end{align}
Every $n \in \{0,\ldots,N-1\}$ belongs to exactly one class $C_r$ with $r = n \bmod T$,
so we can write $n = mT + r$ for some non-negative integer $m$.
Then $e^{-2\pi i \ell r/T} = e^{-2\pi i \ell n/T}$, since the additional $m\ell$ full periods contribute
an integer multiple of $2\pi$ to the exponent.
Re-indexing the double sum as a single sum over $n$:
\begin{align}
\hat{\vmu}[\ell]
&= \frac{\sqrt{T}}{N} \sum_{n=0}^{N-1} \ve(n) \, e^{-2\pi i \ell n / T}.
\end{align}
Because $T \mid N$, the ratio $\ell/T$ is one of the $N$ Fourier frequencies,
so comparing with $\vF_{\ell/T} = \frac{1}{\sqrt{N}} \sum_{n} \ve(n) \, e^{-2\pi i (\ell/T) n}$ gives
\[
\hat{\vmu}[\ell]
= \frac{\sqrt{T}}{N} \cdot \sqrt{N} \cdot \vF_{\ell/T}
= \sqrt{\frac{T}{N}} \cdot \vF_{\ell/T}. \qedhere
\]
\end{proof}
```
Next, we state our second lemma on the identities for $\Tr(\mS_B)$ and $\Tr(\mS_W)$.

```{=latex}
\begin{lemma}\label{lem:fourier-variance-identity}
    Let the between-class and within-class variances $\mS_B$ and $\mS_W$ be defined as in \Cref{thm:fourier}, they satisfy
\[
\Tr(\mS_B) = \frac{\Phi_T}{N}, \qquad \Tr(\mS_W) = \frac{1}{N}\sum_{\nu \notin H_T}\|\vF_\nu\|^2.
\]
\end{lemma}
```
```{=latex}
\begin{proof}
Define the $T \times d$ matrix $\mM$ whose rows are the class means:
\[
\mM \;=\;
\begin{pmatrix}
\text{---}\; \vmu_0^\top \;\text{---} \\
\text{---}\; \vmu_1^\top \;\text{---} \\
\vdots \\
\text{---}\; \vmu_{T-1}^\top \;\text{---}
\end{pmatrix}
\in \mathbb{R}^{T \times d},
\]
and let $\mU$ be the $T \times T$ matrix with entries
$U_{\ell r} = \frac{1}{\sqrt{T}} e^{-2\pi i \ell r/T}$.
Define $\hat{\mM} = \mU \mM \in \mathbb{C}^{T \times d}$, whose rows are $\hat{\vmu}[\ell]^\top$:
\[
\hat{\mM} \;=\;
\begin{pmatrix}
\text{---}\; \hat{\vmu}[0]^\top \;\text{---} \\
\text{---}\; \hat{\vmu}[1]^\top \;\text{---} \\
\vdots \\
\text{---}\; \hat{\vmu}[T-1]^\top \;\text{---}
\end{pmatrix}
\in \mathbb{C}^{T \times d}.
\]

Since $\mU$ is the DFT matrix in $T$ dimensions, $\mU$ is unitary.
%Indeed, we have,
% \[
% (\mU^* \mU)_{r r'}
% = \sum_{\ell=0}^{T-1} (\mU^*)_{r \ell}\, U_{\ell r'}
% = \sum_{\ell=0}^{T-1} \overline{U_{\ell r}}\, U_{\ell r'}.
% \]
% Since $U_{\ell r} = \frac{1}{\sqrt{T}} e^{-2\pi i \ell r/T}$, then $\overline{U_{\ell r}} = \frac{1}{\sqrt{T}} e^{+2\pi i \ell r/T}$.
% Substituting $U_{\ell r'} = \frac{1}{\sqrt{T}} e^{-2\pi i \ell r'/T}$:
% \[
% (\mU^* \mU)_{r r'}
% = \sum_{\ell=0}^{T-1}
%   \frac{1}{\sqrt{T}} e^{+2\pi i \ell r/T}
%   \cdot
%   \frac{1}{\sqrt{T}} e^{-2\pi i \ell r'/T}
% = \frac{1}{T} \sum_{\ell=0}^{T-1} e^{2\pi i \ell (r - r')/T}.
% \]
% When $r = r'$ every term equals $1$ and the sum is $1$.
% When $r \neq r'$, set $\omega = e^{2\pi i(r-r')/T} \neq 1$;
% the geometric series gives
% $\frac{1}{T} \cdot \frac{\omega^T - 1}{\omega - 1} = 0$
% since $\omega^T = e^{2\pi i(r-r')} = 1$.
% Hence $\mU^* \mU = \mI$.
%\vscomment{$\mU$ is just the DFT matrix in $T$ dimensions right? it being unitary is standard, no need to prove. }

Therefore,
\[
\sum_{\ell=0}^{T-1} \|\hat{\vmu}[\ell]\|^2
\;=\; \|\hat{\mM}\|_F^2
\;=\; \|\mU \mM\|_F^2
\;=\; \|\mM\|_F^2
\;=\; \sum_{r=0}^{T-1} \|\vmu_r\|^2.
\]

This is simply the fact that a unitary change of basis preserves the sum of squared norms.

For $\ell = 0$: $\hat{\vmu}[0] = \sqrt{T/N} \cdot \vF_0$.
Since $\vF_0 = \frac{1}{\sqrt{N}}\sum_n \ve(n) = \sqrt{N}\,\vmu$, we have
$\hat{\vmu}[0] = \sqrt{T}\,\vmu$, and therefore
$\frac{1}{T}\|\hat{\vmu}[0]\|^2 = \|\vmu\|^2$.

The between-class variance is:
\begin{align}
\Tr(\mS_B)
&= \frac{1}{T}\sum_r \|\vmu_r - \vmu\|^2
= \frac{1}{T} \left(\sum_r \|\vmu_r\|^2
  - 2\vmu^\top\!\left(\frac{1}{T}\sum_r \vmu_r\right) + \|\vmu\|^2\right) \\
&= \frac{1}{T}\left(\sum_r \|\vmu_r\|^2 - \|\vmu\|^2\right)
= \frac{1}{T}\left(\sum_{\ell=1}^{T-1}\|\hat{\vmu}[\ell]\|^2\right) \\
&= \frac{1}{T}\left(\sum_{\ell=1}^{T-1}\frac{T}{N}\|\vF_{\ell/T}\|^2\right)
= \frac{1}{N}\sum_{\ell=1}^{T-1}\|\vF_{\ell/T}\|^2
= \frac{\Phi_T}{N},
\end{align}
%\vscomment{should there be paranthesis over everything after $\frac{1}{T}$ in these expressions? }
where the second line uses $\frac{1}{T}\sum_r \vmu_r = \vmu$,
which holds because the classes $\{C_r\}$ partition $\{0,\ldots,N-1\}$
into equal-sized groups.

For $\Tr(\mS_W)$, the total variance decomposes as $V = \Tr(\mS_B) + \Tr(\mS_W)$.
Define the $N \times d$ matrix $\mE$ whose rows are $\ve(0)^\top, \ldots, \ve(N-1)^\top$, and let $\mU_N$ be the $N \times N$ DFT matrix with entries $(\mU_N)_{kn} = \frac{1}{\sqrt{N}} e^{-2\pi i k n / N}$.
Then $\hat{\mE} = \mU_N \mE$ has rows $\vF_\nu^\top$, and since $\mU_N$ is unitary,
\[
\sum_{n=0}^{N-1} \|\ve(n)\|^2 = \|\mE\|_F^2 = \|\hat{\mE}\|_F^2 = \sum_{\nu} \|\vF_\nu\|^2.
\]
Since $\vF_0 = \sqrt{N}\,\vmu$, we have $\|\vF_0\|^2 = N\|\vmu\|^2$, and therefore
\[
V = \frac{1}{N}\sum_{n=0}^{N-1}\|\ve(n) - \vmu\|^2
  = \frac{1}{N}\left(\sum_{n}\|\ve(n)\|^2 - N\|\vmu\|^2\right)
  = \frac{1}{N}\sum_{\nu \neq 0}\|\vF_\nu\|^2.
\]

Therefore:
\[
\Tr(\mS_W) = V - \Tr(\mS_B)
= \frac{1}{N}\sum_{\nu \neq 0}\|\vF_\nu\|^2
  - \frac{1}{N}\sum_{\ell=1}^{T-1}\|\vF_{\ell/T}\|^2
= \frac{1}{N}\sum_{\nu \notin H_T}\|\vF_\nu\|^2,
\]
where $H_T = \{0,\, 1/T,\, 2/T,\, \ldots,\, (T{-}1)/T\}$ collects the
zero-frequency and harmonic frequencies.
\end{proof}
```
Necessary condition
-------------------

```{=latex}
\begin{proof}[Proof of Part (i) of \Cref{thm:fourier}]If $\Phi_T = 0$, then $\Tr(\mS_B) = 0$ by Part~(i).
Since $\mS_B$ is positive semidefinite, $\Tr(\mS_B) = 0$ implies $\mS_B = \bm{0}$, which requires $\vmu_r = \vmu$ for all $r = 0, \ldots, T-1$.
When all class means coincide, the class-conditional distributions of $\ve(n)$ share the same first moment, so no linear probe (or any probe relying on mean separation) can distinguish the $T$ classes above the chance rate $1/T$.
\end{proof}
```
Insufficiency condition {#app:insufficient}
-----------------------

```{=latex}
\begin{proof}[Proof of Part (ii)]We construct embeddings whose residue classes interleave periodically on the real line, then show that the geometry of linear decision boundaries prevents any $T$-class linear classifier from exceeding chance-level accuracy by more than $\varepsilon$.
Fix $T \ge 2$, $C > 0$, and $\varepsilon > 0$.
Set $K = \lceil (T{-}1)/(T\varepsilon) \rceil$ and $N = KT$.
Every index $n \in \{0, \dots, N{-}1\}$ has a unique decomposition $n = mT + r$ with residue $r \in \{0, \dots, T{-}1\}$ and block index $m \in \{0, \dots, K{-}1\}$.
Define
\[
  e(n) \;=\; A \cdot \underbrace{(n \bmod T)}_{\text{residue } r}
  \;+\; B \cdot \underbrace{\left\lfloor \frac{n}{T} \right\rfloor}_{\text{block index } m},
\]
with $A, B > 0$ to be chosen.
Intuitively, $A$ controls $n \bmod T$: enlarging $A$ pulls numbers sharing the same residue class $C_r$ together.
The parameter $B$ controls $\lfloor n/T \rfloor$, which measures how many full copies of $T$ fit below $n$: increasing $B$ clusters numbers of similar magnitude together, regardless of their residue.
We now show how $A$ determines the Fourier power while $B$ controls the interleaving that defeats linear classifiers.

\paragraph{Fourier power.}
Within class $C_r = \{r, r{+}T, \dots, r{+}(K{-}1)T\}$, the term $Ar$ is constant and only the block term varies, so the class mean is
\[
  \mu_r
  = \frac{1}{K}\sum_{m=0}^{K-1}(Ar + Bm)
  = Ar + \frac{B(K{-}1)}{2}.
\]
The grand mean is $\mu = A(T{-}1)/2 + B(K{-}1)/2$, giving $\mu_r - \mu = A\bigl(r - (T{-}1)/2\bigr)$.
Note that the block term $B \lfloor n/T \rfloor$ contributes nothing to $S_B$: its class mean $B(K{-}1)/2$ is identical across all classes and cancels in $\mu_r - \mu$.
Therefore
\[
  S_B
  = \frac{1}{T}\sum_{r=0}^{T-1}(\mu_r - \mu)^2
  = \frac{A^2}{T}\sum_{r=0}^{T-1}\Bigl(r - \frac{T{-}1}{2}\Bigr)^{\!2}
  = \frac{A^2(T^2 - 1)}{12},
\]
where the last equality uses the standard identity for the second central moment of $T$ consecutive integers.
By \Cref{lem:fourier-variance-identity}, $\Phi_T = N \cdot S_B = A^2 KT(T^2-1)/12$.
Setting $A = \sqrt{12C / (KT(T^2 - 1))}$ gives $\Phi_T = C$.

\paragraph{Periodic interleaving.}
Choose $B > (T{-}1)A$ so that consecutive blocks separate on the real line.
The largest value in block $m$ is $e((T{-}1) + mT) = (T{-}1)A + Bm$, and
the smallest value in block $m{+}1$ is $e(0 + (m{+}1)T) = B(m{+}1)$.
Since $B > (T{-}1)A$, we have $B(m{+}1) > (T{-}1)A + Bm$, so block $m$
and block $m{+}1$ occupy disjoint intervals on the real line.
Within each block, the $T$ points are sorted by residue:
$e(mT) = Bm < e(mT{+}1) = A + Bm < \cdots < e(mT{+}T{-}1) = (T{-}1)A + Bm$,
since $e(n)$ is increasing in $n \bmod T$ for fixed $\lfloor n/T \rfloor$.
Combining both observations, sorting all $N$ embeddings by value yields the
natural ordering: the $i$-th smallest embedding ($0$-indexed) is
\[
  x_{(i)} \;=\; e(i) \;=\; A \cdot (i \bmod T) \;+\; B \cdot \lfloor i/T \rfloor,
  \qquad i = 0, 1, \dots, N{-}1,
\]
whose class label is $i \bmod T$.
The sorted class label sequence is therefore $(0, 1, \dots, T{-}1)$ repeated
$K$ times:
\[
  \underbrace{0, 1, \dots, T{-}1}_{\text{block } 0},\;\;
  \underbrace{0, 1, \dots, T{-}1}_{\text{block } 1},\;\;
  \dots,\;\;
  \underbrace{0, 1, \dots, T{-}1}_{\text{block } K{-}1}.
\]
Any contiguous subsequence of length $T$ or longer contains at least one
complete cycle and therefore includes at least one point from every class.

\paragraph{Classification bound.}
A $T$-class linear classifier assigns each instance $x \in \mathbb{R}$ to
$\operatorname{argmax}_{c \in \{0,\dots,T-1\}} (w_c x + b_c)$ for
parameters $w_c, b_c \in \mathbb{R}$. This is the hypothesis class of
multiclass logistic regression.
Note that this hypothesis class is expressive enough to perfectly classify
$T$ contiguous groups: when $B$ is small, the $T$ classes cluster near
$0, A, 2A, \dots, (T{-}1)A$, and setting $w_c = c$ with biases cascaded
via $b_0 = 0$, $b_{c+1} = b_c - (cA + A/2)$ places the $T{-}1$ decision
boundaries between consecutive clusters, achieving $100\%$ accuracy.
Next we exploit the limitation that linear classifiers partition $\mathbb{R}$ into at most $T$ contiguous intervals: any two
of the $T$ lines $x \mapsto w_c x + b_c$ intersect in at most one point,
yielding at most $T - 1$ breakpoints and hence at most $T$ intervals, each
assigned to a single class.
Now consider any such interval.
Each complete cycle $(0, 1, \dots, T{-}1)$ contained in the interval contributes exactly one point from every class, so the assigned label matches exactly a $1/T$ fraction of those points.
Only at interval boundaries can a partial cycle contribute additional correct predictions: at most one per boundary, for a total of at most $T - 1$ extra correct points across all $T - 1$ boundaries.
It follows that
\[
  \mathrm{accuracy}
  \;\le\; \frac{N/T + T - 1}{N}
  \;=\; \frac{1}{T} + \frac{T-1}{KT}.
\]
The choice $K = \lceil (T{-}1)/(T\varepsilon) \rceil$ guarantees $(T{-}1)/(KT) \le \varepsilon$, completing the proof.

\Cref{fig:thm_part_ii_examples,fig:part_ii_T_10} illustrates this construction for $T = 5,N = 25$, and for $T = 10, N = 1000$, where in both cases $\varepsilon$ achieves its minimum at $\frac{T-1}{N}$, at 16\% and 0.9\% respectively.
\end{proof}
```
![Emergence of Fourier structure in a constructed embedding $e(n) = A(n \bmod T) + B\lfloor n/T \rfloor$ with $T=10$, $A=5$, $N=1000$. Each dot is a number $n$ placed at its scalar embedding value on the horizontal axis and colored by its class $n \bmod T$; since $e(n)$ is one-dimensional, the vertical coordinate carries no information and is jittered purely for visibility so that overlapping points of different classes remain distinguishable. **Top ($B=0.03$)**: within-block drift is small relative to between-class spacing, so the sorted line separates into 10 clean color bands: a linear classifier on $e(n)$ recovers the mod-10 residue at near-perfect accuracy, and the DFT concentrates energy at the fundamental frequency $\nu = 1/T = 0.1$. **Bottom ($B=21$)**: the between-block drift $B$ dominates, interleaving classes along the line so that every mod-10 lane spans the full range; best linear accuracy collapses to $\approx 1/T + \varepsilon$ with $\varepsilon = 0.9$%. The Fourier spectrum keeps the same peak at $\nu=0.1$ (same $\Phi_T$), showing that periodic energy at the fundamental is a property of the construction itself, independent of whether class identity is linearly decodable.](figures/part_ii_figure_T10.png){#fig:part_ii_T_10 width="\\linewidth"}

The token frequency distribution in `\Cref{fig:fourier_no_probe}`{=latex} provides an empirical counterexample: LSTM exhibits clear Fourier spikes at $T = 2, 5, 10$ yet achieves chance-level probing for all moduli, showing that the construction above captures a phenomenon that occurs in practice.

Lower and Upper Bounds {#app:bounds}
----------------------

```{=latex}
\begin{lemma}
    Assume $d \geq T - 1$ and $\mS_W$ is invertible (which holds when $d \leq N - T$, or after projecting to a subspace of dimension at most $N - T$).
In Fisher's Linear Discriminant Analysis \citep{fisher1936lda}, the optimal linear probe maximizes the ratio of between-class to within-class variance along a projection direction.
The separability of this optimal discriminant is characterized by $\lambda_{\max}(\mS_W^{-1} \mS_B)$, the largest generalized eigenvalue of the scatter matrix pair, which satisfies
\[
\frac{1}{(T-1) \cdot \mathrm{cond}(\mS_W) } \cdot \frac{\Phi_T}{N \cdot \lambda_{\min}(\mS_W)} \;\leq\; \lambda_{\max}(\mS_W^{-1} \mS_B) \;\leq\; \frac{\Phi_T}{N \cdot \lambda_{\min}(\mS_W)}.
\]
The ratio between the upper and lower bounds is $(T-1) \cdot \mathrm{cond}(\mS_W)$, where $\mathrm{cond}(\mS_W) = \lambda_{\max}(\mS_W) / \lambda_{\min}(\mS_W)$.
\end{lemma}
```
```{=latex}
\begin{proof}
The optimal linear discriminant for $T$-class classification projects the data onto the direction maximizing the generalized Rayleigh quotient:
\[
\lambda_{\max}(\mS_W^{-1}\mS_B) = \max_{\vv \neq \vzero} \frac{\vv^\top \mS_B\, \vv}{\vv^\top \mS_W\, \vv}.
\]

\emph{Upper bound.}
For any $\vv \neq \vzero$, the numerator satisfies $\vv^\top \mS_B\, \vv \leq \lambda_{\max}(\mS_B)\|\vv\|^2 \leq \Tr(\mS_B)\|\vv\|^2$, where the second inequality holds because $\mS_B$ is positive semidefinite and $\lambda_{\max}(\mS_B) \leq \Tr(\mS_B)$.
The denominator satisfies $\vv^\top \mS_W\, \vv \geq \lambda_{\min}(\mS_W)\|\vv\|^2$.
Therefore:
\[
\lambda_{\max}(\mS_W^{-1}\mS_B) \leq \frac{\Tr(\mS_B)}{\lambda_{\min}(\mS_W)} = \frac{\Phi_T}{N \cdot \lambda_{\min}(\mS_W)}.
\]

\emph{Lower bound.}
The matrix $\mS_B$ has rank at most $\min(d, T-1)$; since $d \geq T - 1$ by assumption, this simplifies to $T - 1$.
This holds because the $T$ vectors $\{\vmu_r - \vmu\}_{r=0}^{T-1}$ satisfy $\sum_r (\vmu_r - \vmu) = \vzero$ and therefore span a subspace of dimension at most $T-1$.
It follows that $\lambda_{\max}(\mS_B) \geq \Tr(\mS_B)/(T-1) = \Phi_T/(N(T-1))$.
Let $\vv^*$ be the unit eigenvector of $\mS_B$ corresponding to $\lambda_{\max}(\mS_B)$.
Then:
\[
\lambda_{\max}(\mS_W^{-1}\mS_B) \geq \frac{{\vv^*}^\top \mS_B\, \vv^*}{{\vv^*}^\top \mS_W\, \vv^*} = \frac{\lambda_{\max}(\mS_B)}{{\vv^*}^\top \mS_W\, \vv^*} \geq \frac{\lambda_{\max}(\mS_B)}{\lambda_{\max}(\mS_W)} \geq \frac{\Phi_T}{N \cdot (T-1) \cdot \lambda_{\max}(\mS_W)}.
\]

\emph{Gap between the bounds.}
The ratio of the upper to lower bound is:
\[
\frac{\Phi_T / (N \cdot \lambda_{\min}(\mS_W))}{\Phi_T / (N \cdot (T-1) \cdot \lambda_{\max}(\mS_W))} = (T-1) \cdot \frac{\lambda_{\max}(\mS_W)}{\lambda_{\min}(\mS_W)} = (T-1) \cdot \mathrm{cond}(\mS_W).
\]
The Fourier power spectrum fully determines $\Phi_T$ via Part~(i), but $\mathrm{cond}(\mS_W)$ depends on the directional structure of within-class variation, which the power spectrum $\{\|\vF_\nu\|^2\}$ does not capture.
Specifically, $\|\vF_\nu\|^2$ aggregates power across all $d$ embedding dimensions at frequency $\nu$, discarding any information about which dimensions carry the periodic signal versus which carry within-class noise.
As a result, two embeddings with identical power spectra (and hence identical $\Phi_T$) but different within-class covariance structures can yield $\lambda_{\max}(\mS_W^{-1}\mS_B)$ values differing by up to a factor of $(T-1) \cdot \mathrm{cond}(\mS_W)$, producing vastly different probe accuracies.
\end{proof}
```
```{=latex}
\clearpage
```
Experiments {#app:exp}
===========

Model and Training Details {#app:model_training}
--------------------------

```{=latex}
\small
```
```{=latex}
\setlength{\tabcolsep}{4pt}
```
::: {#tab:config}
                                                                                  Transformer                          Gated DeltaNet                Mamba-2                    LSTM
  ---------------------------------------------------------- ------------------------------------------------------ --------------------- ------------------------------ -------------------
  *Architecture*
  `\quad `{=latex}Total parameters                                                    320M                                  318M                       316M                     232M
  `\quad `{=latex}`\;`{=latex} $\rightarrow$ Embedding                                131M                                  131M                       131M                     131M
  `\quad `{=latex}`\;`{=latex} $\rightarrow$ Non-embedding                            189M                                  186M                       185M                     101M
  `\quad `{=latex}Tied embeddings                                              `\cmark `{=latex}                      `\cmark `{=latex}         `\cmark `{=latex}         `\cmark `{=latex}
  `\quad `{=latex}Layers                                                               12                                    12                         28                       12
  `\quad `{=latex}Hidden dim $d$                                                      1024                                  1024                       1024                     1024
  `\quad `{=latex}Heads                                                            16 (8 KV)                                 16                         32                       ---
  `\quad `{=latex}Head dim                                                             64                                    64                         64                       ---
  `\quad `{=latex}MLP intermediate                                                    4096                                  4096                       ---                       ---
  `\quad `{=latex}MLP activation                                                     SwiGLU                                SwiGLU                      ---                       ---
  `\quad `{=latex}Positional encoding                                                 RoPE                                  None                       None                     None
  `\quad `{=latex}Normalization                                                     RMSNorm                                RMSNorm                   RMSNorm                     ---
  `\quad `{=latex}Sequence mechanism                                                  GQA                            Linear Attn + Gate    SSM ($d_\text{state}{=}128$)      Recurrence
  `\quad `{=latex}Short convolution                                                   ---                                  size 4                     size 4                     ---
  `\quad `{=latex}SSM expand factor                                                   ---                            $1.5\times$ (value)            $2\times$                    ---
  `\quad `{=latex}Dropout                                                             ---                                    ---                       ---                       0.1
  *Muon optimizer*
  `\quad `{=latex}2D weight LR                                       $3 \times 10^{-3}$, momentum $= 0.95$
  `\quad `{=latex}Embed/norm/bias LR                              $3 \times 10^{-4}$ (AdamW, $\beta_2{=}0.95$)
  `\quad `{=latex}Weight decay                                               0.01 (2D weights only)
  *AdamW optimizer*
  `\quad `{=latex}Learning rate                                                $3 \times 10^{-4}$
  `\quad `{=latex}$(\beta_1, \beta_2)$                                           $(0.9, 0.95)$
  `\quad `{=latex}Weight decay                                                        0.01
  *Shared training*
  `\quad `{=latex}Context length                                                      1024
  `\quad `{=latex}Batch size                                        512 sequences (${\sim}$524K tokens/step)
  `\quad `{=latex}LR schedule                                 Cosine decay, 500 warmup steps, min $= 10\%$ of peak
  `\quad `{=latex}Training tokens                                  ${\sim}$9.4B (1 epoch of FineWeb-Edu 10BT)
  `\quad `{=latex}Precision                                                 bfloat16 mixed precision

  : Model architectures and training configurations.
:::

LSTM ablation {#app:lstm}
-------------

We ablate the number of layers of LSTMs and find reducing the number of layers to 4 instead of 12 does not change the phenomenon: both will learn fourier spikes but no probing performance. Both have huge condition number on $\mS_W$ yet huge $\Phi_T$ as well.

![**Ablation on the depth of LSTM models.** We find that reducing the number of layers to 4 (green) instead of 12 (blue) does not change the phenomenon: both will learn fourier spikes but no probing performance.](figures/fourier_probe_fisher_lstm.png){#fig:lstm width="\\linewidth"}

Data Perturbation Details {#app:data}
-------------------------

In this section, we describe in detail how we perturb data for each configurations in `\Cref{tab:data_configs}`{=latex}.

#### Isolate-$k$ configuration.

We design an *isolate* configuration to test whether Fourier features and mod-$T$ probes can emerge when reducing the interactive between number tokens, even indirectly through intermediate text tokens across multiple layers. The key idea is to enforce a block-diagonal causal attention mask that partitions each sequence into segments, where each segment contains at most $k$ token. Concretely, given a tokenized sequence, we locate all positions containing number tokens and place segment boundaries at the midpoint of the text span between each consecutive pair of $k$ number tokens. This way, every number token still sees some surrounding context on both sides, but can never interact with any other number token, if they are not in the same segment. Within each segment, standard causal attention applies: position $i$ attends to position $j$ only if $j \leq i$ and both positions belong to the same segment. We also reset RoPE position IDs to zero at segment boundaries to avoid leaking positional information across segments, and mask the loss at boundaries so the model is not trained to predict across segment breaks. Importantly, we do not modify the training data at all. The tokenized sequences are identical to those used in the standard configuration; only the attention mask differs.

#### Context length $\ell$.

To test the role of broad context in Fourier feature formation and geometric emergence, we train models whose effective context is limited to a fixed window of $\ell$ consecutive tokens. Each 1024-token sequence is reshaped into $\lfloor 1024 / \ell \rfloor$ independent subsequences of length $\ell$, each processed with its own standard causal attention mask. Tokens in one window cannot attend to tokens in any other window. This is equivalent to training on a corpus of short documents of length $\ell$. We experiment with $\ell \in \{2, 4, 8, 64\}$: a window of $\ell = 2$ reduces the model to learning bigram statistics (each token sees only the immediately preceding token), while $\ell = 4$ and $8$ permit short-range dependencies but still prevents any long-range co-occurrence patterns. $\ell = 64$ permits longer range dependencies but is still much shorter than `Original`'s context length of 1024. As with the isolate configuration, the underlying token sequence is unchanged; only the effective context window differs.

#### Swap numbers.

This configuration dissociates number tokens from their original textual context while preserving natural number $n$-gram statistics. For each training sequence, we keep every text token in its exact position and replace the entire subsequence of number tokens with a contiguous, order-preserving slice of number tokens drawn from other documents in the corpus. Concretely, we pre-extract all number tokens from the full training set into a single in-memory stream, and for each sequence we substitute a randomly chosen contiguous segment from this pool. The replaced number tokens therefore retain realistic sequential patterns but lose their association with the surrounding text. This tests whether the text-number co-occurrence structure, rather than the number token statistics alone, drives Fourier feature and modular probe emergence.

#### Unigram replace.

In the unigram configuration, every number token in the training data is independently replaced by a random draw from the corpus-wide marginal (unigram) distribution over number tokens. This destroys all sequential structure among numbers, such as their relationship to surrounding text and their co-occurrence with other numbers. But this type of perturbation exactly preserves the marginal frequency of each number token. If a number token $n$ appears with probability $p_n$ in the original corpus, it appears with the same probability in the perturbed data. By comparing to the original and swap-numbers configurations, the unigram ablation isolates whether co-occurrence statistics beyond simple token frequency are necessary for Fourier features and modular arithmetic to emerge.

Training Dynamics {#app:appendix}
-----------------

`\Cref{fig:dynamics_transformer_original}`{=latex} shows that during language pretraining, both Fourier power $\Phi_T$ and linear probe accuracy increase smoothly from the start of training for $T = 2, 5, 10$, with no sudden phase transition. This contrasts with grokking in modular arithmetic [@nanda2023progress], where structured representations appear abruptly after prolonged memorization. In language pretraining, the model continuously encounters diverse co-occurrence statistics rather than memorizing a fixed set of examples, so both tiers of convergence emerge gradually. In contrast, `\Cref{fig:addition_dynamics}`{=latex} shows training dynamics for addition.

![**Spectral and geometric convergence co-emerge gradually during pretraining.** Fourier power $\Phi_T$ (left axis) and linear probe accuracy (right axis) for a 300M Transformer trained with Muon, shown for $T = 2, 5, 10$. Both metrics increase smoothly throughout training with no phase transition, in contrast to the sudden emergence observed in grokking on modular arithmetic tasks in @nanda2023progress.](figures/dynamics_phi_T.png){#fig:dynamics_transformer_original width="\\linewidth"}

![**Training dynamics for Transformers trained on addition (two seeds).** *(Left)* In 9-digit addition, both Muon and AdamW converge smoothly to near-perfect train and test accuracy, with no grokking phase. *(Right)* In 3-digit addition, under seed 42, training accuracy reaches 100% for both optimizers, but generalization is optimizer- and seed-dependent. AdamW exhibits a quick grokking under seed 42 (test accuracy jumps around 1.5--2B tokens) but not under seed 123 (where training accuracy can't reach 100% and test accuracy remains random). This confirms that single-token addition imposes no consistent pressure toward structured representations.](figures/addition_9digit_dynamics.png "fig:"){#fig:addition_dynamics width="0.49\\linewidth"} `\hfill`{=latex} ![**Training dynamics for Transformers trained on addition (two seeds).** *(Left)* In 9-digit addition, both Muon and AdamW converge smoothly to near-perfect train and test accuracy, with no grokking phase. *(Right)* In 3-digit addition, under seed 42, training accuracy reaches 100% for both optimizers, but generalization is optimizer- and seed-dependent. AdamW exhibits a quick grokking under seed 42 (test accuracy jumps around 1.5--2B tokens) but not under seed 123 (where training accuracy can't reach 100% and test accuracy remains random). This confirms that single-token addition imposes no consistent pressure toward structured representations.](figures/addition_3digit_dynamics.png "fig:"){#fig:addition_dynamics width="0.49\\linewidth"}\
`\hfill`{=latex} ![**Training dynamics for Transformers trained on addition (two seeds).** *(Left)* In 9-digit addition, both Muon and AdamW converge smoothly to near-perfect train and test accuracy, with no grokking phase. *(Right)* In 3-digit addition, under seed 42, training accuracy reaches 100% for both optimizers, but generalization is optimizer- and seed-dependent. AdamW exhibits a quick grokking under seed 42 (test accuracy jumps around 1.5--2B tokens) but not under seed 123 (where training accuracy can't reach 100% and test accuracy remains random). This confirms that single-token addition imposes no consistent pressure toward structured representations.](figures/addition_3digit_dynamics_seed123.png "fig:"){#fig:addition_dynamics width="0.49\\linewidth"}

```{=latex}
\clearpage
```
Modular Probe Results with MLP and RFM Probes {#app:mlp_rfm}
---------------------------------------------

In this section, we present the modular probes similar to `\Cref{fig:data,fig:arch_opt,fig:addition}`{=latex} but with RFM probes and 2-layer MLP probes with hidden layer of size 64.

![Structural Attribution Results with RFM probes.](figures/rfm/data_rfm.png "fig:"){#fig:probe_rfm width="\\linewidth"} ![Structural Attribution Results with RFM probes.](figures/rfm/architecture_rfm.png "fig:"){#fig:probe_rfm width="\\linewidth"} ![Structural Attribution Results with RFM probes.](figures/rfm/addition_rfm.png "fig:"){#fig:probe_rfm width="\\linewidth"}

![Structural Attribution Results with 2-layer MLP probes.](figures/mlp/data_mlp.png "fig:"){#fig:probe_mlp width="\\linewidth"} ![Structural Attribution Results with 2-layer MLP probes.](figures/mlp/architecture_mlp.png "fig:"){#fig:probe_mlp width="\\linewidth"} ![Structural Attribution Results with 2-layer MLP probes.](figures/mlp/addition_mlp.png "fig:"){#fig:probe_mlp width="\\linewidth"}

```{=latex}
\clearpage
```
#### Cicular Probes for Transformers Trained on Addition.

To probe how number tokens are represented in the embedding layer, we train circular probes on the token embeddings for numbers. Beyond $T$-class modular probes, which test linear separability in a $(T{-}1)$-dimensional subspace, a circular probe tests angular separability in 2-D: it learns a linear map $W \in \mathbb{R}^{d\times 2}$ projecting each embedding onto the unit circle, and classifies by cosine similarity to $m$ anchor directions at $\theta_k = 2\pi k/m$.

![Circular probe projections of token embeddings onto the unit circle for mod-10 classification. Each point is a number token $n \in [0, 999]$ projected to 2-D by the probe and normalized; color indicates $n \bmod 10$. (*Left*) embeddings from a model trained on 9-digit addition cluster sharply by residue class ($84\%$ test accuracy), indicating that the token embedding layer has learned a geometrically organised, clock-like representation. (*Right*) embeddings from a model trained on 3-digit addition show no angular structure ($11\%$ test accuracy, near chance), consistent with the absence of Fourier peaks in Figure `\ref{fig:addition}`{=latex}.](figures/compare_addition_mod10.png){#fig:circular_probe width="0.75\\linewidth"}
