Pre-trained Large Language Models Use Fourier Features To Compute Addition
Source
- Raw Markdown: paper_llms-use-fourier-features-addition-2024.md
- PDF: paper_llms-use-fourier-features-addition-2024.pdf
- Preprint: arXiv 2406.03445
Core Claim
This paper argues that pretrained LLMs compute simple addition through Fourier features: low-frequency components approximate answer magnitude, while high-frequency components support modular classification such as unit-digit or parity decisions.
Key Contributions
- Studies fine-tuned GPT-2-XL on addition over numbers that fit into single GPT-2 tokens.
- Uses layer-wise readouts to show the model progressively refines answers rather than directly retrieving memorized sums.
- Applies Fourier analysis to MLP and attention logits, finding sparse periodic components.
- Shows causal importance by filtering low- and high-frequency components.
- Identifies pretrained number-token embeddings as a source of useful Fourier-like inductive bias.
Method Notes
This is the mechanistic upstream source for FoNE. FoNE turns the descriptive observation into an explicit embedding design: if pretrained LLMs naturally organize number tokens with Fourier-like components, a number embedding can build those components directly.
The time-series analogy is limited but useful. Periodic components are natural for scalar values with modular or cyclic structure, but a sensor value, exogenous variable, control input, or intervention may require different geometry depending on whether the model needs interpolation, exact arithmetic, or causal sensitivity.
Evidence And Results
The paper reports that a fine-tuned GPT-2-XL model reaches high addition accuracy and that its internal computation decomposes into approximation and classification components. MLP layers primarily approximate magnitude with lower-frequency features, while attention layers primarily contribute modular operations with higher-frequency features.
The pretraining comparison is central: models trained from scratch do not show the same Fourier-feature pattern and achieve lower accuracy, while introducing pretrained token embeddings improves performance.
Limitations
The experiments focus on addition and numbers constrained by GPT-2 tokenization. The mechanism should not be generalized to all numeric reasoning, multiplication, time-series forecasting, or auxiliary numeric value encoding without additional evidence.
Links Into The Wiki
Open Questions
- Do similar Fourier-like number features appear for continuous sensor values, exogenous variables, or control inputs in time-series models?
- Which numeric operations need low-frequency magnitude approximation, high-frequency modular classification, bit-level logic, or all three?
- Can Fourier-feature analysis diagnose failures in point-wise time-series embeddings?