Pre-trained Large Language Models Use Fourier Features To Compute Addition

Source

Raw Markdown: paper_llms-use-fourier-features-addition-2024.md
PDF: paper_llms-use-fourier-features-addition-2024.pdf
Preprint: arXiv 2406.03445

Core Claim

This paper argues that pretrained LLMs compute simple addition through Fourier features: low-frequency components approximate answer magnitude, while high-frequency components support modular classification such as unit-digit or parity decisions.

Key Contributions

Studies fine-tuned GPT-2-XL on addition over numbers that fit into single GPT-2 tokens.
Uses layer-wise readouts to show the model progressively refines answers rather than directly retrieving memorized sums.
Applies Fourier analysis to MLP and attention logits, finding sparse periodic components.
Shows causal importance by filtering low- and high-frequency components.
Identifies pretrained number-token embeddings as a source of useful Fourier-like inductive bias.

Method Notes

This is the mechanistic upstream source for FoNE. FoNE turns the descriptive observation into an explicit embedding design: if pretrained LLMs naturally organize number tokens with Fourier-like components, a number embedding can build those components directly.

The time-series analogy is limited but useful. Periodic components are natural for scalar values with modular or cyclic structure, but a sensor value, exogenous variable, control input, or intervention may require different geometry depending on whether the model needs interpolation, exact arithmetic, or causal sensitivity.

Evidence And Results

The paper reports that a fine-tuned GPT-2-XL model reaches high addition accuracy and that its internal computation decomposes into approximation and classification components. MLP layers primarily approximate magnitude with lower-frequency features, while attention layers primarily contribute modular operations with higher-frequency features.

The pretraining comparison is central: models trained from scratch do not show the same Fourier-feature pattern and achieve lower accuracy, while introducing pretrained token embeddings improves performance.

Limitations

The experiments focus on addition and numbers constrained by GPT-2 tokenization. The mechanism should not be generalized to all numeric reasoning, multiplication, time-series forecasting, or auxiliary numeric value encoding without additional evidence.

Links Into The Wiki

Open Questions

Do similar Fourier-like number features appear for continuous sensor values, exogenous variables, or control inputs in time-series models?
Which numeric operations need low-frequency magnitude approximation, high-frequency modular classification, bit-level logic, or all three?
Can Fourier-feature analysis diagnose failures in point-wise time-series embeddings?

Alex Knowledge Base

Explorer

Pre-trained Large Language Models Use Fourier Features To Compute Addition

Pre-trained Large Language Models Use Fourier Features To Compute Addition

Source

Core Claim

Key Contributions

Method Notes

Evidence And Results

Limitations

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks