---
abstract:
- |
  Pre-trained large language models (LLMs) exhibit impressive mathematical reasoning capabilities, yet how they compute basic arithmetic, such as addition, remains unclear. This paper shows that pre-trained LLMs add numbers using Fourier features---dimensions in the hidden state that represent numbers via a set of features sparse in the frequency domain. Within the model, MLP and attention layers use Fourier features in complementary ways: MLP layers primarily approximate the magnitude of the answer using low-frequency features, while attention layers primarily perform modular addition (e.g., computing whether the answer is even or odd) using high-frequency features. Pre-training is crucial for this mechanism: models trained from scratch to add numbers only exploit low-frequency features, leading to lower accuracy. Introducing pre-trained token embeddings to a randomly initialized model rescues its performance. Overall, our analysis demonstrates that appropriate pre-trained representations (e.g., Fourier features) can unlock the ability of Transformers to learn precise mechanisms for algorithmic tasks.
- |
  Pre-trained large language models (LLMs) exhibit impressive mathematical reasoning capabilities, yet how they compute basic arithmetic, such as addition, remains unclear. This paper shows that pre-trained LLMs add numbers using Fourier features---dimensions in the hidden state that represent numbers via a set of features sparse in the frequency domain. Within the model, MLP and attention layers use Fourier features in complementary ways: MLP layers primarily approximate the magnitude of the answer using low-frequency features, while attention layers primarily perform modular addition (e.g., computing whether the answer is even or odd) using high-frequency features. Pre-training is crucial for this mechanism: models trained from scratch to add numbers only exploit low-frequency features, leading to lower accuracy. Introducing pre-trained token embeddings to a randomly initialized model rescues its performance. Overall, our analysis demonstrates that appropriate pre-trained representations (e.g., Fourier features) can unlock the ability of Transformers to learn precise mechanisms for algorithmic tasks.
author:
- |
  **Tianyi Zhou**    **Deqing Fu**    **Vatsal Sharan**    **Robin Jia**\
  Department of Computer Science\
  University of Southern California\
  Los Angeles, CA 90089\
  `{tzhou029,deqingfu,vsharan,robinjia}@usc.edu`\
bibliography:
- ref.bib
- ref.bib
title: |
  **Pre-trained Large Language Models Use Fourier Features\
  to Compute Addition**
---

```{=latex}
\def\isarxiv{1}
```
```{=latex}
\ifdefined
```
```{=latex}
\isarxiv
```
```{=latex}
\ifdefined
```
```{=latex}
\isarxiv
```
```{=latex}
\else
```
```{=latex}
\maketitle 
```
```{=latex}
\iffalse
```
```{=latex}
\icmltitlerunning{????}
```
```{=latex}
\twocolumn[


\vskip 0.3in
]
```
```{=latex}
\printAffiliationsAndNotice{\icmlEqualContribution}
```
```{=latex}
\fi
```
```{=latex}
\fi
```
```{=latex}
\ifdefined
```
```{=latex}
\isarxiv
```
```{=latex}
\begin{titlepage}
  \maketitle
  \begin{abstract}

Pre-trained large language models (LLMs) exhibit impressive mathematical reasoning capabilities, yet how they compute basic arithmetic, such as addition, remains unclear. 
This paper shows that pre-trained LLMs add numbers using Fourier features---dimensions in the hidden state that represent numbers via a set of features sparse in the frequency domain. 
Within the model, MLP and attention layers use Fourier features in complementary ways: MLP layers primarily approximate the magnitude of the answer using low-frequency features, while attention layers primarily perform modular addition (e.g., computing whether the answer is even or odd) using high-frequency features.
Pre-training is crucial for this mechanism: models trained from scratch to add numbers only exploit low-frequency features, leading to lower accuracy.
Introducing pre-trained token embeddings to a randomly initialized model rescues its performance.
Overall, our analysis demonstrates that appropriate pre-trained representations (e.g., Fourier features) can unlock the ability of Transformers to learn precise mechanisms for algorithmic tasks.


  \end{abstract}
  \thispagestyle{empty}
\end{titlepage}
```
`\hypersetup{linkcolor=black}`{=latex} `\tableofcontents`{=latex} `\newpage`{=latex}

```{=latex}
\else
```
```{=latex}
\fi
```
Introduction {#sec:intro}
============

Mathematical problem solving has become a crucial task for evaluating the reasoning capabilities of large language models (LLMs) [@hendrycksmath2021; @cobbe2021training; @lu2024mathvista; @fu2024isobench]. While LLMs exhibit impressive mathematical abilities [@openai2024gpt4; @team2023gemini; @Claude3; @wang2017deep; @thawani2021representing; @brown2020language; @frieder2024mathematical], it remains unclear how they perform even basic mathematical tasks. Do LLMs apply mathematical principles when solving math problems, or do they merely reproduce memorized patterns from the training data?

In this work, we unravel how pre-trained language models solve simple mathematical problems such as \`\`Put together $15$ and $93$. Answer: [`\hspace{0.2in}`{=latex}]{.underline}". Prior work has studied how Transformers, the underlying architecture of LLMs, perform certain mathematical tasks. Most studies [@charton2023can; @Garg2022WhatCT; @Oswald2022TransformersLI; @Bai2023TransformersAS; @fu2023transformers; @nanda2023progress; @gu2024fourier; @Power2022GrokkingGB] focus on Transformers with a limited number of layers or those trained from scratch; [@hlv23] analyzes how the pre-trained GPT-2-small performs the greater-than task. Our work focuses on a different task from prior interpretability work---integer addition---and shows that pre-trained LLMs learn distinct mechanisms from randomly initialized Transformers.

In §`\ref{sec:fourier_analysis}`{=latex}, we show that pre-trained language models compute addition with Fourier features---dimensions in the hidden state that represent numbers via a set of features sparse in the frequency domain. First, we analyze the behavior of pre-trained LLMs on the addition task after fine-tuning, which leads to almost perfect accuracy on the task. Rather than merely memorizing answers from the training data, the models progressively compute the final answer layer by layer. Next, we analyze the contributions of individual model components using Logit Lens [@belrose2023eliciting]. We observe that some components primarily *approximate* the answer---they promote all numbers close to the correct answer in magnitude---while other components primarily *classify* the answer modulo $m$ for various numbers $m$. Then, we use Fourier analysis to isolate features in the residual stream responsible for the low-frequency \`\`approximation" and high-frequency \`\`classification" subtasks. Identifying these features allows us to precisely ablate the ability of the model to perform either approximation or classification by applying a low-pass or high-pass filter, respectively, to the outputs of different model components. We find that MLP layers contribute primarily to approximation, whereas attention layers contribute primarily to classification.

In §`\ref{sec:effect_of_pretraining}`{=latex}, we show that *pre-training* is crucial for learning this mechanism. The same network trained from scratch with random initialization not only shows no signs of Fourier features, but also has lower accuracy. We identify pre-trained token embeddings as a key source of inductive bias that help the pre-trained model learn a more precise mechanism for addition. Across the pre-trained token embeddings of many different pre-trained models, Fourier analysis uncovers large magnitudes of components with periods $2$, $5$, and $10$. Introducing pre-trained token embeddings when training the model from scratch enables the model to achieve perfect test accuracy. Finally, we show that the same Fourier feature mechanism is present not only in models that were pre-trained and then fine-tuned, but also in frozen pre-trained LLMs when prompted with arithmetic problems.

Overall, our work provides a mechanistic perspective on how pre-trained LLMs compute addition through the lens of Fourier analysis. It not only broadens the scope from only investigating few-layer Transformers trained to fit a particular data distribution to understanding LLMs as a whole, but also hints at how pre-training can lead to more precise model capabilities.

```{=latex}
\iffalse
```
Related Work
============

```{=latex}
\Tianyi{put these related work in intro, mention the fourier paper and other not that related paper in the discussion}
```
Related Work
------------

```{=latex}
\vatsal{In general for the related work section, it is good to try to explain how a given paper or line of papers connects to this paper.}
```
#### Understand mathematical abilities in Transformer

[@hlv23] find a circuit for greater-than in pre-trained GPT2 and show that GPT-2's greater-than relies on a complex circuit that activates across contexts. [@charton2023can] study how small Transformer model calculate the greatest common divisor (GCD) of two positive integers. They focus on how models learn a list of divisors and predict the largest element that divides both inputs. [@stolfo2023a] shows that LMs process the input by transmitting the information relevant to the query from mid-sequence early layers to the final token using the attention mechanism. Then, this information is processed by a set of MLP modules, which generate result-related information that is incorporated into the residual stream. In this paper, we show that the information is also processed by attention modules. We also dive deep to understand what roles are MLP and attention modules playing when doing arithmetic tasks. `\iffalse `{=latex}This mechanism lies between memorization and generalization. They add nuance to the memorization-generalization dichotomy, and take the first step towards a rich characterization of the states in between them. They investigate how Transformer perform the \`\`greater than" operation. For example, to test whether the MLP affect the output, they first get the MLP output with input $x>50$. Then, they test that MLP with input $x>01$. By replace the MLP output with the new one, without affecting other state, they know the importance of that MLP output to the prediction. Similarly, in our paper, we replace $3$ with $4$ to check which MLP or attention that plays important role here. `\fi`{=latex}

```{=latex}
\vscomment{Some of the below related work doesn't feel that closely connected to the paper, it could go in some related work in the appendix? Instead, I think we should  discuss papers on modular addition and related tasks more?}
```
```{=latex}
\Tianyi{Yes, currently, I just list the paper that related to our paper. I will select the ones that can connect to our paper.}
```
#### Grokking

Grokking was first reported in [@power2022grokking], which trained two-layer transformers on several algorithmic tasks and found that test accuracy often increased sharply long after achieving perfect train accuracy. [@barak2022hidden; @nanda2023progress] show that the network instead makes continuous progress toward the generalizing algorithm. [@liu2022towards] construct small examples of grokking, which they use to compute phase diagrams with four separate "phases" of learning. [@thilak2022slingshot] argue that grokking can arise without explicit regularization, from an optimization anomaly they dub the slingshot mechanism, which may act as an implicit regularizer. `\vscomment{Did we observe grokking during fine-tuning?}`{=latex}`\Tianyi{No, the Fourier features are learned during pre-training.  When we trained the model from scratch but with pre-trained token embedding, there is a grokking but not obvious.}`{=latex}`\vscomment{I'm not sure if we need to discuss this so much then..}`{=latex}

```{=latex}
\iffalse
```
#### Memory Stored in Transformer

Recent studies have illuminated the intricate architecture and functioning of pre-trained Transformer models. [@wwz+22] introduced the concept of ```\vsreplace{"skill neurons," }{``skill neurons,"}```{=latex} which are essential for encoding task-specific abilities and significantly impact model performance on related tasks, suggesting their development during the pre-training phase. [@gsbl20] explored how Feed-Forward Layers act as key-value memories, identifying textual patterns and shaping the model's predictions, with pattern complexity increasing across layers. Furthermore, [@ddh+21] presented a method for identifying and manipulating \"knowledge neurons,\" which hold specific factual knowledge, allowing for dynamic updates without full retraining.

Additionally, [@merullo2023language] illustrated the function of the Multilayer Perceptron (MLP) in processing task-related information and converting intermediate representations into specific outcomes, such as identifying capital cities from country names. Moreover, [@gcwg22] demonstrated that Transformers perform additive updates to token representations through the Feed-Forward Network (FFN) layers, enabling adjustments to reduce toxicity in model predictions. Collectively, these findings reveal the critical roles of various neural components in understanding and generating language, highlighting the potential for more targeted and efficient updates and modifications in Transformer-based models. [@zhu2023physics] utilized p-probing and demonstrated that through proper pre-training (the same attribute for the celebrity but in different formats), the GPT-2 model can store all memories in the early site. `\vscomment{I'm wondering how related all these are to this paper, at least we should say that in our case computation seems to be happening.}`{=latex} `\fi`{=latex} `\Tianyi{Should we just put some paper to introduction so that we do not need the related work section?}`{=latex}: `\fi`{=latex}

```{=latex}
\iffalse
```
Problem Formulation
-------------------

```{=latex}
\begin{definition}[Transformer]An autoregressive transformer language model $G: \mathcal{X} \rightarrow \mathcal{Y}$ over vocabulary $\mathrm{Vocab}$ maps a token sequence $x=\left[x_1, \ldots, x_N\right] \in \mathcal{X}, x_t \in \mathrm{Vocab}$ to a probability distribution $y \in \mathcal{Y} \subset \mathbb{R}^{|\mathrm{Vocab}|}$ that predicts next-token continuations of $x$. Within the transformer, the $i$-th token is embedded as a series of hidden state vectors $h_t^{(\ell)}$, beginning with $h_t^{(0)}=\operatorname{emb}\left(x_t\right)+\operatorname{pos}(i) \in \mathbb{R}^D$. The final output $y=\operatorname{softmax}(\bW^{U}\left(h_N^{(\ell)}\right))$ is read from the last hidden state. 
In the autoregressive case, tokens only draw information from past  tokens:
\begin{align*}
    h_t^{(\ell)}=h_t^{(\ell-1)}+\attn_t^{(\ell)}+\mlp_t^{(\ell)}
\end{align*}
where
\begin{align*}
\attn_t^{(\ell)}:=\attn^{(\ell)}\left(h_1^{(\ell-1)}, h_2^{(\ell-1)}, \ldots, h_t^{(\ell-1)}\right) \quad \text{and} \quad \mlp_t^{(\ell)}:= \mlp_t^{(\ell)}(\attn_t^{(\ell)},h_t^{(\ell-1)}). 
\end{align*}
\end{definition}
```
```{=latex}
\iffalse
```
```{=latex}
\Deqing{
% Outline
% \begin{enumerate}[I.]
%     \item 
% \end{enumerate}

Transformers Formula
}
```
```{=latex}
\Tianyi{we need to fix the definitions}
```
```{=latex}
\begin{definition}[Causal Attention Layer]A \textbf{causal} attention layer with $M$ heads and activation function $\sigma(\cdot) := \mathrm{ReLU}(\cdot)$ \Tianyi{should we change this to softmax?}\Deqing{Sure! But the formulation will look a bit messy} is denoted as $\mathrm{Attn}$ on any input  $\bH = \begin{bmatrix}
        \bh_1, \cdots, \bh_N
    \end{bmatrix} \in \mathbb R^{D \times N}$, where $D$ is the dimension of hidden states and $N$ is the sequence length. Let $t \in [N]$ denote the $t$-th token. The vector form of the $\mathrm{{Attn}}$ output is defined as  the following
    \begin{equation}
         \tilde{\bh}_t := [\wt{\bH}]_t := [\mathrm{Attn}(\bH)]_t := \bh_t +  \sum_{m=1}^M \sum_{j=1}^t \sigma \left(\inner{\bQ_m \bh_t, \bK_m \bh_j}\right) \cdot \bV_m \bh_j.
    \end{equation}
\label{def:attn}
\end{definition}
```
```{=latex}
\begin{definition}[Transformers]An $L$-layer decoder-based transformer with Causal Attention Layers is denoted as $\mathrm{TF}_{\btheta}$, which is a composition of a MLP Layer and a Causal Attention Layer. For input sequence $\bH^{(0)}$, the transformers $\ell$-th hidden layer is given by
    \begin{equation}
      \bH^{(\ell)} := \mathrm{TF}_{\btheta}^{\ell}(\bH^{(0)}) = \wt{\bH}^{(\ell)} + \mathrm{MLP}_{\btheta_\mathrm{mlp}^{(\ell)}} \left(\wt{\bH}^{(\ell)}\right) 
    \end{equation}
    \Tianyi{$\wt{\bH}^{(\ell-1)}$ in the above equation should be $\wt{\bH}^{(\ell)}$?} where $\wt{\bH}^{(\ell)} := \mathrm{Attn}_{\btheta_\mathrm{attn}^{(\ell)}} (\bH^{(\ell-1)})$, and $\btheta = \{\btheta_\mathrm{mlp}^{(\ell)}, \btheta_\mathrm{attn}^{(\ell)}\}_{\ell=1}^L$ and $\btheta_\mathrm{attn}^{(\ell)} = \{\bQ_m^{(\ell)}, \bK_m^{(\ell)}, \bV_m^{(\ell)}\}_{m=1}^M$ consists of $M$ heads at layer $\ell$. 

    For simplicity, we abuse the notation \Tianyi{$\mathrm{Attn}\left(\bh_t\right)$ does not tell which layer it represent}$\mathrm{Attn}\left(\bh_t\right) := \mathrm{Attn}\left(\bh_t \mid \bh_1, \cdots, \bh_{t-1}\right) := \left[\mathrm{Attn}\left(\bH^{(\ell-1)}\right)\right]_{:,t}$. 
    
    In the vector form, we have \Tianyi{$\wt{\bh}^{(\ell-1)}$ should be $\wt{\bh}^{(\ell)}$?}
    \begin{equation}
        \bh^{(\ell)}_t = \bh^{(\ell-1)}_t + \mathrm{Attn}\big(\bh_t\big) + \mathrm{MLP}\big(\wt{\bh}^{(\ell-1)}_t\big) 
    \end{equation}
\label{def:transformers}
\end{definition}
```
We only consider the $t = N$. For simplicity, we abuse the notation $\bh^{(\ell)} := \bh^{(\ell)}_t$, $\attn^{(\ell)} := \mathrm{Attn}\big(\bh_t\big)$, $\mlp^{(\ell)} := \mathrm{MLP}\big(\wt{\bh}^{(\ell-1)}_t\big)$

```{=latex}
\Tianyi{we need to define $p$ as the size of our number space, 520, and define $\bW^{U} \in \mathbb R^{p \times D}$}
```
```{=latex}
\begin{definition}[Logits readout unembedding from layer $\ell$]Let $\bW^{U} \in \mathbb R^{N_\mathrm{Vocab} \times D}$ be a pre-trained unembedding matrix. Then the readout logits from layer $\ell$ can be expressed as
    \begin{equation}
        \logits_{\btheta}^{(\ell)}(\bH^{(0)}) = \bW^{U} \bh^{(\ell)}_{N}
    \end{equation}
\end{definition}
```
```{=latex}
\fi
```
In this paper, we only consider the output tokens to be numbers. Hence, we have the unembedding matrix $\bW^U \in \R^{p \times D}$, where $p$ is the size of the number space.

```{=latex}
\begin{definition}[Logits]\label{def:logits}
     Let $\logits_{\attn}^{(\ell)} := \bW^{U} \attn^{(\ell)} $ denote the output logits of the attention module at the $\ell$-th layer. Let $\logits_{\mlp}^{(\ell)} :=  \bW^{U} \mlp^{(\ell)}$ denote the output logits of the MLP module at the $\ell$-th layer. Let $\logits^{(\ell)}:= \bW^{U} h^{L}$ denote the final output logits before softmax function.
\end{definition}
```
We constructed two datasets, namely the \`\`format-math-dataset" and the \`\`language-math-dataset". The \`\`format-math-dataset" was generated by creating all conceivable pairs of numbers up to 520 which is the maximum number that can be represented by a single token for GPT2-XL tokenizer. Each pair is presented in a question-and-answer format where the question frames the addition of the two numbers as \`$a$,$b+$', and the corresponding answer provides the sum.

Conversely, the \`\`language-math-dataset" consists of question-and-answer pairs that involve adding two numbers, represented within a specified numerical system, with each resulting sum not exceeding 520. The questions are uniformly sampled from five distinct formulations, such as \`What is the sum of $a$ and $b$?'. Each dataset is partitioned into training, validation, and test subsets, with 90% allocated to training and the remaining 10% equally divided between validation and testing. We finetune the GPT2-XL model to $99\%$ validation accuracy on these two datasets separately and do the experiments on both dataset to make sure the input format does not change our observation. In the following analysis, due to space limitation, we defer the detail of experimental setting to Appendix `\ref{sec:detail_exp_setting}`{=latex}. `\fi`{=latex}

Problem Setup
=============

#### Task and Dataset.

We constructed a synthetic addition dataset for fine-tuning and evaluation purposes. Each example involves adding two numbers $\le 260$, chosen because the maximum number that can be represented by a single token in the GPT-2-XL tokenizer is $520$. For each pair of numbers between $0$ and $260$, we randomly sample one of five natural language question templates and combine it with the two numbers. The dataset is shuffled and then split into training ($80\%$), validation ($10\%$), and test ($10\%$) sets. More details are provided in Appendix `\ref{sec:detail_exp_setting}`{=latex}. In Appendix `\ref{sec:format}`{=latex}, we show our that results generalize to a different dataset formatted with reverse Polish notation.

#### Model.

Unless otherwise stated, all experiments focus on the pre-trained GPT-2-XL model that has been fine-tuned on our addition dataset. This model, which consists of $48$ layers and approximately $1.5$ billion parameters, learns the task almost perfectly, with an accuracy of $99.74\%$ on the held-out test set. We examine other models in §`\ref{sec:pretrained_and_fromscratch}`{=latex} and §`\ref{sec:icl}`{=latex}.

#### Transformers.

We focus on decoder-only Transformer models [@vaswani2017attention], which process text sequentially, token by token, from left to right. Each layer $\ell$ in the Transformer has an attention module with output $\attn^{(\ell)}$ and an MLP module with output $\mlp^{(\ell)}$. Their outputs are added together to create a continuous residual stream $h$ [@elhage2021mathematical], meaning that the token representation accumulates all additive updates within the residual stream, with the representation $h^{(\ell)}$ in the $\ell$-th layer given by: $$\begin{aligned}
\label{eq:residual_stream}
    h^{(\ell)}=h^{(\ell-1)}+\attn^{(\ell)}+\mlp^{(\ell)}.\end{aligned}$$ The output embedding $W^{U}$ projects the residual stream to the space of the vocabulary; applying the softmax function then yields the model's prediction. We provide formal definitions in Appendix `\ref{sec:formal_definition}`{=latex}.

 Language Models Solve Addition with Fourier Features {#sec:fourier_analysis}
====================================================

In this section, we analyze the internal mechanisms of LLMs when solving addition tasks, employing a Fourier analysis framework. We first show that the model initially approximates the solution before iteratively converging to the correct answer (§`\ref{sec:behavioral_analysis}`{=latex}). We then show that the model refines its initial approximation by computing the exact answer modulo $2$, $5$, and $10$, employing Fourier components of those same periods (§`\ref{sec:fourier_feature}`{=latex}). Finally, we demonstrate through targeted ablations that the identified Fourier components are causally important for the model's computational processes (§`\ref{sec:filter}`{=latex}). Specifically, we show that MLP layers primarily approximate the magnitude of the answer, using low-frequency features, while attention layers primarily perform modular addition using high-frequency components.

Behavioral Analysis {#sec:behavioral_analysis}
-------------------

Our first goal is to understand whether the model merely memorizes and recombines pieces of information learned during training, or it performs calculations to add two numbers.

#### Extracting intermediate predictions.

To elucidate how LLMs perform computations and progressively refine their outputs towards the correct answer, we extract model predictions at each layer from the residual stream. Let $L$ denote the number of layers. Using the Logit Lens method [@belrose2023eliciting], instead of generating predictions by computing logits $W^{U} h^{(L)}$, predictions are derived through $W^{U} h^{(\ell)}$ where $\ell \in [L]$. We compute the accuracy of the prediction using each intermediate state $h^{(\ell)}$.

```{=latex}
\centering
```
```{=latex}
\subfloat[Accuracy change]{\includegraphics[height =3.7cm]{figures/language/skip/layer_accuracy_Language_Math.png}}
```
```{=latex}
\subfloat[MLP  logits]{\includegraphics[height =3.5cm]{figures/language/logit_lens/mlp_logit_lens/prediction_vs_mlp_layer_Put_together_15_and_93.png}}
```
```{=latex}
\subfloat[Attention  logits ]{\includegraphics[height =3.5cm]{figures/language/logit_lens/attn_logit_lens/prediction_vs_attn_layer_Put_together_15_and_93.png}}
```
```{=latex}
\vspace{-4mm}
```
`\label{fig:error_accuracy_skip_logit_lens}`{=latex}

If the models merely retrieve and recombine pieces of information learned during training, certain layers will directly map this information to predictions. `\tzreplace{}{For instance, \cite{merullo2023language} demonstrates that there is a specific MLP module directly maps a country to its capital.}`{=latex}

#### LLMs progressively compute the final answers.

Figure `\ref{fig:error_accuracy_skip_logit_lens}`{=latex}a instead shows that the model progressively approaches the correct answer, layer by layer. The model is capable of making predictions that fall within the range of $\pm 2$ and $\pm 10$ relative to the correct answer in the earlier layers, compared to the exact-match accuracy. This observation implies that the Transformer's layer-wise processing structure is beneficial for gradually refining predictions through a series of transformations and updates applied to the token representations.

Fourier Features in MLP & Attention Outputs {#sec:fourier_feature}
-------------------------------------------

#### Logits for MLP and attention have *periodic* structures.

We now analyze how each MLP and attention module contributes to the final prediction. We transform the output of the attention and MLP output at layer $\ell$ into the token space using $W^{U} \attn^{(\ell)}$ and $W^{U} \mlp^{(\ell)}$ at each layer, thereby obtaining the logits $\logits$ for each MLP and attention module. We use the running example \`\`Put together $15$ and $93$. Answer: $108$\" to demonstrate how the fine-tuned GPT-2-XL performs the computation. As illustrated in Figure `\ref{fig:error_accuracy_skip_logit_lens}`{=latex}b and Figure `\ref{fig:error_accuracy_skip_logit_lens}`{=latex}c, both the MLP and attention modules exhibit a periodic pattern in their logits across the output number space, e.g., the MLP in layer $33$, outlined in green, promotes all numbers that are congruent to $108 \mod 2$ (in Figure `\ref{fig:mlp_logit_wave}`{=latex} in the appendix, we zoom into such layers to make this clearer). Overall, we observe two distinct types of computation within these components. Some components predominantly assign a high weight to numbers around the correct answer, which we term *approximation*. Meanwhile, other components predominantly assign a high weight to all numbers congruent to $a + b \mod c$ for some constant $c$, which we term *classification*.

#### Logits for MLP and attention are approximately sparse in the Fourier space.

It is natural to transform the logits into Fourier space to gain a better understanding of their properties such as the periodic pattern. We apply the discrete Fourier transform to represent the logits as the sum of sine and cosine waves of different periods: `\tzedit{the   $k$-th component in  Fourier space has   period $520/k$ and  frequency  $k/520$ (see Appendix \ref{sec:formal_definition} for more details).}`{=latex} Let $\hat{\logits}$ denote the logits in Fourier space. Figure `\ref{fig:logit_fourier_layer}`{=latex} shows the Fourier space logits for two layers from Figure `\ref{fig:error_accuracy_skip_logit_lens}`{=latex}b and Figure `\ref{fig:error_accuracy_skip_logit_lens}`{=latex}c that have a clear periodic pattern. We find that the high-frequency components in Fourier space, which we define as components with index greater or equal to $50$, are approximately sparse as depicted in Figure `\ref{fig:logit_fourier_layer}`{=latex}. This observation aligns with [@nanda2023progress], which found that a one-layer Transformer utilizes particular Fourier components within the Fourier space to solve the modular addition task.

```{=latex}
\centering
```
```{=latex}
\subfloat[Logits for the MLP output in Fourier space]{\includegraphics[width=0.5\textwidth]{figures/language/fourier/mlp_33.png}}
```
```{=latex}
\subfloat[Logits for the attention output in Fourier space]{\includegraphics[width=0.5\textwidth]{figures/language/fourier/attn_40.png}}
```
In Figure `\ref{fig:logit_fourier}`{=latex}, we show that similar sparsity patterns in Fourier space hold across the entire dataset. We compute the logits in Fourier space for the last $15$ layers, i.e., $\hat{\logits}_{\attn}^{(\ell)}$ and $\hat{\logits}_{\mlp}^{(\ell)}$ where $\ell \in [32, 47]$, for all test examples and average them. We annotate the top-$10$ outlier high-frequency components based on their magnitude. The MLPs also exhibit some strong low-frequency components; the attention modules do not exhibit strong low-frequency components, only high-frequency components.

```{=latex}
\centering
```
```{=latex}
\vspace{-3mm}
```
```{=latex}
\subfloat[Logits for MLP output in Fourier space]{\includegraphics[width=0.45\textwidth]{figures/language/fourier/layer_logit_fourier_mlp.png}}
```
```{=latex}
\hspace{3mm}
```
```{=latex}
\subfloat[Logits for attention output in Fourier space]{\includegraphics[width=0.45\textwidth]{figures/language/fourier/layer_logit_fourier_attn.png}}
```
```{=latex}
\vspace{-3mm}
```
`\label{fig:logit_fourier}`{=latex}

#### Final logits are superpositions of these outlier Fourier components.

The final logits, $\logits^{(L)}$, are the sum of all $\logits^{(l)}_\mlp$ and $\logits^{(l)}_\attn$ across all layers $l \in [L]$. Figure `\ref{fig:final_logit_topk}`{=latex} elucidates how these distinct Fourier components contribute to the final prediction, for the example \`\`Put together $15$ and $93$. Answer: $108$\". We select the top-$5$ Fourier components of $\hat{\logits}^{(L)}$ based on their magnitudes and transfer them back to logits in number space via the inverse discrete Fourier transform (Figure `\ref{fig:final_logit_topk}`{=latex}a). The large-period (low-frequency) components approximate the magnitude while the small-period (high-frequency) components are crucial for modular addition. Figure `\ref{fig:final_logit_topk}`{=latex}b shows that aggregating these $5$ waves is sufficient to predict the correct answer.

#### Why is high-frequency classification helpful?

The Fourier basis comprises both $\cos$ and $\sin$ waves (see Definition `\ref{def:fourier_basis}`{=latex}). By adjusting the coefficients of $\cos$ and $\sin$, the trained model can manipulate the phase of the logits in Fourier space (number shift in number space), aligning the peak of the wave more closely with the correct answer. As shown in Figure `\ref{fig:final_logit_topk}`{=latex}a, consider a wave with a period of $2$. Here, the peak occurs at every even number in the number space, corresponding to the $\mod 2$ task. In contrast, for components with a large period such as $520$, the model struggles to accurately position the peak at $108$ (also see Figure `\ref{fig:final_logits_T520}`{=latex} in the appendix for the plot of this component with period $520$ in the full number space). This scenario can be interpreted as solving a \`\``mod 520`" task---a classification task among $520$ classes---which is challenging for the model to learn accurately. Nevertheless, even though the component with a period of $520$ does not solve the \`\``mod 520`" task precisely, it does succeed in assigning more weight to numbers near $108$. The classification results from the high-frequency components can then provide finer-grained resolution to distinguish between all the numbers around $108$ assigned a large weight by the lower frequencies. Due to this, the low-frequency components need not be perfectly aligned with the answer to make accurate predictions.

```{=latex}
\centering
```
```{=latex}
\vspace{-3mm}
```
```{=latex}
\subfloat[Final logits for top Fourier components]{\includegraphics[width=0.5\textwidth]{figures/language/logit_wave/final_logits_for15+93_topk.png}}
```
```{=latex}
\subfloat[Summation of the Top-$5$ Fourier components]{\includegraphics[width=0.5\textwidth]{figures/language/logit_wave/final_logits_for15+93_topk_sum.png}}
```
Fourier Features are Causally Important for Model Predictions {#sec:filter}
-------------------------------------------------------------

In the previous section, we demonstrated that there are outlier Fourier components in the logits generated by both the MLP and attention modules, as shown in Figure `\ref{fig:logit_fourier}`{=latex}. We also illustrated that, in one example, the high-frequency components primarily approximate the magnitude, while the low-frequency components are crucial for modular addition tasks, as depicted in Figure `\ref{fig:final_logit_topk}`{=latex}. In this section, through an ablation study conducted across the entire test dataset, we show that both types of components are essential for correctly computing sums. Moreover, we reveal that the MLP layers primarily approximate the magnitude of the answer using low-frequency features, whereas the attention layers are responsible for modular addition using high-frequency features.

#### Filtering out Fourier components.

To understand the role various frequency components play for the addition task, we introduce low-pass and high-pass filters $\mathcal{F}$. For an intermediate state $h$, and a set of frequencies $\Gamma = \{\gamma_1, \dotsc, \gamma_k\}$, the filter $\mathcal{F}(h; \Gamma)$ returns the vector $\tilde{h}$ that is closest in $L_2$ distance to $h$ subject to the constraint that the Fourier decomposition of $W^{U} \tilde{h}$ at every frequency $\gamma_i$ is $0$. We show in Appendix `\ref{sec:formal_definition}`{=latex} that this has a simple closed-form solution involving a linear projection. We then apply either a low-pass filter by taking $\Gamma$ to be all the components whose frequencies are greater than the frequency of the $\tau$-th component for some threshold $\tau$ (i.e., removing high-frequency components), and a high-pass filter by taking $\Gamma$ to be all the components whose frequencies are less than the frequency of the $\tau$-th component (i.e., removing low-frequency components). As in the previous subsection, we take the high-frequency threshold $\tau = 50$ for the following experiments (see Appendix `\ref{sec:app:selection_of_tau}`{=latex} for more details).

#### Different roles of frequency components in approximation and classification tasks. 

We evaluated the fine-tuned GPT-2-XL model on the test dataset with different frequency filters applied to all of the output of MLP and attention modules. The results, presented in Table `\ref{tab:frequency_components}`{=latex}, indicate that removing low-frequency components from attention modules or high-frequency components from MLP modules does not impact performance. This observation suggests that attention modules are not crucial for approximation tasks, and MLP modules are less significant for classification tasks.

Eliminating high-frequency components from attention results in a noticeable decrease in accuracy. Furthermore, removing high-frequency components from both the attention and MLP modules simultaneously leads to an even greater reduction in accuracy. This finding corresponds with observations from Figure `\ref{fig:error_accuracy_skip_logit_lens}`{=latex}b,c and Figure `\ref{fig:logit_fourier}`{=latex}, which indicate that both MLP and attention modules are involved in classification tasks due to the presence of high-frequency components in the logits. However, the approximation tasks are primarily performed by the MLP modules alone.

The errors induced by these ablations align with our mechanistic understanding. Ablating low-frequency parts of MLPs leads to off-by $10$, $50$, and $100$ errors: the model fails to perform the approximation subtask, though it still accurately predicts the unit digit. Conversely, ablating high-frequency parts of attention leads to small errors less than $6$ in magnitude: the model struggles to accurately predict the units digit, but it can still estimate the overall magnitude of the answer. See Figure `\ref{fig:histogram_error}`{=latex} in the Appendix for more details. These observations validate our hypothesis that low-frequency components are crucial for approximation, while high-frequency components are vital for classification. The primary function of MLP modules is to approximate the magnitude of outcomes using low-frequency components, while the primary role of attention modules is to ensure accurate classification by determining the correct unit digit.

```{=latex}
\centering
```
::: {#tab:frequency_components}
               **Module**                       **Fourier Component Removed**                    **Validation Loss**                        **Accuracy**
  ------------------------------------- ---------------------------------------------- --------------------------------------- ---------------------------------------
                  None                                Without Filtering                                0.0073                                  0.9974
               ATTN & MLP                               Low-Frequency                                  4.0842                                  0.0594
   [ATTN]{style="color: academicblue"}   [Low-Frequency]{style="color: academicblue"}   [0.0352]{style="color: academicblue"}   [0.9912]{style="color: academicblue"}
                   MLP                                  Low-Frequency                                  2.1399                                  0.3589
               ATTN & MLP                               High-Frequency                                 1.8598                                  0.2708
                  ATTN                                  High-Frequency                                 0.5943                                  0.7836
    [MLP]{style="color: academicred"}    [High-Frequency]{style="color: academicred"}   [0.1213]{style="color: academicred"}    [0.9810]{style="color: academicred"}

  : Impact of Filtering out Fourier Components on Model Performance. Removing low-frequency components from attention modules [(blue)]{style="color: academicblue"} or high-frequency components from MLP modules [(red)]{style="color: academicred"} does not impact performance
:::

Effects of Pre-training {#sec:effect_of_pretraining}
=======================

The previous section shows that pre-trained LLMs leverage Fourier features to solve the addition problem. Now, we study where the models' reliance on Fourier features comes from. In this section, we demonstrate that LLMs learn Fourier features in the token embeddings for numbers during pre-training. These token embeddings are important for achieving high accuracy on the addition task: models trained from scratch achieve lower accuracy, but adding just the pre-trained token embeddings fixes this problem. We also show that pre-trained models leverage Fourier features not only when fine-tuned, but also when prompted.

Fourier features in Token Embedding {#sec:fourier_token_embedding}
-----------------------------------

#### Number embedding exhibits approximate sparsity in the Fourier space. 

Let $W^E \in \R^{p \times D}$, where $p = 521$ and $D$ is the size of the token embeddings, denote the token embedding for numbers. We apply the discrete Fourier transform to each column of $W^E$ to obtain a matrix $V \in \R^{p \times D}$, where each row represents a different Fourier component. Then we take the $L_2$ norm of each row to yield a $p$-dimensional vector. Each component $j$ in this vector measures the overall magnitude of the $j$-th Fourier component across all the token embedding dimensions. Figure `\ref{fig:embedding_cluster_gpt2}`{=latex}a shows the magnitude of different Fourier components in the token embedding of GPT-2-XL. We see that the token embedding has outlier components whose periods are $2,2.5,5$, and $10$. Therefore, similar to how the model uses different Fourier components to represent its prediction (as shown in Section `\ref{sec:fourier_feature}`{=latex}), the token embeddings represent numbers with different Fourier components. Figure `\ref{fig:embedding_fourier_models}`{=latex} in the Appendix shows that the token embeddings of other pre-trained models have similar patterns the Fourier space. This suggests that Fourier features are a common attribute in the token embedding of pre-trained LLMs. In Figure `\ref{fig:embedding_cluster_gpt2}`{=latex}b, we use t-SNE and $k$-means to visualize the token embedding clustering. We can see that numbers cluster not only by magnitude but also by their multiples of $10$.`\Robin{I think this is only really true for the multiples of 10, so just say that instead? It makes sense because 10 was a big fourier component}`{=latex}`\Tianyi{fixed}`{=latex}

```{=latex}
\centering
```
```{=latex}
\subfloat[Number embedding in Fourier space]{\includegraphics[height=5.5cm]{figures/language/fourier/embedding_finetuned_gpt2xl.png}}
```
```{=latex}
\subfloat[Number embedding clustering]{\includegraphics[height=5.5cm]{figures/pretrained/clustering_gpt2.pdf}}
```
Contrasting Pre-trained Models with Models Trained from Scratch {#sec:pretrained_and_fromscratch}
---------------------------------------------------------------

```{=latex}
\centering
```
```{=latex}
\subfloat[Logits for MLP output in Fourier space]{\includegraphics[width=0.45\textwidth]{figures/language/fourier/layer_logit_fourier_mlp_fromscratch.png}}
```
```{=latex}
\hspace{3mm}
```
```{=latex}
\subfloat[Logits for attention output in Fourier space]{\includegraphics[width=0.45\textwidth]{figures/language/fourier/layer_logit_fourier_attn_fromscratch.png}}
```
To understand the necessity of Fourier features for the addition problem, we trained the GPT-2-XL model from scratch on the addition task with random initialization. After convergence, it achieved only $94.44\%$ test accuracy (recall that the fine-tuned GPT-2-XL model achieved $99.74\%$ accuracy).

#### Fourier features are learned during pre-training.

Figure `\ref{fig:logit_fourier_from_scratch}`{=latex} shows that there are no Fourier features in the intermediate logits of the GPT-2-XL model trained from scratch on the addition task. Furthermore, Figure `\ref{fig:embedding_validation_acc_from_scratch}`{=latex}a shows that the token embeddings also have no Fourier features. Without leveraging Fourier features, the model merely approximates the correct answer without performing modular addition, resulting in frequent off-by-one errors between the prediction and the correct answer `\tzreplace{}{(see details in Figure \ref{fig:error_embedding_from_scratch})}`{=latex}.

#### Pre-trained token embeddings improve model training.

We also trained GPT-2-small, with $124$ million parameters and $12$ layers, from scratch on the addition task. GPT-2-small often struggles with mathematical tasks [@mishra2022numglue]. This model achieved a test accuracy of only $53.95\%$ after convergence. However, when we freeze the token embedding layer and randomly initialize the weights for all other layers before training on the addition task, the test accuracy increases to $100\%$, with a significantly faster convergence rate. This outcome was consistently observed across five different random seeds, as illustrated in Figure `\ref{fig:embedding_validation_acc_from_scratch}`{=latex}b. This demonstrates that given the number embeddings with Fourier features, the model can effectively learn to leverage these features to solve the addition task.

```{=latex}
\centering
```
```{=latex}
\subfloat[Embedding: GPT-2-XL Trained from Scratch]{\includegraphics[height=4.3cm]{figures/pretrained/embedding_fromscratch_gpt2xl.png}
}
```
```{=latex}
\subfloat[Validation Accuracy Comparison for GPT-2]{\includegraphics[height=4.3cm]{figures/pretrained/with_pretrained_embedding_acc.png}}
```
```{=latex}
\vspace{-4mm}
```
Fourier Features in Prompted Pre-Trained Models {#sec:icl}
-----------------------------------------------

Finally, we ask whether larger language models use similar Fourier features during prompting.

#### Pre-trained LLMs use Fourier features to compute addition during in-context learning.

We first test on the open-source models GPT-J [@gpt-j] with 6B parameters, and Phi-2 [@phi2] with 2.7B parameters on the test dataset. `\tzedit{Without in-context learning, the model cannot perform addition tasks. Therefore, we use 4-shot in-context learning to test its performance. }`{=latex}Their absolute errors are predominantly multiples of $10$: `\dfedit{93\% of the time for GPT-J, and 73\% for Phi-2}`{=latex} . Using the Fourier analysis framework proposed in Section `\ref{sec:fourier_feature}`{=latex}, we demonstrate that for Phi-2 and GPT-J, the outputs of MLP and attention modules exhibit approximate sparsity in Fourier space across the last $15$ layers (Figure `\ref{fig:logit_fourier_phi2}`{=latex} and Figure `\ref{fig:logit_fourier_gptj}`{=latex}). This evidence strongly suggests that these models leverage Fourier features to compute additions.

#### Closed-source models exhibit similar behavior.

We study the closed-source models GPT-3.5 [@chatgpt], GPT-4 [@openai2024gpt4], and PaLM-2 [@anil2023palm]. While we cannot analyze their internal representations, we can study whether their behavior on addition problems is consistent with reliance on Fourier features. `\dfreplace{\tzedit{Since these pre-trained LLMs perform well without in-context learning, we conduct error analysis with 0-shot.}}{Since closed-source LLMs are instruction tuned and perform well without in-context learning, we conduct error analysis with 0-shot.}`{=latex} Most absolute errors by these models are also multiples of $10$: `\dfedit{100\% of the time for GPT-3.5 and GPT-4, and 87\% for PaLM-2}`{=latex}. The similarity in error distribution to that of open-source models leads us to hypothesize that Fourier features play a critical role in their computational mechanism.

```{=latex}
\centering
```
```{=latex}
\subfloat[Logits for MLP output in Fourier space]{\includegraphics[width=0.45\textwidth]{figures/pretrained/phi2/phi2_mlp_FFT_heatmap.png}}
```
```{=latex}
\hspace{3mm}
```
```{=latex}
\subfloat[Logits for attention output in Fourier space]{\includegraphics[width=0.45\textwidth]{figures/pretrained/phi2/phi2_attn_FFT_heatmap.png}}
```
```{=latex}
\vspace{-1mm}
```
Related Work {#sec:related_work}
============

```{=latex}
\vspace{-1mm}
```
#### Learning mathematical tasks.

Previous studies primarily explore what pre-trained LMs can achieve on arithmetic tasks, with less emphasis on the underlying mechanisms [@nogueira2021investigating; @qian2022limitations]. For instance, [@lee2023teaching] demonstrates that small Transformer models can effectively learn arithmetic by altering the question format and utilizing a scratchpad method [@nye2021show]. [@hlv23] identifies activation patterns for the \`\`greater-than\" operation in GPT-2, and [@charton2023can] focuses on the enumeration and selection processes in GCD computation. In this paper, we dive into the specific roles of MLP and attention layers in solving mathematical tasks. Our research analyzes these components' distinct contributions to integer addition tasks.

#### Mechanisms of pre-trained LMs.

Recent studies have significantly advanced our understanding of the underlying mechanisms of pre-trained Transformer models. For instance, research on \`\`skill neurons" by [@wwz+22] and \`\`knowledge neurons" by [@ddh+21] underscores the development of specialized neural components that encode task-specific capabilities or hold explicit factual information in the pre-trained LMs, enhancing model performance on related tasks. [@merullo2023language] and [@gcwg22] discuss how MLPs and FFNs transform and update token representations for general language tasks. In contrast, we show that the pre-trained LMs use multiple layers to compute addition by combining the results of approximation and classification. Additionally, [@zhu2023physics] demonstrated the capacity of GPT-2 to consolidate similar information through pre-training in the model weights, which aligns with our observations on the importance of pre-training in developing effective number embedding and arithmetic computation strategies in LMs.

#### Fourier features in Neural Networks.

Fourier features are commonly observed in image models, particularly in the early layers of vision models [@olshausen1997sparse; @olah2020overview; @fiquet2024polar]. These features enable the model to detect edges, textures, and other spatial patterns effectively. Recently, Fourier features have been noted in networks trained for tasks that allow cyclic wraparound, such as modular addition [@nanda2023progress; @morwani2023feature], general group compositions [@chughtai2023toy], or invariance to cyclic translations [@sanborn2022bispectral]. [@nanda2023progress] demonstrates that learning Fourier features can induce \`grokking' [@power2022grokking]. Furthermore, [@marchetti2023harmonics] provides a mathematical framework explaining the emergence of Fourier features when the network exhibits invariance to a finite group. We extend these insights by observing Fourier features in tasks that do not involve cyclic wraparound. [@tancik2020fourier] found that by selecting problem-specific Fourier features, the performance of MLPs can be improved on a computer vision-related task.

Conclusion {#sec:conclusion}
==========

In this paper, we provide a comprehensive analysis of how pre-trained LLMs compute numerical sums, revealing a nuanced interplay of Fourier features within their architecture. Our findings demonstrate that LLMs do not simply memorize answers from training data but actively compute solutions through a combination of approximation and classification processes encoded in the frequency domain of their hidden states. Specifically, MLP layers contribute to approximating the magnitude of sums, while attention layers contribute to modular operations.

Our work also shows that pre-training plays a critical role in equipping LLMs with the Fourier features necessary for executing arithmetic operations. Models trained from scratch lack these crucial features and achieve lower accuracy; introducing pre-trained token embeddings greatly improves their convergence rate and accuracy. This insight into the arithmetic problem-solving capabilities of LLMs through Fourier features sets the stage for potential modifications to training approaches. By imposing specific constraints on model training, we could further enhance the ability of LLMs to learn and leverage these Fourier features, thereby improving their performance in mathematical tasks.

Acknowledgments {#acknowledgments .unnumbered}
===============

DF and RJ were supported by a Google Research Scholar Award. RJ was also supported by an Open Philanthropy research grant. VS was supported by NSF CAREER Award CCF-2239265 and an Amazon Research Award. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of the funding agencies. `\clearpage`{=latex}

```{=latex}
\ifdefined
```
```{=latex}
\isarxiv
```
```{=latex}
\bibliographystyle{alpha}
```
```{=latex}
\else
```
```{=latex}
\bibliographystyle{plain}
```
```{=latex}
\fi
```
```{=latex}
\clearpage
```
```{=latex}
\newpage
```
```{=latex}
\onecolumn
```
```{=latex}
\appendix
```
```{=latex}
\ifdefined
```
```{=latex}
\isarxiv
```
Appendix {#appendix .unnumbered}
========

```{=latex}
\iffalse
```
```{=latex}
\startcontents[appendix]
```
```{=latex}
\addcontentsline{toc}{chapter}{Appendix}
```
```{=latex}
\renewcommand{\thesection}{\Alph{section}}
```
```{=latex}
\printcontents[appendix]{}{1}{\setcounter{tocdepth}{3}}
```
```{=latex}
\setcounter{section}{0}
```
```{=latex}
\fi
```
#### Roadmap.

In Appendix `\ref{sec:formal_definition}`{=latex}, we introduce some formal definitions that used in our main content. In Appendix `\ref{sec:app:selection_of_tau}`{=latex}, we show why we separate the Fourier components into the high-frequency part and the low-frequency part and why we choose $\tau$ to be $50$. In Appendix `\ref{sec:other_task}`{=latex}, we show our observation generalizes to another format of dataset, another arithmetic task and other models. In Appendix `\ref{sec:support_evidence_fourier}`{=latex}, we provide more evidence that shows the Fourier features in the model when computing addition. In Appendix `\ref{sec:more_from_scratch}`{=latex}, we provide more evidence that shows the GPT-2-XL trained from scratch does not use Fourier feature to solve the addition task. In Appendix `\ref{sec:detail_exp_setting}`{=latex}, we give the details of our experimental settings.

Formal Definition of Transformer and Logits in Fourier Space  {#sec:formal_definition}
============================================================

We first introduce the formal definition of the Transformer structure that we used in this paper.

```{=latex}
\begin{definition}[Transformer]An autoregressive Transformer language model $G: \mathcal{X} \rightarrow \mathcal{Y}$ over vocabulary $\mathrm{Vocab}$ maps a token sequence $x=\left[x_1, \ldots, x_N\right] \in \mathcal{X}, x_t \in \mathrm{Vocab}$ to a probability distribution $y \in \mathcal{Y} \subset \mathbb{R}^{|\mathrm{Vocab}|}$ that predicts next-token continuations of $x$. Within the Transformer, the $i$-th token is embedded as a series of hidden state vectors $h_t^{(\ell)}$, beginning with $h_t^{(0)}=\operatorname{emb}\left(x_t\right)+\operatorname{pos}(i) \in \mathbb{R}^D$. Let $W^{U} \in \R^{|\mathrm{Vocab}| \times D}$ denote the output embedding. The final output $y=\operatorname{softmax}(W^{U}\left(h_N^{(L)}\right))$ is read from the last hidden state. 
In the autoregressive case, tokens only draw information from past  tokens:
\begin{align*}%\label{eq:residual_stream}
    h_t^{(\ell)}=h_t^{(\ell-1)}+\attn_t^{(\ell)}+\mlp_t^{(\ell)}
\end{align*}
where
\begin{align*}
\attn_t^{(\ell)}:=\attn^{(\ell)}\left(h_1^{(\ell-1)}, h_2^{(\ell-1)}, \ldots, h_t^{(\ell-1)}\right) \quad \text{and} \quad \mlp_t^{(\ell)}:= \mlp_t^{(\ell)}(\attn_t^{(\ell)},h_t^{(\ell-1)}). 
\end{align*}
\end{definition}
```
In this paper, we only consider the output tokens to be numbers. Hence, we have the unembedding matrix $W^U \in \R^{p \times D}$, where $p$ is the size of the number space. As we are given the length-$N$ input sequences and predict the $(N+1)$-th, we only consider $h_N^{(\ell)}=h_N^{(\ell-1)}+\attn_N^{(\ell)}+\mlp_N^{(\ell)}$. For simplicity, we ignore the subscript $N$ in the following paper, so we get Eq. `\eqref{eq:residual_stream}`{=latex}.

```{=latex}
\begin{definition}[Intermediate Logits]\label{def:logits}
     Let $\logits_{\attn}^{(\ell)} := W^{U} \attn^{(\ell)} $ denote the intermediate logits of the attention module at the $\ell$-th layer. Let $\logits_{\mlp}^{(\ell)} :=  W^{U} \mlp^{(\ell)}$ denote the intermediate logits of the MLP module at the $\ell$-th layer. Let $\logits^{(\ell)}:= W^{U} h^{(\ell)}$ denote the logits on intermediate state $h^{(\ell)}$.
\end{definition}
```
Throughout the model, $h$ undergoes only additive updates (Eq. `\eqref{eq:residual_stream}`{=latex}), creating a continuous residual stream [@elhage2021mathematical], meaning that the token representation $h$ accumulates all additive updates within the residual stream up to layer $t$.

To analyze the logits in Fourier space, we give the formal definition of the Fourier basis as follows:

```{=latex}
\begin{definition}[Fourier Basis]\label{def:fourier_basis}
Let $p$ denote the size of the number space.
    Let $\overrightarrow{\mathbf{x}}:=(0,1, \ldots,(p-1))$.
Let $\omega_k := \frac{2 \pi k}{p-1}$. We  denote the normalized Fourier basis $F$ as the $p \times p$ matrix:
$$
F :=\left[\begin{array}{c}
\sqrt{\frac{1}{p-1}} \cdot \overrightarrow{\mathbf{1}} \\
\sqrt{\frac{2}{p-1}} \cdot  \sin \left(\omega_1 \overrightarrow{\mathbf{x}}\right) \\
\sqrt{\frac{2}{p-1}} \cdot  \cos \left(\omega_1 \overrightarrow{\mathbf{x}}\right) \\
\sqrt{\frac{2}{p-1}} \cdot  \sin \left(\omega_2 \overrightarrow{\mathbf{x}}\right) \\
\vdots \\
\sqrt{\frac{2}{p-1}} \cdot  \cos \left(\omega_{(p-1) / 2} \overrightarrow{\mathbf{x}}\right)
\end{array}\right] \in \R^{p \times p}
$$
The first component $F[0]$ is defined as a constant component. For $i \in [0,p-1]$, $F[i]$ is defined as the $k$-th component in Fourier space, where $k = \lfloor \frac{i+1}{2} \rfloor$. The frequency of the $k$-th component is $f_k := \frac{k}{p-1}$. The period of the $k$-th component is $T_k := \frac{p-1}{k}$
\end{definition}
```
We can compute the discrete Fourier transform under that Fourier basis as follows:

```{=latex}
\begin{remark}[Discrete Fourier transformer (DFT) and inverse DFT]We can transform any logits $u \in \R^{p}$ to Fourier space by computing $\hat{u} = F \cdot u$. We can transform $\hat{u}$ back to $u$ by $u = F^\top \cdot \hat{u}$
\end{remark}
```
Next, we define the logits in Fourier space.

```{=latex}
\begin{definition}[Logits in Fourier Space]\label{def:logits_fourier}
Let $\logits^{(L)}$, $\logits_{\attn}^{(\ell)}$ and $\logits_{\mlp}^{(\ell)}$ denote the logits (Definition \ref{def:logits}). The output logits before softmax in Fourier space is defined as: $\hat{\logits}^{(L)} = F \cdot \logits^{(L)} $. The logits of the MLP and attention modules in Fourier space are defined as:
\begin{align*}
\hat{\logits}_{\attn}^{(\ell)} = F \cdot \logits_{\attn}^{(\ell)} \quad \text{and} \quad \hat{\logits}_{\mlp}^{(\ell)} = F \cdot \logits_{\mlp}^{(\ell)}.
\end{align*}

\end{definition}
```
We ignore the first elements in $\hat{\logits}^{(L)}, \hat{\logits}_{\attn}^{(\ell)}$ and $\hat{\logits}_{\mlp}^{(\ell)}$ for the Fourier analysis in this paper as they are the constant terms. Adding a constant to the logits will not change the prediction.

Let $\tau \in \R$ denote a constant threshold. The low-frequency components for the logits in Fourier space are defined as $\hat{\logits}^{(\ell)}[1:2\tau]$. The high-frequency components for the logits in Fourier space are defined as $\hat{\logits}^{(\ell)}[2\tau:]$. For the following analysis, we choose $\tau = 50$ (the specific choice of $\tau = 50$ is explained in Appendix `\ref{sec:app:selection_of_tau}`{=latex}).

Next, we propose the formal definition of low-pass/high-pass filter that is used in the following ablation study.

```{=latex}
\begin{definition}[Loss-pass / High-pass Filter]\label{def:filter}
Let $x \in \R^{D}$ denote the output of MLP or attention modules. Let $F$ denote the Fourier Basis (Definition  \ref{def:fourier_basis}). Let $\tau \in R$ denote the frequency threshold. Let $ W^{U} \in R^{p \times D}$ denote the output embedding. For low-pass filter, we define a diagonal binary matrix $B \in \{0,1\}^{p \times p}$ as 
% \begin{align*}
$
    b_{ii} = 
\begin{cases} 
1 & \text{if } i \geq \tau \\
0 & \text{otherwise}
\end{cases}.
$
% \end{align*}
 For high-pass filter, we define a diagonal binary matrix $B \in \{0,1\}^{p \times p}$ as 
% \begin{align*}
$
    b_{ii} = 
\begin{cases} 
1 & \text{if } 1 \leq i < \tau \\
0 & \text{otherwise}
\end{cases}.
$
% \end{align*}
Note that we retain the constant component, so $b_{i,i} = 0$.
The output of the filter $\mathcal{F}(x): \R^{D} \rightarrow \R^{D}$ is defined by the following objective function:
    \begin{align*}
    \min_{y} \quad & \|x-y\|_2^2 \\
    \mathrm{subject~to} \quad &  BF  W^{U} y = 0
    \end{align*}
\end{definition}
```
The solution to the above optimization problem is given by a linear projection.

```{=latex}
\begin{remark}
    The result of the optimization problem defined in Definition \ref{def:filter} is the projection of $x$ to the null space of $BF W^{U}$. Let $\mathcal{N}(BF W^{U})$ denote the null space of $BF W^{U}$. We have
    \begin{align*}
        \mathcal{F}(x) = \mathcal{N}(BF W^{U}) \cdot \mathcal{N}(BF W^{U})^\top \cdot x^\top
    \end{align*}
\end{remark}
```
```{=latex}
\newpage
```
Fourier Components Separation and Selection of $\tau$ {#sec:app:selection_of_tau}
=====================================================

```{=latex}
\centering
```
```{=latex}
\subfloat[Logits for MLP output in Fourier space]{\includegraphics[width=0.5\textwidth]{figures/language/fourier/layer_logit_fourier_mlp_low.png}}
```
```{=latex}
\subfloat[Logits for attention output in Fourier space]{\includegraphics[width=0.5\textwidth]{figures/language/fourier/layer_logit_fourier_attn_low.png}}
```
Following Definition `\ref{def:filter}`{=latex}, we define single-pass filter as follows:

```{=latex}
\begin{definition}[Single-Pass Filter]\label{def:single_filter}
Let $x \in \R^{D}$ denote the output of MLP or attention modules. Let $F$ denote the Fourier Basis (Definition  \ref{def:fourier_basis}). Let $\gamma \in R$ denote the $\gamma$-th Fourier component (Definition \ref{def:fourier_basis}) that we want to retain. Let $ W^{U} \in R^{V \times D}$ denote the output embedding.
We define a diagonal binary matrix $B \in \{0,1\}^{V \times V}$ as 
% \begin{align*}
$
    b_{ii} = 
\begin{cases} 
0 & \text{if } \lfloor \frac{i+1}{2} \rfloor = \gamma \text{ or } i = 0,\\
1 & \text{otherwise}.
\end{cases}
$
% \end{align*}

The output of the filter $\mathcal{F}_\gamma(x): \R^{D} \rightarrow \R^{D}$ is defined as the following objective function:
    \begin{align*}
    \min_{y} \quad & \|x-y\|_2^2 \\
    \mathrm{subject~to} \quad &  BF  W^{U} y = 0
    \end{align*}
% We define $\mathcal{F}_{:\tau}$ as high-pass filter and $\mathcal{F}_{\tau:}$ as low-pass filter.
\end{definition}
```
```{=latex}
\begin{remark}
    The result of the optimization problem defined in Definition \ref{def:single_filter} is the projection of $x$ to the null space of $BF W^{U}$. Let $\mathcal{N}(BF W^{U})$ denote the null space of $BF W^{U}$. We have
    \begin{align*}
        \mathcal{F}_\gamma(x) = \mathcal{N}(BF W^{U}) \cdot \mathcal{N}(BF W^{U})^\top \cdot x^\top
    \end{align*}
\end{remark}
```
For the single-pass filter, we only retrain one Fourier component and analyze how this component affects the model's prediction. The residual stream is then updated as follows: $$\begin{aligned}
h^{(\ell)}  = h^{(\ell-1)}  + \mathcal{F}_\gamma(\mathrm{Attn}^{(\ell-1)}) + \mathcal{F}_\gamma(\mathrm{MLP}^{(\ell-1)} )\end{aligned}$$

We evaluated the fine-tuned GPT-2-XL model on the addition dataset with the Fourier components period $520$ and $2$. Given that $T_k := \frac{V-1}{k}$ (Definition `\ref{def:fourier_basis}`{=latex}), we retained only the Fourier components with $\gamma = 1$ and $260$, respectively.

As shown in Figure `\ref{fig:single_pass_error}`{=latex}a, with only one frequency component, whose period is $2$, the model accurately predicts the parity with $99.59\%$ accuracy. As depicted in Figure `\ref{fig:single_pass_error}`{=latex}b, with a single frequency component of period $520$, the model fails to accurately predict with $96.51\%$ accuracy. We consider the frequency component with a period of $2$ as the model's prediction for the `mod 2` task, and the frequency component with a period of $520$ as its prediction for the `mod 520` task. Figures `\ref{fig:single_pass_error}`{=latex} and `\ref{fig:histogram_mod520}`{=latex} suggest that the model effectively learns the `mod 2` task, as it involves a two-class classification, but struggles with the `mod 520` task, which requires classifying among $520$ classes. As the model does not need to be trained to converge to the optimal for these low-frequency components as explained at the end of Section `\ref{sec:fourier_feature}`{=latex}, predicting with the period-$520$ component leads to predictions that normally distributed around the correct answers.

```{=latex}
\centering
```
```{=latex}
\subfloat[Retaining Period-2 Frequency Components]{\includegraphics[height=4cm]{figures/language/filter/mod2_ablation.png}}
```
```{=latex}
\hspace{1mm}
```
```{=latex}
\subfloat[Retaining Period-520 Frequency Components]{\includegraphics[height=4cm]{figures/language/filter/mod520_ablation.png}}
```
```{=latex}
\centering
```
![ Retaining only the Period-$520$ Fourier component makes the model's predictions normally distributed around the correct answers. ](figures/language/filter/mod520_ablation_distribution.png "fig:"){#fig:histogram_mod520 width="50%"}

The Fourier components with larger periods present greater difficulty in solving the corresponding modular addition task compared to those with smaller periods. As demonstrated in Figure `\ref{fig:histogram_mod520}`{=latex}, components with large periods serve primarily as approximations of the correct answer. Consequently, we categorize the Fourier components into low-frequency and high-frequency groups. The low-frequency components approximate the magnitude of the answer, whereas the high-frequency components are employed to enhance the precision of the predictions.

In reference to Figure `\ref{fig:final_logit_topk}`{=latex}, to elucidate the contribution of these distinct Fourier components to our final prediction and the rationale behind their separation, consider the example: \`\`Put together $15$ and $93$. Answer: $108$". We selected the top-10 Fourier components of $\hat{\logits}^{(L)}$ based on their magnitudes and converted them back to logits in the numerical space by multiplying with $F^\top$. We plotted the components with components index less than 50 in Figure `\ref{fig:choose_of_tau}`{=latex}a and those with components index greater than 50 in Figure `\ref{fig:choose_of_tau}`{=latex}b. Leveraging the constructive and destructive inference for different waves, the components with low periods assign more weight to the correct answer, $108$, and less weight to numbers close to $108$. These high-frequency (low-period) components ensure the prediction's accuracy at the unit place. For the low-frequency (large-period) components, the model fails to precisely learn the magnitude of the factor between the $\cos$ and $\sin$ components, which results in failing to peak at the correct answer. Thus, the low-frequency (large-period) components are used to approximate the magnitude of the addition results.

```{=latex}
\centering
```
```{=latex}
\subfloat[Final Logits for Components whose index Less than $50$]{\includegraphics[width=0.48\textwidth]{figures/language/filter/component_less_100.png}}
```
```{=latex}
\hspace{1mm}
```
```{=latex}
\subfloat[Final Logits for Components whose index Greater than $50$]{\includegraphics[width=0.48\textwidth]{figures/language/filter/component_greater_100.png}}
```
```{=latex}
\centering
```
![Visualization of the Fourier component whose period is $520$ analysis for the final logits.](figures/language/logit_wave/final_logits_for15+93_topk_T520.png "fig:"){#fig:final_logits_T520 width="48%"}

```{=latex}
\clearpage
```
```{=latex}
\newpage
```
Does Fourier Features Generalize? {#sec:other_task}
=================================

Token Embedding for Other LMs
-----------------------------

We first show that other pre-trained LMs also have Fourier features in their token embedding for the numbers $[0,520]$.

```{=latex}
\centering
```
+:-------------------------------------------------------------------------------------------------------------------------------------------:+:-----------------------------------------------------------------------------------------------------------------------------------------:+
| `\subfloat[pre-trained GPT-2-XL]{\includegraphics[width=0.45\textwidth]{figures/language/fourier/embedding_pretrained_gpt2xl.png}}`{=latex} | `\subfloat[fine-tuned GPT-2-XL]{\includegraphics[width=0.45\textwidth]{figures/language/fourier/embedding_finetuned_gpt2xl.png}}`{=latex} |
+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+
| `\subfloat[pre-trained RoBERTa]{\includegraphics[width=0.45\textwidth]{figures/language/fourier/embedding_pretrained_roberta.png}}`{=latex} | ```{=latex}                                                                                                                               |
|                                                                                                                                             | \subfloat[pre-trained Phi2]{\includegraphics[width=0.45\textwidth]{figures/language/fourier/embedding_pretrained_phi2.png}}               |
|                                                                                                                                             | ```                                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+

```{=latex}
\newpage
```
Multiplication Task
-------------------

A key question is whether pre-trained models utilize Fourier Features solely for solving addition tasks or if they generalize to other arithmetic tasks. We hypothesize the latter, knowing that numbers are represented by their Fourier features in the token embeddings after pre-training. Consequently, this Fourier representation should be leveraged in a variety of number-related tasks. To validate this hypothesis, we perform a Fourier analysis on the GPT-2-XL model fine-tuned for the multiplication task.

Considering a maximum number of $520$ for multiplication would result in an insufficient dataset size. Therefore, we set the maximum allowable product to $10000$. For each pair of numbers where the product does not exceed this limit, we generate various phrasings of multiplication questions and their corresponding answers in base $10$. The different phrasings used include: \`\`What is the product of num1 and num2?\", \`\`Find the product of num1 multiplied by num2.\", \`\`Calculate num1 times num2.\", \`\`num1 multiplied by num2 equals what?\", and \`\`Multiplication of num1 with num2.\" The dataset is then shuffled to ensure randomness and split into training ($80\%$), validation ($10\%$), and test ($10\%$) sets. We finetune the model for $25$ epochs with a learning rate of $1e-4$. Upon convergence, the validation accuracy reaches $74.58\%$.

As the primary objective is to determine whether the Fourier features are utilized in tasks other than addition, Figure `\ref{fig:logit_fourier_multiplication}`{=latex} displays the logits in Fourier space for each layer, as in Figure `\ref{fig:logit_fourier}`{=latex}. It is evident that the logits are sparse in Fourier space.

```{=latex}
\centering
```
```{=latex}
\subfloat[Logits for MLP output in Fourier space]{\includegraphics[width=0.5\textwidth]{figures/multiplication/layer_logit_fourier_mlp_multiplication.png}}
```
```{=latex}
\subfloat[Logits for attention output in Fourier space]{\includegraphics[width=0.5\textwidth]{figures/multiplication/layer_logit_fourier_attn_multiplication.png}}
```
```{=latex}
\newpage
```
Same Results for other format {#sec:format}
-----------------------------

To demonstrate that our observations are not confined to a specific description of the mathematical problem, we conducted experiments on another format of addition problem and obtained consistent results. From Figure `\ref{fig:mlp_logit_lens_format}`{=latex}, we can see that there are also periodic structures in the intermediate logits.

```{=latex}
\centering
```
```{=latex}
\subfloat[MLP output logits for the last $15$ layers]{\includegraphics[width=0.5\textwidth]{figures/format/logit_lens/prediction_vs_mlp_layer_15,93+.png}}
```
```{=latex}
\subfloat[Attention output logits for the last $15$ layers]{\includegraphics[width=0.5\textwidth]{figures/format/logit_lens/prediction_vs_attn_layer_15,93+.png}}
```
From Figure `\ref{fig:logit_fourier_format}`{=latex}, we can also see the Fourier features for the MLP and attention output. These two experiments validate that our observations are not confined to a specific format of the addition problems.

```{=latex}
\centering
```
```{=latex}
\subfloat[Logits for MLP output in Fourier space]{\includegraphics[width=0.5\textwidth]{figures/format/fourier/mlp.png}}
```
```{=latex}
\subfloat[Logits for attention output in Fourier space]{\includegraphics[width=0.5\textwidth]{figures/format/fourier/attn.png}}
```
```{=latex}
\iffalse
```
We trained GPT-2 from scratch using the \`format-math-dataset.' In Figure `\ref{fig:embedding_validation_acc_from_scratch_format}`{=latex}a, we plot the number embedding in Fourier space for the GPT-2. The GPT-2 trained from scratch achieves $99.8\%$ test accuracy. All errors in mispredictions were off by $1$.\" When we froze the token embedding layer (pre-trained) and randomly initialized the weights for all other layers before retraining the model on the \`format-math-dataset,' the model achieved $100\%$ accuracy on validation dataset with a faster convergence rate, as illustrated in Figure `\ref{fig:embedding_validation_acc_from_scratch}`{=latex}b. We concatenated all the weights of three models: 1) fine-tuned GPT-2, 2) GPT-2 trained from scratch, and 3) GPT-2 with pre-trained token embeddings trained from scratch. We then compared the cosine similarity between the weight matrices of these models. The similarity between models 1 and 3 was $0.7058$, whereas the similarity between models 1 and 2 was $-0.0053$, and between models 2 and 3 was $0.0073$. These two experiments demonstrate that, given the number embeddings with Fourier features learned during pre-training, the model can effectively leverage these features to solve the addition task. `\fi`{=latex} `\iffalse`{=latex}

```{=latex}
\centering
```
```{=latex}
\subfloat[Embedding: GPT-2 Trained from Scratch]{\includegraphics[height=4.5cm]{figures/pretrained/embedding_fromscratch_gpt2.png}
}
```
```{=latex}
\subfloat[Validation Accuracy Comparison for GPT-2]{\includegraphics[height=4.3cm]{figures/pretrained/with_pre-trained_embedding_acc_gpt2.png}}
```
```{=latex}
\fi
```
Fourier Features in Other Pre-trained LM
----------------------------------------

Using the Fourier analysis framework proposed in Section `\ref{sec:fourier_feature}`{=latex}, we demonstrate that for GPT-J, the outputs of MLP and attention modules exhibit approximate sparsity in Fourier space across the last $15$ layers (Figure `\ref{fig:logit_fourier_gptj}`{=latex})

```{=latex}
\centering
```
```{=latex}
\subfloat[Logits for MLP output in Fourier space]{\includegraphics[width=0.5\textwidth]{figures/pretrained/gptj/gptj_mlp_FFT_heatmap.png}}
```
```{=latex}
\subfloat[Logits for attention output in Fourier space]{\includegraphics[width=0.5\textwidth]{figures/pretrained/gptj/gptj_attn_FFT_heatmap.png}}
```
```{=latex}
\newpage
```
Supporting Evidence For the Fourier Features {#sec:support_evidence_fourier}
============================================

We selected the layers that clearly show the periodic pattern in Figure `\ref{fig:error_accuracy_skip_logit_lens}`{=latex}b and Figure `\ref{fig:error_accuracy_skip_logit_lens}`{=latex}c and plot their logits in Figure `\ref{fig:mlp_logit_wave}`{=latex}.

```{=latex}
\centering
```
```{=latex}
\subfloat[Logits for the $33$-th layer's MLP output]{\includegraphics[width=0.5\textwidth]{figures/language/logit_wave/mlp_logit_wave/layer_33_prediction_vs_mlp_layer_Put_together_15_and_93._wave33.png}}
```
```{=latex}
\subfloat[Logits for the $40$-th layer's attention output]{\includegraphics[width=0.5\textwidth]{figures/language/logit_wave/attn_logit_wave/layer_40_prediction_vs_attn_layer_Put_together_15_and_93._wave40.png}}
```
Figure `\ref{fig:histogram_error}`{=latex} illustrates that the errors resulting from the ablation study (Section `\ref{sec:filter}`{=latex}) correspond with our theoretical insights. Removing low-frequency parts from the MLP results in errors such as off-by $10$, $50$, and $100$. Without these low-frequency components, the MLP is unable to accurately approximate, although it still correctly predicts the unit digit. In contrast, removing high-frequency components from the attention modules results in smaller errors, all less than $6$ in magnitude. These findings support our statement that low-frequency components are essential for accurate approximation, whereas high-frequency components are key for precise classification tasks. Consequently, the primary function of MLP modules is to approximate numerical magnitudes using low-frequency components, and the essential function of attention modules is to facilitate precise classification by identifying the correct unit digit.

```{=latex}
\centering
```
```{=latex}
\subfloat[Filtering out low-frequency components of MLP]{\includegraphics[height=4.5cm]{figures/language/filter/differences_histogram_low_mlp.png}}
```
```{=latex}
\subfloat[Filtering out high-frequency components of attention]{\includegraphics[height=4.5cm]{figures/language/filter/differences_histogram_high_attn.png}}
```
```{=latex}
\newpage
```
More Experiments on GPT-2-XL Trained from Scratch {#sec:more_from_scratch}
=================================================

Following the methodology proposed in Section `\ref{sec:fourier_analysis}`{=latex}, we plotted the logits of the MLP and attention modules for each layer, as shown in Figure `\ref{fig:mlp_logit_lens_from_scratch}`{=latex}. The prediction is solely determined by the $40$-th layer MLP. Unlike Figure `\ref{fig:logit_fourier}`{=latex}, there is no observable periodic structure across all layers.

```{=latex}
\centering
```
```{=latex}
\subfloat[MLP output logits for the last $15$ layers]{\includegraphics[width=0.5\textwidth]{figures/language/logit_lens/mlp_logit_lens/prediction_vs_mlp_layer_Put_together_15_and_93_fromscratch.png}}
```
```{=latex}
\subfloat[Attention output logits for the last $15$ layers]{\includegraphics[width=0.5\textwidth]{figures/language/logit_lens/attn_logit_lens/prediction_vs_attn_layer_Put_together_15_and_93_fromscratch.png}}
```
For the model trained from scratch on the created addition dataset, all of the predictions on the test dataset deviate from the correct answer within $2$ as shown in Figure `\ref{fig:error_embedding_from_scratch}`{=latex}.

```{=latex}
\centering
```
![Error distribution for GPT-2-XL trained from scratch.](figures/pretrained/error_dis_from_scratch_gpt2xl.png "fig:"){#fig:error_embedding_from_scratch height="4.5cm"}

```{=latex}
\newpage
```
Details of Experimental Settings {#sec:detail_exp_setting}
================================

#### Fine-tuned GPT-2-XL

We finetune GPT-2-XL on the \`\`language-math-dataset" with $50$ epochs and a batch size of $16$. The dataset consists of $27,400$ training samples, $3,420$ validation samples, and $3,420$ test samples. We use the AdamW optimizer, scheduling the learning rate linearly from $1 \times 10^{-5}$ to $0$ without warmup.

#### Train GPT-2-XL from scratch

We train GPT-2-XL on the \`\`language-math-dataset" from scratch with $500$ epochs and a batch size of $16$. The dataset consists of $27,400$ training samples, $3,420$ validation samples, and $3,420$ test samples. We use the AdamW optimizer, scheduling the learning rate linearly from $1 \times 10^{-4}$ to $0$ without warmup.

#### Train GPT-2 from scratch

For both with pre-trained token embedding and without token embedding, we train GPT-2 on the \`\`language-math-dataset" with $700$ epochs and a batch size of $16$. The dataset consists of $27,400$ training samples, $3,420$ validation samples, and $3,420$ test samples. We use the AdamW optimizer, scheduling the learning rate linearly from $5 \times 10^{-5}$ to $0$ without warmup. In Figure `\ref{fig:embedding_validation_acc_from_scratch}`{=latex}b, we train the model with five different seeds and plot the mean and deviation for them.

#### Create the addition dataset in main content

We consider numbers in base $10$ up to a maximum value of $260$. For each pair of numbers between $0$ and $260$, we generate various phrasings of addition questions and their corresponding answers. The different phrasings used are: \`\`Total of num1 and num2.\", \`\`Add together num1 and num2.\", \`\`Calculate num1 + num2.\", \`\`What is the sum of num1 and num2?\", and \`\`Put together num1 and num2.\". The dataset is shuffled to ensure randomness and then split into training ($80\%$), validation ($10\%$), and test ($10\%$) sets.

#### Create the addition dataset in Appendix `\ref{sec:format}`{=latex} with different format

We consider numbers in base $10$ up to a maximum value of $260$. We generate all possible pairs of numbers within this range using combinations with replacement. For each pair, we convert the numbers to the specified base and create questions formatted as \`\`num1,num2+\" with their corresponding answers. The dataset is then split into training ($80\%$), validation ($10\%$), and test ($10\%$) sets.

#### Experiments Compute Resources

All experiments involving fine-tuning and training from scratch in this paper were conducted on one NVIDIA A6000 GPU with 48GB of video memory. The fine-tuning process required less than 10 hours, while training from scratch took less than 3 days. Other experiments, such as those involving Logit Lens, were completed in less than 1 hour.

####  Licenses for Existing Assets & Open Access to Data and Code.

For the following models, we use the checkpoints provided by Huggingface. For all the trained models, we use default hyperparameters during all the training but with different random seeds.

-   GPT-2-XL: <https://huggingface.co/openai-community/gpt2-xl>, [Modified MIT License](https://github.com/openai/gpt-2/blob/master/LICENSE)

-   GPT-2: <https://huggingface.co/openai-community/gpt2>, [Modified MIT License](https://github.com/openai/gpt-2/blob/master/LICENSE)

-   GPT-J: <https://huggingface.co/EleutherAI/gpt-j-6b>, Apache-2.0 License

-   Phi2: <https://huggingface.co/microsoft/phi-2>, MIT License

-   GPT-3.5 and GPT-4: <https://chatgpt.com/> or <https://openai.com/index/openai-api/>

-   PaLM-2 <https://ai.google/discover/palm2/>

```{=latex}
\else
```
Appendix {#appendix-1 .unnumbered}
========

```{=latex}
\iffalse
```
```{=latex}
\startcontents[appendix]
```
```{=latex}
\addcontentsline{toc}{chapter}{Appendix}
```
```{=latex}
\renewcommand{\thesection}{\Alph{section}}
```
```{=latex}
\printcontents[appendix]{}{1}{\setcounter{tocdepth}{3}}
```
```{=latex}
\setcounter{section}{0}
```
```{=latex}
\fi
```
#### Roadmap.

In Appendix `\ref{sec:formal_definition}`{=latex}, we introduce some formal definitions that used in our main content. In Appendix `\ref{sec:app:selection_of_tau}`{=latex}, we show why we separate the Fourier components into the high-frequency part and the low-frequency part and why we choose $\tau$ to be $50$. In Appendix `\ref{sec:other_task}`{=latex}, we show our observation generalizes to another format of dataset, another arithmetic task and other models. In Appendix `\ref{sec:support_evidence_fourier}`{=latex}, we provide more evidence that shows the Fourier features in the model when computing addition. In Appendix `\ref{sec:more_from_scratch}`{=latex}, we provide more evidence that shows the GPT-2-XL trained from scratch does not use Fourier feature to solve the addition task. In Appendix `\ref{sec:detail_exp_setting}`{=latex}, we give the details of our experimental settings.

Formal Definition of Transformer and Logits in Fourier Space  {#sec:formal_definition}
============================================================

We first introduce the formal definition of the Transformer structure that we used in this paper.

```{=latex}
\begin{definition}[Transformer]An autoregressive Transformer language model $G: \mathcal{X} \rightarrow \mathcal{Y}$ over vocabulary $\mathrm{Vocab}$ maps a token sequence $x=\left[x_1, \ldots, x_N\right] \in \mathcal{X}, x_t \in \mathrm{Vocab}$ to a probability distribution $y \in \mathcal{Y} \subset \mathbb{R}^{|\mathrm{Vocab}|}$ that predicts next-token continuations of $x$. Within the Transformer, the $i$-th token is embedded as a series of hidden state vectors $h_t^{(\ell)}$, beginning with $h_t^{(0)}=\operatorname{emb}\left(x_t\right)+\operatorname{pos}(i) \in \mathbb{R}^D$. Let $W^{U} \in \R^{|\mathrm{Vocab}| \times D}$ denote the output embedding. The final output $y=\operatorname{softmax}(W^{U}\left(h_N^{(L)}\right))$ is read from the last hidden state. 
In the autoregressive case, tokens only draw information from past  tokens:
\begin{align*}%\label{eq:residual_stream}
    h_t^{(\ell)}=h_t^{(\ell-1)}+\attn_t^{(\ell)}+\mlp_t^{(\ell)}
\end{align*}
where
\begin{align*}
\attn_t^{(\ell)}:=\attn^{(\ell)}\left(h_1^{(\ell-1)}, h_2^{(\ell-1)}, \ldots, h_t^{(\ell-1)}\right) \quad \text{and} \quad \mlp_t^{(\ell)}:= \mlp_t^{(\ell)}(\attn_t^{(\ell)},h_t^{(\ell-1)}). 
\end{align*}
\end{definition}
```
In this paper, we only consider the output tokens to be numbers. Hence, we have the unembedding matrix $W^U \in \R^{p \times D}$, where $p$ is the size of the number space. As we are given the length-$N$ input sequences and predict the $(N+1)$-th, we only consider $h_N^{(\ell)}=h_N^{(\ell-1)}+\attn_N^{(\ell)}+\mlp_N^{(\ell)}$. For simplicity, we ignore the subscript $N$ in the following paper, so we get Eq. `\eqref{eq:residual_stream}`{=latex}.

```{=latex}
\begin{definition}[Intermediate Logits]\label{def:logits}
     Let $\logits_{\attn}^{(\ell)} := W^{U} \attn^{(\ell)} $ denote the intermediate logits of the attention module at the $\ell$-th layer. Let $\logits_{\mlp}^{(\ell)} :=  W^{U} \mlp^{(\ell)}$ denote the intermediate logits of the MLP module at the $\ell$-th layer. Let $\logits^{(\ell)}:= W^{U} h^{(\ell)}$ denote the logits on intermediate state $h^{(\ell)}$.
\end{definition}
```
Throughout the model, $h$ undergoes only additive updates (Eq. `\eqref{eq:residual_stream}`{=latex}), creating a continuous residual stream [@elhage2021mathematical], meaning that the token representation $h$ accumulates all additive updates within the residual stream up to layer $t$.

To analyze the logits in Fourier space, we give the formal definition of the Fourier basis as follows:

```{=latex}
\begin{definition}[Fourier Basis]\label{def:fourier_basis}
Let $p$ denote the size of the number space.
    Let $\overrightarrow{\mathbf{x}}:=(0,1, \ldots,(p-1))$.
Let $\omega_k := \frac{2 \pi k}{p-1}$. We  denote the normalized Fourier basis $F$ as the $p \times p$ matrix:
$$
F :=\left[\begin{array}{c}
\sqrt{\frac{1}{p-1}} \cdot \overrightarrow{\mathbf{1}} \\
\sqrt{\frac{2}{p-1}} \cdot  \sin \left(\omega_1 \overrightarrow{\mathbf{x}}\right) \\
\sqrt{\frac{2}{p-1}} \cdot  \cos \left(\omega_1 \overrightarrow{\mathbf{x}}\right) \\
\sqrt{\frac{2}{p-1}} \cdot  \sin \left(\omega_2 \overrightarrow{\mathbf{x}}\right) \\
\vdots \\
\sqrt{\frac{2}{p-1}} \cdot  \cos \left(\omega_{(p-1) / 2} \overrightarrow{\mathbf{x}}\right)
\end{array}\right] \in \R^{p \times p}
$$
The first component $F[0]$ is defined as a constant component. For $i \in [0,p-1]$, $F[i]$ is defined as the $k$-th component in Fourier space, where $k = \lfloor \frac{i+1}{2} \rfloor$. The frequency of the $k$-th component is $f_k := \frac{k}{p-1}$. The period of the $k$-th component is $T_k := \frac{p-1}{k}$
\end{definition}
```
We can compute the discrete Fourier transform under that Fourier basis as follows:

```{=latex}
\begin{remark}[Discrete Fourier transformer (DFT) and inverse DFT]We can transform any logits $u \in \R^{p}$ to Fourier space by computing $\hat{u} = F \cdot u$. We can transform $\hat{u}$ back to $u$ by $u = F^\top \cdot \hat{u}$
\end{remark}
```
Next, we define the logits in Fourier space.

```{=latex}
\begin{definition}[Logits in Fourier Space]\label{def:logits_fourier}
Let $\logits^{(L)}$, $\logits_{\attn}^{(\ell)}$ and $\logits_{\mlp}^{(\ell)}$ denote the logits (Definition \ref{def:logits}). The output logits before softmax in Fourier space is defined as: $\hat{\logits}^{(L)} = F \cdot \logits^{(L)} $. The logits of the MLP and attention modules in Fourier space are defined as:
\begin{align*}
\hat{\logits}_{\attn}^{(\ell)} = F \cdot \logits_{\attn}^{(\ell)} \quad \text{and} \quad \hat{\logits}_{\mlp}^{(\ell)} = F \cdot \logits_{\mlp}^{(\ell)}.
\end{align*}

\end{definition}
```
We ignore the first elements in $\hat{\logits}^{(L)}, \hat{\logits}_{\attn}^{(\ell)}$ and $\hat{\logits}_{\mlp}^{(\ell)}$ for the Fourier analysis in this paper as they are the constant terms. Adding a constant to the logits will not change the prediction.

Let $\tau \in \R$ denote a constant threshold. The low-frequency components for the logits in Fourier space are defined as $\hat{\logits}^{(\ell)}[1:2\tau]$. The high-frequency components for the logits in Fourier space are defined as $\hat{\logits}^{(\ell)}[2\tau:]$. For the following analysis, we choose $\tau = 50$ (the specific choice of $\tau = 50$ is explained in Appendix `\ref{sec:app:selection_of_tau}`{=latex}).

Next, we propose the formal definition of low-pass/high-pass filter that is used in the following ablation study.

```{=latex}
\begin{definition}[Loss-pass / High-pass Filter]\label{def:filter}
Let $x \in \R^{D}$ denote the output of MLP or attention modules. Let $F$ denote the Fourier Basis (Definition  \ref{def:fourier_basis}). Let $\tau \in R$ denote the frequency threshold. Let $ W^{U} \in R^{p \times D}$ denote the output embedding. For low-pass filter, we define a diagonal binary matrix $B \in \{0,1\}^{p \times p}$ as 
% \begin{align*}
$
    b_{ii} = 
\begin{cases} 
1 & \text{if } i \geq \tau \\
0 & \text{otherwise}
\end{cases}.
$
% \end{align*}
 For high-pass filter, we define a diagonal binary matrix $B \in \{0,1\}^{p \times p}$ as 
% \begin{align*}
$
    b_{ii} = 
\begin{cases} 
1 & \text{if } 1 \leq i < \tau \\
0 & \text{otherwise}
\end{cases}.
$
% \end{align*}
Note that we retain the constant component, so $b_{i,i} = 0$.
The output of the filter $\mathcal{F}(x): \R^{D} \rightarrow \R^{D}$ is defined by the following objective function:
    \begin{align*}
    \min_{y} \quad & \|x-y\|_2^2 \\
    \mathrm{subject~to} \quad &  BF  W^{U} y = 0
    \end{align*}
\end{definition}
```
The solution to the above optimization problem is given by a linear projection.

```{=latex}
\begin{remark}
    The result of the optimization problem defined in Definition \ref{def:filter} is the projection of $x$ to the null space of $BF W^{U}$. Let $\mathcal{N}(BF W^{U})$ denote the null space of $BF W^{U}$. We have
    \begin{align*}
        \mathcal{F}(x) = \mathcal{N}(BF W^{U}) \cdot \mathcal{N}(BF W^{U})^\top \cdot x^\top
    \end{align*}
\end{remark}
```
```{=latex}
\newpage
```
Fourier Components Separation and Selection of $\tau$ {#sec:app:selection_of_tau}
=====================================================

```{=latex}
\centering
```
```{=latex}
\subfloat[Logits for MLP output in Fourier space]{\includegraphics[width=0.5\textwidth]{figures/language/fourier/layer_logit_fourier_mlp_low.png}}
```
```{=latex}
\subfloat[Logits for attention output in Fourier space]{\includegraphics[width=0.5\textwidth]{figures/language/fourier/layer_logit_fourier_attn_low.png}}
```
Following Definition `\ref{def:filter}`{=latex}, we define single-pass filter as follows:

```{=latex}
\begin{definition}[Single-Pass Filter]\label{def:single_filter}
Let $x \in \R^{D}$ denote the output of MLP or attention modules. Let $F$ denote the Fourier Basis (Definition  \ref{def:fourier_basis}). Let $\gamma \in R$ denote the $\gamma$-th Fourier component (Definition \ref{def:fourier_basis}) that we want to retain. Let $ W^{U} \in R^{V \times D}$ denote the output embedding.
We define a diagonal binary matrix $B \in \{0,1\}^{V \times V}$ as 
% \begin{align*}
$
    b_{ii} = 
\begin{cases} 
0 & \text{if } \lfloor \frac{i+1}{2} \rfloor = \gamma \text{ or } i = 0,\\
1 & \text{otherwise}.
\end{cases}
$
% \end{align*}

The output of the filter $\mathcal{F}_\gamma(x): \R^{D} \rightarrow \R^{D}$ is defined as the following objective function:
    \begin{align*}
    \min_{y} \quad & \|x-y\|_2^2 \\
    \mathrm{subject~to} \quad &  BF  W^{U} y = 0
    \end{align*}
% We define $\mathcal{F}_{:\tau}$ as high-pass filter and $\mathcal{F}_{\tau:}$ as low-pass filter.
\end{definition}
```
```{=latex}
\begin{remark}
    The result of the optimization problem defined in Definition \ref{def:single_filter} is the projection of $x$ to the null space of $BF W^{U}$. Let $\mathcal{N}(BF W^{U})$ denote the null space of $BF W^{U}$. We have
    \begin{align*}
        \mathcal{F}_\gamma(x) = \mathcal{N}(BF W^{U}) \cdot \mathcal{N}(BF W^{U})^\top \cdot x^\top
    \end{align*}
\end{remark}
```
For the single-pass filter, we only retrain one Fourier component and analyze how this component affects the model's prediction. The residual stream is then updated as follows: $$\begin{aligned}
h^{(\ell)}  = h^{(\ell-1)}  + \mathcal{F}_\gamma(\mathrm{Attn}^{(\ell-1)}) + \mathcal{F}_\gamma(\mathrm{MLP}^{(\ell-1)} )\end{aligned}$$

We evaluated the fine-tuned GPT-2-XL model on the addition dataset with the Fourier components period $520$ and $2$. Given that $T_k := \frac{V-1}{k}$ (Definition `\ref{def:fourier_basis}`{=latex}), we retained only the Fourier components with $\gamma = 1$ and $260$, respectively.

As shown in Figure `\ref{fig:single_pass_error}`{=latex}a, with only one frequency component, whose period is $2$, the model accurately predicts the parity with $99.59\%$ accuracy. As depicted in Figure `\ref{fig:single_pass_error}`{=latex}b, with a single frequency component of period $520$, the model fails to accurately predict with $96.51\%$ accuracy. We consider the frequency component with a period of $2$ as the model's prediction for the `mod 2` task, and the frequency component with a period of $520$ as its prediction for the `mod 520` task. Figures `\ref{fig:single_pass_error}`{=latex} and `\ref{fig:histogram_mod520}`{=latex} suggest that the model effectively learns the `mod 2` task, as it involves a two-class classification, but struggles with the `mod 520` task, which requires classifying among $520$ classes. As the model does not need to be trained to converge to the optimal for these low-frequency components as explained at the end of Section `\ref{sec:fourier_feature}`{=latex}, predicting with the period-$520$ component leads to predictions that normally distributed around the correct answers.

```{=latex}
\centering
```
```{=latex}
\subfloat[Retaining Period-2 Frequency Components]{\includegraphics[height=4cm]{figures/language/filter/mod2_ablation.png}}
```
```{=latex}
\hspace{1mm}
```
```{=latex}
\subfloat[Retaining Period-520 Frequency Components]{\includegraphics[height=4cm]{figures/language/filter/mod520_ablation.png}}
```
```{=latex}
\centering
```
![ Retaining only the Period-$520$ Fourier component makes the model's predictions normally distributed around the correct answers. ](figures/language/filter/mod520_ablation_distribution.png "fig:"){#fig:histogram_mod520 width="50%"}

The Fourier components with larger periods present greater difficulty in solving the corresponding modular addition task compared to those with smaller periods. As demonstrated in Figure `\ref{fig:histogram_mod520}`{=latex}, components with large periods serve primarily as approximations of the correct answer. Consequently, we categorize the Fourier components into low-frequency and high-frequency groups. The low-frequency components approximate the magnitude of the answer, whereas the high-frequency components are employed to enhance the precision of the predictions.

In reference to Figure `\ref{fig:final_logit_topk}`{=latex}, to elucidate the contribution of these distinct Fourier components to our final prediction and the rationale behind their separation, consider the example: \`\`Put together $15$ and $93$. Answer: $108$". We selected the top-10 Fourier components of $\hat{\logits}^{(L)}$ based on their magnitudes and converted them back to logits in the numerical space by multiplying with $F^\top$. We plotted the components with components index less than 50 in Figure `\ref{fig:choose_of_tau}`{=latex}a and those with components index greater than 50 in Figure `\ref{fig:choose_of_tau}`{=latex}b. Leveraging the constructive and destructive inference for different waves, the components with low periods assign more weight to the correct answer, $108$, and less weight to numbers close to $108$. These high-frequency (low-period) components ensure the prediction's accuracy at the unit place. For the low-frequency (large-period) components, the model fails to precisely learn the magnitude of the factor between the $\cos$ and $\sin$ components, which results in failing to peak at the correct answer. Thus, the low-frequency (large-period) components are used to approximate the magnitude of the addition results.

```{=latex}
\centering
```
```{=latex}
\subfloat[Final Logits for Components whose index Less than $50$]{\includegraphics[width=0.48\textwidth]{figures/language/filter/component_less_100.png}}
```
```{=latex}
\hspace{1mm}
```
```{=latex}
\subfloat[Final Logits for Components whose index Greater than $50$]{\includegraphics[width=0.48\textwidth]{figures/language/filter/component_greater_100.png}}
```
```{=latex}
\centering
```
![Visualization of the Fourier component whose period is $520$ analysis for the final logits.](figures/language/logit_wave/final_logits_for15+93_topk_T520.png "fig:"){#fig:final_logits_T520 width="48%"}

```{=latex}
\clearpage
```
```{=latex}
\newpage
```
Does Fourier Features Generalize? {#sec:other_task}
=================================

Token Embedding for Other LMs
-----------------------------

We first show that other pre-trained LMs also have Fourier features in their token embedding for the numbers $[0,520]$.

```{=latex}
\centering
```
+:-------------------------------------------------------------------------------------------------------------------------------------------:+:-----------------------------------------------------------------------------------------------------------------------------------------:+
| `\subfloat[pre-trained GPT-2-XL]{\includegraphics[width=0.45\textwidth]{figures/language/fourier/embedding_pretrained_gpt2xl.png}}`{=latex} | `\subfloat[fine-tuned GPT-2-XL]{\includegraphics[width=0.45\textwidth]{figures/language/fourier/embedding_finetuned_gpt2xl.png}}`{=latex} |
+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+
| `\subfloat[pre-trained RoBERTa]{\includegraphics[width=0.45\textwidth]{figures/language/fourier/embedding_pretrained_roberta.png}}`{=latex} | ```{=latex}                                                                                                                               |
|                                                                                                                                             | \subfloat[pre-trained Phi2]{\includegraphics[width=0.45\textwidth]{figures/language/fourier/embedding_pretrained_phi2.png}}               |
|                                                                                                                                             | ```                                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+

```{=latex}
\newpage
```
Multiplication Task
-------------------

A key question is whether pre-trained models utilize Fourier Features solely for solving addition tasks or if they generalize to other arithmetic tasks. We hypothesize the latter, knowing that numbers are represented by their Fourier features in the token embeddings after pre-training. Consequently, this Fourier representation should be leveraged in a variety of number-related tasks. To validate this hypothesis, we perform a Fourier analysis on the GPT-2-XL model fine-tuned for the multiplication task.

Considering a maximum number of $520$ for multiplication would result in an insufficient dataset size. Therefore, we set the maximum allowable product to $10000$. For each pair of numbers where the product does not exceed this limit, we generate various phrasings of multiplication questions and their corresponding answers in base $10$. The different phrasings used include: \`\`What is the product of num1 and num2?\", \`\`Find the product of num1 multiplied by num2.\", \`\`Calculate num1 times num2.\", \`\`num1 multiplied by num2 equals what?\", and \`\`Multiplication of num1 with num2.\" The dataset is then shuffled to ensure randomness and split into training ($80\%$), validation ($10\%$), and test ($10\%$) sets. We finetune the model for $25$ epochs with a learning rate of $1e-4$. Upon convergence, the validation accuracy reaches $74.58\%$.

As the primary objective is to determine whether the Fourier features are utilized in tasks other than addition, Figure `\ref{fig:logit_fourier_multiplication}`{=latex} displays the logits in Fourier space for each layer, as in Figure `\ref{fig:logit_fourier}`{=latex}. It is evident that the logits are sparse in Fourier space.

```{=latex}
\centering
```
```{=latex}
\subfloat[Logits for MLP output in Fourier space]{\includegraphics[width=0.5\textwidth]{figures/multiplication/layer_logit_fourier_mlp_multiplication.png}}
```
```{=latex}
\subfloat[Logits for attention output in Fourier space]{\includegraphics[width=0.5\textwidth]{figures/multiplication/layer_logit_fourier_attn_multiplication.png}}
```
```{=latex}
\newpage
```
Same Results for other format {#sec:format}
-----------------------------

To demonstrate that our observations are not confined to a specific description of the mathematical problem, we conducted experiments on another format of addition problem and obtained consistent results. From Figure `\ref{fig:mlp_logit_lens_format}`{=latex}, we can see that there are also periodic structures in the intermediate logits.

```{=latex}
\centering
```
```{=latex}
\subfloat[MLP output logits for the last $15$ layers]{\includegraphics[width=0.5\textwidth]{figures/format/logit_lens/prediction_vs_mlp_layer_15,93+.png}}
```
```{=latex}
\subfloat[Attention output logits for the last $15$ layers]{\includegraphics[width=0.5\textwidth]{figures/format/logit_lens/prediction_vs_attn_layer_15,93+.png}}
```
From Figure `\ref{fig:logit_fourier_format}`{=latex}, we can also see the Fourier features for the MLP and attention output. These two experiments validate that our observations are not confined to a specific format of the addition problems.

```{=latex}
\centering
```
```{=latex}
\subfloat[Logits for MLP output in Fourier space]{\includegraphics[width=0.5\textwidth]{figures/format/fourier/mlp.png}}
```
```{=latex}
\subfloat[Logits for attention output in Fourier space]{\includegraphics[width=0.5\textwidth]{figures/format/fourier/attn.png}}
```
```{=latex}
\iffalse
```
We trained GPT-2 from scratch using the \`format-math-dataset.' In Figure `\ref{fig:embedding_validation_acc_from_scratch_format}`{=latex}a, we plot the number embedding in Fourier space for the GPT-2. The GPT-2 trained from scratch achieves $99.8\%$ test accuracy. All errors in mispredictions were off by $1$.\" When we froze the token embedding layer (pre-trained) and randomly initialized the weights for all other layers before retraining the model on the \`format-math-dataset,' the model achieved $100\%$ accuracy on validation dataset with a faster convergence rate, as illustrated in Figure `\ref{fig:embedding_validation_acc_from_scratch}`{=latex}b. We concatenated all the weights of three models: 1) fine-tuned GPT-2, 2) GPT-2 trained from scratch, and 3) GPT-2 with pre-trained token embeddings trained from scratch. We then compared the cosine similarity between the weight matrices of these models. The similarity between models 1 and 3 was $0.7058$, whereas the similarity between models 1 and 2 was $-0.0053$, and between models 2 and 3 was $0.0073$. These two experiments demonstrate that, given the number embeddings with Fourier features learned during pre-training, the model can effectively leverage these features to solve the addition task. `\fi`{=latex} `\iffalse`{=latex}

```{=latex}
\centering
```
```{=latex}
\subfloat[Embedding: GPT-2 Trained from Scratch]{\includegraphics[height=4.5cm]{figures/pretrained/embedding_fromscratch_gpt2.png}
}
```
```{=latex}
\subfloat[Validation Accuracy Comparison for GPT-2]{\includegraphics[height=4.3cm]{figures/pretrained/with_pre-trained_embedding_acc_gpt2.png}}
```
```{=latex}
\fi
```
Fourier Features in Other Pre-trained LM
----------------------------------------

Using the Fourier analysis framework proposed in Section `\ref{sec:fourier_feature}`{=latex}, we demonstrate that for GPT-J, the outputs of MLP and attention modules exhibit approximate sparsity in Fourier space across the last $15$ layers (Figure `\ref{fig:logit_fourier_gptj}`{=latex})

```{=latex}
\centering
```
```{=latex}
\subfloat[Logits for MLP output in Fourier space]{\includegraphics[width=0.5\textwidth]{figures/pretrained/gptj/gptj_mlp_FFT_heatmap.png}}
```
```{=latex}
\subfloat[Logits for attention output in Fourier space]{\includegraphics[width=0.5\textwidth]{figures/pretrained/gptj/gptj_attn_FFT_heatmap.png}}
```
```{=latex}
\newpage
```
Supporting Evidence For the Fourier Features {#sec:support_evidence_fourier}
============================================

We selected the layers that clearly show the periodic pattern in Figure `\ref{fig:error_accuracy_skip_logit_lens}`{=latex}b and Figure `\ref{fig:error_accuracy_skip_logit_lens}`{=latex}c and plot their logits in Figure `\ref{fig:mlp_logit_wave}`{=latex}.

```{=latex}
\centering
```
```{=latex}
\subfloat[Logits for the $33$-th layer's MLP output]{\includegraphics[width=0.5\textwidth]{figures/language/logit_wave/mlp_logit_wave/layer_33_prediction_vs_mlp_layer_Put_together_15_and_93._wave33.png}}
```
```{=latex}
\subfloat[Logits for the $40$-th layer's attention output]{\includegraphics[width=0.5\textwidth]{figures/language/logit_wave/attn_logit_wave/layer_40_prediction_vs_attn_layer_Put_together_15_and_93._wave40.png}}
```
Figure `\ref{fig:histogram_error}`{=latex} illustrates that the errors resulting from the ablation study (Section `\ref{sec:filter}`{=latex}) correspond with our theoretical insights. Removing low-frequency parts from the MLP results in errors such as off-by $10$, $50$, and $100$. Without these low-frequency components, the MLP is unable to accurately approximate, although it still correctly predicts the unit digit. In contrast, removing high-frequency components from the attention modules results in smaller errors, all less than $6$ in magnitude. These findings support our statement that low-frequency components are essential for accurate approximation, whereas high-frequency components are key for precise classification tasks. Consequently, the primary function of MLP modules is to approximate numerical magnitudes using low-frequency components, and the essential function of attention modules is to facilitate precise classification by identifying the correct unit digit.

```{=latex}
\centering
```
```{=latex}
\subfloat[Filtering out low-frequency components of MLP]{\includegraphics[height=4.5cm]{figures/language/filter/differences_histogram_low_mlp.png}}
```
```{=latex}
\subfloat[Filtering out high-frequency components of attention]{\includegraphics[height=4.5cm]{figures/language/filter/differences_histogram_high_attn.png}}
```
```{=latex}
\newpage
```
More Experiments on GPT-2-XL Trained from Scratch {#sec:more_from_scratch}
=================================================

Following the methodology proposed in Section `\ref{sec:fourier_analysis}`{=latex}, we plotted the logits of the MLP and attention modules for each layer, as shown in Figure `\ref{fig:mlp_logit_lens_from_scratch}`{=latex}. The prediction is solely determined by the $40$-th layer MLP. Unlike Figure `\ref{fig:logit_fourier}`{=latex}, there is no observable periodic structure across all layers.

```{=latex}
\centering
```
```{=latex}
\subfloat[MLP output logits for the last $15$ layers]{\includegraphics[width=0.5\textwidth]{figures/language/logit_lens/mlp_logit_lens/prediction_vs_mlp_layer_Put_together_15_and_93_fromscratch.png}}
```
```{=latex}
\subfloat[Attention output logits for the last $15$ layers]{\includegraphics[width=0.5\textwidth]{figures/language/logit_lens/attn_logit_lens/prediction_vs_attn_layer_Put_together_15_and_93_fromscratch.png}}
```
For the model trained from scratch on the created addition dataset, all of the predictions on the test dataset deviate from the correct answer within $2$ as shown in Figure `\ref{fig:error_embedding_from_scratch}`{=latex}.

```{=latex}
\centering
```
![Error distribution for GPT-2-XL trained from scratch.](figures/pretrained/error_dis_from_scratch_gpt2xl.png "fig:"){#fig:error_embedding_from_scratch height="4.5cm"}

```{=latex}
\newpage
```
Details of Experimental Settings {#sec:detail_exp_setting}
================================

#### Fine-tuned GPT-2-XL

We finetune GPT-2-XL on the \`\`language-math-dataset" with $50$ epochs and a batch size of $16$. The dataset consists of $27,400$ training samples, $3,420$ validation samples, and $3,420$ test samples. We use the AdamW optimizer, scheduling the learning rate linearly from $1 \times 10^{-5}$ to $0$ without warmup.

#### Train GPT-2-XL from scratch

We train GPT-2-XL on the \`\`language-math-dataset" from scratch with $500$ epochs and a batch size of $16$. The dataset consists of $27,400$ training samples, $3,420$ validation samples, and $3,420$ test samples. We use the AdamW optimizer, scheduling the learning rate linearly from $1 \times 10^{-4}$ to $0$ without warmup.

#### Train GPT-2 from scratch

For both with pre-trained token embedding and without token embedding, we train GPT-2 on the \`\`language-math-dataset" with $700$ epochs and a batch size of $16$. The dataset consists of $27,400$ training samples, $3,420$ validation samples, and $3,420$ test samples. We use the AdamW optimizer, scheduling the learning rate linearly from $5 \times 10^{-5}$ to $0$ without warmup. In Figure `\ref{fig:embedding_validation_acc_from_scratch}`{=latex}b, we train the model with five different seeds and plot the mean and deviation for them.

#### Create the addition dataset in main content

We consider numbers in base $10$ up to a maximum value of $260$. For each pair of numbers between $0$ and $260$, we generate various phrasings of addition questions and their corresponding answers. The different phrasings used are: \`\`Total of num1 and num2.\", \`\`Add together num1 and num2.\", \`\`Calculate num1 + num2.\", \`\`What is the sum of num1 and num2?\", and \`\`Put together num1 and num2.\". The dataset is shuffled to ensure randomness and then split into training ($80\%$), validation ($10\%$), and test ($10\%$) sets.

#### Create the addition dataset in Appendix `\ref{sec:format}`{=latex} with different format

We consider numbers in base $10$ up to a maximum value of $260$. We generate all possible pairs of numbers within this range using combinations with replacement. For each pair, we convert the numbers to the specified base and create questions formatted as \`\`num1,num2+\" with their corresponding answers. The dataset is then split into training ($80\%$), validation ($10\%$), and test ($10\%$) sets.

#### Experiments Compute Resources

All experiments involving fine-tuning and training from scratch in this paper were conducted on one NVIDIA A6000 GPU with 48GB of video memory. The fine-tuning process required less than 10 hours, while training from scratch took less than 3 days. Other experiments, such as those involving Logit Lens, were completed in less than 1 hour.

####  Licenses for Existing Assets & Open Access to Data and Code.

For the following models, we use the checkpoints provided by Huggingface. For all the trained models, we use default hyperparameters during all the training but with different random seeds.

-   GPT-2-XL: <https://huggingface.co/openai-community/gpt2-xl>, [Modified MIT License](https://github.com/openai/gpt-2/blob/master/LICENSE)

-   GPT-2: <https://huggingface.co/openai-community/gpt2>, [Modified MIT License](https://github.com/openai/gpt-2/blob/master/LICENSE)

-   GPT-J: <https://huggingface.co/EleutherAI/gpt-j-6b>, Apache-2.0 License

-   Phi2: <https://huggingface.co/microsoft/phi-2>, MIT License

-   GPT-3.5 and GPT-4: <https://chatgpt.com/> or <https://openai.com/index/openai-api/>

-   PaLM-2 <https://ai.google/discover/palm2/>

Limitations {#sec:limitations}
===========

We note that our contributions are limited by the size of the dataset. As the maximum number that can be represented by one token for GPT-2-XL is $520$, we analyze on the dataset whose operands are less than $260$. However, as the Fourier features commonly exist in many different pre-trained models as shown in Section `\ref{sec:effect_of_pretraining}`{=latex}, we believe the model still uses the Fourier features but with a more complicated strategy.

Impact Statement {#sec:impact}
================

Our work aims to understand the potential of large language models in solving arithmetic tasks. Our paper is an interpretability paper and thus we foresee no immediate negative ethical impact. We believe improved understanding and enhancement of LLMs can lead to more robust AI systems that are capable of performing complex tasks more reliably. This can benefit areas such as automated data analysis, financial forecasting, and more.

```{=latex}
\clearpage
```
```{=latex}
\newpage
```
NeurIPS Paper Checklist {#neurips-paper-checklist .unnumbered}
=======================

1.  **Claims**

2.  Question: Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope?

3.  Answer: `\answerYes{}`{=latex}

4.  Justification: In the introduction (Section `\ref{sec:intro}`{=latex}), we explicitly state the observations and their implications.

5.  Guidelines:

    -   The answer NA means that the abstract and introduction do not include the claims made in the paper.

    -   The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.

    -   The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.

    -   It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

6.  **Limitations**

7.  Question: Does the paper discuss the limitations of the work performed by the authors?

8.  Answer: `\answerYes{}`{=latex}

9.  Justification: The limitations is discussed in Section `\ref{sec:limitations}`{=latex}.

10. Guidelines:

    -   The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.

    -   The authors are encouraged to create a separate \"Limitations\" section in their paper.

    -   The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.

    -   The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.

    -   The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.

    -   The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.

    -   If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.

    -   While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren't acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

11. **Theory Assumptions and Proofs**

12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

13. Answer: `\answerNA{}`{=latex}

14. Justification: This is an interpretation paper without any theoretical results.

15. Guidelines:

    -   The answer NA means that the paper does not include theoretical results.

    -   All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced.

    -   All assumptions should be clearly stated or referenced in the statement of any theorems.

    -   The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.

    -   Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.

    -   Theorems and Lemmas that the proof relies upon should be properly referenced.

16. **Experimental Result Reproducibility**

17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

18. Answer: `\answerYes{}`{=latex}

19. Justification: We provide the details of experimental settings in Section `\ref{sec:detail_exp_setting}`{=latex}. By following these settings, our results can be reproduced. The detail process about the interpretability methods can be found in Section `\ref{sec:formal_definition}`{=latex}.

20. Guidelines:

    -   The answer NA means that the paper does not include experiments.

    -   If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.

    -   If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.

    -   Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.

    -   While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example

        1.  If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.

        2.  If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.

        3.  If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).

        4.  We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

21. **Open access to data and code**

22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

23. Answer:`\answerNo{}`{=latex}

24. Justification: The goal of this paper is to understand how LLMs compute addition. We believe the code is not central to our contribution.

25. Guidelines:

    -   The answer NA means that paper does not include experiments requiring code.

    -   Please see the NeurIPS code and data submission guidelines (<https://nips.cc/public/guides/CodeSubmissionPolicy>) for more details.

    -   While we encourage the release of code and data, we understand that this might not be possible, so "No" is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).

    -   The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (<https://nips.cc/public/guides/CodeSubmissionPolicy>) for more details.

    -   The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.

    -   The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.

    -   At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).

    -   Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

26. **Experimental Setting/Details**

27. Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

28. Answer:`\answerYes{}`{=latex}

29. Justification: We show all the details in Section `\ref{sec:detail_exp_setting}`{=latex}.

30. Guidelines:

    -   The answer NA means that the paper does not include experiments.

    -   The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.

    -   The full details can be provided either with the code, in appendix, or as supplemental material.

31. **Experiment Statistical Significance**

32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

33. Answer: `\answerYes{}`{=latex}

34. Justification: We run the experiments with 5 random seeds. In Figure `\ref{fig:embedding_validation_acc_from_scratch}`{=latex}, we show the plot the error bar with the mean and the standard deviation of the validation accuracy.

35. Guidelines:

    -   The answer NA means that the paper does not include experiments.

    -   The authors should answer \"Yes\" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.

    -   The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

    -   The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)

    -   The assumptions made should be given (e.g., Normally distributed errors).

    -   It should be clear whether the error bar is the standard deviation or the standard error of the mean.

    -   It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.

    -   For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).

    -   If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

36. **Experiments Compute Resources**

37. Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

38. Answer: `\answerYes{}`{=latex}

39. Justification: The detail of the computing resourse is provided at the end of Section `\ref{sec:detail_exp_setting}`{=latex}.

40. Guidelines:

    -   The answer NA means that the paper does not include experiments.

    -   The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.

    -   The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.

    -   The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn't make it into the paper).

41. **Code Of Ethics**

42. Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics <https://neurips.cc/public/EthicsGuidelines>?

43. Answer: `\answerYes{}`{=latex}

44. Justification: The authors have read the NeurIPS Code of Ethies and made sure the paper followsthe NeurIPS Code of Ethics in every aspect.

45. Guidelines:

    -   The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.

    -   If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.

    -   The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

46. **Broader Impacts**

47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

48. Answer: `\answerYes{}`{=latex}

49. Justification: The potential societal impact is discussed in Section `\ref{sec:impact}`{=latex}.

50. Guidelines:

    -   The answer NA means that there is no societal impact of the work performed.

    -   If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

    -   Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

    -   The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.

    -   The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.

    -   If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

51. **Safeguards**

52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

53. Answer: `\answerNA{}`{=latex}

54. Justification: Our paper works on simple addition task and dataset. We belive there is no such risks.

55. Guidelines:

    -   The answer NA means that the paper poses no such risks.

    -   Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.

    -   Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.

    -   We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

56. **Licenses for existing assets**

57. Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

58. Answer:`\answerYes{}`{=latex}

59. Justification: The details are listed in Section `\ref{sec:detail_exp_setting}`{=latex}

60. Guidelines:

    -   The answer NA means that the paper does not use existing assets.

    -   The authors should cite the original paper that produced the code package or dataset.

    -   The authors should state which version of the asset is used and, if possible, include a URL.

    -   The name of the license (e.g., CC-BY 4.0) should be included for each asset.

    -   For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.

    -   If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, [paperswithcode.com/datasets](paperswithcode.com/datasets) has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

    -   For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

    -   If this information is not available online, the authors are encouraged to reach out to the asset's creators.

61. **New Assets**

62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

63. Answer: `\answerNA{}`{=latex}

64. Justification: We did not introduce any new assets in this paper.

65. Guidelines:

    -   The answer NA means that the paper does not release new assets.

    -   Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.

    -   The paper should discuss whether and how consent was obtained from people whose asset is used.

    -   At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

66. **Crowdsourcing and Research with Human Subjects**

67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

68. Answer: `\answerNA{}`{=latex}

69. Justification: This paper does not involve crowdsourcing nor research with human subjects.

70. Guidelines:

    -   The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

    -   Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.

    -   According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

71. **Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects**

72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

73. Answer: `\answerNA{}`{=latex}

74. Justification: This paper does not involve crowdsourcing nor research with human subjects.

75. Guidelines:

    -   The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

    -   Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.

    -   We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.

    -   For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.

```{=latex}
\fi
```