---
abstract:
- |
  Language models treat numbers in the same way as ordinary word tokens, which introduces two major issues: (1) embeddings of numerical tokens primarily reflect their frequency in text corpora rather than their inherent numerical properties, leading to frequency bias, and (2) numbers are often split into multiple tokens, forcing the model to aggregate these pieces to recover their values. Inspired by the observation that pre-trained Large Language Models (LLMs) internally learn Fourier-like features for number tokens, we propose **Fo**urier **N**umber **E**mbedding **(FoNE)**, a novel method that directly maps numbers into the embedding space with their Fourier features. FoNE encodes each number as a single token with only two embedding dimensions per digit, effectively capturing numerical values without fragmentation. Compared to traditional subword and digit-wise embeddings, FoNE achieves higher accuracy on arithmetic tasks, requires significantly less training data, and offers more efficient training and inference. A $38$M-parameter Transformer trained from scratch with FoNE outperforms a fine-tuned Llama-3.2-1B model on addition, subtraction, and multiplication. FoNE is also the only method that achieves $100\%$ accuracy on over 100,000 test examples across these tasks. On 6-digit decimal addition, FoNE needs 64$\times$ less data than subword and digit-wise embeddings to reach $\ge 99\%$ accuracy, while using 3$\times$ and 6$\times$ fewer tokens per number, respectively.

  ```{=latex}
  \ifdefined
  ```
  ```{=latex}
  \isarxiv
  ```
  The codes and visualization are available at <https://fouriernumber.github.io/>. `\fi`{=latex}

  ```{=latex}
  \ifdefined
  ```
  ```{=latex}
  \isarxiv
  ```
  ```{=latex}
  \centering
  ```
  ![(a) We extract all the numbers from the input sequence. (b) For each number, we use FoNE to directly map the number to its embedding. The first two entries in the embedding represent $18 \bmod 10$, while the next two entries represent $18 \bmod 100$. (c) We pad the FoNE with zeros, add it to the word embeddings, and then feed the combined embeddings into the model. (d) For each digit, we take every two entries from the last hidden state and find the number whose representation is closest to these two entries.](figures/teaserarxiv.png){#fig:teaser1 width="0.99\\linewidth"}

  ```{=latex}
  \else
  ```
  ```{=latex}
  \fi
  ```
- |
  Language models treat numbers in the same way as ordinary word tokens, which introduces two major issues: (1) embeddings of numerical tokens primarily reflect their frequency in text corpora rather than their inherent numerical properties, leading to frequency bias, and (2) numbers are often split into multiple tokens, forcing the model to aggregate these pieces to recover their values. Inspired by the observation that pre-trained Large Language Models (LLMs) internally learn Fourier-like features for number tokens, we propose **Fo**urier **N**umber **E**mbedding **(FoNE)**, a novel method that directly maps numbers into the embedding space with their Fourier features. FoNE encodes each number as a single token with only two embedding dimensions per digit, effectively capturing numerical values without fragmentation. Compared to traditional subword and digit-wise embeddings, FoNE achieves higher accuracy on arithmetic tasks, requires significantly less training data, and offers more efficient training and inference. A $38$M-parameter Transformer trained from scratch with FoNE outperforms a fine-tuned Llama-3.2-1B model on addition, subtraction, and multiplication. FoNE is also the only method that achieves $100\%$ accuracy on over 100,000 test examples across these tasks. On 6-digit decimal addition, FoNE needs 64$\times$ less data than subword and digit-wise embeddings to reach $\ge 99\%$ accuracy, while using 3$\times$ and 6$\times$ fewer tokens per number, respectively.

  ```{=latex}
  \ifdefined
  ```
  ```{=latex}
  \isarxiv
  ```
  The codes and visualization are available at <https://fouriernumber.github.io/>. `\fi`{=latex}

  ```{=latex}
  \ifdefined
  ```
  ```{=latex}
  \isarxiv
  ```
  ```{=latex}
  \centering
  ```
  ![(a) We extract all the numbers from the input sequence. (b) For each number, we use FoNE to directly map the number to its embedding. The first two entries in the embedding represent $18 \bmod 10$, while the next two entries represent $18 \bmod 100$. (c) We pad the FoNE with zeros, add it to the word embeddings, and then feed the combined embeddings into the model. (d) For each digit, we take every two entries from the last hidden state and find the number whose representation is closest to these two entries.](figures/teaserarxiv.png){#fig:teaser1 width="0.99\\linewidth"}

  ```{=latex}
  \else
  ```
  ```{=latex}
  \fi
  ```
author:
- |
  **Tianyi Zhou**    **Deqing Fu**    **Mahdi Soltanolkotabi**    **Robin Jia**    **Vatsal Sharan** `\vspace{2mm}`{=latex}\
  Department of Computer Science\
  University of Southern California\
  Los Angeles, CA 90089\
  `{tzhou029,deqingfu,soltanol,robinjia,vsharan}@usc.edu`\
- |
  Tianyi Zhou$^{1}$, Deqing Fu$^{1}$, Mahdi Soltanolkotabi$^{1}$, Robin Jia$^{1}$, Vatsal Sharan$^{1}$\
  \
  $^{1}$Department of Computer Science, University of Southern California\
  `{tzhou029,deqingfu,soltanol,robinjia,vsharan}@usc.edu`
bibliography:
- ref.bib
- ref.bib
title:
- |
  FoNE: Precise Single-Token\
  Number Embeddings via Fourier Features
- |
  FoNE: Precise Single-Token Number\
  embeddings via Fourier Features
---

```{=latex}
\ifdefined
```
```{=latex}
\isarxiv
```
```{=latex}
\ifdefined
```
```{=latex}
\isarxiv
```
```{=latex}
\usepackage{lineno}
```
```{=latex}
\def\linenumberfont{\normalfont\small}
```
`\titlespacing`{=latex}

#### 0pt

4pt plus 4pt minus 2pt2pt plus 2pt minus 2pt

```{=latex}
\else
```
```{=latex}
\maketitle
```
```{=latex}
\fi
```
```{=latex}
\ifdefined
```
```{=latex}
\isarxiv
```
```{=latex}
\begin{titlepage}
  \maketitle
  \begin{abstract}


Language models treat numbers in the same way as ordinary word tokens, which introduces two major issues: (1) embeddings of numerical tokens primarily reflect their frequency in text corpora rather than their inherent numerical properties, leading to frequency bias, and (2) numbers are often split into multiple tokens, forcing the model to aggregate these pieces to recover their values.  Inspired by the observation that pre-trained Large Language Models (LLMs) internally learn Fourier-like features for number tokens, we propose \textbf{Fo}urier \textbf{N}umber \textbf{E}mbedding \textbf{(FoNE)}, a novel method that directly maps numbers into the embedding space with their Fourier features. FoNE encodes each number as a single token with only two embedding dimensions per digit, effectively capturing numerical values without fragmentation. 
Compared to traditional subword and digit-wise embeddings, FoNE achieves higher accuracy on arithmetic tasks, requires significantly less training data, and offers more efficient training and inference. 
A $38$M-parameter Transformer trained from scratch with FoNE outperforms a fine-tuned Llama-3.2-1B model on addition, subtraction, and multiplication. FoNE is also the only method that achieves $100\%$ accuracy on over 100,000 test examples across these tasks. On 6-digit decimal addition, FoNE needs 64$\times$ less data than subword and digit-wise embeddings to reach $\ge 99\%$ accuracy, while using 3$\times$ and 6$\times$ fewer tokens per number, respectively.


\ifdefined\isarxiv
The codes and visualization are available at  \url{https://fouriernumber.github.io/}.
\fi


\ifdefined\isarxiv
\begin{figure}[!ht]
    \centering
    \includegraphics[width=0.99\linewidth]{figures/teaserarxiv.pdf}
    \caption{(a) We extract all the numbers from the input sequence.
     (b) For each number, we use FoNE to directly map the number to its embedding. The first two entries in the embedding represent \( 18 \bmod 10 \), while the next two entries represent \( 18 \bmod 100 \).
     (c) We pad the FoNE with zeros, add it to the word embeddings, and then feed the combined embeddings into the model.
     (d) For each digit, we take every two entries from the last hidden state and find the number whose representation is closest to these two entries.}
    \label{fig:teaser1}
\end{figure}
\else
\fi
% \tzcomment{make it two figues, one encoding, one decoding. for figure one just keep (b)}

% \tzcomment{in decoding plot, when generate [num], then use Fourier decoding}

% \tzcomment{for teaser fig, left: panelb , right: two simple plots of addition results. try put three subfigure in one rows}

  \end{abstract}
  \thispagestyle{empty}
\end{titlepage}
```
```{=latex}
\newpage
```
```{=latex}
\else
```
```{=latex}
\fi
```
Introduction
============

LLMs require precise representations of numerical data to perform number-related tasks effectively. However, since LLMs treat numbers just like any other token, embeddings of numerical tokens do not systematically capture important numerical features. As a result, it is challenging for even billion-parameter models to achieve perfect accuracy in solving simple arithmetic tasks [^1] [@saxton2019analysing; @dziri2024faith; @lee2023teaching; @shen2023positional; @zhou2023algorithms]. While generating code can be a useful workaround, relying solely on this capability highlights a fundamental limitation: without a proper understanding of numbers, the model cannot fully grasp concepts critical to domains like mathematical theorems, physics laws, or quantitative reasoning. Even with approaches like Chain-of-Thought (CoT) prompting [@wei2022chain], it is important to have a perfect accuracy in solving basic arithmetic tasks to build a strong foundation for more complex reasoning.

Standard tokenization approaches, such as subword tokenization (e.g., GPT-4o [@achiam2023gpt], Llama-3 [@dubey2024llama], Phi-2 [@abdin2024phi]) or digit-wise tokenization (e.g., Llama-2 [@touvron2023llama], Mistral [@jiang2023mistral]), require the model to aggregate multiple tokens to understand numbers and introduces inefficiencies by tokenizing one number into multiple tokens. However, this inefficiency in tokenizing numbers leads to larger challenges when it comes to their representation. Numbers, unlike words, require systematic, frequency-agnostic representations, yet LLMs often exhibit a frequency bias [@razeghi2022impact; @shrestha2025mathematical; @shao2025benford], predicting numbers based on training data prevalence rather than their mathematical properties.

We draw inspiration from interpretability analyses of LLMs, which reveal that models internally develop Fourier-like features. Specifically, pre-trained models embed number tokens using a sparse set of features in the Fourier domain [@zhou2024pre]. These features enable the representation of numbers capturing both the magnitude and exact values of numbers, which are critical for solving arithmetic tasks [@zhou2024pre]. However, because numbers are split into subwords and their embeddings are learned from co-occurrence statistics in text during pre-training, current LLMs fail to learn precise numerical representations and struggle to extend these mechanisms to larger numbers, underscoring the need for more systematic approaches to numerical representation.

In this paper, we propose a novel approach called Fourier Number Embedding (FoNE), which directly maps numbers to their Fourier representations, bypassing the tokenization step entirely. By representing each digit using *cosine and sine functions with different periods*, as shown in the left panel of Figure `\ref{fig:teaser-three}`{=latex}, FoNE ensures precise representation of numbers. FoNE represents each digit using only two dimensions in the embedding vector. This compact design not only reduces computational overhead but also creates opportunities for future extensions by incorporating additional features to better capture numeric properties. By embedding and predicting numbers directly as single tokens, our method eliminates the need for multiple forward passes and token aggregation, significantly enhancing computational efficiency. Furthermore, we provide a theoretical justification for why FoNE can represent numbers accurately as single tokens, leveraging the modular encoding properties of trigonometric functions to ensure exact recovery of each digit through periodic embeddings.

Beyond theoretical justification, we demonstrate the effectiveness of FoNE through extensive experiments on arithmetic tasks, including addition, subtraction, and multiplication. Our results show that FoNE is the only approach which---when used to train a Transformer from scratch---achieves perfect accuracy on multiple arithmetic tasks while requiring significantly less training data and fewer model parameters compared to existing methods. Moreover, FoNE offers faster training and inference times by encoding each number into a single token. On 6‐digit decimal addition it achieves $\geq 99\%$ accuracy using $64\times$ less data than subword or digit‐wise embeddings, while cutting token usage per number by $3\times$ and $6\times$, respectively. These findings underscore FoNE's capacity to represent and manipulate numerical data both efficiently and precisely within large language models.

```{=latex}
\subfigure{%
    \begin{minipage}[t]{0.52\textwidth}
      \includegraphics[width=\linewidth]{figures/encoder.pdf}
      \label{fig:enc}
    \end{minipage}
  }
```
```{=latex}
\hspace{-7mm}
```
```{=latex}
\hfill
```
```{=latex}
\subfigure{%
    \begin{minipage}[t]{0.47\textwidth}
      \includegraphics[width=0.5\linewidth]{figures/add_decimal_datasize_error.png}\hspace{-2mm}\hfill
      \includegraphics[width=0.5\linewidth]{figures/add_decimal_modelsize_error.png}
      \label{fig:err-both}
    \end{minipage}
  }
```
```{=latex}
\vspace{-5mm}
```
```{=latex}
\vspace{-4mm}
```
`\label{fig:teaser-three}`{=latex}

```{=latex}
\vspace{-2mm}
```
Related Work
============

```{=latex}
\vspace{-2mm}
```
#### Arithmetic and Number-Related Tasks in LLMs.

Using language models for number-related tasks, including solving math problems [@saxton2019analysing; @yu2023metamath; @meidani2023snip], time-series prediction [@tan2024language; @ma2024survey; @zhou2023one; @liu2024taming; @jin2023time; @cao2023tempo; @li2025climatellm], quantitative reasoning [@mcleish2024benchmarking; @liu2024llms; @chen2023theoremqa; @jin2024cladder; @cobbe2021training], and handling tabular data [@gao2024raw; @fang2024large; @sahakyan2021explainable], remains a significant challenge. Despite advancements in transformer-based models, LLMs such as Qwen3-235B and GPT-5, with billions of parameters, struggle to solve simple arithmetic problems involving multi-digit addition and multiplication across multiple forward passes [@dziri2024faith; @feng2024numerical], even when using scratchpads [@nye2021show].

@golkar2023xval [@sundararaman2020methods; @jiang2019learning; @sivakumar2024leverage], introduce number embedding methods to enhance model performance on number-related tasks. However, the range of numbers these methods can accurately represent is typically limited to fewer than five digits and fail to achieve perfect accuracy. Additionally, a line of research [@mcleish2024transformers; @shen2023positional] incorporates the positional information of digits into embeddings or adds it as extra tokens [@nogueira2021investigating]. @lee2023teaching demonstrate that smaller transformer models can successfully handle multiplication when equipped with carefully designed scratchpads. However, these approaches are tailored specifically for arithmetic tasks and are difficult to integrate seamlessly into general-purpose LLM training. @thawani2021representing explores encoding strategies like digit-by-digit, scientific notation, and base-10 formats, while @jiang2019learning maps numbers to finite \`\`prototype numerals". These methods help the model align digits of equal significance but often require digit-wise tokenization and introduce additional tokens, reducing training and prediction efficiency. In contrast, the method proposed in this paper precisely encodes all numbers as a single token, eliminating range limitations and avoiding the efficiency drawbacks associated with previous approaches (see Section `\ref{sec:discussion}`{=latex} for further details).

Methods
=======

Building on insights from prior studies [@zhou2024pre] that highlight the importance of Fourier features in numerical embeddings, we propose Fourier Number Embedding. Unlike existing methods that often require digit-wise tokenization or pre-training to handle numeric tasks, FoNE directly maps numbers into compact Fourier representations. Sections `\ref{sec:fne}`{=latex}, `\ref{sec:fnp}`{=latex}, and `\ref{sec:inco}`{=latex} describe our embedding, decoding, and integration methods, respectively. The complete process is shown in Figure `\ref{fig:decoder}`{=latex}.

Fourier Number Embedding (FoNE) {#sec:fne}
-------------------------------

We first introduce the following function that maps each $x \in \R$ to a point on the unit circle.

```{=latex}
\begin{definition}[Circular embedding]\label{def:circular}
Let $T$ be a given period. We define the function $\phi: \R \rightarrow \R^2$ as $$
\phi(x, T) :=  \left(\cos \left(\tfrac{2\pi}{T}x \right),\sin \left(\tfrac{2\pi}{T}x \right) \right).$$

\end{definition}
```
Next, we formally define FoNE, which directly maps any floating point number $x$ to an embedding. We predefine $m$ and $n$ as the maximum number of digits before and after the decimal point, respectively. Note that $m$ and $n$ are global hyperparameters fixed for the entire dataset or model, rather than chosen per number.

```{=latex}
\begin{definition}[Fourier Number Embedding]Let $m$ be the integer digit length, and $n$ be the decimal digit length. We define the Fourier Number Embedding function $\FoNE: \R \rightarrow \R^{2(m+n)}$ for an input number $x$ as follows:
\begin{align*}
    \FoNE(x,m,n) := \bigl[\phi(x,T_{-n+1});\phi(x,T_{-n+2}) ;\dots;\phi(x,T_m)\bigr],
\end{align*}
where $T_i = 10^{i}$ for each integer $i$ in the range $-n+1$ to $m$.
\end{definition}
```
To align the embedding dimensions of FoNE with the model's input embedding dimension $d$, we map the Fourier Number Embedding, which lies in $\mathbb{R}^{2(m+n)}$, to $\mathbb{R}^d$. This mapping can be achieved in two ways: (1) by applying a learnable linear transformation $\mathbf{W} \in \mathbb{R}^{d \times 2(m+n)}$, or (2) by appending zeros to the embedding vector to match the dimensionality $d$. As demonstrated in Section `\ref{sec:ablation}`{=latex}, both approaches achieve comparable results.

FoNE's Representational Properties
----------------------------------

Then, we introduce an elementary lemma and demonstrate why FoNE can preserve the numeracy on numbers.

```{=latex}
\begin{lemma}[Informal version of Lemma \ref{lem:fne_preserve_numeracy:formal}]\label{lem:fne_preserve_numeracy:informal}
    Given the pair 
    $ 
    \left(\cos\left(\tfrac{2\pi}{T}x\right), \sin\left(\tfrac{2\pi}{T}x\right)\right)
    $, we can recover 
    % \begin{align*}
$x \bmod T$.
    % \end{align*}
\end{lemma}
```
Hence, by applying Lemma `\ref{lem:fne_preserve_numeracy:informal}`{=latex} to each frequency component in FoNE, we immediately obtain the following result.

```{=latex}
\begin{lemma}[FoNE preserves numeracy]\label{lem:fne_numeracy}
    Given a number's Fourier Number Embedding $\FoNE(x)$, its integer digit length $m$, and the decimal digit length $n$, by using Lemma~\ref{lem:fne_preserve_numeracy:informal}, we can recover $x \bmod 10^{i}$ for each integer $i$ in the range $-n+1$ to $m$.

\end{lemma}
```
A natural question that arises here is the need for $x \bmod 10$, if we already know $x \bmod 100$. The reason is that even though knowing $x \bmod 100$ exactly suffices to recover $x \bmod 10$, this estimation is noisy in practice. When $T$ becomes very large in a circular embedding (Definition `\ref{def:circular}`{=latex}), the difference $\frac{2\pi}{T} (x+1) -\frac{2\pi}{T} x$ approaches zero, causing the embedded representations of $x$ and $x+1$ to become arbitrarily close on the unit circle. Consequently, a single large $T$ cannot sufficiently distinguish adjacent values in the embedding. Hence, one must choose $T$ across a broad range of scales to ensure that the embedding remains adequately distinguishable for all values of $x$. In this paper, we choose $T$ as $10^i,~ \forall i$, so that each $T$ effectively captures one digit of $x$.

To provide a clear illustration of our method, we present a detailed example demonstrating how we map number $4.17$ to its embedding.

```{=latex}
\begin{example}
Consider $x = 4.17$. Its Fourier Number Embedding is given by 
$$
[\phi(4.17,0.1); \phi(4.17,1); \phi(4.17,10)],
$$
where $\phi$ is defined in Definition \ref{def:circular}.
From these components, by using Lemma \ref{lem:fne_preserve_numeracy:informal}, we can recover
$$
[4.17 \bmod 0.1, 4.17 \bmod 1, 4.17 \bmod 10],\footnote{  For real \(x\) and positive real \(m\), \(x \bmod m\) is defined as 
$ x - m \cdot \left\lfloor \frac{x}{m} \right\rfloor,$
yielding a value in the range \([0, m)\)}$$
which simplifies to $[0.07,0.17,4.17]$. If we used only $T = 10$, then $\phi(4.17,10)$ would be nearly indistinguishable from $\phi(4.18,10)$, causing the embedding to lose fine-grained information about less significant digits. However, with these chosen periods $T$, we can capture all the digits.
\end{example}
```
Decoding {#sec:fnp}
--------

```{=latex}
\centering
```
![Left: Each number in the input sequence is replaced by a special token \[NUM\] and embedded as the sum of \[NUM\] token embedding, its FoNE (see Figure `\ref{fig:enc}`{=latex}), and the standard position embedding (if used by the architecture; Llama-3.2 does not use an explicit position embedding). Right: At decoding time, every pair of hidden-state entries corresponds to one digit, e.g. the first two entries $h_0$ and $h_1$ correspond to the unit digit. The model identifies the digit whose circular Fourier representation best matches those two entries. Digits are then combined by their positional weights. ](figures/decoder.png){#fig:decoder width="0.99\\linewidth"}

```{=latex}
\ifdefined
```
```{=latex}
\isarxiv
```
```{=latex}
\begin{algorithm}[!ht]\caption{Fourier Number Loss \& Prediction}
\label{alg:fourier_loss_prediction}
\begin{algorithmic}[1]

    %--------------------------------------------------
    % Digit-wise Loss
    %--------------------------------------------------
    \Function{FourierNumberLossFunction}{$h, y, i$}
        \State $y_i \gets \text{the $i$-th digit of } y$
        \State $a \gets \bigl[h[2i],h[2i+1]\bigr]$
        \State $b \gets  [\phi(0,10),
                \phi(1,10),
                \cdots,
                \phi(9,10)]^\top$
        \State $\text{logits} \gets a \cdot b$
        \State $\text{loss} \gets L_{\mathrm{CE}}(y_i,\ \text{logits})$ \Comment{\textit{Cross-entropy loss for digit $i$}}
        \State \Return \text{loss}
    \EndFunction

    %--------------------------------------------------
    % Digit-wise Prediction
    %--------------------------------------------------
    \Function{FourierNumberPrediction}{$h, i$}
    \Comment{\textit{Prediction for digit $i$}}
        \State $\text{logits} \gets 
        \bigl[h[2i],h[2i+1]\bigr]
            \cdot
            \bigl[
                \phi(j,10)
            \bigr]_{j=0,\dots,9}$
        \State $\hat{y}_i \gets \arg\max_{j \in \{0,\dots,9\}} \text{logits}[j]$
        \State \Return $\hat{y}_i$
    \EndFunction

\end{algorithmic}
\end{algorithm}
```
```{=latex}
\else
```
```{=latex}
\fi
```
As each number has its own FoNE, calculating the logits for all possible numbers becomes computationally infeasible. Therefore, we introduce a novel decoding head that maps hidden states from Fourier space to number space as shown in Figure `\ref{fig:decoder}`{=latex}. Below, we explicitly define the loss function and prediction function for each digit and then show how to combine these to obtain the final loss and prediction.

```{=latex}
\begin{definition}[Fourier Number Loss Function]Let \(h \in \mathbb{R}^{d}\) denote the last-layer hidden state of the model. Let $y_i$ denote the $i$-th digit of the label number $y$. Let $L_{\mathrm{CE}}$ denote the cross entropy loss. For digit $i$, we define the Fourier Number Loss Function $L_{\FoNE}$ as:
\vspace{-1.5mm}
\begin{align*}
    L_{\FoNE} (h, y, i) :=    L_{\mathrm{CE}}\!\Bigl(y_i,  (\underbrace{[
                h[2i], ~h[2i+1]
            ]}_{1 \times 2}
            \cdot
            \underbrace{\begin{bmatrix}
                \phi(0,10);
                \dotsb; \phi(9,10)
            \end{bmatrix}^{\top}}_{2\times 10})
       \Bigr).
\end{align*}
\end{definition}
```
```{=latex}
\vspace{-3mm}
```
Note that the comparison set $\{\phi(0,10), \ldots, \phi(9,10)\}$ is the same for all digit positions $i$. This is because each digit, regardless of its positional significance, must be classified as one of the 10 possible values $\{0, 1, \ldots, 9\}$. This design enables parallel decoding where each digit position is treated as a separate classification task over the same 10-class space. The final training loss is obtained by averaging $L_{\FoNE}(h, y, i)$ over all digit positions $i$.

```{=latex}
\begin{definition}[Fourier Number Prediction for the \(i\)-th digit]Let \(h \in \mathbb{R}^{d}\) denote the last-layer hidden state of the model. For digit $i$, we define the Fourier Number Prediction as:
\vspace{-1mm}
\begin{align*}
 \hat{y}_i := 
    \arg\max_{j \in \{0,\dots,9\}}
    \Bigl(
        \big[   h[2i], ~h[2i+1]
            \big]
            \cdot
            \big[\phi(j,10)\big]
    \Bigr).
\end{align*}
\end{definition}
```
```{=latex}
\vspace{-2mm}
```
`\noindent`{=latex} Here, $\hat{y}_i$ is determined by the similarity between the hidden states and the circular embedding of number in $\{0,\cdots,9\}$ as illustrated in Figure `\ref{fig:teaser1}`{=latex}(d). Once we have computed $\hat{y}_i$ for each digit $i$, the final prediction for the entire number can be formed by concatenating these digit-wise predictions. We defer the detailed algorithms to Appendix `\ref{app:fourier_final_loss_prediction}`{=latex}.

INCORPORATING FONE INTO INPUT SEQUENCES {#sec:inco}
---------------------------------------

To incorporate FoNE, we create one special token \[NUM\] and add it to the vocabulary. This token must be generated by the model in order to generate any number. We can then remove any tokens corresponding to numbers from the vocabulary. In practice, numbers appear in text with varying formats (e.g., \`\`1,234.56", \`\`\$99.99", \`\`3.14e-2"). FoNE handles this diversity by first applying a standard numeric-string parser that detects numbers, ensuring consistent representation regardless of the original textual format.

**Integration Procedure.** The integration of FoNE into input sequences proceeds as follows, as illustrated in Figure 2 and Figure 6:

1.  Extract all numbers from the input sequence using the numeric parser to create a number list. Each detected number is replaced with \[NUM\] and its parsed canonical value. Tokenize the modified sequence to obtain a token list.

2.  Embed the token list using standard word embedding methods.

3.  Map each canonical number value from the number list to its FoNE representation using Algorithm 1 (Section 3.1).

4.  Add the FoNE to the word embedding of the corresponding \[NUM\] token.

5.  Feed the combined embeddings into the model.

6.  Use the model's output embeddings to predict the next token in the sequence.

7.  If the predicted token is \[NUM\], decode the numerical value using the method described in Section 3.3, or compute the loss during training.

This procedure ensures that FoNE embeddings are seamlessly integrated into the input sequence. Transformer architectures can effectively aggregate information across tokens; with FoNE, each numeric token carries a precise numerical representation, enabling the model to aggregate and interpret numerical information more reliably than with frequency-biased standard embeddings.

Empirical Evaluation
====================

Experimental Setting {#sec:exp_setting}
--------------------

We evaluate the performance of our proposed *FoNE* method on arithmetic tasks designed to benchmark different number embedding methods. The dataset includes tasks such as 6-digit integer addition, 6-digit decimal addition (with 3 digits after the decimal), 5-digit integer subtraction, 3-digit integer multiplication, and 4-digit integer multiplication. These tasks are curated to measure model capabilities in accurate numeric computation, while remaining within ranges where baseline embedding methods are still competitive---since their performance degrades rapidly for larger numbers. To further probe the scalability of our approach, we additionally evaluate FoNE on 60-digit addition in Section `\ref{sec:application}`{=latex}, which highlights its ability to handle much larger operands where other embeddings fail.

#### Dataset.

Each example in the dataset is formatted as `[operand a][operator][operand b]=`, where the operands `a` and `b` are sampled based on the operation type. For addition and multiplication, we ensure $a \leq b$ to avoid duplication (e.g., $a + b$ and $b + a$ are treated as identical and included only once). For subtraction, we enforce $a \geq b$ to ensure non-negative results. For an $x$-digit operands dataset, each operand can have up to $x$ digits. The dataset is divided into training, validation, and test subsets as shown in Table `\ref{tab:dataset_sizes}`{=latex} in Appendix `\ref{app:exp}`{=latex}.

#### Baselines.

We compare our proposed FoNE method against several baseline methods for numeric embeddings. First, we consider digit-wise tokenization, where each digit in a number is treated as an individual token. Second, we evaluate subword tokenization, where numeric values are tokenized into subword units based on the Llama3.2-1b tokenizer's default vocabulary. Third, we include the [xVal]{.smallcaps} method [@golkar2023xval], which leverages explicit value-based representations for numeric computation. As [xVal]{.smallcaps} predict floating point numbers, predictions are rounded to calculate accuracy. Finally, we fine-tune pre-trained LLMs on the same dataset for comparison.

#### Setup.

Our experiments involve multiple configurations of randomly initialized transformer-based models. Models were evaluated across varying sizes, ranging from small to large architectures as defined in Table `\ref{tab:model_config}`{=latex}. For the accuracy vs. training data size experiments, we use a configuration similar to Llama-3.2 but with 38M parameters. In Appendix `\ref{sec:gpt2}`{=latex}, we conduct more experiments on a transformer model with a different configuration and observe consistent results.

Learning rates were determined through an extensive search, with the best rates selected separately for each method based on the validation performance. Model evaluation used exact match accuracy to assess numeric prediction correctness. All models were trained from random initialization , except the fine-tuned Llama-3.2-1B baseline model. We varied the training data size by uniformly sampling subsets and adjusted model sizes to compare accuracy across methods.

Experiment Results {#sec:exp_results}
------------------

```{=latex}
\ifdefined
```
```{=latex}
\isarxiv
```
```{=latex}
\else
```
```{=latex}
\vspace{-3mm}
```
```{=latex}
\begin{figure*}[!ht]\centering
  \subfigure[6-digit integer addition]{
    \begin{minipage}{0.48\textwidth}
      \centering
      \includegraphics[width=0.49\textwidth]{figures/add_datasize_error}
      \includegraphics[width=0.49\textwidth]{figures/add_modelsize_error.png}
    \end{minipage}
    \label{fig:addition}
  }
  \subfigure[5-digit integer subtraction]{
    \begin{minipage}{0.48\textwidth}
      \centering
      \includegraphics[width=0.49\textwidth]{figures/sub_datasize_error.png}
      \includegraphics[width=0.49\textwidth]{figures/sub_modelsize_error.png}
    \end{minipage}
    \label{fig:decimal_addition}
  }
  \subfigure[3-digit integer multiplication]{
    \begin{minipage}{0.48\textwidth}
      \centering
      \includegraphics[width=0.49\textwidth]{figures/mul_datasize_error.png}
      \includegraphics[width=0.49\textwidth]{figures/mul_modelsize_error.png}
    \end{minipage}
    \label{fig:subtraction}
  }
  \subfigure[4-digit integer multiplication]{
    \begin{minipage}{0.48\textwidth}
      \centering
      \includegraphics[width=0.49\textwidth]{figures/mul4_datasize_error.png}
      \includegraphics[width=0.49\textwidth]{figures/mul4_modelsize_error.png}
    \end{minipage}
    \label{fig:multiplication}
  }
  \caption{Comparison of accuracy for various arithmetic tasks with respect to model and data size.}
  \label{fig:result} 
\end{figure*}
```
```{=latex}
\fi
```
```{=latex}
\vspace{-3mm}
```
```{=latex}
\ifdefined
```
```{=latex}
\isarxiv
```
```{=latex}
\else
```
```{=latex}
\fi
```
```{=latex}
\ifdefined
```
```{=latex}
\isarxiv
```
```{=latex}
\begin{figure*}[!ht]\centering
\subfigure[ 6-digit decimal addition: Acc. vs. Training Data Size]{\includegraphics[width=0.49\textwidth]{figures/results/adddecimaldataacc.png}}
\subfigure[ 6-digit decimal addition: Acc. vs. Model Size]{\includegraphics[width=0.49\textwidth]{figures/results/adddecimalmodelacc.png}}
\caption{
We train Llama-3.2-1B from scratch with random initialization using different number embedding methods on 6-digit decimal addition. The test accuracy is compared across varying data sizes and model sizes.
}
\label{fig:6digitdecimalacc}
\end{figure*}
```
```{=latex}
\begin{figure*}[!ht]\centering
  \subfigure[6-digit integer addition: Model\&Data size vs. Acc.]{
    \begin{minipage}{0.48\textwidth}
      \centering
      \includegraphics[width=0.49\textwidth]{figures/results/add1dataacc_small.png}
      \hspace{-2mm}
      \includegraphics[width=0.49\textwidth]{figures/results/add1modelacc_small.png}
    \end{minipage}
    \label{fig:addition}
  } 
  \hspace{-1mm}
  \subfigure[5-digit integer subtraction: Model\&Data size vs. Acc.]{
    \begin{minipage}{0.48\textwidth}
      \centering
      \includegraphics[width=0.49\textwidth]{figures/results/sub1dataacc_small.png}
      \hspace{-2mm}
      \includegraphics[width=0.49\textwidth]{figures/results/sub1modelacc_small.png}
    \end{minipage}
    \label{fig:decimal_addition}
  }
  \hspace{-1mm}
  \subfigure[3-digit integer multiplication: Model\&Data size vs. Acc.]{
    \begin{minipage}{0.48\textwidth}
      \centering
      \includegraphics[width=0.49\textwidth]{figures/results/mul1dataacc_small.png}
      \hspace{-2mm}
      \includegraphics[width=0.49\textwidth]{figures/results/mul1modelacc_small.png}
    \end{minipage}
    \label{fig:subtraction}
  }
  \hspace{-1mm}
  \subfigure[4-digit integer multiplication: Model\&Data size vs. Acc.]{
    \begin{minipage}{0.48\textwidth}
      \centering
      \includegraphics[width=0.49\textwidth]{figures/results/mul2dataacc_small.png}
      \hspace{-2mm}
      \includegraphics[width=0.49\textwidth]{figures/results/mul2modelacc_small.png}
    \end{minipage}
    \label{fig:multiplication}
  }
  \caption{Comparison of accuracy trends for various arithmetic tasks with respect to model size and data size.}
  \label{fig:result} 
\end{figure*}
```
```{=latex}
\else
```
```{=latex}
\fi
```
#### Data Efficiency.

The middle panel of Figure `\ref{fig:teaser-three}`{=latex} illustrates the accuracy trends of different embedding methods as the data size increases for the 6-digit decimal addition task. Remarkably, our model achieves $99\%$ accuracy with just $6,400$ training samples and $37.55$ million parameters, requiring $64\times$ less training data than traditional embedding methods ($409,600 / 6,400 = 64$). Even with only $3,200$ training samples, our method outperforms the fine-tuned Llama-3.2 model. Additionally, it achieves perfect accuracy with $51,200$ training samples.

Beyond synthetic tasks, our approach also improves compute efficiency in real-world scenarios. For instance, FoNE requires only $149.25$ tokens on average to represent numerical values from a table in the WikiTableQuestions dataset [@pasupat-liang-2015-compositional], compared to $329.7$ tokens used by a digit-wise tokenizer. This significant reduction in token usage highlights the efficiency of our method in encoding numerical data, making it more scalable for number-heavy tasks.

#### Parameter Efficiency.

The right panel of Figure `\ref{fig:teaser-three}`{=latex} shows the accuracy trends of different embedding methods as the model size increases for the 6-digit decimal addition task. Our method achieves $97\%$ accuracy with just $1$ layer and $8.31$ million parameters using $200k$ examples for training. Furthermore, with $26.62$ million parameters, it surpasses the fine-tuned Llama-3.2 model and achieves $100\%$ accuracy.

#### Different Tasks.

We conducted the same experiments across all different datasets. As shown in Figure `\ref{fig:result}`{=latex}, our method consistently demonstrates superior data and parameter efficiency compared to other approaches. Notably, it is the only method that achieves perfect accuracy on 6-digit decimal addition, 6-digit integer addition, 5-digit subtraction, and 3-digit multiplication. We also show that our method performs better in a binary classification task that involves numerical values in Figure `\ref{fig:classacc}`{=latex} and Figure `\ref{fig:classacc2}`{=latex} . Specifically, the task requires predicting a label based on a linear equation applied to three integers. In addition, we evaluate FoNE on a modular addition task. In Table `\ref{tab:modular_results}`{=latex}, we show that it outperforms standard tokenization methods, especially under large moduli where conventional approaches fail. Due to space limitations, we defer the details to Appendix `\ref{app:class}`{=latex}.

#### Training and Inference Efficiency.

Table `\ref{tab:time_comparison}`{=latex} compares the training and test times used for one epoch across different embedding methods. Our method is consistently faster than digit-wise and subword embedding methods, as it uses one token to embed each number. Compared with [xVal]{.smallcaps}, our method consistently achieves higher accuracy. Additionally, we show the number of tokens required to tokenize the maximum number for different methods, highlighting the efficiency of our approach. `\vspace{-4mm}`{=latex}

```{=latex}
\begin{table*}[h!]\centering
\small
\setlength{\tabcolsep}{2pt}  % Reduce horizontal spacing between columns
\caption{Training and inference efficiency comparison across three arithmetic tasks. The times are reported in minutes (\('\)) and seconds (\(''\)).}
\begin{tabular}{lcccc cccc cccc}
\toprule
 & \multicolumn{4}{c}{\textbf{Decimal Addition}} & \multicolumn{4}{c}{\textbf{Subtraction}} & \multicolumn{4}{c}{\textbf{Multiplication}} \\
\cmidrule(lr){2-5}\cmidrule(lr){6-9}\cmidrule(lr){10-13}
\textbf{Method} & \textbf{Train Time} & \textbf{Test.} & \textbf{Tokens} & \textbf{Accuracy} & \textbf{Train.} & \textbf{Test.} & \textbf{Toks.} & \textbf{Acc.} & \textbf{Train.} & \textbf{Test.} & \textbf{Toks.} & \textbf{Acc.} \\
\midrule
\textbf{Ours}       & 3\(^{\prime}\)18\(^{\prime\prime}\) & 29\(^{\prime\prime}\) & 1 & 100 &
             2\(^{\prime}\)42\(^{\prime\prime}\) & 29\(^{\prime\prime}\) & 1 &  100 &
             2\(^{\prime}\)56\(^{\prime\prime}\) & 33\(^{\prime\prime}\) & 1 &  98.56   \\
Digit-wise & 11\(^{\prime}\)48\(^{\prime\prime}\) & 1\(^{\prime}\)25\(^{\prime\prime}\) & 7 & 99.85 &
             9\(^{\prime}\)41\(^{\prime\prime}\) & 1\(^{\prime}\)15\(^{\prime\prime}\) & 5 & 99.71 &
             10\(^{\prime}\)11\(^{\prime\prime}\) & 1\(^{\prime}\)18\(^{\prime\prime}\) & 8 &   81.21 \\
Subword    & 6\(^{\prime}\)46\(^{\prime\prime}\) & 58\(^{\prime\prime}\) & 3 & 97.94 &
             5\(^{\prime}\)47\(^{\prime\prime}\) & 54\(^{\prime\prime}\) & 2 & 91.66 &
             6\(^{\prime}\)20\(^{\prime\prime}\) & 58\(^{\prime\prime}\) & 3 & 8.05  \\
\textsc{xVal}       & 3\(^{\prime}\)17\(^{\prime\prime}\) & 27\(^{\prime\prime}\) & 1 & 0.44 &
             2\(^{\prime}\)54\(^{\prime\prime}\) & 27\(^{\prime\prime}\) & 1 & 3.41 &
             2\(^{\prime}\)56\(^{\prime\prime}\) & 26\(^{\prime\prime}\) & 1 & 0  \\
\bottomrule
\end{tabular}

\label{tab:time_comparison}
\end{table*}
```
```{=latex}
\ifdefined
```
```{=latex}
\isarxiv
```
```{=latex}
\begin{table*}[h!]\centering
\small
\setlength{\tabcolsep}{2pt}  % Reduce horizontal spacing between columns
\caption{Training and inference efficiency comparison across three arithmetic tasks. The times are reported in minutes (\('\)) and seconds (\(''\)).}
\begin{tabular}{lcccc cccc cccc}
\toprule
 & \multicolumn{4}{c}{\textbf{Decimal Addition}} & \multicolumn{4}{c}{\textbf{Subtraction}} & \multicolumn{4}{c}{\textbf{Multiplication}} \\
\cmidrule(lr){2-5}\cmidrule(lr){6-9}\cmidrule(lr){10-13}
\textbf{Method} & \textbf{Train Time} & \textbf{Test Time} & \textbf{Tokens} & \textbf{Accuracy} & \textbf{Train.} & \textbf{Test.} & \textbf{Toks.} & \textbf{Acc.} & \textbf{Train.} & \textbf{Test.} & \textbf{Toks.} & \textbf{Acc.} \\
\midrule
Ours       & 3\(^{\prime}\)18\(^{\prime\prime}\) & 29\(^{\prime\prime}\) & 1 & 100 &
             2\(^{\prime}\)42\(^{\prime\prime}\) & 29\(^{\prime\prime}\) & 1 &  100 &
             2\(^{\prime}\)56\(^{\prime\prime}\) & 33\(^{\prime\prime}\) & 1 &  98.56   \\
Digit-wise & 11\(^{\prime}\)48\(^{\prime\prime}\) & 1\(^{\prime}\)25\(^{\prime\prime}\) & 7 & 99.85 &
             9\(^{\prime}\)41\(^{\prime\prime}\) & 1\(^{\prime}\)15\(^{\prime\prime}\) & 5 & 99.71 &
             10\(^{\prime}\)11\(^{\prime\prime}\) & 1\(^{\prime}\)18\(^{\prime\prime}\) & 8 &   81.21 \\
Subword    & 6\(^{\prime}\)46\(^{\prime\prime}\) & 58\(^{\prime\prime}\) & 3 & 97.94 &
             5\(^{\prime}\)47\(^{\prime\prime}\) & 54\(^{\prime\prime}\) & 2 & 91.66 &
             6\(^{\prime}\)20\(^{\prime\prime}\) & 58\(^{\prime\prime}\) & 3 & 8.05  \\
\textsc{xVal}       & 3\(^{\prime}\)17\(^{\prime\prime}\) & 27\(^{\prime\prime}\) & 1 & 0.44 &
             2\(^{\prime}\)54\(^{\prime\prime}\) & 27\(^{\prime\prime}\) & 1 & 3.41 &
             2\(^{\prime}\)56\(^{\prime\prime}\) & 26\(^{\prime\prime}\) & 1 & 0  \\
\bottomrule
\end{tabular}

\label{tab:time_comparison}
\end{table*}
```
```{=latex}
\else
```
```{=latex}
\fi
```
```{=latex}
\vspace{-3mm}
```
Ablation Studies {#sec:ablation}
----------------

#### Linear Layer after FoNE.

As discussed in Section `\ref{sec:fne}`{=latex}, we evaluate the use of a linear layer applied after FoNE and compare it with the approach of appending zeros to align the embedding dimensions with the model's input requirements. As shown in Table `\ref{tab:ablation_linear}`{=latex}, both configurations achieve almost the same accuracy. Hence, either technique can be used to align FoNE with the embedding dimension.

```{=latex}
\vspace{-4mm}
```
```{=latex}
\ifdefined
```
```{=latex}
\isarxiv
```
```{=latex}
\centering
```
`\setlength{\tabcolsep}{4pt}`{=latex}

::: {#tab:ablation_linear}
  **Task**            **Linear Layer**   **Zero Padding**
  ------------------ ------------------ ------------------
  Decimal Addition          100%               100%
  Integer Addition          100%               100%
  Multiplication           99.95%             99.91%
  Subtraction               100%               100%

  : Accuracy Comparison Across Datasets
:::

```{=latex}
\else
```
```{=latex}
\centering
```
```{=latex}
\small
```
```{=latex}
\setlength{\tabcolsep}{4pt}
```
```{=latex}
\centering
```
::: {#tab:modular_task_comparison}
  **Task**            **Linear**   **Zero Padding**
  ------------------ ------------ ------------------
  Decimal Addition       100%            100%
  Integer Addition       100%            100%
  Multiplication        99.95%          99.91%
  Subtraction            100%            100%

  : Accuracy Comparison Across Periods
:::

```{=latex}
\hfill
```
```{=latex}
\centering
```
::: {#tab:modular_task_comparison}
  **Dataset**         **2,5,10**   **10**   **5**   **7**
  ------------------ ------------ -------- ------- -------
  Decimal Addition       100        100     1.52    3.64
  Integer Addition       100        100     1.55    0.02
  Multiplication        99.99      99.95    3.67    1.91
  Subtraction            100        100     4.64    0.24

  : Accuracy Comparison Across Periods
:::

```{=latex}
\fi
```
```{=latex}
\vspace{-2mm}
```
#### Effect of Different Periods.

As discussed in Section `\ref{sec:fne}`{=latex}, the modular group captures the necessary information for each digit, ensuring the effectiveness of our approach. We test the model with base periods of $[2, 5, 10]$, $[5]$, and $[7]$, as shown in Table `\ref{tab:modular_task_comparison}`{=latex}. The $[2, 5, 10]$ configuration achieves accuracy comparable to that of the $10$-period setup across different datasets. In this paper, we choose single $10$ to make it more parameter efficient. However, configurations using only $\bmod 5$ or $\bmod 7$ exhibit significantly lower accuracy. This is because neither $\bmod 5$ nor $\bmod 7$ can fully represent the required information for all digits.

```{=latex}
\ifdefined
```
```{=latex}
\isarxiv
```
```{=latex}
\centering
```
`\setlength{\tabcolsep}{4pt}`{=latex}

::: {#tab:modular_task_comparison}
  **Dataset**         **2,5,10**   **10**   **5**   **7**
  ------------------ ------------ -------- ------- -------
  Decimal Addition       100        100     1.52    3.64
  Integer Addition       100        100     1.55    0.02
  Multiplication        99.99      99.95    3.67    1.91
  Subtraction            100        100     4.64    0.24

  : Accuracy Comparison Across Datasets and Periods
:::

```{=latex}
\else
```
```{=latex}
\fi
```
The mispredictions are attributed to the absence of critical modular information. As illustrated in Table `\ref{tab:misprediction}`{=latex} in Appendix `\ref{app:exp}`{=latex}, in the decimal addition task, using only a $\bmod 5$ representation prevents the model from distinguishing between certain digits, such as $2$ and $7$, which results in errors.

#### Necessity of Sine and Cosine Encoding.

A natural question arises: are sinusoidal encodings truly necessary for arithmetic tasks? One could directly encode each digit into a separate dimension of the embedding, representing a number like 567 as $[5,6,7]$. However, this approach fails to achieve perfect accuracy. For instance, numbers such as $999$ and $888$ become nearly indistinguishable after layer normalization, which reduces their differences and can lead to confusion during training. We evaluate this direct encoding method on 6-digit decimal addition and, after performing a learning rate search, find that the best accuracy is 99.3% with a learning rate of 0.01 and training for 100 epochs. In contrast, FoNE achieves better accuracy in just 6 epochs with the same dataset and model size. This suggests that naive direct encoding does not adequately preserve numerical distinctions for reliable arithmetic operations. As illustrated in Table `\ref{tab:misprediction2}`{=latex} in the appendix, the model frequently mispredicts 8 as 9, further demonstrating the limitations of direct encoding in preserving numerical structure.

Applications and Complementarity of FoNE
----------------------------------------

```{=latex}
\vspace{-3mm}
```
`\label{sec:application}`{=latex}

```{=latex}
\begin{figure*}[ht]\centering
  \subfigure[Impact of combining FoNE with Abacus embedding]{
    \includegraphics[width=0.49\textwidth]{figures/results/lengendiff.png}
  }
    \subfigure[Test accuracy of 60-digit addition with FoNE]{
\includegraphics[width=0.46\textwidth]{figures/results/abacus60.png} 
  }
  \caption{
(a) Performance improvements achieved by combining FoNE with the Abacus embedding method across various random seeds. The transformer is trained on addition tasks with up to 10-digits  numbers (represented by the smaller square)  and tested up to 50-digit numbers.
(b) Average accuracy of an 8-layer transformer model on 60-digit addition tasks using FoNE for chunked input.} 
    \label{fig:lengen} 
\end{figure*}
```
```{=latex}
\vspace{-3mm}
```
#### Combining FoNE with Abacus.

FoNE can be combined with other positional embedding methods even with different tokenization methods, requiring only minor modifications. For instance, we integrated FoNE with the Abacus embedding method [@mcleish2024transformers], which operates on digit-wise tokenization. In this setup, the embeddings for each digit (0--9) are replaced with their corresponding FoNE. We trained an 8-layer transformer model on integer addition tasks with up to 10 digits and tested it on addition tasks involving up to 50-digit numbers. The results, as illustrated in Figure `\ref{fig:lengen}`{=latex}(a) and Figure `\ref{fig:diff}`{=latex} in Appendix `\ref{app:abacus}`{=latex}, show that incorporating FoNE consistently improves the performance of the Abacus method across various random seeds. This highlights the complementary benefits of combining FoNE with other positional embedding strategies.

#### How does FoNE Handle numbers with longer digit sequences

The maximum digit length that a `float64` data type can represent is 15 digits. When $x$ exceeds 15 digits in length, applying $\text{FoNE}(x)$ directly may result in a loss of precision. To address this, $x$ can be divided into smaller chunks, and $\text{FoNE}$ can be applied to each chunk independently. For example, $x$ can be split into groups of five digits. The FoNE can then be calculated for each chunk, resulting in a representation of length 10 per chunk, as each digit is encoded in two dimensions. These embeddings are subsequently concatenated to obtain the final number embedding for $x$. Note that we are still using one token for each number. By using this method, as shown in Figure `\ref{fig:lengen}`{=latex}(b), an 8-layer transformer trained on 60-digit addition achieved an average accuracy of 97.42% across different operand length with just one forward pass. This demonstrates the effectiveness of FoNE in handling long sequences.

Discussion {#sec:discussion}
==========

```{=latex}
\centering
```
![Pretraining 117M GPT-2 model on 10 Billion tokens with various number tokenization methods. Our method (FoNE) achieves the lowest validation perplexity. This experiment shows that FoNE does not harm the language models' semantic abilities.](figures/results/ppl_comparison.png){#fig:perplexity width="0.9\\linewidth"}

`\noindent`{=latex}***Q1:*** **Why not use regression loss instead of classification loss, which minimizes RMSE and can yield smaller prediction errors [@zausinger2024regress].** The key limitation is that regression produces continuous values, making it impossible to integrate number-related tasks with general language modeling. For example, when predicting the year \`\`1997", regression may output \`\`1996.9999", which is acceptable under regression metrics but unusable in sequence generation or token-based reasoning. In contrast, FoNE retains a classification-based loss, so it does not require explicitly identifying which numbers are used for arithmetic. This ensures seamless compatibility with standard LLM training while still delivering accurate numerical representations.

`\noindent`{=latex}***Q2:*** **Why does FoNE outperform other number embedding methods?** Note that since FoNE uses the ratio between entries to represent numbers as shown in Lemma `\ref{lem:fne_numeracy}`{=latex}, it is unaffected by layer normalization and RMS normalization (Lemma `\ref{lem:fne_preserve_numeracy_layer_norm}`{=latex}), in contrast to xVal [@golkar2023xval], which uses the magnitudes of entries. Other approaches, such as DICE [@sundararaman2020methods] and SALSA [@stevens2024salsa], map numbers onto a single unit circle, which can limit their capacity to distinguish between different magnitudes effectively. FoNE, by leveraging multiple sinusoidal components, captures both the magnitude and periodicity of numbers more precisely. This comprehensive representation enables FoNE to achieve higher accuracy and generalization in number-related tasks.

***Q3:*** **Why do we choose components with periods that are multiples of $10$, i.e., $10, 100, 1000 \cdots$?** As shown in @zhou2024pre and Figure `\ref{fig:embedding_fourier_models}`{=latex}, pre-trained LLMs trained on diverse datasets and strategies consistently learn nearly identical key frequency components. These components have periods of $10$ and its divisors, such as $2$ and $5$. Since $\bmod 10$ can already represent a single digit, we believe that $\bmod 2$ and $\bmod 5$ contribute to enhancing robustness. Models trained on real-world text data---where numbers are almost always expressed in decimal---commonly learn frequency components that correspond to $\bmod 10$. In principle, we could choose alternative bases (such as 5, 16, etc.) to help the model better learn arithmetic in those bases, as demonstrated in Table `\ref{tab:base5_alignment}`{=latex}. However, since most large language models primarily encounter numbers in base 10, and our results show that base-10 FoNE already performs well on arithmetic tasks in other bases (Table `\ref{tab:base_selection_accuracy}`{=latex}), we adopt base 10 as the default. Additional experiments validating this choice are provided in Appendix `\ref{app:class}`{=latex}.

```{=latex}
\ifdefined
```
```{=latex}
\isarxiv
```
```{=latex}
\centering
```
```{=latex}
\subfigure[Test accuracy of 60-digit addition with FoNE]{
\includegraphics[width=0.46\textwidth]{figures/results/abacus60.png} 
  }
```
```{=latex}
\subfigure[Impact of combining FoNE with Abacus embedding]{
    \includegraphics[width=0.46\textwidth]{figures/results/lengendiff.png}
  }
```
```{=latex}
\else
```
```{=latex}
\fi
```
`\noindent`{=latex}***Q4:*** **Can FoNE be integrated into pretrained LLMs without harming their semantic abilities?**

We study this question in two complementary settings. First, we perform continual pretraining of an existing LLM with a simplified FoNE adaptation. Second, we train a smaller GPT-2--117M model from scratch on 10B FineWeb tokens with different number tokenization schemes. In both cases, FoNE preserve, and in the latter case even improving, language modeling performance.

#### Continual pretraining on Llama-3.1-1B.

We demonstrate that our FoNE token embedding can be merged with any existing LLM with slight continual pretraining. We made a simplification of FoNE where instead of overriding all the number embeddings which could make continual pretraining harder, we build the simplified FoNE on top of existing BPE tokenization. We compute FoNE embedding of each subword, and continual pretrain a linear projection layer from the FoNE embedding space to the original token embedding space to align the embeddings similar to vision language models' alignment phase. We continual pretrain both the original Llama-3.1-1B model and our FoNE adapted version on 15B tokens from `MegaMath-Web-Pro` [@zhou2025megamathpushinglimitsopen] and evaluated on arithmetic tasks from @brown2020language and on MMLU [@hendrycks2021measuringmassivemultitasklanguage]. As shown in Table `\ref{tab:continue}`{=latex}, this simplified version of FoNE improves the model's zero-shot arithmetic with larger number of digits without affecting language abilities evaluated by MMLU. `\vspace{-4mm}`{=latex}

```{=latex}
\centering
```
```{=latex}
\small
```
```{=latex}
\setlength{\tabcolsep}{4pt}
```
```{=latex}
\centering
```
::: {#tab:ppl_fineweb}
  [Task]{.smallcaps}     [Regular]{.smallcaps}   [FoNE (*Simplified*)]{.smallcaps}
  --------------------- ----------------------- -----------------------------------
  4-digit addition              51.35%                      **59.00**%
  4-digit subtraction           29.60%                      **39.90**%
  5-digit addition              29.40%                      **35.75**%
  5-digit subtraction           24.95%                      **33.95**%
  MMLU                          38.10%                        38.21%

  : Perplexity of text tokens only on sequences that contain both text and numbers. (evaluated on $\sim$5 million tokens in total).
:::

`\label{tab:continue}`{=latex}

```{=latex}
\hfill
```
```{=latex}
\centering
```
::: {#tab:ppl_fineweb}
  **Method**      Perplexity        
  -------------- ------------ -- -- --
  BPE               67.390          
  Single Digit      55.922          
  xVal              49.160          
  **FoNE**        **46.856**        

  : Perplexity of text tokens only on sequences that contain both text and numbers. (evaluated on $\sim$5 million tokens in total).
:::

`\label{tab:ppl_number}`{=latex}

```{=latex}
\vspace{-4mm}
```
```{=latex}
\vspace{-2mm}
```
#### Pretraining from scratch on FineWeb.

To make the model learn text and numbers together, we also pretrain a GPT-2--117M model from scratch on 10 billion tokens from FineWeb [@penedo2024the] under four number tokenization schemes: BPE, Single Digit, xVal, and FoNE. As shown in Figure `\ref{fig:perplexity}`{=latex}, the FoNE model achieves the lowest validation perplexity on the FineWeb validation set. Furthermore, we evaluate the perplexity of text tokens only on sequences that contain both text and numbers. As shown in Table `\ref{tab:ppl_number}`{=latex}, FoNE remains the lowest perplexity while other methods show even worse perplexity compared to their performance on all sequences. Since numbers in FineWeb appear naturally interleaved with text, this experiment shows that using FoNE throughout pretraining not only preserves the model's semantic abilities but can in fact *improve* its language modeling quality by giving better number representation. `\vspace{-3mm}`{=latex}

#### Semantic vs. numerical roles.

These findings are consistent with prior interpretability results. As discussed by @meng2022locating, semantic and factual associations of tokens---such as the historical or cultural meaning of a year---are often stored in the MLP layers of transformer models rather than in the token embeddings themselves. FoNE therefore plays a complementary role: it provides precise, numeracy-preserving input embeddings for numbers, while the model's MLP layers continue to store and retrieve semantic knowledge. Together, this separation of roles allows FoNE to enhance numerical reasoning without sacrificing semantic competence.

Conclusion
==========

In this paper, we introduced FoNE, a novel method for representing numbers in the embedding space of LLMs. By leveraging Fourier features, FoNE directly maps numbers into a compact and precise representation, bypassing tokenization inefficiencies and preserving essential numerical properties. FoNE has significant implications for pre-training LLMs. By incorporating FoNE, models can develop a robust understanding of numerical concepts, addressing a fundamental limitation in current architectures. We believe our work establishes a solid foundation for future research on a wide range of number-related tasks, including time-series analysis, quantitative reasoning, and complex operations in fields like physics and mathematics.

```{=latex}
\ifdefined
```
```{=latex}
\isarxiv
```
```{=latex}
\else
```
```{=latex}
\fi
```
```{=latex}
\newpage
```
Acknowledgments {#acknowledgments .unnumbered}
===============

We are grateful to Wang Zhu for providing helpful feedback on our paper. This work was supported by AWS credits from a gift from the USC-Amazon Center on Secure and Trusted Machine Learning. DF, MS, and RJ were supported in part by a gift from the USC-Capital One Center for Responsible AI and Decision Making in Finance (CREDIF). RJ was also supported in part by the National Science Foundation under Grant No. IIS-2403436. The work of MS was also supported in part by AWS credits through an Amazon Faculty Research Award, a NAIRR Pilot Award, and generous funding by Coefficient Giving. MS is also supported by the Packard Fellowship in Science and Engineering, a Sloan Research Fellowship in Mathematics, NSF CAREER Award \#1846369, DARPA FastNICS program, NSF CIF Awards \#1813877 and \#2008443, and NIH Award DP2LM014564-01. VS was supported by NSF CAREER Award CCF-2239265, an Amazon Research Award, a Google Research Scholar Award and a Okawa Foundation Award. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Reproducibility Statement {#reproducibility-statement .unnumbered}
=========================

We have taken several steps to ensure the reproducibility of our results. The FoNE method is defined step by step in Section `\ref{sec:fne}`{=latex} and Section `\ref{app:fourier_final_loss_prediction}`{=latex}. Our experimental setup, including datasets, sampling rules, model configurations, and training details, is described in Section `\ref{sec:exp_setting}`{=latex} and Appendix `\ref{app:exp}`{=latex}.

```{=latex}
\ifdefined
```
```{=latex}
\isarxiv
```
Acknowledgments {#acknowledgments-1 .unnumbered}
===============

This research is supported in part by AWS credits through an Amazon Faculty research award, a NAIRR Pilot award, and Microsft accelerating foundations research grant, and a Google Research Scholar Award. M. Soltanolkotabi is also supported by the Packard Fellowship in Science and Engineering, a Sloan Research Fellowship in Mathematics, an NSF-CAREER under award \#1846369, and NSF-CIF awards \#1813877 and \#2008443, NSF SLES award \#2417075. and NIH DP2LM014564-01. R. Jia was also supported by the National Science Foundation under Grant No. IIS-2403436. V. Sharan was supported in part by an NSF CAREER Award CCF-2239265 and an Amazon Research Award. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. The work was done in part while some of the authors were visiting the Simons Institute for the Theory of Computing.

```{=latex}
\bibliographystyle{plainnat}
```
```{=latex}
\else
```
```{=latex}
\bibliographystyle{iclr2026_conference}
```
```{=latex}
\fi
```
```{=latex}
\clearpage
```
```{=latex}
\newpage
```
```{=latex}
\onecolumn
```
```{=latex}
\appendix
```
```{=latex}
\ifdefined
```
```{=latex}
\isarxiv
```
Appendix {#appendix .unnumbered}
========

```{=latex}
\startcontents[appendix]
```
```{=latex}
\addcontentsline{toc}{chapter}{Appendix}
```
```{=latex}
\renewcommand{\thesection}{\Alph{section}}
```
```{=latex}
\printcontents[appendix]{}{1}{\setcounter{tocdepth}{3}}
```
```{=latex}
\setcounter{section}{0}
```
```{=latex}
\newpage
```
Detailed Algorithms for Computing FoNE, and Making Predictions {#app:fourier_final_loss_prediction}
==============================================================

In this section, we provide the detail pipeline and algorithms of how we compute FoNE, get the final loss and final prediction as defined in Section `\ref{sec:fne}`{=latex}.

We first show the how are FoNE and Fourier number decoding integrated with regular Transformer pipeline.

```{=latex}
\centering
```
![(a) We extract all the numbers from the input sequence. (b) For each number, we use FoNE to directly map the number to its embedding. The first two entries in the embedding represent $18 \bmod 10$, while the next two entries represent $18 \bmod 100$. (c) We pad the FoNE with zeros, add it to the word embeddings, and then feed the combined embeddings into the model. (d) For each digit, we take every two entries from the last hidden state and find the number whose representation is closest to these two entries.](figures/teaserarxiv.png){#fig:teaser1 width="0.99\\linewidth"}

```{=latex}
\ifdefined
```
```{=latex}
\isarxiv
```
```{=latex}
\else
```
Next we show the exact algorithms we use to compute FoNE, compute loss and make the final prediction.

```{=latex}
\begin{algorithm*}[!ht]\caption{Fourier Number Embedding (FoNE) Algorithm}\label{alg:fne_algorithm_fixed}
    \begin{algorithmic}[1]
    \Procedure{\textsc{FourierNumberEmbedding}}{$x \in \R, m \in \Z_{\geq 0}, n \in \Z_{\geq 0}, d \in \Z_{> 0}$} 
    \State{\textbf{Inputs}: Number $x$, integer digit length $m$, decimal digit length $n$, embedding dimension $d$}
    \State Initialize empty embedding vector $\text{FoNE} \gets []$
    \For{$i = -n+1 \to m$} \Comment{Loop over all scales from $10^{-n+1}$ to $10^m$}
        \State $T_i \gets 10^i$ \Comment{Set the period for the current scale}
        \State $\phi(x, T_i) \gets (\cos(\tfrac{2\pi}{T_i} x), \sin(\tfrac{2\pi}{T_i} x))$ \Comment{Compute the circular embedding for scale $T_i$}
        \State Append $\phi(x, T_i)$ to $\text{FoNE}$ \Comment{Add the embedding for this scale to the result}
    \EndFor
    \While{$\text{Length}(\text{FoNE}) < d$} \Comment{Ensure embedding dimension matches the target}
        \State Append $0$ to $\text{FoNE}$ \Comment{Zero-pad}
    \EndWhile
    \State \Return $\text{FoNE}$ 
    \EndProcedure
    \end{algorithmic}
\end{algorithm*}
```
```{=latex}
\fi
```
```{=latex}
\ifdefined
```
```{=latex}
\isarxiv
```
```{=latex}
\begin{algorithm*}[!ht]\caption{Fourier Number Embedding (FoNE) Algorithm}\label{alg:fne_algorithm_fixed}
    \begin{algorithmic}[1]
    \Procedure{\textsc{FourierNumberEmbedding}}{$x \in \R, m \in \Z_{\geq 0}, n \in \Z_{\geq 0}, d \in \Z_{> 0}$} 
    \State{\textbf{Inputs}: Number $x$, integer digit length $m$, decimal digit length $n$, embedding dimension $d$}
    \State Initialize empty embedding vector $\text{FoNE} \gets []$
    \For{$i = -n+1 \to m$} \Comment{Loop over all scales from $10^{-n+1}$ to $10^m$}
        \State $T_i \gets 10^i$ \Comment{Set the period for the current scale}
        \State $\phi(x, T_i) \gets (\cos(\tfrac{2\pi}{T_i} x), \sin(\tfrac{2\pi}{T_i} x))$ \Comment{Compute the circular embedding for scale $T_i$}
        \State Append $\phi(x, T_i)$ to $\text{FoNE}$ \Comment{Add the embedding for this scale to the result}
    \EndFor
    \While{$\text{Length}(\text{FoNE}) < d$} \Comment{Ensure embedding dimension matches the target}
        \State Append $0$ to $\text{FoNE}$ \Comment{Zero-pad}
    \EndWhile
    \State \Return $\text{FoNE}$ 
    \EndProcedure
    \end{algorithmic}
\end{algorithm*}
```
```{=latex}
\else
```
```{=latex}
\fi
```
```{=latex}
\ifdefined
```
```{=latex}
\isarxiv
```
```{=latex}
\else
```
```{=latex}
\begin{algorithm}[!ht]\caption{Fourier Number Loss \& Prediction}
\label{alg:fourier_loss_prediction}
\begin{algorithmic}[1]

    %--------------------------------------------------
    % Digit-wise Loss
    %--------------------------------------------------
    \Function{FourierNumberLossFunction}{$h, y, i$}
        \State $y_i \gets \text{the $i$-th digit of } y$
        \State $a \gets \bigl[h[2i],h[2i+1]\bigr]$
        \State $b \gets  [\phi(0,10),
                \phi(1,10),
                \cdots,
                \phi(9,10)]^\top$
        \State $\text{logits} \gets a \cdot b$
        \State $\text{loss} \gets L_{\mathrm{CE}}(y_i,\ \text{logits})$ \Comment{\textit{Cross-entropy loss for digit $i$}}
        \State \Return \text{loss}
    \EndFunction

    %--------------------------------------------------
    % Digit-wise Prediction
    %--------------------------------------------------
    \Function{FourierNumberPrediction}{$h, i$} \Comment{\textit{Prediction for digit $i$}}
        \State $\text{logits} \gets 
        \bigl[h[2i],h[2i+1]\bigr]
            \cdot
            \bigl[
                \phi(j,10)
            \bigr]_{j=0,\dots,9}$
        \State $\hat{y}_i \gets \arg\max_{j \in \{0,\dots,9\}} \text{logits}[j]$
        \State \Return $\hat{y}_i$
    \EndFunction

\end{algorithmic}
\end{algorithm}
```
```{=latex}
\fi
```
```{=latex}
\begin{algorithm*}[!htbp]\caption{Fourier Number Final Loss \& Prediction}
\label{alg:fourier_final_loss_prediction}
\begin{algorithmic}[1]

    %--------------------------------------------------
    % Final Loss
    %--------------------------------------------------
    \Function{FourierNumberFinalLoss}{$h, y, m, n$}
    \Comment{\textit{Compute average loss}}
        \State $\text{totalLoss} \gets 0$
        \State $\mathcal{I} \gets [m+n]$
        \For{$i \in \mathcal{I}$} 
            \State $\text{digitLoss} \gets \Call{FourierNumberLossFunction}{h,\ y,\ i}$
            \State $\text{totalLoss} \gets \text{totalLoss} + \text{digitLoss}$
        \EndFor
        \State $\text{finalLoss} \gets \frac{\text{totalLoss}}{|\mathcal{I}|}$
        \Comment{Average over all digit positions}
        \State \Return $\text{finalLoss}$
    \EndFunction

    %--------------------------------------------------
    % Final Prediction
    %--------------------------------------------------
    \Function{FourierNumberFinalPrediction}{$h, m, n$}
    \Comment{\textit{Compute final prediction}}
        \State $\hat{y} \gets 0$
        \State $\mathcal{I}_\text{frac} \gets [0, \dots, n-1]$ \Comment{Fractional digit indices}
        \State $\mathcal{I}_\text{int} \gets [n, \dots, m+n-1]$ \Comment{Integer digit indices}
        
        \For{$i \in \mathcal{I}_\text{frac}$} 
            \State $\text{logits}_i \gets 
                \bigl[h[2i], h[2i+1]\bigr]
                \cdot
                \bigl[\phi(j,10)\bigr]_{j=0,\dots,9}$
            \State $\hat{y}_i \gets \arg\max_{j \in \{0, \dots, 9\}} \text{logits}_i[j]$
            \State $\hat{y} \gets \hat{y} + \hat{y}_i \cdot 10^{-(n-i)}$ \Comment{Scale fractional part by $10^{-(n-i)}$}
        \EndFor
        
        \For{$j \in \mathcal{I}_\text{int}$}
            \State $\text{logits}_j \gets 
                \bigl[h[2j], h[2j+1]\bigr]
                \cdot
                \bigl[\phi(j,10)\bigr]_{j=0,\dots,9}$
            \State $\hat{y}_j \gets \arg\max_{j \in \{0, \dots, 9\}} \text{logits}_j[j]$
            \State $\hat{y} \gets \hat{y} + \hat{y}_j \cdot 10^{j-n}$ \Comment{Scale integer part by $10^{j}$}
        \EndFor
        
        \State \Return $\hat{y}$
    \EndFunction

\end{algorithmic}
\end{algorithm*}
```
```{=latex}
\newpage
```
FoNE on Binary Classification and Modular Arithmetic Task {#app:class}
=========================================================

In this section, we demonstrate that FoNE outperforms other methods on binary classification tasks and modular arithmetic tasks, benefiting from its precise representation.

#### Binary Classification Task

Each example in the dataset is formatted as `[num1,num2,num3]`, where the integers `num1`, `num2`, and `num3` are sorted in ascending order ($num1 \leq num2 \leq num3$) to ensure uniqueness and eliminate duplicate representations of the same combination. The integers are uniformly sampled from the range $[0, 1000]$. The label for each example is determined by evaluating the linear equation $$a \cdot \texttt{num1} + b \cdot \texttt{num2} + c \cdot \texttt{num3} - d,$$ using predefined coefficients $a=1.5$, $b=-2$, $c=0.5$, and $d=10$ and $a=1.5$, $b=-2$, $c=0.5$, and $d=-190$. If the result is greater than zero, the label is assigned as `1`; otherwise, it is assigned as `0`. The dataset is divided into training, validation, and test subsets, as outlined in Table `\ref{tab:dataset_sizes}`{=latex}. Figure `\ref{fig:classacc}`{=latex} and Figure `\ref{fig:classacc2}`{=latex} show that FoNE outperforms the regular embedding method, XVAL, and even a fine-tuned Llama-3.2-1B model by requiring less data and achieving higher accuracy.

```{=latex}
\centering
```
```{=latex}
\subfigure[Accuracy vs. Training Data Size]{\includegraphics[width=0.45\textwidth]{figures/results/classdataacc.png}}
```
```{=latex}
\hspace{2mm}
```
```{=latex}
\subfigure[Accuracy vs. Model Size]{\includegraphics[width=0.45\textwidth]{figures/results/classmodelacc.png}}
```
```{=latex}
\centering
```
```{=latex}
\subfigure[Accuracy vs. Training Data Size]{\includegraphics[width=0.45\textwidth]{figures/results/class2dataacc.png}}
```
```{=latex}
\hspace{2mm}
```
```{=latex}
\subfigure[Accuracy vs. Model Size]{\includegraphics[width=0.45\textwidth]{figures/results/class2modelacc.png}}
```
#### Modular Arithmetic and Base Selection

We conduct experiments varying the FoNE base from 7 to 13 and observed a clear "sweet spot" at base 10. For example, at base 7 the addition, subtraction, and multiplication accuracies are only 17.66 %, 1.93 %, and 0.54 %, respectively, while at base 10 we achieve 100 % accuracy on addition and subtraction after just 4 and 26 epochs, and 99.25 % on multiplication after 50 epochs. Accuracy then fall off again at larger bases (e.g., at base 13: 1.06 %, 5.71 %, and 1.91 %). This confirms that bases that are too small or too large degrade digit-level distinguishability.

```{=latex}
\centering
```
```{=latex}
\small
```
::: {#tab:base_selection_accuracy}
   **Base**   **Epochs (Add/Sub/Mul)**   **Accuracy (Add/Sub/Mul)**  
  ---------- -------------------------- ---------------------------- --
      7             50 / 50 / 50            17.66 / 1.93 / 0.54      
      8             50 / 50 / 50            40.91 / 6.07 / 0.33      
      9             50 / 50 / 50           65.49 / 65.63 / 65.78     
      10            4 / 26 / 50           100.00 / 100.00 / 99.25    
      11            50 / 50 / 50           99.99 / 99.99 / 78.76     
      12            50 / 50 / 50           99.78 / 99.99 / 54.14     
      13            50 / 50 / 50             1.06 / 5.71 / 1.91      

  : Accuracy (%) on arithmetic tasks under different bases.
:::

We also aligned FoNE's moduli to the numbers' representation base by preprocessing inputs into base 5. In that setting, only the FoNE embedding with period 5 reached perfect accuracy (100 % in 6 epochs), whereas other periods (2--4 and 7--8) yielded near-zero to sub-20 % accuracy (e.g., period 2: 0.25 %; period 3: 0.12 %; period 7: 16.38 %). This again demonstrates that FoNE performs best when its moduli match the underlying digit base.

```{=latex}
\centering
```
```{=latex}
\small
```
::: {#tab:base5_alignment}
   **FoNE Period**   **Epochs**   **Accuracy (%)**
  ----------------- ------------ ------------------
          2              50             0.25
          3              50             0.12
          4              50             0.17
          5              6             100.00
          7              50            16.38
          8              50            99.97

  : Accuracy (%) with base-5 number inputs, under varying FoNE periods.
:::

We additionally benchmarked FoNE on modular addition tasks with varying moduli. The goal is to predict $x + y \bmod m$, where $x, y \in [0, m)$. FoNE consistently outperforms standard tokenization, especially at higher moduli.

```{=latex}
\centering
```
```{=latex}
\small
```
::: {#tab:modular_results}
   **Modulus $m$**   **FoNE Accuracy**   **Standard Tokenization Accuracy**
  ----------------- ------------------- ------------------------------------
         11               100.00                       99.98
         60                99.27                       99.99
         97               100.00                        0.89
         100              100.00                       18.17
         113              100.00                       99.99
         121              100.00                        5.48
         225              100.00                       65.10
         256              100.00                       70.20
         257              100.00                       11.36

  : Accuracy (%) on modular addition tasks across various moduli.
:::

These experiments validate our claim that FoNE is most effective when its moduli align with the target numeral base, and it remains robust even in high-modulus arithmetic settings where standard tokenization breaks down.

More Related Work
=================

#### Fourier Features.

Fourier features are commonly observed in image models, particularly in the early layers of vision models [@olshausen1997sparse; @olah2020overview; @fiquet2024polar]. These features enable the model to detect edges, textures, and other spatial patterns effectively. However, Transformers struggle to capture high-frequency components [@bai2022improving; @tancik2020fourier]. Augmenting data with high-frequency components or explicitly encoding coordinates using Fourier features has been demonstrated to improve model performance [@tancik2020fourier; @he2024frequency; @hua2024fourier]. In fact, the original Transformers paper [@vaswani2017attention] uses Fourier features to encode the position information of tokens; however, it does not apply this idea to number tokens to aid with numerical tasks. In modular addition tasks, studies have revealed that after "grokking," a one-layer Transformer can learn to solve the task perfectly by leveraging Fourier features [@nanda2023progress; @gu2024fourier]. Furthermore, @zhou2024pre demonstrate that LLMs naturally encode numbers using Fourier features during pretraining, leveraging these representations for arithmetic tasks [@levy2024language; @kantamneni2025language; @lindsey2025biology]. Building on this insight, we propose using modular Fourier components to explicitly represent digits, enabling models to perform precise numerical computation. This allows algebraic operations to be carried out in a component-wise, parallel manner and overcomes the limitations of token-based number representations.

Preliminaries and Missing Proofs {#app:pre_proof}
================================

Preliminaries
-------------

In this section, we provide the necessary mathematical definitions and concepts used throughout the paper.

#### Period and Frequency.

A function $f(x)$ is periodic with period $T > 0$ if $f(x + T) = f(x)$ for all $x$. The period $T$ represents the smallest positive value for which the function repeats. The frequency $f$ of a periodic function is the reciprocal of its period, $f = \frac{1}{T}$, and describes the number of cycles completed in one unit interval. For the sine and cosine functions $\cos\bigl(\frac{2\pi}{T}x\bigr)$ and $\sin\bigl(\frac{2\pi}{T}x\bigr)$, the period is $T$.

#### Unit Circle.

The unit circle is the set of points in the plane at a distance of 1 from the origin, given by $x^2 + y^2 = 1$. The coordinates of points on the unit circle can be parameterized as $(\cos\theta, \sin\theta)$, where $\theta$ is the angle measured counterclockwise from the positive $x$-axis. For any angle $\theta$, $\cos\theta$ represents the $x$-coordinate, and $\sin\theta$ represents the $y$-coordinate.

#### Two-Argument Inverse Tangent.

The two-argument inverse tangent function, $\operatorname{atan2}(y, x)$, determines the angle $\theta$ (modulo $2\pi$) given the coordinates $(x, y) = (\cos\theta, \sin\theta)$. Specifically, $$\theta = \operatorname{atan2}(y, x),$$ which resolves the angle $\theta$ uniquely based on the signs of $x$ and $y$.

#### Modular Arithmetic.

Modular arithmetic considers equivalence classes of numbers under a modulus $T > 0$. For integers $a$ and $b$, $a \equiv b \pmod{T}$ if $T \mid (a - b)$, meaning $a$ and $b$ differ by an integer multiple of $T$.

#### Fourier Representation.

Periodic functions with period $T$ can be represented using the fundamental frequencies $\frac{2\pi}{T}$. For example, the embeddings $\bigl(\cos\bigl(\frac{2\pi}{T}x\bigr), \sin\bigl(\frac{2\pi}{T}x\bigr)\bigr)$ capture the periodicity of $x$ modulo $T$ by mapping it to a unique point on the unit circle.

Missing Proofs
--------------

In this section, we provide some missing proofs.

```{=latex}
\begin{lemma}[Formal version of Lemma \ref{lem:fne_preserve_numeracy:informal}]\label{lem:fne_preserve_numeracy:formal}
    Given the pair $\bigl(\cos(\tfrac{2\pi}{T}x), \sin(\tfrac{2\pi}{T}x)\bigr)$, we can recover $x \bmod T$.
\end{lemma}
```
```{=latex}
\begin{proof}
Let $\theta = \frac{2\pi}{T} \, x$. Then the given pair becomes 
\[
\bigl(\cos(\theta), \sin(\theta)\bigr).
\]
From this pair, one can recover $\theta$ uniquely modulo $2\pi$. Concretely, $\theta$ can be obtained (modulo $2\pi$) using the two-argument inverse tangent function:
\[
\theta \equiv \operatorname{atan2}\bigl(\sin(\theta), \cos(\theta)\bigr) \quad (\bmod \; 2\pi).
\]
Since $\theta = \frac{2\pi}{T} \, x$, we have
\[
x \;=\; \frac{T}{2\pi} \, \theta.
\]
Hence $x$ is determined up to integer multiples of $T$, i.e., $x \bmod T$. 

In other words, if
\[
\bigl(\cos(\tfrac{2\pi}{T}x_1), \sin(\tfrac{2\pi}{T}x_1)\bigr)
\;=\;
\bigl(\cos(\tfrac{2\pi}{T}x_2), \sin(\tfrac{2\pi}{T}x_2)\bigr),
\]
then $\frac{2\pi}{T} x_1 \equiv \frac{2\pi}{T} x_2 \pmod{2\pi}$, which implies $x_1 \equiv x_2 \pmod{T}$. Therefore, from the pair $\bigl(\cos(\tfrac{2\pi}{T}x), \sin(\tfrac{2\pi}{T}x)\bigr)$, we can indeed recover $x \bmod T$.
\end{proof}
```
```{=latex}
\begin{lemma}[Layer-Normalized FoNE Preserves Numeracy]\label{lem:fne_preserve_numeracy_layer_norm}
    Given a number's Layer-Normalized Fourier Number Embedding $\mathrm{LN}(\FoNE(x) + \mathbf{p})$, where $\FoNE(x)$ is the Fourier Number Embedding of $x$ and $\mathbf{p}$ is an orthogonal positional encoding vector, assume the mean of $\FoNE(x) + \mathbf{p}$ is $0$. Let $m$ be the integer digit length of $x$ and $n$ be the decimal digit length of $x$. Then, using Lemma~\ref{lem:fne_preserve_numeracy:informal}, we can recover $x \bmod 10^{i}$ for each integer $i$ in the range $-n+1$ to $m$.
\end{lemma}
```
```{=latex}
\begin{proof}
Assume the mean of $\mathbf{x} = \FoNE(x) + \mathbf{p}$ is $0$, i.e., $\mu = 0$. Under this assumption, LayerNorm simplifies to:
\[
\mathrm{LN}(\mathbf{x}) = \frac{\mathbf{x}}{\sigma},
\]
where $\sigma$ is the standard deviation of $\mathbf{x}$.

Let $\mathbf{u} = \FoNE(x)$ encode the scalar $x$, and let $\mathbf{p}$ be an orthogonal positional encoding vector such that:
\[
\|\mathbf{u}\| = \|\mathbf{p}\| = 1 \quad \text{and} \quad \mathbf{u} \cdot \mathbf{p} = 0.
\]
Then, the input to LayerNorm is:
\[
\mathbf{x} = \mathbf{u} + \mathbf{p}.
\]

The standard deviation $\sigma$ of $\mathbf{x}$ is given by:
\[
\sigma = \sqrt{\frac{1}{d} \sum_{i=1}^d (\mathbf{x}_i - \mu)^2},
\]
where $d$ is the dimensionality of $\mathbf{x}$. Since $\mu = 0$, this simplifies to:
\[
\sigma = \sqrt{\frac{1}{d} \|\mathbf{x}\|^2}.
\]
Substitute $\mathbf{x} = \mathbf{u} + \mathbf{p}$:
\[
\|\mathbf{x}\|^2 = \|\mathbf{u} + \mathbf{p}\|^2 = \|\mathbf{u}\|^2 + \|\mathbf{p}\|^2 + 2\mathbf{u} \cdot \mathbf{p}.
\]
By orthogonality and unit norm, $\mathbf{u} \cdot \mathbf{p} = 0$, $\|\mathbf{u}\|^2 = 1$, and $\|\mathbf{p}\|^2 = 1$. Thus:
\[
\|\mathbf{x}\|^2 = 1 + 1 + 0 = 2.
\]
Therefore:
\[
\sigma = \sqrt{\frac{1}{d} \cdot 2} = \sqrt{\frac{2}{d}}.
\]

The LayerNorm operation simplifies to:
\[
\mathrm{LN}(\mathbf{x}) = \frac{\mathbf{x}}{\sigma} = \frac{\mathbf{u} + \mathbf{p}}{\sqrt{\frac{2}{d}}} = \sqrt{\frac{d}{2}} (\mathbf{u} + \mathbf{p}).
\]
This rescales $\mathbf{u}$ and $\mathbf{p}$ by a factor of $\sqrt{\frac{d}{2}}$.

The key observation is that LayerNorm applies a \textbf{uniform scaling} to all components of $\mathbf{x}$. Since $\mathbf{u}$ and $\mathbf{p}$ are orthogonal and their relative directions are preserved, the numerical relationships encoded in $\mathbf{u}$ (which represent $x$) are preserved up to a scaling factor. 

By Lemma~\ref{lem:fne_preserve_numeracy:informal}, the numeracy of $x$ is preserved. This means we can recover $x \bmod 10^i$ for all $i$ in the range $-n+1 \leq i \leq m$, as the normalized embedding retains the necessary information about $x$.

\end{proof}
```
The same result holds for RMSNorm because it also applies a uniform scaling (based on the root mean square of the input) while preserving the relative directions of the embedding components, thus maintaining the numeracy of $x$.

Notet that standard sinusoidal positional encodings (PEs) [@vaswani2017attention] cannot serve as a substitute for numerical embeddings. PEs are explicitly constructed to distinguish token positions in a sequence, not to represent the magnitude or digit structure of real numbers. They use a fixed set of exponentially spaced frequencies to ensure each position has a unique yet non-invertible signature; as a result, there is no straightforward way to recover the original numeric value (or its individual digits) from a PE embedding. Moreover, the frequencies in PEs are chosen for positional uniqueness, not for digit-aligned modular arithmetic, so small changes in numeric value can produce large, non-monotonic changes in the embedding space---precisely the opposite of the smooth, digit-wise variation required for accurate number encoding. Unlike FNE's digit-aligned sinusoidal components (e.g., mod 10, mod 100, ...), PEs do not guarantee invertibility with respect to numeric operations.

More Evidence {#app:moreevidence}
=============

LLMs Struggle with Multi-digit Arithmetic {#app:eval_llm}
-----------------------------------------

We evaluate five production LLMs (Claude 3.7 Sonnet, DeepSeek V3.1, Gemini 2.5 Flash, GPT-5, and Qwen3-235B) on direct arithmetic without chain-of-thought, code execution, or tools. For each setting, we sample 100 IID problem instances with uniformly distributed operands over the full $d$-digit numbers. Tasks include $d\in\{3,4,5,6\}$ for multiplication and $d\in\{7,8,50\}$ for addition. Models receive a two-shot, operation-matched prompt and must return a single integer only. Decoding uses temperature 0. We parse the first integer token from the response and score exact-match accuracy by numeric equality with the reference.

```{=latex}
\centering
```
::: {#tab:arithmetic_results}
  ------------------------------------------------------------ ---------------------------------------------- ---------------------------------------- ------ ------ ------ ------ ------
                                                                `\multicolumn{4}{c}{Multiplication}`{=latex}   `\multicolumn{3}{c}{Addition}`{=latex}                              
  `\cmidrule`{=latex}(lr)2-5`\cmidrule`{=latex}(lr)6-8 Model                         3                                           4                       5      6      7      8      50
  Claude 3.7 Sonnet                                                                 1.00                                        0.53                    0.02   0.00   1.00   0.97   0.82
  DeepSeek V3.1                                                                     0.94                                        0.25                    0.01   0.00   0.95   0.95   0.34
  Gemini 2.5 Flash                                                                  0.58                                        0.11                    0.00   0.00   0.99   0.96   0.33
  GPT-5                                                                             0.74                                        0.09                    0.00   0.00   0.98   0.92   0.66
  Qwen3-235B                                                                        0.94                                        0.62                    0.05   0.01   0.99   0.98   0.28
  ------------------------------------------------------------ ---------------------------------------------- ---------------------------------------- ------ ------ ------ ------ ------

  : Exact-match accuracy on direct multi-digit arithmetic; columns indicate digits per operand. Each entry averages 100 IID problems per setting with uniformly sampled $d$-digit operands. Two-shot prompting, single-integer output only.
:::

`\noindent`{=latex}**Results.** Even the most recently released LLMs still struggle with *multi-digit multiplication*, while fail to achieve perfect accuracy on *addition*. For multiplication, accuracy drops sharply as operand length increases: models perform well on 3-digit multiplication ($\geq 0.58$), but fall below $0.10$ for 5-digit cases and nearly $0$ at 6 digits. Qwen3-235B is the most robust, reaching $0.62$ on 4-digit and $0.05$ on 5-digit multiplication, yet still fails on 6 digits. For addition, all models achieve $\geq 0.95$ on 7--8 digit tasks, but accuracy declines on long-sequence addition (50 digits), ranging from $0.28$ (Qwen3-235B) to $0.82$ (Claude 3.7 Sonnet). In summary, LLMs excel at addition with short to medium operands but remain brittle for both long-sequence addition and especially large-digit multiplication.

Emergence of Fourier Features during Pre-training
-------------------------------------------------

We follow @zhou2024pre and conduct the same Fourier analysis on Pythia model. In Figure `\ref{fig:pythia_number_checkpoints}`{=latex}, we show how Pythia gradually learns the Fourier features during pre-training. With different model size, the model gradually learn the same frequency components.

```{=latex}
\begin{figure*}[htbp]\centering
  \includegraphics[width=1.0\textwidth]{figures/pretrained_embeddings/pythia_number_embedding_checkpoints.png}
  \caption{Fourier analysis of the Pythia model's number embeddings across pre-training checkpoints. The figure illustrates how the Fourier features are progressively learned during pre-training, showing the emergence of specific frequency components. Models of varying sizes exhibit a similar trend, gradually learning the same frequency components over time.}
  \label{fig:pythia_number_checkpoints}
\end{figure*}
```
We extend the work of @zhou2024pre to other pre-trained LLMs and observe similar findings: pre-trained LLMs, regardless of the dataset used, tend to learn the same outlier frequency components.

```{=latex}
\centering
```
+:------------------------------------------------------------------------------------------------------------------------------------------------:+:-----------------------------------------------------------------------------------------------------------------------------:+
| `\subfigure[pre-trained Pythia]{\includegraphics[width=0.45\textwidth]{figures/pretrained_embeddings/pythia_number_embedding_last.png}}`{=latex} | `\subfigure[fine-tuned Llama3.2]{\includegraphics[width=0.45\textwidth]{figures/pretrained_embeddings/llama3.2.png}}`{=latex} |
+--------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+
| `\subfigure[pre-trained OPT]{\includegraphics[width=0.45\textwidth]{figures/pretrained_embeddings/opt-1.3b.png}}`{=latex}                        | ```{=latex}                                                                                                                   |
|                                                                                                                                                  | \subfigure[pre-trained GPT2]{\includegraphics[width=0.45\textwidth]{figures/pretrained_embeddings/gpt2-large.png}}            |
|                                                                                                                                                  | ```                                                                                                                           |
+--------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+

```{=latex}
\newpage
```
```{=latex}
\newpage
```
FoNE for 60-digit Integer Addition in One Forward Pass {#app:60digit}
======================================================

```{=latex}
\begin{figure*}[!htbp]\centering
    \includegraphics[width=0.5\linewidth]{figures/results/6060addition.png}
    \caption{ Accuracy of an 8-layer transformer on 60-digit addition tasks, illustrating the effectiveness of FoNE embeddings in handling long sequences. The model achieves an average accuracy of 97.42\% across different operand lengths, showcasing its capability in numerical precision and sequence representation.}
    \label{fig:6060addition}
\end{figure*}
```
As discussed in Section `\ref{sec:discussion}`{=latex}, the maximum digit length that a `float64` data type can precisely represent is 15 digits. Consequently, even if we convert numbers to `float64` and then back to `float16` to match the model weight precisionm it still introduce numerical inaccuracies when the input $x$ exceeds 15 digits. To mitigate this issue, we process $x$ by dividing it into smaller chunks, allowing the FoNE to operate effectively without precision loss.

Specifically, $x$ is split into groups of five digits, and $\text{FoNE}$ is applied independently to each chunk. Each digit within a chunk is encoded into two dimensions, resulting in an embedding of length 10 per chunk. These chunk embeddings are then concatenated to form the final representation of $x$. This method ensures that even for long inputs, the FoNE still preserve the numeracy of the numbers.

We adopt the data generation approach from [@mcleish2024transformers], which includes all combinations of operand lengths $(i, j)$ up to a maximum length $k$, generating 20 million stratified samples to ensure balanced representation across all length pairs. Training is conducted using a language model cramming approach [@geiping2023cramming], constrained to 8 exaFLOP (equivalent to 24 hours of training on a single Nvidia RTX A6000 GPU). Using this strategy, as depicted in Figure `\ref{fig:lengen}`{=latex}(a), an 8-layer transformer trained on 60-digit addition achieves an average accuracy of 97.42% across various operand lengths in just one forward pass. This result underscores the effectiveness of the $\text{FoNE}$ in processing long numbers with high precision and computational efficiency in just one forward pass.

Combining FoNE with Abacus {#app:abacus}
==========================

We train decoder-only causal language models to solve arithmetic problems, following the setup described in @mcleish2024transformers. Inputs are formatted in a least-significant-digit-first order (e.g., $98282 + 3859172 = 2787472$), without padding between digits or operands. The training dataset includes all combinations of operand lengths $(i, j)$ up to a maximum length $k$, with 20 million stratified samples ensuring balanced representation across all length pairs.

For input representation, we combine Fourier Number Embeddings (FoNE) with the Abacus method [@mcleish2024transformers]. That each digit is embedded with FoNE. Training is conducted using a language model cramming approach [@geiping2023cramming], constrained to 8 exaFLOP (equivalent to 24 hours of training on a single Nvidia RTX A6000 GPU).

We train and evaluate the models across three runs, each with a different random seed, as shown in Figure `\ref{fig:diff}`{=latex}. Results indicate that incorporating FoNE enables the Abacus method to achieve better generalization and higher accuracy.

```{=latex}
\centering
```
![Heatmaps of accuracy percentages for \`\`FoNE+Abacus\" (left column) and \`\`Abacus\" (right column) across three different random seeds. Each heatmap represents accuracy as a function of the first and second number lengths, with lighter blue shades indicating higher accuracy. The color scale ranges from white (low accuracy) to blue (high accuracy). These visualizations highlight FoNE can combine with Abacus to improve performance.](figures/results/diff1.png){#fig:diff width="0.8\\linewidth"}

More Details on Experimental Setup {#app:exp}
==================================

In this section, we provide the experiments settings that we used in the Section `\ref{sec:exp_setting}`{=latex}.

Learning rates were determined through an extensive search, with the best rates selected separately for each method based on validation performance. Final training hyperparameters include a learning rate of $0.005$ for regular and FoNE methods, and $0.0001$ for the xVal method, a batch size of $512$, and $100$ epochs. The fine-tuning process required less than 10 hours, while training from scratch took less than 3 days.

```{=latex}
\centering
```
::: {#tab:dataset_sizes}
            **Dataset**             **Train Size**   **Validation Size**   **Test Size**
  -------------------------------- ---------------- --------------------- ---------------
      6-digit decimal addition         720,000             80,000             200,000
      6-digit integer addition         720,000             80,000             200,000
      5-digit integer subtract         720,000             80,000             200,000
   3-digit integer multiplication      360,000             40,000             100,000
   4-digit integer multiplication      720,000             80,000             200,000
           classification               720,00              80,00             200,00

  : Dataset Sizes for Training, Testing, and Validation
:::

```{=latex}
\centering
```
```{=latex}
\small
```
::: {#tab:varysizes}
            **Dataset**             **Model Size for Varying Data Size**   **Data Size for Varying Model Size**
  -------------------------------- -------------------------------------- --------------------------------------
      6-digit decimal addition                     37.55M                                200,000
      6-digit integer addition                     37.55M                                200,000
      5-digit integer subtract                     37.55M                                200,000
   3-digit integer multiplication                  37.55M                                360,000
   4-digit integer multiplication                  37.55M                                360,000
   4-digit integer multiplication                  37.55M                                360,000
           classification                          37.55M                                 50,000

  : Dataset and Configuration Sizes for Model and Data Variation Experiments
:::

```{=latex}
\centering
```
```{=latex}
\footnotesize
```
::: {#tab:model_config}
   **Model**   **Hidden Size**   **Intermediate Size**   **\# Hidden Layers**   **\# Attention Heads**   **\# Key-Value Heads**
  ----------- ----------------- ----------------------- ---------------------- ------------------------ ------------------------
       1             64                   256                     1                       4                        2
       2             128                  512                     2                       4                        2
       3             192                  768                     3                       6                        3
       4             256                 1024                     4                       8                        4
       5             320                 1280                     5                       8                        4
       6             384                 1536                     6                       8                        4

  : Model Configuration Table
:::

Ablation Study
--------------

In this section, we present the mispredictions of the model trained with an FoNE, where the periods are multiples of $5$ instead of $10$. Table `\ref{tab:misprediction}`{=latex} demonstrates that, for each digit, the mispredictions consistently deviate from the true labels by $5$.

```{=latex}
\centering
```
::: {#tab:misprediction}
   **Index**   **Predicted Value**   **Actual Value**
  ----------- --------------------- ------------------
       1             934.03              934.585
       2              3.009              558.509
       3             912.311             917.366
       4            6201.003             1756.008
       5             1240.34             1290.84

  : Mispredictions in the Final Evaluation with when we embed each digit with only $\bmod 5$.
:::

We also present the model's mispredictions in Table `\ref{tab:misprediction2}`{=latex}, where each digit is encoded into a separate dimension of the embedding. For example, the number $567$ is represented as $[5,6,7]$. During training, we compute the RMSE loss between the last hidden states and the labels. During prediction, we interpret each entry in the last hidden state as a single digit.

```{=latex}
\centering
```
::: {#tab:misprediction2}
   **Index**   **Predicted Value**   **Actual Value**
  ----------- --------------------- ------------------
       1             883.888             993.999
       2             787.878             898.989
       3             888.758             989.759
       4             748.785             849.895
       5             677.677             688.788
      10            1179.488             1189.499

  : Mispredictions in the Final Evaluation when directly encoding numbers into their embeddings.
:::

```{=latex}
\newpage
```
Replicating Results on GPT2-Large Based Model {#sec:gpt2}
=============================================

We conduct the same experiments on decimal addition using a GPT-2 Large-based model. The results indicate that changing the model architecture does not affect the outcomes. For instance, GPT-2 Large employs LayerNorm, while Llama 3.2 uses RMSNorm.

```{=latex}
\begin{figure*}[!htbp]\centering
\subfigure[ 6-digit decimal addition: Accuracy vs. Training Data Size]{\includegraphics[width=0.45\textwidth]{figures/results/add2datagpt.png}}
\hspace{2mm}
\subfigure[6-digit decimal addition: Accuracy vs. Model Size]{\includegraphics[width=0.45\textwidth]{figures/results/add2modelgpt.png}}
\caption{
We train GPT2-Large from scratch with random initialization using different number embedding methods on 6-digit decimal addition. The test accuracy is compared across varying data sizes and model sizes.
}
\label{fig:gpt}
\end{figure*}
```
```{=latex}
\newpage
```
$R^2$ Comparison for Different Arithmetic Tasks {#app:r2}
===============================================

xVal [@golkar2023xval] performs well on the $R^2$ metric $$\begin{aligned}
R^2 = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (y_i - \bar{y})^2},\end{aligned}$$ because it uses RMSE as its loss function. However, we demonstrate that FoNE outperforms xVal on $R^2$ in most tasks. We show the final $R^2$ on test dataset in our experiments(Section `\ref{sec:exp_results}`{=latex}).

```{=latex}
\begin{figure*}[!htbp]\centering
\subfigure[Data size vs. Accuracy]{\includegraphics[width=0.45\textwidth]{figures/results/adddecimaldatar2.png}}
\hspace{2mm}
\subfigure[Model size vs. Accuracy]{\includegraphics[width=0.45\textwidth]{figures/results/adddecimalmodelr2.png}}
\caption{
Comparison of $R^2$ trends for 6-digit decimal addition with respect to model size and data size.
}
\label{fig:r2decimal}
\end{figure*}
```
```{=latex}
\begin{figure*}[!htbp]\centering
  \subfigure[6-digit integer addition: Model\&Data size vs. Accuracy]{
    \begin{minipage}{0.48\textwidth}
      \centering
      \includegraphics[width=0.49\textwidth]{figures/results/add2datar2.png}
    \hspace{-2mm}
      \includegraphics[width=0.49\textwidth]{figures/results/add2modelr2.png}
    \end{minipage}
    \label{fig:decimal_addition_r2}
  }
  \hspace{-2mm}
    \subfigure[5-digit integer addition: Model\&Data size vs. Accuracy]{
    \begin{minipage}{0.48\textwidth}
      \centering
      \includegraphics[width=0.49\textwidth]{figures/results/adddatar2.png}
      \hspace{-2mm}
      \includegraphics[width=0.49\textwidth]{figures/results/addmodelr2.png}
    \end{minipage}
    \label{fig:addition_r2}
  }
  \hspace{-2mm}\\
  \subfigure[5-digit integer subtraction: Model\&Data size vs. Accuracy]{
    \begin{minipage}{0.48\textwidth}
      \centering
      \includegraphics[width=0.49\textwidth]{figures/results/subdatar2.png}
      \hspace{-2mm}
      \includegraphics[width=0.49\textwidth]{figures/results/submodelr2.png}
    \end{minipage}
    \label{fig:subtraction_r2}
  }
  \hspace{-1mm}
  \subfigure[3-digit integer multiplication: Model\&Data size vs. Accuracy]{
    \begin{minipage}{0.48\textwidth}
      \centering
      \includegraphics[width=0.49\textwidth]{figures/results/muldatar2.png}
      \hspace{-2mm}
      \includegraphics[width=0.49\textwidth]{figures/results/mulmodelr2.png}
    \end{minipage}
    \label{fig:multiplication_r2}
  }
  \caption{Comparison of $R^2$ trends for various arithmetic tasks with respect to model size and data size.}
  \label{fig:result_r2} 
\end{figure*}
```
Comparison with Number Token Loss (NTL)
=======================================

To validate the effectiveness of FoNE, we conduct comprehensive experiments comparing against Number Token Loss (NTL) [@zausinger2024regress], a method designed to improve numerical reasoning by incorporating new losses between number tokens. We evaluate two NTL variants: NTL with digit-wise tokenization (denoted NTL-D), where numbers are decomposed into individual digit tokens, and NTL with subword tokenization (NTL-S), using the standard tokenizer with an extended number vocabulary covering tokens 0-999. For each variant, we test three lambda values ($\lambda \in \{0.15, 0.3, 0.8\}$) that control the relative weighting of the NTL loss component.

Experimental Setup
------------------

We conduct experiments on two arithmetic reasoning benchmarks: 4-digit integer addition and 3-digit integer multiplication. To assess data efficiency, we train models with varying amounts of data: 5K, 25K, and 50K samples for addition; 10K, 50K, and 75K samples for multiplication. All experiments use Llama-3.2-1B-Instruct's configuration, training from scratch with consistent architectural parameters (512 hidden dimensions, 8 layers, 8 attention heads) to ensure fair comparison. We train all models for 50 epochs with batch size 512, using learning rates of $5 \times 10^{-4}$ for FoNE and $5 \times 10^{-3}$ for NTL methods based on preliminary tuning. Models are evaluated on $1000$ held-out test samples using whole number accuracy, $R^2$, and RMSE.

Results
-------

Figure `\ref{fig:nlt}`{=latex} presents comprehensive performance comparisons across both tasks and all metrics. The results demonstrate FoNE's substantial and consistent superiority over all NTL variants.

#### Addition Task

FoNE reach 98.6% accuracy with only 5,000 training samples and perfect 100% accuracy with 25,000 or more samples (Table `\ref{tab:addition_results}`{=latex}). The $R^2$ scores consistently exceed 0.999, and RMSE drops to effectively zero at larger training scales. NTL methods exhibit significantly degraded performance. The best-performing variant, NTL(D,0.15), achieves 97.4% accuracy with 50K samples---still below FoNE's performance at 5K samples. Other NTL configurations struggle considerably: NTL(D,0.3) peaks at only 51.4% accuracy, while higher lambda values ($\lambda=0.8$) yield near-zero performance (3.3-6.1%). Most critically, NTL with subword tokenization fails catastrophically across all configurations, achieving near-zero accuracy with negative $R^2$ values as low as -13,170, indicating performance substantially worse than a naive baseline. RMSE values for NTL(S) exceed $4.7 \times 10^5$, compared to 0.0 for FoNE.

#### Multiplication Task

The multiplication task proves more challenging, yet FoNE maintains strong performance. With 75,000 training samples, FoNE achieves 89.3% accuracy with $R^2$=0.9996 and RMSE=4,286, demonstrating consistent and reliable predictions (Table `\ref{tab:multiplication_results}`{=latex}). Performance scales smoothly with data: from 35.1% at 10K samples to 84.5% at 50K and 89.3% at 75K. NTL methods again significantly underperform. The best NTL configuration, NTL(D,0.15) with 75K samples, achieves only 16.3% accuracy---73 percentage points below FoNE. Most other NTL variants achieve less than 5% accuracy even at the largest training scale. Subword tokenization continues to fail, with negative $R^2$ scores (e.g., -0.534 for NTL(S,0.8)) and RMSE values exceeding $2 \times 10^5$, three orders of magnitude worse than FoNE.

#### Analysis

Several key observations emerge from our experiments. First, FoNE demonstrates superior data efficiency, achieving strong performance with minimal training data while NTL requires substantially more samples to reach even moderate accuracy levels. Second, NTL's performance is highly sensitive to the lambda hyperparameter: increasing $\lambda$ from 0.15 to 0.8 consistently degrades accuracy across both tasks, suggesting that heavily weighting the distance-based loss interferes with learning. Finally, FoNE also exhibits better training efficiency, requiring $2.5-11.7\times$ less training time than NTL methods while achieving far superior accuracy.

```{=latex}
\begin{figure*}[t]\centering
\includegraphics[width=\textwidth]{figures/ntlcomparison/comparison_all.pdf}
\caption{Performance comparison of FoNE vs. NTL methods on addition (top) and multiplication (bottom) tasks across three metrics: accuracy, $R^2$, and RMSE. FoNE (solid blue squares) consistently outperforms all NTL variants, achieving 100\% accuracy on addition and 89.3\% on multiplication vs. NTL's best of 16.3\%.}
\label{fig:nlt}
\end{figure*}
```
```{=latex}
\begin{table*}[]\centering
\caption{Addition task results comparing FoNE with NTL methods across different training data sizes. FoNE achieves perfect accuracy with $\geq$25K samples while all NTL variants struggle.}
\label{tab:addition_results}
\small
\begin{tabular}{lcccccc}
\toprule
\textbf{Method} & \multicolumn{2}{c}{\textbf{5K Samples}} & \multicolumn{2}{c}{\textbf{25K Samples}} & \multicolumn{2}{c}{\textbf{50K Samples}} \\
\cmidrule(lr){2-3} \cmidrule(lr){4-5} \cmidrule(lr){6-7}
& Acc. (\%) & $R^2$ & Acc. (\%) & $R^2$ & Acc. (\%) & $R^2$ \\
\midrule
\textbf{FoNE} & \textbf{98.6} & \textbf{0.9993} & \textbf{100.0} & \textbf{1.0000} & \textbf{100.0} & \textbf{1.0000} \\
\midrule
NTL(D,0.15) & 0.9 & -3.57 & 96.3 & 0.9998 & 97.4 & 0.623 \\
NTL(D,0.3) & 41.2 & 0.107 & 51.4 & 0.506 & 49.1 & 0.887 \\
NTL(D,0.8) & 3.3 & -0.304 & 6.1 & 0.774 & 5.3 & 0.773 \\
NTL(S,0.15) & 0.0 & -2.41 & 4.6 & 0.957 & 8.6 & -39.3 \\
NTL(S,0.3) & 0.0 & -7763 & 0.1 & -13030 & 5.7 & -8884 \\
NTL(S,0.8) & 0.1 & -8418 & 3.6 & -13170 & 0.7 & -13730 \\
\bottomrule
\end{tabular}
\end{table*}
```
```{=latex}
\begin{table*}[]\centering
\caption{Multiplication task results. FoNE maintains strong performance (89.3\% at 75K samples) while the best NTL variant achieves only 16.3\%.}
\label{tab:multiplication_results}
\small
\begin{tabular}{lcccccc}
\toprule
\textbf{Method} & \multicolumn{2}{c}{\textbf{10K Samples}} & \multicolumn{2}{c}{\textbf{50K Samples}} & \multicolumn{2}{c}{\textbf{75K Samples}} \\
\cmidrule(lr){2-3} \cmidrule(lr){4-5} \cmidrule(lr){6-7}
& Acc. (\%) & $R^2$ & Acc. (\%) & $R^2$ & Acc. (\%) & $R^2$ \\
\midrule
\textbf{FoNE} & \textbf{35.1} & \textbf{0.9932} & \textbf{84.5} & \textbf{0.9994} & \textbf{89.3} & \textbf{0.9996} \\
\midrule
NTL(D,0.15) & 0.3 & 0.955 & 5.1 & 0.993 & 16.3 & 0.9991 \\
NTL(D,0.3) & 0.2 & 0.958 & 3.1 & 0.978 & 8.2 & 0.990 \\
NTL(D,0.8) & 0.0 & 0.928 & 0.4 & 0.969 & 0.5 & 0.983 \\
NTL(S,0.15) & 0.2 & 0.957 & 2.2 & 0.944 & 4.7 & 0.983 \\
NTL(S,0.3) & 0.2 & -0.462 & 0.7 & -0.032 & 3.9 & -0.107 \\
NTL(S,0.8) & 0.3 & -0.132 & 1.9 & -0.525 & 1.6 & -0.534 \\
\bottomrule
\end{tabular}
\end{table*}
```
```{=latex}
\else
```
Appendix {#appendix-1 .unnumbered}
========

```{=latex}
\startcontents[appendix]
```
```{=latex}
\addcontentsline{toc}{chapter}{Appendix}
```
```{=latex}
\renewcommand{\thesection}{\Alph{section}}
```
```{=latex}
\printcontents[appendix]{}{1}{\setcounter{tocdepth}{3}}
```
```{=latex}
\setcounter{section}{0}
```
```{=latex}
\newpage
```
Detailed Algorithms for Computing FoNE, and Making Predictions {#app:fourier_final_loss_prediction}
==============================================================

In this section, we provide the detail pipeline and algorithms of how we compute FoNE, get the final loss and final prediction as defined in Section `\ref{sec:fne}`{=latex}.

We first show the how are FoNE and Fourier number decoding integrated with regular Transformer pipeline.

```{=latex}
\centering
```
![(a) We extract all the numbers from the input sequence. (b) For each number, we use FoNE to directly map the number to its embedding. The first two entries in the embedding represent $18 \bmod 10$, while the next two entries represent $18 \bmod 100$. (c) We pad the FoNE with zeros, add it to the word embeddings, and then feed the combined embeddings into the model. (d) For each digit, we take every two entries from the last hidden state and find the number whose representation is closest to these two entries.](figures/teaserarxiv.png){#fig:teaser1 width="0.99\\linewidth"}

```{=latex}
\ifdefined
```
```{=latex}
\isarxiv
```
```{=latex}
\else
```
Next we show the exact algorithms we use to compute FoNE, compute loss and make the final prediction.

```{=latex}
\begin{algorithm*}[!ht]\caption{Fourier Number Embedding (FoNE) Algorithm}\label{alg:fne_algorithm_fixed}
    \begin{algorithmic}[1]
    \Procedure{\textsc{FourierNumberEmbedding}}{$x \in \R, m \in \Z_{\geq 0}, n \in \Z_{\geq 0}, d \in \Z_{> 0}$} 
    \State{\textbf{Inputs}: Number $x$, integer digit length $m$, decimal digit length $n$, embedding dimension $d$}
    \State Initialize empty embedding vector $\text{FoNE} \gets []$
    \For{$i = -n+1 \to m$} \Comment{Loop over all scales from $10^{-n+1}$ to $10^m$}
        \State $T_i \gets 10^i$ \Comment{Set the period for the current scale}
        \State $\phi(x, T_i) \gets (\cos(\tfrac{2\pi}{T_i} x), \sin(\tfrac{2\pi}{T_i} x))$ \Comment{Compute the circular embedding for scale $T_i$}
        \State Append $\phi(x, T_i)$ to $\text{FoNE}$ \Comment{Add the embedding for this scale to the result}
    \EndFor
    \While{$\text{Length}(\text{FoNE}) < d$} \Comment{Ensure embedding dimension matches the target}
        \State Append $0$ to $\text{FoNE}$ \Comment{Zero-pad}
    \EndWhile
    \State \Return $\text{FoNE}$ 
    \EndProcedure
    \end{algorithmic}
\end{algorithm*}
```
```{=latex}
\fi
```
```{=latex}
\ifdefined
```
```{=latex}
\isarxiv
```
```{=latex}
\begin{algorithm*}[!ht]\caption{Fourier Number Embedding (FoNE) Algorithm}\label{alg:fne_algorithm_fixed}
    \begin{algorithmic}[1]
    \Procedure{\textsc{FourierNumberEmbedding}}{$x \in \R, m \in \Z_{\geq 0}, n \in \Z_{\geq 0}, d \in \Z_{> 0}$} 
    \State{\textbf{Inputs}: Number $x$, integer digit length $m$, decimal digit length $n$, embedding dimension $d$}
    \State Initialize empty embedding vector $\text{FoNE} \gets []$
    \For{$i = -n+1 \to m$} \Comment{Loop over all scales from $10^{-n+1}$ to $10^m$}
        \State $T_i \gets 10^i$ \Comment{Set the period for the current scale}
        \State $\phi(x, T_i) \gets (\cos(\tfrac{2\pi}{T_i} x), \sin(\tfrac{2\pi}{T_i} x))$ \Comment{Compute the circular embedding for scale $T_i$}
        \State Append $\phi(x, T_i)$ to $\text{FoNE}$ \Comment{Add the embedding for this scale to the result}
    \EndFor
    \While{$\text{Length}(\text{FoNE}) < d$} \Comment{Ensure embedding dimension matches the target}
        \State Append $0$ to $\text{FoNE}$ \Comment{Zero-pad}
    \EndWhile
    \State \Return $\text{FoNE}$ 
    \EndProcedure
    \end{algorithmic}
\end{algorithm*}
```
```{=latex}
\else
```
```{=latex}
\fi
```
```{=latex}
\ifdefined
```
```{=latex}
\isarxiv
```
```{=latex}
\else
```
```{=latex}
\begin{algorithm}[!ht]\caption{Fourier Number Loss \& Prediction}
\label{alg:fourier_loss_prediction}
\begin{algorithmic}[1]

    %--------------------------------------------------
    % Digit-wise Loss
    %--------------------------------------------------
    \Function{FourierNumberLossFunction}{$h, y, i$}
        \State $y_i \gets \text{the $i$-th digit of } y$
        \State $a \gets \bigl[h[2i],h[2i+1]\bigr]$
        \State $b \gets  [\phi(0,10),
                \phi(1,10),
                \cdots,
                \phi(9,10)]^\top$
        \State $\text{logits} \gets a \cdot b$
        \State $\text{loss} \gets L_{\mathrm{CE}}(y_i,\ \text{logits})$ \Comment{\textit{Cross-entropy loss for digit $i$}}
        \State \Return \text{loss}
    \EndFunction

    %--------------------------------------------------
    % Digit-wise Prediction
    %--------------------------------------------------
    \Function{FourierNumberPrediction}{$h, i$} \Comment{\textit{Prediction for digit $i$}}
        \State $\text{logits} \gets 
        \bigl[h[2i],h[2i+1]\bigr]
            \cdot
            \bigl[
                \phi(j,10)
            \bigr]_{j=0,\dots,9}$
        \State $\hat{y}_i \gets \arg\max_{j \in \{0,\dots,9\}} \text{logits}[j]$
        \State \Return $\hat{y}_i$
    \EndFunction

\end{algorithmic}
\end{algorithm}
```
```{=latex}
\fi
```
```{=latex}
\begin{algorithm*}[!htbp]\caption{Fourier Number Final Loss \& Prediction}
\label{alg:fourier_final_loss_prediction}
\begin{algorithmic}[1]

    %--------------------------------------------------
    % Final Loss
    %--------------------------------------------------
    \Function{FourierNumberFinalLoss}{$h, y, m, n$}
    \Comment{\textit{Compute average loss}}
        \State $\text{totalLoss} \gets 0$
        \State $\mathcal{I} \gets [m+n]$
        \For{$i \in \mathcal{I}$} 
            \State $\text{digitLoss} \gets \Call{FourierNumberLossFunction}{h,\ y,\ i}$
            \State $\text{totalLoss} \gets \text{totalLoss} + \text{digitLoss}$
        \EndFor
        \State $\text{finalLoss} \gets \frac{\text{totalLoss}}{|\mathcal{I}|}$
        \Comment{Average over all digit positions}
        \State \Return $\text{finalLoss}$
    \EndFunction

    %--------------------------------------------------
    % Final Prediction
    %--------------------------------------------------
    \Function{FourierNumberFinalPrediction}{$h, m, n$}
    \Comment{\textit{Compute final prediction}}
        \State $\hat{y} \gets 0$
        \State $\mathcal{I}_\text{frac} \gets [0, \dots, n-1]$ \Comment{Fractional digit indices}
        \State $\mathcal{I}_\text{int} \gets [n, \dots, m+n-1]$ \Comment{Integer digit indices}
        
        \For{$i \in \mathcal{I}_\text{frac}$} 
            \State $\text{logits}_i \gets 
                \bigl[h[2i], h[2i+1]\bigr]
                \cdot
                \bigl[\phi(j,10)\bigr]_{j=0,\dots,9}$
            \State $\hat{y}_i \gets \arg\max_{j \in \{0, \dots, 9\}} \text{logits}_i[j]$
            \State $\hat{y} \gets \hat{y} + \hat{y}_i \cdot 10^{-(n-i)}$ \Comment{Scale fractional part by $10^{-(n-i)}$}
        \EndFor
        
        \For{$j \in \mathcal{I}_\text{int}$}
            \State $\text{logits}_j \gets 
                \bigl[h[2j], h[2j+1]\bigr]
                \cdot
                \bigl[\phi(j,10)\bigr]_{j=0,\dots,9}$
            \State $\hat{y}_j \gets \arg\max_{j \in \{0, \dots, 9\}} \text{logits}_j[j]$
            \State $\hat{y} \gets \hat{y} + \hat{y}_j \cdot 10^{j-n}$ \Comment{Scale integer part by $10^{j}$}
        \EndFor
        
        \State \Return $\hat{y}$
    \EndFunction

\end{algorithmic}
\end{algorithm*}
```
```{=latex}
\newpage
```
FoNE on Binary Classification and Modular Arithmetic Task {#app:class}
=========================================================

In this section, we demonstrate that FoNE outperforms other methods on binary classification tasks and modular arithmetic tasks, benefiting from its precise representation.

#### Binary Classification Task

Each example in the dataset is formatted as `[num1,num2,num3]`, where the integers `num1`, `num2`, and `num3` are sorted in ascending order ($num1 \leq num2 \leq num3$) to ensure uniqueness and eliminate duplicate representations of the same combination. The integers are uniformly sampled from the range $[0, 1000]$. The label for each example is determined by evaluating the linear equation $$a \cdot \texttt{num1} + b \cdot \texttt{num2} + c \cdot \texttt{num3} - d,$$ using predefined coefficients $a=1.5$, $b=-2$, $c=0.5$, and $d=10$ and $a=1.5$, $b=-2$, $c=0.5$, and $d=-190$. If the result is greater than zero, the label is assigned as `1`; otherwise, it is assigned as `0`. The dataset is divided into training, validation, and test subsets, as outlined in Table `\ref{tab:dataset_sizes}`{=latex}. Figure `\ref{fig:classacc}`{=latex} and Figure `\ref{fig:classacc2}`{=latex} show that FoNE outperforms the regular embedding method, XVAL, and even a fine-tuned Llama-3.2-1B model by requiring less data and achieving higher accuracy.

```{=latex}
\centering
```
```{=latex}
\subfigure[Accuracy vs. Training Data Size]{\includegraphics[width=0.45\textwidth]{figures/results/classdataacc.png}}
```
```{=latex}
\hspace{2mm}
```
```{=latex}
\subfigure[Accuracy vs. Model Size]{\includegraphics[width=0.45\textwidth]{figures/results/classmodelacc.png}}
```
```{=latex}
\centering
```
```{=latex}
\subfigure[Accuracy vs. Training Data Size]{\includegraphics[width=0.45\textwidth]{figures/results/class2dataacc.png}}
```
```{=latex}
\hspace{2mm}
```
```{=latex}
\subfigure[Accuracy vs. Model Size]{\includegraphics[width=0.45\textwidth]{figures/results/class2modelacc.png}}
```
#### Modular Arithmetic and Base Selection

We conduct experiments varying the FoNE base from 7 to 13 and observed a clear "sweet spot" at base 10. For example, at base 7 the addition, subtraction, and multiplication accuracies are only 17.66 %, 1.93 %, and 0.54 %, respectively, while at base 10 we achieve 100 % accuracy on addition and subtraction after just 4 and 26 epochs, and 99.25 % on multiplication after 50 epochs. Accuracy then fall off again at larger bases (e.g., at base 13: 1.06 %, 5.71 %, and 1.91 %). This confirms that bases that are too small or too large degrade digit-level distinguishability.

```{=latex}
\centering
```
```{=latex}
\small
```
::: {#tab:base_selection_accuracy}
   **Base**   **Epochs (Add/Sub/Mul)**   **Accuracy (Add/Sub/Mul)**  
  ---------- -------------------------- ---------------------------- --
      7             50 / 50 / 50            17.66 / 1.93 / 0.54      
      8             50 / 50 / 50            40.91 / 6.07 / 0.33      
      9             50 / 50 / 50           65.49 / 65.63 / 65.78     
      10            4 / 26 / 50           100.00 / 100.00 / 99.25    
      11            50 / 50 / 50           99.99 / 99.99 / 78.76     
      12            50 / 50 / 50           99.78 / 99.99 / 54.14     
      13            50 / 50 / 50             1.06 / 5.71 / 1.91      

  : Accuracy (%) on arithmetic tasks under different bases.
:::

We also aligned FoNE's moduli to the numbers' representation base by preprocessing inputs into base 5. In that setting, only the FoNE embedding with period 5 reached perfect accuracy (100 % in 6 epochs), whereas other periods (2--4 and 7--8) yielded near-zero to sub-20 % accuracy (e.g., period 2: 0.25 %; period 3: 0.12 %; period 7: 16.38 %). This again demonstrates that FoNE performs best when its moduli match the underlying digit base.

```{=latex}
\centering
```
```{=latex}
\small
```
::: {#tab:base5_alignment}
   **FoNE Period**   **Epochs**   **Accuracy (%)**
  ----------------- ------------ ------------------
          2              50             0.25
          3              50             0.12
          4              50             0.17
          5              6             100.00
          7              50            16.38
          8              50            99.97

  : Accuracy (%) with base-5 number inputs, under varying FoNE periods.
:::

We additionally benchmarked FoNE on modular addition tasks with varying moduli. The goal is to predict $x + y \bmod m$, where $x, y \in [0, m)$. FoNE consistently outperforms standard tokenization, especially at higher moduli.

```{=latex}
\centering
```
```{=latex}
\small
```
::: {#tab:modular_results}
   **Modulus $m$**   **FoNE Accuracy**   **Standard Tokenization Accuracy**
  ----------------- ------------------- ------------------------------------
         11               100.00                       99.98
         60                99.27                       99.99
         97               100.00                        0.89
         100              100.00                       18.17
         113              100.00                       99.99
         121              100.00                        5.48
         225              100.00                       65.10
         256              100.00                       70.20
         257              100.00                       11.36

  : Accuracy (%) on modular addition tasks across various moduli.
:::

These experiments validate our claim that FoNE is most effective when its moduli align with the target numeral base, and it remains robust even in high-modulus arithmetic settings where standard tokenization breaks down.

More Related Work
=================

#### Fourier Features.

Fourier features are commonly observed in image models, particularly in the early layers of vision models [@olshausen1997sparse; @olah2020overview; @fiquet2024polar]. These features enable the model to detect edges, textures, and other spatial patterns effectively. However, Transformers struggle to capture high-frequency components [@bai2022improving; @tancik2020fourier]. Augmenting data with high-frequency components or explicitly encoding coordinates using Fourier features has been demonstrated to improve model performance [@tancik2020fourier; @he2024frequency; @hua2024fourier]. In fact, the original Transformers paper [@vaswani2017attention] uses Fourier features to encode the position information of tokens; however, it does not apply this idea to number tokens to aid with numerical tasks. In modular addition tasks, studies have revealed that after "grokking," a one-layer Transformer can learn to solve the task perfectly by leveraging Fourier features [@nanda2023progress; @gu2024fourier]. Furthermore, @zhou2024pre demonstrate that LLMs naturally encode numbers using Fourier features during pretraining, leveraging these representations for arithmetic tasks [@levy2024language; @kantamneni2025language; @lindsey2025biology]. Building on this insight, we propose using modular Fourier components to explicitly represent digits, enabling models to perform precise numerical computation. This allows algebraic operations to be carried out in a component-wise, parallel manner and overcomes the limitations of token-based number representations.

Preliminaries and Missing Proofs {#app:pre_proof}
================================

Preliminaries
-------------

In this section, we provide the necessary mathematical definitions and concepts used throughout the paper.

#### Period and Frequency.

A function $f(x)$ is periodic with period $T > 0$ if $f(x + T) = f(x)$ for all $x$. The period $T$ represents the smallest positive value for which the function repeats. The frequency $f$ of a periodic function is the reciprocal of its period, $f = \frac{1}{T}$, and describes the number of cycles completed in one unit interval. For the sine and cosine functions $\cos\bigl(\frac{2\pi}{T}x\bigr)$ and $\sin\bigl(\frac{2\pi}{T}x\bigr)$, the period is $T$.

#### Unit Circle.

The unit circle is the set of points in the plane at a distance of 1 from the origin, given by $x^2 + y^2 = 1$. The coordinates of points on the unit circle can be parameterized as $(\cos\theta, \sin\theta)$, where $\theta$ is the angle measured counterclockwise from the positive $x$-axis. For any angle $\theta$, $\cos\theta$ represents the $x$-coordinate, and $\sin\theta$ represents the $y$-coordinate.

#### Two-Argument Inverse Tangent.

The two-argument inverse tangent function, $\operatorname{atan2}(y, x)$, determines the angle $\theta$ (modulo $2\pi$) given the coordinates $(x, y) = (\cos\theta, \sin\theta)$. Specifically, $$\theta = \operatorname{atan2}(y, x),$$ which resolves the angle $\theta$ uniquely based on the signs of $x$ and $y$.

#### Modular Arithmetic.

Modular arithmetic considers equivalence classes of numbers under a modulus $T > 0$. For integers $a$ and $b$, $a \equiv b \pmod{T}$ if $T \mid (a - b)$, meaning $a$ and $b$ differ by an integer multiple of $T$.

#### Fourier Representation.

Periodic functions with period $T$ can be represented using the fundamental frequencies $\frac{2\pi}{T}$. For example, the embeddings $\bigl(\cos\bigl(\frac{2\pi}{T}x\bigr), \sin\bigl(\frac{2\pi}{T}x\bigr)\bigr)$ capture the periodicity of $x$ modulo $T$ by mapping it to a unique point on the unit circle.

Missing Proofs
--------------

In this section, we provide some missing proofs.

```{=latex}
\begin{lemma}[Formal version of Lemma \ref{lem:fne_preserve_numeracy:informal}]\label{lem:fne_preserve_numeracy:formal}
    Given the pair $\bigl(\cos(\tfrac{2\pi}{T}x), \sin(\tfrac{2\pi}{T}x)\bigr)$, we can recover $x \bmod T$.
\end{lemma}
```
```{=latex}
\begin{proof}
Let $\theta = \frac{2\pi}{T} \, x$. Then the given pair becomes 
\[
\bigl(\cos(\theta), \sin(\theta)\bigr).
\]
From this pair, one can recover $\theta$ uniquely modulo $2\pi$. Concretely, $\theta$ can be obtained (modulo $2\pi$) using the two-argument inverse tangent function:
\[
\theta \equiv \operatorname{atan2}\bigl(\sin(\theta), \cos(\theta)\bigr) \quad (\bmod \; 2\pi).
\]
Since $\theta = \frac{2\pi}{T} \, x$, we have
\[
x \;=\; \frac{T}{2\pi} \, \theta.
\]
Hence $x$ is determined up to integer multiples of $T$, i.e., $x \bmod T$. 

In other words, if
\[
\bigl(\cos(\tfrac{2\pi}{T}x_1), \sin(\tfrac{2\pi}{T}x_1)\bigr)
\;=\;
\bigl(\cos(\tfrac{2\pi}{T}x_2), \sin(\tfrac{2\pi}{T}x_2)\bigr),
\]
then $\frac{2\pi}{T} x_1 \equiv \frac{2\pi}{T} x_2 \pmod{2\pi}$, which implies $x_1 \equiv x_2 \pmod{T}$. Therefore, from the pair $\bigl(\cos(\tfrac{2\pi}{T}x), \sin(\tfrac{2\pi}{T}x)\bigr)$, we can indeed recover $x \bmod T$.
\end{proof}
```
```{=latex}
\begin{lemma}[Layer-Normalized FoNE Preserves Numeracy]\label{lem:fne_preserve_numeracy_layer_norm}
    Given a number's Layer-Normalized Fourier Number Embedding $\mathrm{LN}(\FoNE(x) + \mathbf{p})$, where $\FoNE(x)$ is the Fourier Number Embedding of $x$ and $\mathbf{p}$ is an orthogonal positional encoding vector, assume the mean of $\FoNE(x) + \mathbf{p}$ is $0$. Let $m$ be the integer digit length of $x$ and $n$ be the decimal digit length of $x$. Then, using Lemma~\ref{lem:fne_preserve_numeracy:informal}, we can recover $x \bmod 10^{i}$ for each integer $i$ in the range $-n+1$ to $m$.
\end{lemma}
```
```{=latex}
\begin{proof}
Assume the mean of $\mathbf{x} = \FoNE(x) + \mathbf{p}$ is $0$, i.e., $\mu = 0$. Under this assumption, LayerNorm simplifies to:
\[
\mathrm{LN}(\mathbf{x}) = \frac{\mathbf{x}}{\sigma},
\]
where $\sigma$ is the standard deviation of $\mathbf{x}$.

Let $\mathbf{u} = \FoNE(x)$ encode the scalar $x$, and let $\mathbf{p}$ be an orthogonal positional encoding vector such that:
\[
\|\mathbf{u}\| = \|\mathbf{p}\| = 1 \quad \text{and} \quad \mathbf{u} \cdot \mathbf{p} = 0.
\]
Then, the input to LayerNorm is:
\[
\mathbf{x} = \mathbf{u} + \mathbf{p}.
\]

The standard deviation $\sigma$ of $\mathbf{x}$ is given by:
\[
\sigma = \sqrt{\frac{1}{d} \sum_{i=1}^d (\mathbf{x}_i - \mu)^2},
\]
where $d$ is the dimensionality of $\mathbf{x}$. Since $\mu = 0$, this simplifies to:
\[
\sigma = \sqrt{\frac{1}{d} \|\mathbf{x}\|^2}.
\]
Substitute $\mathbf{x} = \mathbf{u} + \mathbf{p}$:
\[
\|\mathbf{x}\|^2 = \|\mathbf{u} + \mathbf{p}\|^2 = \|\mathbf{u}\|^2 + \|\mathbf{p}\|^2 + 2\mathbf{u} \cdot \mathbf{p}.
\]
By orthogonality and unit norm, $\mathbf{u} \cdot \mathbf{p} = 0$, $\|\mathbf{u}\|^2 = 1$, and $\|\mathbf{p}\|^2 = 1$. Thus:
\[
\|\mathbf{x}\|^2 = 1 + 1 + 0 = 2.
\]
Therefore:
\[
\sigma = \sqrt{\frac{1}{d} \cdot 2} = \sqrt{\frac{2}{d}}.
\]

The LayerNorm operation simplifies to:
\[
\mathrm{LN}(\mathbf{x}) = \frac{\mathbf{x}}{\sigma} = \frac{\mathbf{u} + \mathbf{p}}{\sqrt{\frac{2}{d}}} = \sqrt{\frac{d}{2}} (\mathbf{u} + \mathbf{p}).
\]
This rescales $\mathbf{u}$ and $\mathbf{p}$ by a factor of $\sqrt{\frac{d}{2}}$.

The key observation is that LayerNorm applies a \textbf{uniform scaling} to all components of $\mathbf{x}$. Since $\mathbf{u}$ and $\mathbf{p}$ are orthogonal and their relative directions are preserved, the numerical relationships encoded in $\mathbf{u}$ (which represent $x$) are preserved up to a scaling factor. 

By Lemma~\ref{lem:fne_preserve_numeracy:informal}, the numeracy of $x$ is preserved. This means we can recover $x \bmod 10^i$ for all $i$ in the range $-n+1 \leq i \leq m$, as the normalized embedding retains the necessary information about $x$.

\end{proof}
```
The same result holds for RMSNorm because it also applies a uniform scaling (based on the root mean square of the input) while preserving the relative directions of the embedding components, thus maintaining the numeracy of $x$.

Notet that standard sinusoidal positional encodings (PEs) [@vaswani2017attention] cannot serve as a substitute for numerical embeddings. PEs are explicitly constructed to distinguish token positions in a sequence, not to represent the magnitude or digit structure of real numbers. They use a fixed set of exponentially spaced frequencies to ensure each position has a unique yet non-invertible signature; as a result, there is no straightforward way to recover the original numeric value (or its individual digits) from a PE embedding. Moreover, the frequencies in PEs are chosen for positional uniqueness, not for digit-aligned modular arithmetic, so small changes in numeric value can produce large, non-monotonic changes in the embedding space---precisely the opposite of the smooth, digit-wise variation required for accurate number encoding. Unlike FNE's digit-aligned sinusoidal components (e.g., mod 10, mod 100, ...), PEs do not guarantee invertibility with respect to numeric operations.

More Evidence {#app:moreevidence}
=============

LLMs Struggle with Multi-digit Arithmetic {#app:eval_llm}
-----------------------------------------

We evaluate five production LLMs (Claude 3.7 Sonnet, DeepSeek V3.1, Gemini 2.5 Flash, GPT-5, and Qwen3-235B) on direct arithmetic without chain-of-thought, code execution, or tools. For each setting, we sample 100 IID problem instances with uniformly distributed operands over the full $d$-digit numbers. Tasks include $d\in\{3,4,5,6\}$ for multiplication and $d\in\{7,8,50\}$ for addition. Models receive a two-shot, operation-matched prompt and must return a single integer only. Decoding uses temperature 0. We parse the first integer token from the response and score exact-match accuracy by numeric equality with the reference.

```{=latex}
\centering
```
::: {#tab:arithmetic_results}
  ------------------------------------------------------------ ---------------------------------------------- ---------------------------------------- ------ ------ ------ ------ ------
                                                                `\multicolumn{4}{c}{Multiplication}`{=latex}   `\multicolumn{3}{c}{Addition}`{=latex}                              
  `\cmidrule`{=latex}(lr)2-5`\cmidrule`{=latex}(lr)6-8 Model                         3                                           4                       5      6      7      8      50
  Claude 3.7 Sonnet                                                                 1.00                                        0.53                    0.02   0.00   1.00   0.97   0.82
  DeepSeek V3.1                                                                     0.94                                        0.25                    0.01   0.00   0.95   0.95   0.34
  Gemini 2.5 Flash                                                                  0.58                                        0.11                    0.00   0.00   0.99   0.96   0.33
  GPT-5                                                                             0.74                                        0.09                    0.00   0.00   0.98   0.92   0.66
  Qwen3-235B                                                                        0.94                                        0.62                    0.05   0.01   0.99   0.98   0.28
  ------------------------------------------------------------ ---------------------------------------------- ---------------------------------------- ------ ------ ------ ------ ------

  : Exact-match accuracy on direct multi-digit arithmetic; columns indicate digits per operand. Each entry averages 100 IID problems per setting with uniformly sampled $d$-digit operands. Two-shot prompting, single-integer output only.
:::

`\noindent`{=latex}**Results.** Even the most recently released LLMs still struggle with *multi-digit multiplication*, while fail to achieve perfect accuracy on *addition*. For multiplication, accuracy drops sharply as operand length increases: models perform well on 3-digit multiplication ($\geq 0.58$), but fall below $0.10$ for 5-digit cases and nearly $0$ at 6 digits. Qwen3-235B is the most robust, reaching $0.62$ on 4-digit and $0.05$ on 5-digit multiplication, yet still fails on 6 digits. For addition, all models achieve $\geq 0.95$ on 7--8 digit tasks, but accuracy declines on long-sequence addition (50 digits), ranging from $0.28$ (Qwen3-235B) to $0.82$ (Claude 3.7 Sonnet). In summary, LLMs excel at addition with short to medium operands but remain brittle for both long-sequence addition and especially large-digit multiplication.

Emergence of Fourier Features during Pre-training
-------------------------------------------------

We follow @zhou2024pre and conduct the same Fourier analysis on Pythia model. In Figure `\ref{fig:pythia_number_checkpoints}`{=latex}, we show how Pythia gradually learns the Fourier features during pre-training. With different model size, the model gradually learn the same frequency components.

```{=latex}
\begin{figure*}[htbp]\centering
  \includegraphics[width=1.0\textwidth]{figures/pretrained_embeddings/pythia_number_embedding_checkpoints.png}
  \caption{Fourier analysis of the Pythia model's number embeddings across pre-training checkpoints. The figure illustrates how the Fourier features are progressively learned during pre-training, showing the emergence of specific frequency components. Models of varying sizes exhibit a similar trend, gradually learning the same frequency components over time.}
  \label{fig:pythia_number_checkpoints}
\end{figure*}
```
We extend the work of @zhou2024pre to other pre-trained LLMs and observe similar findings: pre-trained LLMs, regardless of the dataset used, tend to learn the same outlier frequency components.

```{=latex}
\centering
```
+:------------------------------------------------------------------------------------------------------------------------------------------------:+:-----------------------------------------------------------------------------------------------------------------------------:+
| `\subfigure[pre-trained Pythia]{\includegraphics[width=0.45\textwidth]{figures/pretrained_embeddings/pythia_number_embedding_last.png}}`{=latex} | `\subfigure[fine-tuned Llama3.2]{\includegraphics[width=0.45\textwidth]{figures/pretrained_embeddings/llama3.2.png}}`{=latex} |
+--------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+
| `\subfigure[pre-trained OPT]{\includegraphics[width=0.45\textwidth]{figures/pretrained_embeddings/opt-1.3b.png}}`{=latex}                        | ```{=latex}                                                                                                                   |
|                                                                                                                                                  | \subfigure[pre-trained GPT2]{\includegraphics[width=0.45\textwidth]{figures/pretrained_embeddings/gpt2-large.png}}            |
|                                                                                                                                                  | ```                                                                                                                           |
+--------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+

```{=latex}
\newpage
```
```{=latex}
\newpage
```
FoNE for 60-digit Integer Addition in One Forward Pass {#app:60digit}
======================================================

```{=latex}
\begin{figure*}[!htbp]\centering
    \includegraphics[width=0.5\linewidth]{figures/results/6060addition.png}
    \caption{ Accuracy of an 8-layer transformer on 60-digit addition tasks, illustrating the effectiveness of FoNE embeddings in handling long sequences. The model achieves an average accuracy of 97.42\% across different operand lengths, showcasing its capability in numerical precision and sequence representation.}
    \label{fig:6060addition}
\end{figure*}
```
As discussed in Section `\ref{sec:discussion}`{=latex}, the maximum digit length that a `float64` data type can precisely represent is 15 digits. Consequently, even if we convert numbers to `float64` and then back to `float16` to match the model weight precisionm it still introduce numerical inaccuracies when the input $x$ exceeds 15 digits. To mitigate this issue, we process $x$ by dividing it into smaller chunks, allowing the FoNE to operate effectively without precision loss.

Specifically, $x$ is split into groups of five digits, and $\text{FoNE}$ is applied independently to each chunk. Each digit within a chunk is encoded into two dimensions, resulting in an embedding of length 10 per chunk. These chunk embeddings are then concatenated to form the final representation of $x$. This method ensures that even for long inputs, the FoNE still preserve the numeracy of the numbers.

We adopt the data generation approach from [@mcleish2024transformers], which includes all combinations of operand lengths $(i, j)$ up to a maximum length $k$, generating 20 million stratified samples to ensure balanced representation across all length pairs. Training is conducted using a language model cramming approach [@geiping2023cramming], constrained to 8 exaFLOP (equivalent to 24 hours of training on a single Nvidia RTX A6000 GPU). Using this strategy, as depicted in Figure `\ref{fig:lengen}`{=latex}(a), an 8-layer transformer trained on 60-digit addition achieves an average accuracy of 97.42% across various operand lengths in just one forward pass. This result underscores the effectiveness of the $\text{FoNE}$ in processing long numbers with high precision and computational efficiency in just one forward pass.

Combining FoNE with Abacus {#app:abacus}
==========================

We train decoder-only causal language models to solve arithmetic problems, following the setup described in @mcleish2024transformers. Inputs are formatted in a least-significant-digit-first order (e.g., $98282 + 3859172 = 2787472$), without padding between digits or operands. The training dataset includes all combinations of operand lengths $(i, j)$ up to a maximum length $k$, with 20 million stratified samples ensuring balanced representation across all length pairs.

For input representation, we combine Fourier Number Embeddings (FoNE) with the Abacus method [@mcleish2024transformers]. That each digit is embedded with FoNE. Training is conducted using a language model cramming approach [@geiping2023cramming], constrained to 8 exaFLOP (equivalent to 24 hours of training on a single Nvidia RTX A6000 GPU).

We train and evaluate the models across three runs, each with a different random seed, as shown in Figure `\ref{fig:diff}`{=latex}. Results indicate that incorporating FoNE enables the Abacus method to achieve better generalization and higher accuracy.

```{=latex}
\centering
```
![Heatmaps of accuracy percentages for \`\`FoNE+Abacus\" (left column) and \`\`Abacus\" (right column) across three different random seeds. Each heatmap represents accuracy as a function of the first and second number lengths, with lighter blue shades indicating higher accuracy. The color scale ranges from white (low accuracy) to blue (high accuracy). These visualizations highlight FoNE can combine with Abacus to improve performance.](figures/results/diff1.png){#fig:diff width="0.8\\linewidth"}

More Details on Experimental Setup {#app:exp}
==================================

In this section, we provide the experiments settings that we used in the Section `\ref{sec:exp_setting}`{=latex}.

Learning rates were determined through an extensive search, with the best rates selected separately for each method based on validation performance. Final training hyperparameters include a learning rate of $0.005$ for regular and FoNE methods, and $0.0001$ for the xVal method, a batch size of $512$, and $100$ epochs. The fine-tuning process required less than 10 hours, while training from scratch took less than 3 days.

```{=latex}
\centering
```
::: {#tab:dataset_sizes}
            **Dataset**             **Train Size**   **Validation Size**   **Test Size**
  -------------------------------- ---------------- --------------------- ---------------
      6-digit decimal addition         720,000             80,000             200,000
      6-digit integer addition         720,000             80,000             200,000
      5-digit integer subtract         720,000             80,000             200,000
   3-digit integer multiplication      360,000             40,000             100,000
   4-digit integer multiplication      720,000             80,000             200,000
           classification               720,00              80,00             200,00

  : Dataset Sizes for Training, Testing, and Validation
:::

```{=latex}
\centering
```
```{=latex}
\small
```
::: {#tab:varysizes}
            **Dataset**             **Model Size for Varying Data Size**   **Data Size for Varying Model Size**
  -------------------------------- -------------------------------------- --------------------------------------
      6-digit decimal addition                     37.55M                                200,000
      6-digit integer addition                     37.55M                                200,000
      5-digit integer subtract                     37.55M                                200,000
   3-digit integer multiplication                  37.55M                                360,000
   4-digit integer multiplication                  37.55M                                360,000
   4-digit integer multiplication                  37.55M                                360,000
           classification                          37.55M                                 50,000

  : Dataset and Configuration Sizes for Model and Data Variation Experiments
:::

```{=latex}
\centering
```
```{=latex}
\footnotesize
```
::: {#tab:model_config}
   **Model**   **Hidden Size**   **Intermediate Size**   **\# Hidden Layers**   **\# Attention Heads**   **\# Key-Value Heads**
  ----------- ----------------- ----------------------- ---------------------- ------------------------ ------------------------
       1             64                   256                     1                       4                        2
       2             128                  512                     2                       4                        2
       3             192                  768                     3                       6                        3
       4             256                 1024                     4                       8                        4
       5             320                 1280                     5                       8                        4
       6             384                 1536                     6                       8                        4

  : Model Configuration Table
:::

Ablation Study
--------------

In this section, we present the mispredictions of the model trained with an FoNE, where the periods are multiples of $5$ instead of $10$. Table `\ref{tab:misprediction}`{=latex} demonstrates that, for each digit, the mispredictions consistently deviate from the true labels by $5$.

```{=latex}
\centering
```
::: {#tab:misprediction}
   **Index**   **Predicted Value**   **Actual Value**
  ----------- --------------------- ------------------
       1             934.03              934.585
       2              3.009              558.509
       3             912.311             917.366
       4            6201.003             1756.008
       5             1240.34             1290.84

  : Mispredictions in the Final Evaluation with when we embed each digit with only $\bmod 5$.
:::

We also present the model's mispredictions in Table `\ref{tab:misprediction2}`{=latex}, where each digit is encoded into a separate dimension of the embedding. For example, the number $567$ is represented as $[5,6,7]$. During training, we compute the RMSE loss between the last hidden states and the labels. During prediction, we interpret each entry in the last hidden state as a single digit.

```{=latex}
\centering
```
::: {#tab:misprediction2}
   **Index**   **Predicted Value**   **Actual Value**
  ----------- --------------------- ------------------
       1             883.888             993.999
       2             787.878             898.989
       3             888.758             989.759
       4             748.785             849.895
       5             677.677             688.788
      10            1179.488             1189.499

  : Mispredictions in the Final Evaluation when directly encoding numbers into their embeddings.
:::

```{=latex}
\newpage
```
Replicating Results on GPT2-Large Based Model {#sec:gpt2}
=============================================

We conduct the same experiments on decimal addition using a GPT-2 Large-based model. The results indicate that changing the model architecture does not affect the outcomes. For instance, GPT-2 Large employs LayerNorm, while Llama 3.2 uses RMSNorm.

```{=latex}
\begin{figure*}[!htbp]\centering
\subfigure[ 6-digit decimal addition: Accuracy vs. Training Data Size]{\includegraphics[width=0.45\textwidth]{figures/results/add2datagpt.png}}
\hspace{2mm}
\subfigure[6-digit decimal addition: Accuracy vs. Model Size]{\includegraphics[width=0.45\textwidth]{figures/results/add2modelgpt.png}}
\caption{
We train GPT2-Large from scratch with random initialization using different number embedding methods on 6-digit decimal addition. The test accuracy is compared across varying data sizes and model sizes.
}
\label{fig:gpt}
\end{figure*}
```
```{=latex}
\newpage
```
$R^2$ Comparison for Different Arithmetic Tasks {#app:r2}
===============================================

xVal [@golkar2023xval] performs well on the $R^2$ metric $$\begin{aligned}
R^2 = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (y_i - \bar{y})^2},\end{aligned}$$ because it uses RMSE as its loss function. However, we demonstrate that FoNE outperforms xVal on $R^2$ in most tasks. We show the final $R^2$ on test dataset in our experiments(Section `\ref{sec:exp_results}`{=latex}).

```{=latex}
\begin{figure*}[!htbp]\centering
\subfigure[Data size vs. Accuracy]{\includegraphics[width=0.45\textwidth]{figures/results/adddecimaldatar2.png}}
\hspace{2mm}
\subfigure[Model size vs. Accuracy]{\includegraphics[width=0.45\textwidth]{figures/results/adddecimalmodelr2.png}}
\caption{
Comparison of $R^2$ trends for 6-digit decimal addition with respect to model size and data size.
}
\label{fig:r2decimal}
\end{figure*}
```
```{=latex}
\begin{figure*}[!htbp]\centering
  \subfigure[6-digit integer addition: Model\&Data size vs. Accuracy]{
    \begin{minipage}{0.48\textwidth}
      \centering
      \includegraphics[width=0.49\textwidth]{figures/results/add2datar2.png}
    \hspace{-2mm}
      \includegraphics[width=0.49\textwidth]{figures/results/add2modelr2.png}
    \end{minipage}
    \label{fig:decimal_addition_r2}
  }
  \hspace{-2mm}
    \subfigure[5-digit integer addition: Model\&Data size vs. Accuracy]{
    \begin{minipage}{0.48\textwidth}
      \centering
      \includegraphics[width=0.49\textwidth]{figures/results/adddatar2.png}
      \hspace{-2mm}
      \includegraphics[width=0.49\textwidth]{figures/results/addmodelr2.png}
    \end{minipage}
    \label{fig:addition_r2}
  }
  \hspace{-2mm}\\
  \subfigure[5-digit integer subtraction: Model\&Data size vs. Accuracy]{
    \begin{minipage}{0.48\textwidth}
      \centering
      \includegraphics[width=0.49\textwidth]{figures/results/subdatar2.png}
      \hspace{-2mm}
      \includegraphics[width=0.49\textwidth]{figures/results/submodelr2.png}
    \end{minipage}
    \label{fig:subtraction_r2}
  }
  \hspace{-1mm}
  \subfigure[3-digit integer multiplication: Model\&Data size vs. Accuracy]{
    \begin{minipage}{0.48\textwidth}
      \centering
      \includegraphics[width=0.49\textwidth]{figures/results/muldatar2.png}
      \hspace{-2mm}
      \includegraphics[width=0.49\textwidth]{figures/results/mulmodelr2.png}
    \end{minipage}
    \label{fig:multiplication_r2}
  }
  \caption{Comparison of $R^2$ trends for various arithmetic tasks with respect to model size and data size.}
  \label{fig:result_r2} 
\end{figure*}
```
Comparison with Number Token Loss (NTL)
=======================================

To validate the effectiveness of FoNE, we conduct comprehensive experiments comparing against Number Token Loss (NTL) [@zausinger2024regress], a method designed to improve numerical reasoning by incorporating new losses between number tokens. We evaluate two NTL variants: NTL with digit-wise tokenization (denoted NTL-D), where numbers are decomposed into individual digit tokens, and NTL with subword tokenization (NTL-S), using the standard tokenizer with an extended number vocabulary covering tokens 0-999. For each variant, we test three lambda values ($\lambda \in \{0.15, 0.3, 0.8\}$) that control the relative weighting of the NTL loss component.

Experimental Setup
------------------

We conduct experiments on two arithmetic reasoning benchmarks: 4-digit integer addition and 3-digit integer multiplication. To assess data efficiency, we train models with varying amounts of data: 5K, 25K, and 50K samples for addition; 10K, 50K, and 75K samples for multiplication. All experiments use Llama-3.2-1B-Instruct's configuration, training from scratch with consistent architectural parameters (512 hidden dimensions, 8 layers, 8 attention heads) to ensure fair comparison. We train all models for 50 epochs with batch size 512, using learning rates of $5 \times 10^{-4}$ for FoNE and $5 \times 10^{-3}$ for NTL methods based on preliminary tuning. Models are evaluated on $1000$ held-out test samples using whole number accuracy, $R^2$, and RMSE.

Results
-------

Figure `\ref{fig:nlt}`{=latex} presents comprehensive performance comparisons across both tasks and all metrics. The results demonstrate FoNE's substantial and consistent superiority over all NTL variants.

#### Addition Task

FoNE reach 98.6% accuracy with only 5,000 training samples and perfect 100% accuracy with 25,000 or more samples (Table `\ref{tab:addition_results}`{=latex}). The $R^2$ scores consistently exceed 0.999, and RMSE drops to effectively zero at larger training scales. NTL methods exhibit significantly degraded performance. The best-performing variant, NTL(D,0.15), achieves 97.4% accuracy with 50K samples---still below FoNE's performance at 5K samples. Other NTL configurations struggle considerably: NTL(D,0.3) peaks at only 51.4% accuracy, while higher lambda values ($\lambda=0.8$) yield near-zero performance (3.3-6.1%). Most critically, NTL with subword tokenization fails catastrophically across all configurations, achieving near-zero accuracy with negative $R^2$ values as low as -13,170, indicating performance substantially worse than a naive baseline. RMSE values for NTL(S) exceed $4.7 \times 10^5$, compared to 0.0 for FoNE.

#### Multiplication Task

The multiplication task proves more challenging, yet FoNE maintains strong performance. With 75,000 training samples, FoNE achieves 89.3% accuracy with $R^2$=0.9996 and RMSE=4,286, demonstrating consistent and reliable predictions (Table `\ref{tab:multiplication_results}`{=latex}). Performance scales smoothly with data: from 35.1% at 10K samples to 84.5% at 50K and 89.3% at 75K. NTL methods again significantly underperform. The best NTL configuration, NTL(D,0.15) with 75K samples, achieves only 16.3% accuracy---73 percentage points below FoNE. Most other NTL variants achieve less than 5% accuracy even at the largest training scale. Subword tokenization continues to fail, with negative $R^2$ scores (e.g., -0.534 for NTL(S,0.8)) and RMSE values exceeding $2 \times 10^5$, three orders of magnitude worse than FoNE.

#### Analysis

Several key observations emerge from our experiments. First, FoNE demonstrates superior data efficiency, achieving strong performance with minimal training data while NTL requires substantially more samples to reach even moderate accuracy levels. Second, NTL's performance is highly sensitive to the lambda hyperparameter: increasing $\lambda$ from 0.15 to 0.8 consistently degrades accuracy across both tasks, suggesting that heavily weighting the distance-based loss interferes with learning. Finally, FoNE also exhibits better training efficiency, requiring $2.5-11.7\times$ less training time than NTL methods while achieving far superior accuracy.

```{=latex}
\begin{figure*}[t]\centering
\includegraphics[width=\textwidth]{figures/ntlcomparison/comparison_all.pdf}
\caption{Performance comparison of FoNE vs. NTL methods on addition (top) and multiplication (bottom) tasks across three metrics: accuracy, $R^2$, and RMSE. FoNE (solid blue squares) consistently outperforms all NTL variants, achieving 100\% accuracy on addition and 89.3\% on multiplication vs. NTL's best of 16.3\%.}
\label{fig:nlt}
\end{figure*}
```
```{=latex}
\begin{table*}[]\centering
\caption{Addition task results comparing FoNE with NTL methods across different training data sizes. FoNE achieves perfect accuracy with $\geq$25K samples while all NTL variants struggle.}
\label{tab:addition_results}
\small
\begin{tabular}{lcccccc}
\toprule
\textbf{Method} & \multicolumn{2}{c}{\textbf{5K Samples}} & \multicolumn{2}{c}{\textbf{25K Samples}} & \multicolumn{2}{c}{\textbf{50K Samples}} \\
\cmidrule(lr){2-3} \cmidrule(lr){4-5} \cmidrule(lr){6-7}
& Acc. (\%) & $R^2$ & Acc. (\%) & $R^2$ & Acc. (\%) & $R^2$ \\
\midrule
\textbf{FoNE} & \textbf{98.6} & \textbf{0.9993} & \textbf{100.0} & \textbf{1.0000} & \textbf{100.0} & \textbf{1.0000} \\
\midrule
NTL(D,0.15) & 0.9 & -3.57 & 96.3 & 0.9998 & 97.4 & 0.623 \\
NTL(D,0.3) & 41.2 & 0.107 & 51.4 & 0.506 & 49.1 & 0.887 \\
NTL(D,0.8) & 3.3 & -0.304 & 6.1 & 0.774 & 5.3 & 0.773 \\
NTL(S,0.15) & 0.0 & -2.41 & 4.6 & 0.957 & 8.6 & -39.3 \\
NTL(S,0.3) & 0.0 & -7763 & 0.1 & -13030 & 5.7 & -8884 \\
NTL(S,0.8) & 0.1 & -8418 & 3.6 & -13170 & 0.7 & -13730 \\
\bottomrule
\end{tabular}
\end{table*}
```
```{=latex}
\begin{table*}[]\centering
\caption{Multiplication task results. FoNE maintains strong performance (89.3\% at 75K samples) while the best NTL variant achieves only 16.3\%.}
\label{tab:multiplication_results}
\small
\begin{tabular}{lcccccc}
\toprule
\textbf{Method} & \multicolumn{2}{c}{\textbf{10K Samples}} & \multicolumn{2}{c}{\textbf{50K Samples}} & \multicolumn{2}{c}{\textbf{75K Samples}} \\
\cmidrule(lr){2-3} \cmidrule(lr){4-5} \cmidrule(lr){6-7}
& Acc. (\%) & $R^2$ & Acc. (\%) & $R^2$ & Acc. (\%) & $R^2$ \\
\midrule
\textbf{FoNE} & \textbf{35.1} & \textbf{0.9932} & \textbf{84.5} & \textbf{0.9994} & \textbf{89.3} & \textbf{0.9996} \\
\midrule
NTL(D,0.15) & 0.3 & 0.955 & 5.1 & 0.993 & 16.3 & 0.9991 \\
NTL(D,0.3) & 0.2 & 0.958 & 3.1 & 0.978 & 8.2 & 0.990 \\
NTL(D,0.8) & 0.0 & 0.928 & 0.4 & 0.969 & 0.5 & 0.983 \\
NTL(S,0.15) & 0.2 & 0.957 & 2.2 & 0.944 & 4.7 & 0.983 \\
NTL(S,0.3) & 0.2 & -0.462 & 0.7 & -0.032 & 3.9 & -0.107 \\
NTL(S,0.8) & 0.3 & -0.132 & 1.9 & -0.525 & 1.6 & -0.534 \\
\bottomrule
\end{tabular}
\end{table*}
```
```{=latex}
\clearpage
```
```{=latex}
\newpage
```
The Use of LLMs {#the-use-of-llms .unnumbered}
===============

LLMs were used only to polish language, such as grammar and wording. These models did not contribute to idea creation or writing, and the authors take full responsibility for this paper's content. `\fi`{=latex}

[^1]: Our evaluation (See Appendix `\ref{app:eval_llm}`{=latex}) of recently released LLMs on arithmetic confirms this limitation: they still struggle with multi-digit addition and multiplication.