---
abstract: |
  To drive progress in science and engineering, large language models (LLMs) must be able to process large amounts of numerical data and solve long calculations efficiently. This is currently only possible through the use of external tools or extensive reasoning chains, either limiting the numerical intuition of LLMs or limiting the length of problems they can solve. We show that frontier LLMs require excessive amounts of reasoning tokens to solve even basic calculations, which is exacerbated by their tokenization strategies that split single numbers into multiple tokens. This motivates the need for efficient and effective single-token number encodings. We introduce a set of desiderata for such encodings and show that existing approaches fail to fulfill them. To address these shortcomings, we propose *BitTokens* [^1], a novel tokenization strategy that embeds any number into a single token using its IEEE 754 binary floating-point representation. Through extensive experiments we show that our BitTokens allow even small language models to learn algorithms that solve basic arithmetic operations nearly perfectly. This newly gained efficiency could expand the length and complexity of problems language models can solve.
author:
- |
  Linus Kreitner^1^[^2]`\hspace{0.5em}`{=latex}, Paul Hager^1^`\footnotemark[1]`{=latex}`\hspace{0.5em}`{=latex}, Jonathan Mengedoht^2^,\
  **Georgios Kaissis^1^, Daniel Rueckert^1,3,4^, Martin J. Menten^1,3,4^**\
  ^1^ Chair for AI in Healthcare and Medicine, Technical University of Munich (TUM)\
  and TUM University Hospital, Munich, Germany\
  ^2^ School of Computation, Information and Technology, TUM, Germany\
  ^3^ Munich Center for Machine Learning (MCML), Germany\
  ^4^ Department of Computing, Imperial College London, UK\
  `{linus.kreitner, paul.hager, martin.menten}@tum.de`
bibliography:
- references.bib
title: 'Efficient numeracy in language models through single-token number embeddings'
---

```{=latex}
\maketitle
```
Introduction
============

Many researchers share the common vision that large language models (LLMs) will not only alleviate routine work but also drive scientific and technological innovation. In many fields, such as physics and engineering, solving such complex tasks requires the processing of large amounts of numerical data and extensive calculations. Thus, to aid advancements in these fields, LLMs must possess efficient and effective numeracy, defined as the ability to understand and work with numbers. However, LLMs have historically struggled to solve even basic calculations [@ai4science2023impactlargelanguagemodels; @Hager2024; @cuicurie].

Research to improve the numeracy of LLMs has predominantly focused on two strategies: arithmetic tool use and reasoning chains. Tool-augmented LLMs leverage external calculators or code to bypass the need for internal arithmetic computation [@schick2023toolformer; @Qu2025; @gou2024tora; @zhang2023evaluating; @he-yueya2023solving; @parisi2022talm; @le2022coderl; @wang2024chainoftableevolvingtablesreasoning; @gu-etal-2024-middleware; @gao2023pal]. While this approach guarantees the correctness of the calculations, it forces the model to dynamically identify, construct, and offload all mathematical operations, introducing additional latency and complexity. Most importantly, outsourcing all calculations prevents the model from building an intuition for numbers and the results of calculations, which is required to interpret and contextualize information from complex domains [@zhou2024transformers; @spathis2024first; @nogueira2021investigatinglimitationstransformerssimple; @akhtar2023exploring].

Reasoning chains, on the other hand, prompt LLMs to generate logically consistent text, step-by-step, enabling them to solve complex problems by breaking them into smaller parts [@NEURIPS2022_9d560961; @yao2023tree; @wangself; @lightman2023let; @snell2024scaling]. This has resulted in large gains on many benchmarks, pushing the frontier of what is possible with LLMs. However, reasoning chains can be very inefficient, sometimes requiring tens of thousands of tokens to solve a single calculation, as illustrated in `\cref{fig:overview}`{=latex} and shown later. This limits the length and complexity of problems that LLMs can solve due to context window and cost constraints [@kwa2025measuring].

```{=latex}
\centering
```
![LLMs perform poorly on arithmetic tasks, requiring excessive reasoning tokens to achieve good performance. Our BitTokens tokenization strategy allows language models to solve arithmetic tasks both effectively and efficiently.](figures/Fig1_v6.drawio.png){#fig:overview width="0.75\\linewidth"}

We argue that both tooling and reasoning chains are merely crutches, allowing LLMs to solve arithmetic calculations but preventing them from obtaining the intrinsic numeracy skills required to efficiently solve complex tasks in advanced technical domains. We hypothesize that addressing this problem requires rethinking the way LLMs tokenize numbers. Existing tokenizers treat numbers the same as any other word, meaning they are parsed left to right and split based on a pre-defined vocabulary. This introduces an additional source of token inefficiency and is not well suited for comparing and calculating numbers. A series of works has tried to enhance this tokenization step by adding information about the magnitude of numbers via additional tokens [@schwartz2024numerologicnumberencodingenhanced] or positional embeddings [@mcleish2024transformers]. Others have aimed to address issues of left-to-right token prediction clashing with right-to-left carry-over logic by changing the order or formatting of digits during encoding [@baeumel2025lookaheadlimitationmultioperandaddition; @lee2024teaching; @singh2024tokenizationcountsimpacttokenization].

The most ambitious line of works aims to completely replace the way numbers are represented in LLMs, moving from encoding individual digits or triplets of digits to embedding each number via a single specialized number token. @han2023lunalanguageunderstandingnumber and @alberts2024interleaving both employ small neural networks to generate such a number token based on the number's digits and its surrounding textual context. While this improves numeracy, it suffers from similar drawbacks as conventional tokenization, requires auxiliary networks, and has only been tested for encoding. Ideally one would have a deterministic algorithm to encode numbers into and decode numbers from a fixed structure which would allow LLMs to learn calculation algorithms internally. Accordingly, @golkar2024xvalcontinuousnumericaltokenization introduce xVal, which scales a learned number token by the numeric value. @zhou2025foneprecisesingletokennumber propose FoNE, which encodes a number using sinusoidal functions of different frequencies. While conceptually interesting, we will demonstrate that both methods have fundamental shortcomings that render them ineffective number encodings.

With the aim of evaluating and improving both the efficiency and numeracy of LLMs, this work makes the following three major contributions:

-   We systematically evaluate the numeracy of eight frontier LLMs across nine different tasks. We find that frontier LLMs strongly rely on reasoning with a large number of tokens to solve calculations (see `\Cref{sec:frontier}`{=latex}).

-   We develop a set of desiderata for single-token number encoding that ensure the number embedding is efficient, trainable, and effective (see `\Cref{sec:single_token_strategies}`{=latex}). In an analysis of recent works, we show that existing single-token strategies fail to fulfill several key desiderata, limiting their potential to be used in LLMs (see `\Cref{sec:limitations}`{=latex}).

-   Guided by our desiderata, we propose BitTokens, a novel single-token embedding. Based on the IEEE 754 standard on binary floating-point numbers, BitTokens encode numbers as a sequence of bits representing the sign, exponent, and significand (see `\Cref{sec:bittokens}`{=latex}). We demonstrate our method's ability to generate structured, informative representations that allow even small language models to learn algorithms to solve basic arithmetic operations (see `\Cref{sec:experiments}`{=latex}).

Evaluation of numeracy in frontier LLMs {#sec:frontier}
=======================================

```{=latex}
\centering
```
![While simple tasks such as addition and comparing numbers are almost perfectly solved by frontier LLMs, other tasks such as multiplication, division, calculating the standard deviation, or exponentiation remain difficult and require extensive reasoning to solve. ](figures/radar_models.png){#fig:frontier_radar width="\\linewidth"}

**Numeracy benchmarks for language models** Mathematics is a core discipline on which LLMs are evaluated [@10.5555/3692070.3692401; @shao2024deepseekmathpushinglimitsmathematical; @zhang2024interpreting; @ahn2024large]. However, popular benchmarking datasets like GSM8K, MMLU-Pro, and Numeracy-600K are designed to test mathematical *reasoning* [@cobbe2021trainingverifierssolvemath; @mirzadeh2025gsmsymbolic; @wang2024mmlupro; @chen2019numeracy]. They primarily sample integers from narrow distributions and rely on text-based problems that conflate arithmetic with language understanding. Recently, some benchmarks have attempted to isolate numeracy. The NumberCookbook and NumericBench evaluate LLMs on many elementary tasks like arithmetic, number comparison, and digit understanding [@yang2025number; @li2025exposingnumeracygapsbenchmark]. While these benchmarks broadly investigate numeracy, they are limited in arithmetic operations and number magnitudes tested, focusing primarily on integers. To properly evaluate numeracy, it is essential to use diverse magnitudes and precisions that reflect scientific and engineering reality while also limiting the impact of model memorization [@razeghi2022impact]. We therefore construct a new benchmark.

**Tasks** We design our tasks to isolate individual aspects of numeracy as well as core mathematical operations. The most basic task is to determine whether a numerical comparison of the form $n_1 \mathrel{\{<,>\}} n_2$ holds true. To increase the difficulty of number comparisons, we test the ability to: (1) return the minimum or maximum of a list, (2) select the correct interval containing a number from a list of options, and (3) sort a list of numbers in ascending or descending order. To test the calculating capabilities of LLMs, we use all basic arithmetic operations including (4) addition / subtraction, (5) multiplication, and (6) division. Finally, we test the multi-step operations (7) exponentiation, (8) mean, and (9) standard deviation. Examples can be found in `\Cref{app:tasks}`{=latex}.

**Number sampling** We utilize a custom number sampling strategy to ensure that we test a large diversity of inputs. Specifically, all generated numbers $n \in \mathbb{R}$ satisfy $n \in (-10^{15}, -10^{-14}] \cup [10^{-14}, 10^{15})$ and contain at most $15$ significant decimal digits, defined as the total number of digits excluding leading and trailing zeros. Number magnitudes, differences between numbers, and precision are all densely and uniformly sampled. Further information can be found in `\Cref{app:benchmark_distribution}`{=latex}.

```{=latex}
\centering
```
![Difficult numeracy tasks such as multiplication, division, exponentiation, and standard deviation can only be solved by frontier models using excessive reasoning tokens.](figures/tokens_vs_performance_matrix.png){#fig:frontier_tokens width="\\linewidth"}

**Evaluation metrics** While exactly matching the true result is the correct answer, it is useful to know the relative error to compare models. To measure the relative error between a prediction $\hat{y}$ and the target $y$, the symmetric mean absolute percentage error ($\mathrm{sMAPE}$) [@flores1986pragmatic] is defined as: $$\mathrm{sMAPE}(\hat y, y) = \frac{\lvert \hat y - y\rvert}
         {\lvert y\rvert + \lvert \hat y\rvert + \varepsilon},\quad \text{with }\epsilon=10^{-100} \text{ for stability.}$$ While this measures the relative error and is thus well suited for numbers and errors of different magnitudes, the metric decreases exponentially with increasing accuracy in significant figures. An error in the 4th significant digit can thus already achieve a $\mathrm{sMAPE}$ on the order of $10^{-4}$ despite the error being potentially very relevant in many real-world contexts. To address this, we propose the $\mathrm{log\textnormal{-}sMAPE}$ score, which uses the base-10 logarithm of $\mathrm{sMAPE}$ divided by the total number of significant digits: $$\mathrm{log\textnormal{-}sMAPE}(\hat y, y) = \min\left(1,\;
       \frac{-\log_{10}\left(\mathrm{sMAPE}(\hat y, y)+\varepsilon\right)}{M}
     \right)$$ Here $M=15$ is the maximum number of significant digits tested. In essence, $\mathrm{log\textnormal{-}sMAPE}$ calculates the fraction of significant figures correctly predicted until the first error. Thus, a $\mathrm{log\textnormal{-}sMAPE}$ value of 0.2 indicates that the first 3 out of 15 significant digits are correct. We use this metric in addition to exact match accuracy for the addition, multiplication, division, exponentiation, mean, and standard deviation tasks.

**Results** A radar plot of the results is shown in `\Cref{fig:frontier_radar}`{=latex} and an overview of the tested models and exact results can be found in `\Cref{app:frontier_benchmark_results}`{=latex}. Immediately apparent is the perfect accuracy of almost all models on the basic number comparison tasks of MinMax, Sorting, and Interval, which are effectively solved. When considering the basic arithmetic tasks, there is a clear separation between high reasoning and non- or minimal-reasoning models. While both perform excellently on addition and mean calculation, all non-reasoning models perform very poorly on multiplication, division, exponentiation, and standard deviation tasks, with none achieving above 60% $\mathrm{log\textnormal{-}sMAPE}$. Only once we allow a large number of reasoning tokens do we see improved performance on these tasks. This separation between thinking and non-thinking models is underscored in `\Cref{fig:frontier_tokens}`{=latex}. We see that to improve performance, LLMs use between 5 and 30 thousand tokens to solve a single calculation. Such high token counts for just a single calculation make multi-step calculations and longer reasoning chains infeasible, thus motivating the need for a single-token number representation.

Desiderata for single-token number representation strategies {#sec:single_token_strategies}
============================================================

Single-token number embeddings aim to explicitly encode numeracy in LLMs, while being efficient, trainable, and effective. In order to formalize these considerations, we postulate a novel set of nine desiderata that create a framework to analyze current and future strategies:

1.  **Token efficiency:** Every number is represented by a single token.`\label{D1}`{=latex}

2.  **Uniqueness:** Each value has exactly one valid encoding, with a unique inverse mapping.`\label{D2}`{=latex}

3.  **Structured:** The encoding geometry reflects numeric order and distance.`\label{D3}`{=latex}

4.  **Scale invariance:** A wide range of magnitudes can be represented with high precision.`\label{D4}`{=latex}

5.  **Normalization:** Encodings are bounded and information preserving under standard normalization functions used in language models (e.g., LayerNorm, RMSNorm).`\label{D5}`{=latex}

6.  **Numerical stability:** Representations remain accurate when using low-precision data type formats (e.g., fp8).`\label{D6}`{=latex}

7.  **Continuity:** Encodings vary smoothly with the underlying value, making them compatible with gradient-based optimization.`\label{D7}`{=latex}

8.  **Robustness:** Values can be decoded reliably even under stochastic prediction noise.`\label{D8}`{=latex}

9.  **Arithmetic:** Encodings admit learnable algorithms for core mathematical operations.`\label{D9}`{=latex}

Establishing the limitations of existing single-token strategies using our desiderata {#sec:limitations}
=====================================================================================

In this section we will analyze the two existing approaches -- xVal [@golkar2024xvalcontinuousnumericaltokenization] and FoNE [@zhou2025foneprecisesingletokennumber] -- for constructing single-token embeddings and demonstrate why they are unsuitable using our desiderata. xVal introduces a trainable number token, $[\mathtt{NUM}]$, that is scaled by the numerical value being encoded. If a $[\mathtt{NUM}]$ token is predicted, the number can be decoded from the final hidden state through a separate number head. This approach satisfies `\Cref{D1,D3,D2,D7}`{=latex}. However, as shown by the authors, this strategy must map all inputs to the range \[-5, 5\] to satisfy `\ref{D5}`{=latex} and avoid layer normalization interfering with the information encoded in the embedding magnitude. This causes it to be extremely limited in the range and precision of numbers it can encode, thus violating `\ref{D4}`{=latex}, `\ref{D6}`{=latex}, and by extension `\ref{D9}`{=latex}.

FoNE encodes a number in a single $[\mathtt{NUM}]$ token using Fourier features. Each dimension of the embedding corresponds to either a sine or cosine function of base 10 with differing frequencies, each of which encodes the number before being added to a trainable $[\mathtt{NUM}]$ token. When a language model predicts a number token, the final hidden state can directly be interpreted as a sinusoidal encoding. The output number is predicted digit by digit through the maximum cosine similarity between each dimension of the final hidden state (corresponding to each digit of the output) with those of the embeddings of the numbers in the range $[0, 9]$. This fulfills `\Cref{D1,D4,D3,D2,D5,D6,D7,D8}`{=latex}. In contrast to simply scaling a $[\mathtt{NUM}]$ token, using both sine and cosine functions equally guarantees a constant RMS norm, satisfying `\ref{D5}`{=latex}. Such a digit-wise maximum similarity decoding also allows for robustness to noise, satisfying `\ref{D8}`{=latex}.

The key limitation of sinusoidal embeddings is that they are poorly suited for solving certain arithmetic operations with neural networks, most notably multiplication. In order to satisfy condition `\ref{D9}`{=latex}, we require a learnable mapping $OP'$ that computes the encoding, $\xi$, of the result of an operation $OP$ from the encodings of its operands: $$\forall_{OP}\ \exists_{OP'}:\quad \xi\big( OP(x_1, \ldots, x_n) \big) \;=\; OP'\big( \xi(x_1), \ldots, \xi(x_n) \big)\ .$$ For addition, such a mapping exists and is computationally trivial.

```{=latex}
\begin{definition}[\textbf{sinusoidal encoding}]A sinusoidal encoding $\mathcal F:\mathbb R\mapsto \mathbb T^{|\Phi|}$ maps real numbers to a $|\Phi|$-dimensional torus, which forms a compact abelian lie group. Given frequencies $b^\phi$ with base $b>1$ and $\phi\in\Phi\subset\mathbb Z$, then $\mathcal F$ is defined as:
\begin{equation}
    \mathcal F(x):=\left[\cos(2\pi b^\phi x),\, \sin(2\pi b^\phi x)\right]_{\phi \in \Phi}
= \left[e^{i2\pi b^\phi x}\right]_{\phi\in\Phi}
\end{equation}

\end{definition}
```
```{=latex}
\begin{lemma}[\textbf{additive homomorphism}]\label{lem:fourier_sum}
The encoding map $\mathcal F$  is a group homomorphism from the additive group of real numbers $(\mathbb R, +)$ to the multiplicative torus $(\mathbb T^{|\Phi|}, \odot)$, where $\odot$ denotes the element-wise (Hadamard) product.
\end{lemma}
```
```{=latex}
\begin{proof}
The claim follows directly from Euler’s formula:
\begin{equation}
    \mathcal F(x_1+x_2) 
    = \left[ e^{i2\pi b^\phi (x_1+x_2)} \right]_{\phi\in\Phi} 
    = \left[ e^{i2\pi b^\phi x_1} \cdot e^{i2\pi b^\phi x_2} \right]_{\phi\in\Phi} 
    = \mathcal F(x_1) \odot \mathcal F(x_2)\
\end{equation}
Thus the encoding of the sum of two numbers $x_1$ and $x_2$ can be computed by a simple component-wise multiplication of their respective encodings.
\end{proof}
```
This homomorphism elegantly transforms addition in the number domain into a simple, local operation in the encoded domain. Notably, this operation does not require carry-over logic. However, for multiplication such a simple mapping does not exist.

```{=latex}
\begin{proposition}[\textbf{non-locality and computational complexity of multiplication}]\label{prop:fourier_mult}
Let $\mathcal X:=\{\varepsilon,\dots,U\}$ be the set of input numbers with resolution $\varepsilon=b^{m}$ and choose $\Phi = \{m,\dots,n\}$ so that $\mathcal F$ uniquely encodes the entire number range. Assume each encoding component has a finite precision $P$ (i.e., can represent $P$ distinct states). Suppose there exists an operator $\otimes_\phi:\mathbb T^{|\Phi|}\times\mathbb T^{|\Phi|}\mapsto\mathbb T$ such that $\otimes_\phi(\mathcal F(x),\mathcal F(y)) := \mathcal F_\phi(xy)$ for each output frequency $\phi\in\Phi$. Let $S_\phi^x,S_\phi^y\subseteq\Phi$ be the subsets of input frequencies that $\otimes_\phi$ is required to read from $\mathcal F(x)$ and $\mathcal F(y)$, respectively. Then:
\begin{enumerate}
    \item \textbf{Non-locality} The operator $\otimes_\phi$ must access at least $d = \mathcal O\left(\log_P(U/\varepsilon)\right)$ components from each input vector with $|S_\phi^x|,\;|S_\phi^y|\;\ge\;\left\lceil \log_P |\mathcal X| \right\rceil$.
    \item \textbf{Computational complexity} The operator $\otimes$ must perform a computation functionally equivalent to polynomial multiplication.
\end{enumerate}
\end{proposition}
```
```{=latex}
\begin{proof}
\textbf{1. Non-locality.} The proof follows from a counting argument. Suppose $|S_\phi^x|<\log_P|\mathcal X|$. Then the projection of $\mathcal F(x)$ onto $S_\phi^x$ has at most $P^{|S_\phi^x|}$ states, so by the pigeonhole principle there exist two distinct $x\neq x'$ that share the same projection. Let $\Delta:=x-x'\neq 0$. The set $Y^*:=\{y:\; b^\phi \Delta y\in\mathbb Z\}$ is a proper subset of $\mathcal X$, so pick $y^\star\notin Y^*$. Then $\mathcal F_\phi(x y^\star)\neq \mathcal F_\phi(x' y^\star)$, yet $\otimes_\phi$ sees identical inputs from $\mathcal F(x)$ and $\mathcal F(x')$, which is a contradiction. Thus, $|S_\phi^x|\ge\lceil\log_P|\mathcal X|\rceil$. The same holds for $|S_\phi^y|$.

\textbf{2. Computational complexity.} 
Multiplication in any positional system is equivalent to the discrete convolution of the coefficients $k$ and $l$, followed by carry propagation.

Write $x=\sum_{\psi} k_\psi b^\psi$ and $y=\sum_{\varphi} l_\varphi b^\varphi$ as the sum of their coefficients. Then
\begin{equation}
xy=\sum_{\tau}(k*l)_\tau b^\tau,\qquad (k*l)_\tau=\sum_{\psi+\varphi=\tau}k_\psi l_\varphi.
\end{equation}

Interpreting the circular sinusoidal component in its interval $[0,1)$ wrap-around form, then $\hat {\mathcal F}_\psi(x) :=(x\ \textrm{mod}\ b^\psi)/b^{\psi} =\sum_{\psi'<\psi} k_\psi' b^{\psi'-\psi}$.
If we attempt to compute $\mathcal F_\phi(xy)$ directly from these entangled phases, the bilinear expansion produces
\begin{equation}
\hat {\mathcal F}_\phi(xy) = \Big(\hat {\mathcal F}(x)*\hat {\mathcal F}(y)\Big)_\phi=\sum_{\psi+\varphi=\phi}\Big(\sum_{\psi'<\psi}k_{\psi'}b^{\psi'-\psi}\Big)\Big(\sum_{\varphi'<\varphi}l_{\varphi'}b^{\varphi'-\varphi}\Big),
\end{equation}
which contains many redundant cross-terms. To recover the correct coefficients $k*l$, these redundancies must be canceled or reorganized—an operation that is at least as complex as performing the convolution explicitly, since it requires isolating the same coefficient interactions.
Moreover, under finite precision, postponing disentanglement amplifies error as intermediate sums of large wrapped phases cause quantization collapse.

Consequently, any successful operator $\otimes$ must learn the multi-stage procedure: (1) a non-local decoding of the sinusoidal inputs into internal coefficient sequences for $x$ and $y$, (2) a convolution of these sequences, followed by (3) a carry propagation step, and finally (4) a re-encoding of the result into the sinusoidal format.
\end{proof}
```
The preceding proposition demonstrates that performing multiplication in sinusoidal encoding space requires a transformation that is both computationally intensive and prone to precision errors. Any network implementing such an operation is forced to first decode, then calculate, and finally re-encode the embedding. This leads us to conclude that sinusoidal encodings alone are not well-suited as a general purpose number representation.

**Potential improvements to sinusoidal encoding** To address the aforementioned difficulties of performing multiplication in sinusoidal encoding space, we experimented with applying a logarithmic transformation of the input values before encoding them. In logarithmic space, multiplication reduces to component-wise multiplication, following from the identity $\ln(x_1x_2)=\ln x_1+\ln x_2$ and Lemma `\ref{lem:fourier_sum}`{=latex}.

However, this conversion to a logarithmic space trades one homomorphism for another: multiplication becomes Hadamard-linear, but addition no longer admits a simple local rule, which is a well-known drawback of logarithmic number systems [@1508538; @6810472]. Computing $\mathcal F_{\log}(x_1+x_2)$ from $\mathcal F_{\log}(x_1)$ and $\mathcal F_{\log}(x_2)$ requires learning an effective *exp--sum--log* transformation, reintroducing the same aliasing and inversion difficulties as discussed in Proposition `\ref{prop:fourier_mult}`{=latex}. Consequently, some operations are easiest to learn in linear space and others in log space. One could simultaneously encode both spaces and use routing mechanisms, such as mixture-of-experts, to enable models to selectively process either encoding and indicate the correct space for decoding. However, our attempts to implement such a solution were unsuccessful.

BitTokens {#sec:bittokens}
=========

Guided by the desiderata introduced in `\Cref{sec:single_token_strategies}`{=latex}, we propose BitTokens, a novel tokenization algorithm to encode numbers. BitTokens uses a dedicated, learnable $[\mathtt{NUM}]$ token to which a numeric encoding is added that is based on the IEEE 754 double-precision binary floating-point format (i.e. `float64`) [@IEEE754-1985; @IEEE754-2008; @IEEE754-2019]. The floating-point format writes a signed real value $v$ as $$v = (-1)^{s} \left( 1+\sum_{i=1}^{52}{b_{52-i}2^{-i}} \right) \times 2^{E-1023},$$ with $s \in \{0,1\}$ denoting the sign bit, $E \in \{0,\dots,2047\}$ the 11-bit exponent field offset by $1023$, and $b_j \in \{0,1\}$ for $j=0,\dots,51$ the 52 significand bits.

Each bit is mapped to a dimension of the embedding vector, satisfying `\Cref{D1,D3}`{=latex}. The resulting 64-dimensional number encoding provides a representational range of $[2.23\times10^{-308},\, 1.8\times10^{308}]$ with 15--17 significant decimal digits, satisfying `\Cref{D2,D4}`{=latex}. Additionally, it supports special values such as $\pm0$, $\pm\infty$, and NaN.

**Token construction:** The binary vector of a number can be efficiently constructed via type reinterpretation and bit shifts. To allow for the learning of more efficient division algorithms, we also concatenate the binary representation of the number with a binary representation of its reciprocal value. Scaling the bit vector to $[-1,1]$ yields unit RMS norm inputs, satisfying `\ref{D5}`{=latex}. We then zero-pad the resulting encoding to match the network's embedding size and then add the vector to the learned $[\mathtt{NUM}]$ token embedding. Using raw bit values ensures numerical stability, satisfying `\ref{D6}`{=latex}.

**Decoding and loss computation:** During next token prediction, the language model outputs a $[\mathtt{NUM}]$ token to indicate when it is predicting a number. Then the last hidden state that was used to predict the $[\mathtt{NUM}]$ token is passed through a dedicated number head, predicting the binary encoding vector. The number head consists of a linear layer followed by a sigmoid activation, with a fixed decision threshold of 0.5 generating the binary vector representation while increasing robustness to prediction noise, thus satisfying `\ref{D8}`{=latex}. During training, computing a regression loss directly on the reconstructed number from the bit vector is infeasible due to the magnitude of potential values as well as the thresholding operations involved in the process. We therefore employ a bit-wise binary cross entropy loss (BCE) and use equal weighting for each dimension. This does not strictly satisfy condition `\ref{D7}`{=latex}, as illustrated by the predictions $\hat{y}_1 = 8 = 0b1000$ and $\hat{y}_2 = 3 = 0b0011$ for the label $y = 7 = 0b0111$. Although $\hat{y}_1$ is numerically closer to $y$, it incurs a higher loss than $\hat{y}_2$. In practice, we find it more important to weigh all bits equally to ensure the more challenging low significance bits receive sufficient attention from the overall loss.

**Arithmetic properties:** Our encoding possesses several properties which ease the learning of mathematical operators, satisfying `\ref{D9}`{=latex}. First, originating from Leibniz's introduction of binary arithmetic (@Leibniz1703Binary), researchers have developed many efficient algorithms for binary addition and multiplication for modern computing hardware that are compatible with IEEE 754 [@booth1951signed; @wallace2006suggestion; @brent1982regular; @kogge2009parallel; @brent2010modern], Crucially, because each bit is represented independently in our encoding, models can learn algorithms by directly operating upon our BitTokens. As Proposition `\ref{prop:fourier_mult}`{=latex} shows, this is not possible with sinusoidal encodings as any algorithm for multiplication must first decode, then compute, and finally re-encode the embeddings.

Second, the IEEE 754 structure separates values into a logarithmic exponent and a linear significand. The simplest algorithm for addition aligns significands by exponent difference and then adds or subtracts them. Multiplication is also simple as exponents are directly added, significands are multiplied, and the sign is determined by an XOR operation. The direct accessibility of individual bits simplifies the network's ability to learn such algorithms.

Third, bit-wise arithmetic over $\mathbb{Z}_2$ results in coefficient-wise operations reducing to Boolean gates: $$(x \cdot y) \bmod 2 = x \land y, \qquad (x+y) \bmod 2 = x \oplus y.$$ This further simplifies the calculations needed for arithmetic.

Finally, calculations using bits can be parallelized across multiple dimensions of the embedding, reducing the number of sequential steps needed and increasing parameter efficiency for learning algorithms.

Comparing BitTokens to other tokenization strategies {#sec:experiments}
====================================================

**Experimental setup** We quantify the impact of different tokenization schemes for numeracy in experiments using nanoGPT-2 models [@modded_nanogpt_2024]. The use of these small language models allows us to train models from scratch and conduct a diverse set of experiments in a controlled environment, as suggested by @AllenZhu-icml2024-tutorial. A full list of architectural details and hyperparameters is included in `\Cref{app:hyperparams}`{=latex}. We use the same number tasks and datasets as in `\Cref{sec:frontier}`{=latex}, generating 30M unique training samples of each task. Simultaneously, we include the FineWeb 10B dataset to model training dynamics with text [@penedo2024the].

We compare BitTokens to traditional single digit and triple digit (subword) tokenizers, as well as xVal and FoNE. To enable xVal to process the large number range in our dataset, we rescale the inputs logarithmically instead of linearly to the range of \[-5, 5\]. FoNE only encodes the magnitude of the number, thus for negative numbers, a \[$-$\] token is added before the absolute number encoding. We use 17 integer and 32 fraction frequencies with base $b=10$.

**Model training with curriculum learning and dynamic task balancing** Since each task category contains both easy and difficult problems, we observe that the performance of small language models, regardless of their tokenization method, can be improved by strategically choosing the order of training samples. We address this by defining a difficulty metric for each problem and employing a curriculum learning strategy with progressive difficulty thresholds, preview sampling, and adaptive advancement criteria, the details of which can be found in `\Cref{app:curriculum}`{=latex}.

The goal of our experiments is to train a model that can learn all ten tasks simultaneously, a setup we call multi-task. During multi-task training, we find that assigning an equal token budget to each of the nine numeracy plus language tasks yields suboptimal performance. This is due to varying task difficulties and their corresponding imbalanced impact on the overall loss. To address this, we dynamically control the composition of each minibatch. An increased sampling probability is assigned to tasks on which the model performs worse. Additional details and discussion of the dynamic task rebalancing are included in `\Cref{app:multitask}`{=latex}.

We find that the exponentiation, mean, and standard deviation tasks are intrinsically difficult for small language models. We attribute this to the fact that solving these tasks requires sequential application of multiple arithmetic operations. While dynamic rebalancing improves overall performance, a disproportionate token budget would be assigned to these three tasks, preventing convergence on the remaining seven tasks. For this reason, we conduct multi-task experiments on the seven comparison and single-step calculation tasks plus text. For exponentiation, mean, and standard deviation, we report the solo-task performance.

`\label{sec:results}`{=latex}

```{=latex}
\centering
```
![While single digit is the superior multi-token strategy, our BitTokens outperforms it as well as all other single-token strategies on all 7 tasks in the multi-task setting.](figures/results.png){#fig:results width="\\linewidth"}

**Results**

The results of our experiments are shown in `\Cref{fig:results}`{=latex}. Multi-task results show that xVal performs poorly on all tasks, primarily due to its limited precision in representing both input and output. While FoNE is able to learn addition, it struggles with multiplication and division as predicted in `\Cref{sec:limitations}`{=latex}. Among the multi-token strategies, single-digit embeddings consistently outperform subword embeddings. Finally, our BitToken method outperforms all other methods and achieves near-perfect performance on comparison and single-step calculation tasks.

Solo-task results indicate that all models struggle with computing standard deviation and exponentiation. Notably, for the mean task, single digit and subword tokenizer methods outperform the single-token methods. We attribute this advantage to the fact that multi-token methods generate answers over multiple forward passes, which effectively enables a form of \`\`reasoning\". These models can refine their answers one digit at a time, making it possible to learn iterative algorithms. In contrast, single-token methods need to perform the same work in a single step. While such a built-in iterative behavior can be beneficial in some cases, it also substantially increases computational costs and generation time for simpler tasks, as shown in `\cref{tab:lm_token_count}`{=latex}. The ablations in `\Cref{app:ablations}`{=latex} demonstrate the effectiveness of base-2 encodings over base-10, the benefit of appending the reciprocal, and that curriculum learning substantially boost performance for all methods.

Discussion and conclusion
=========================

Single-token number embeddings efficiently encode numerical information in language models. This could reduce the dependency of current frontier LLMs on excessive amounts of reasoning tokens to solve arithmetic problems while simultaneously reducing the amount of tokens needed to encode a single number. We substantially advance this promising line of research by introducing a set of nine desiderata that ensure a number encoding is efficient, trainable, and effective. Based on these, we propose BitTokens, overcoming the drawbacks of existing single-token approaches.

A limitation of our study is that we exclusively evaluate BitTokens on small language models. Although our experiments show that BitTokens integrate well with text during next-token prediction, incorporating them into larger models with broader pretraining data would allow for a more holistic evaluation of their capabilities. Key questions include whether BitTokens enhance the ability of LLMs to memorize and recite numerical facts and the interaction with text in math word problems. Moreover, we do not fully outline the steps required to integrate BitTokens in production-level LLMs. These include efficient strategies to robustly identify and parse numbers in input text, accounting for different arithmetic notations, and ensuring models output the correct level of precision for a given context.

While our BitTokens aim to replace excessively long reasoning chains for simple mathematical calculations, reasoning and single-token strategies are not mutually exclusive. Combining them could further improve LLM capabilities on the most complicated problems. By enabling efficient and accurate calculations on intermediate steps, more tokens can be dedicated to logical reasoning, expanding the complexity of problems language models can solve.

Reproducibility statement {#reproducibility-statement .unnumbered}
=========================

We took several steps to ensure our results are reproducible. Dataset construction is specified in `\Cref{sec:frontier}`{=latex}, with the exact number-sampling scheme and the task difficulty metric detailed in `\Cref{app:benchmark_distribution}`{=latex} and `\Cref{app:difficulty}`{=latex}. The full experimental setup for the BitToken method and all baselines is described in `\cref{sec:experiments}`{=latex}, with model architectures and hyperparameter configurations in `\Cref{app:experimental_setup}`{=latex}, and additional details for curriculum learning and multi-task training in `\Cref{app:curriculum}`{=latex} and `\cref{app:multitask}`{=latex}. We include an anonymized, downloadable repository at <https://github.com/AnonymousAuthor553/BitTokens>. This repository includes code for the frontier-model benchmark (including evaluation scripts) and the BitToken method and benchmark. These materials provide the code and configuration files needed to reproduce the reported experiments and regenerate the results referenced in the paper.

Acknowledgement {#acknowledgement .unnumbered}
===============

This work was partially funded by ERC Grant Deep4MI (Grant No. 884622) and by the German Research Foundation (DFG) under project 532139938. Furthermore, the authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR\@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) under the NHR project b247bb. NHR funding is provided by federal and Bavarian state authorities. NHR\@FAU hardware is partially funded by the DFG under project 440719683.

```{=latex}
\bibliographystyle{plainnat}
```
```{=latex}
\newpage
```
```{=latex}
\appendix
```
Benchmarking dataset {#app:FrontierBenchmark}
====================

Tasks {#app:tasks}
-----

### Comparing numbers

The most elementary numeracy task is comparing two random numbers, $n_1$ and $n_2$. As this task is trivial for modern LLMs, we increase the difficulty to multiple operands.

**Determining the minimum and maximum**

Determining the minimum or maximum of a list of numbers is in essence a series of number comparisons where one number must be held in memory. Each list is comprised of $l$ floats, with $l \in \{2,3,4,5\}$. We use the following phrasings:

```{=latex}
\begin{promptbox}[title=MinMax phrasing]\texttt{What is the maximum of the list [$n_1$, $n_2$]}\\
\texttt{What is the minimum of the list [$n_1$, $n_2$, $n_3$, $n_4$, $n_5$]}
\end{promptbox}
```
Our system prompt is:

```{=latex}
\begin{promptbox}[title=MinMax system prompt]\texttt{You are an expert in numeracy. For each problem, output only valid JSON in this format: \\
{"answer": <numeric\_answer>}\\
Do not explain, show steps, or add any extra text. Do not use code blocks to output the answer.\\
DO NOT CALL ANY external APIs or use ANY external tool to solve the problem. DO NOT USE a calculator tool. DO NOT USE python. DO NOT USE Wolfram Alpha.\\
The answer must be a single number, exactly as it appears in the list.}
\end{promptbox}
```
We evaluate by calculating the exact string match accuracy of the answer.

**Interval assignment**

Determining if a number falls within an interval can be achieved by a series of pairwise comparisons, where two numbers must be held in memory. Each list is comprised of $l$ floats, with $l \in \{2,3,4,5\}$. We use the following phrasing:

```{=latex}
\begin{promptbox}[title=Interval phrasing]\texttt{What interval does x=$n$ belong to? A: x < $n_1$, B: $n_1$ <= x < $n_2$, C: $n_2$ <= x}
\end{promptbox}
```
Our system prompt is:

```{=latex}
\begin{promptbox}[title=Interval system prompt]\texttt{You are an expert in numeracy. For each problem, output only valid JSON in this format: \\
{"answer": <interval\_multiple\_choice\_answer>}\\
Do not explain, show steps, or add any extra text. Do not use code blocks to output the answer.\\
DO NOT CALL ANY external APIs or use ANY external tool to solve the problem. DO NOT USE a calculator tool. DO NOT USE python. DO NOT USE Wolfram Alpha.\\
The answer must be one of the following: A, B, C, D, E, F.}
\end{promptbox}
```
We evaluate by calculating the exact string match accuracy of the answer.

**List sorting**

Sorting a list of numbers is in requires more numbers to be held in memory (depending on the \`\`algorithm\" used by the LLM). Each list consists of $l$ floats, with $l \in \{2,3,4,5\}$. We use the following phrasing:

```{=latex}
\begin{promptbox}[title=Sorting phrasing]\texttt{Sort the list [$n_1$, $n_2$, $n_3$, $n_4$, $n_5$] in ascending order.}\\
\texttt{Sort the list [$n_1$, $n_2$, $n_3$] in descending order.}
\end{promptbox}
```
Our system prompt is:

```{=latex}
\begin{promptbox}[title=Sorting system prompt]\texttt{You are an expert in numeracy. For each problem, output only valid JSON in this format: \\
{"answer": <sorted\_list>}\\
Do not explain, show steps, or add any extra text. Do not use code blocks to output the answer.\\
DO NOT CALL ANY external APIs or use ANY external tool to solve the problem. DO NOT USE a calculator tool. DO NOT USE python. DO NOT USE Wolfram Alpha.\\
The answer must be a list of numbers.}
\end{promptbox}
```
We evaluate by calculating the exact string match accuracy of the answer.

### Single-step arithmetic

We test both addition and subtraction capabilities in the addition task, but separate multiplication and division as the generation of the multiplicative inverse is not trivial. We use the following phrasing:

```{=latex}
\begin{samepage}
\begin{promptbox}[title=Single-step arithmetic phrasing]
\texttt{What is $n_1$ + $n_2$?}\\
\texttt{What is $n_1$ - $n_2$?}\\
\texttt{What is $n_1$ * $n_2$?}\\
\texttt{What is $n_1$ / $n_2$?}\\
\end{promptbox}
\end{samepage}
```
Our system prompt is:

```{=latex}
\begin{promptbox}[title=Single-step arithmetic system prompt]\texttt{You are an expert in numeracy. Return exactly one valid JSON object in this format: \\
{"answer": <numeric\_answer>}\\
Do not explain, show steps, or add any extra text. Do not use code blocks to output the answer.\\
DO NOT CALL ANY external APIs or use ANY external tool to solve the problem. DO NOT USE a calculator tool. DO NOT USE python. DO NOT USE Wolfram Alpha.\\
If the answer is not an integer, give it as a decimal (not a fraction), rounded to at most 15 significant digits.}
\end{promptbox}
```
We use $\mathrm{log\textnormal{-}sMAPE}$ and exact match accuracy as metrics.

### Multi-step arithmetic

The exponentiation task includes both positive and negative exponents as well as float exponents, thus testing both power and root operations. We also test the ability to calculate both the mean and the standard deviation. Calculating the mean requires adding all numbers in the list and then dividing by the length of the list. Calculating the standard deviation requires calculating the mean of the list, then calculating the average difference between the mean and each element squared, and then taking the square root of that average. We use a single set of number lists for both tasks. Each list is comprised of $l$ floats, with $l \in \{2,3,4,5\}$. We use the following phrasing:

```{=latex}
\begin{promptbox}[title=Mean and Std Phrasing]\texttt{What is the mean of the list [$n_1$, $n_2$, $n_3$]?} \\
\texttt{What is the std of the list [$n_1$, $n_2$, $n_3$, $n_4$]?} \\
\texttt{What is $n_1$ \string^ $n_2$?}
\end{promptbox}
```
Our system prompt is:

```{=latex}
\begin{promptbox}[title=Multi-step arithmetic system prompt]\texttt{You are an expert in numeracy. Return exactly one valid JSON object in this format: \\
{"answer": <numeric\_answer>}\\
Do not explain, show steps, or add any extra text. Do not use code blocks to output the answer.\\
DO NOT CALL ANY external APIs or use ANY external tool to solve the problem. DO NOT USE a calculator tool. DO NOT USE python. DO NOT USE Wolfram Alpha.\\
If the answer is not an integer, give it as a decimal (not a fraction), rounded to at most 15 significant digits.}
\end{promptbox}
```
We use $\mathrm{log\textnormal{-}sMAPE}$ and exact match accuracy as metrics.

```{=latex}
\FloatBarrier
```
Number sampling {#app:benchmark_distribution}
---------------

We sample all numbers from the interval $[10^{-14}, 10^{15})\cup \{0\}$ and ensure that the result is in the same interval. For each task, we sample the operands $n\sim\mathcal U(10^x,10^{x+1})$, where $x\sim\mathcal U(-14,15)$. We ensure that all numbers can be represented without loss of information by a `float64` tensor and round all answers to at most 15 significant digits. Further, we ensure that the operation of both numbers does not result in vanishing information. For instance, computing the mean of the list $[10^{14},10^{-5},-10^{14}]$ in python would be evaluated as $0$ because the intermediate result $10^{14}+10^{-5}$ cannot be represented given the limited precision of 15-17 digits.

When used by humans, numbers often occur a reduced number of decimal digits, while full precision is used in scientific and engineering environments. To reflect both worlds, we uniformly sample the number of significant digits of the values from a uniform distribution.

For each task, we sample 30M train samples, 10k validation and 10k test samples. In the following, we outline task specific generation parameters.

**MinMax / Interval / Sorting** We draw the list length $l\in\{2,3,4,5\}$ and the magnitude of the mean of all operands from $x\sim\mathcal U(-14,15)$. We then sample operands for each spread $s\in[\min (13,x-13), \min(x+2,13)]$. This increases the number of samples where the numbers share a common prefix and only differ in few digits, thus increasing difficulty. We then draw $m\sim\mathcal U(10^x,10^{x+1})$ and generate a list of numbers $[n_1,\dots,n_l],\ n_i\in[m-s,m+s]$ such that $\frac{1}{l}\sum^l_{i=0}n_i=m$. We ensure that each list element only has a maximum of $p\sim\mathcal U(1,17)$ significant digits. Because of the limited precision, we cannot guarantee that the mean will be $m$. We therefore draw 1000 lists and take the one that has the closest mean. We ensure that $\sigma\neq0$, as well as $m\neq0$ if $l\in\{2,4\}$ to avoid trivial samples.

For the MinMax task, we randomly choose whether we require the minimum or the maximum of the list. For the sorting task, we randomly choose whether the list should be sorted ascending or descending. For the interval task, we randomly choose a position index $pos\in[0,\dots,l]$. We then draw a reference value from the sorted list $L$: $$ref\sim\begin{cases} 
      L_0-10^{s}/l & pos=0 \\
      \mathcal U(L_{pos-1}, L_{pos}) & 0< pos< l \\
      L_{l-1}+10^{s}/l & pos=l 
   \end{cases}$$

**Addition** The difficulty of addition lies in the carry propagation. We therefore increase the number of samples where both operands are of similar magnitude (see `\Cref{fig:addition_dist}`{=latex}. We sample the combined operand precision $p_{12}\in\mathcal U(2,34)$ from a uniform distribution. We then draw $p_1\in\mathrm{triangular}(l,l+1, \min(p_{12}-1, 17))$ with $l=\lceil\frac{p_{12}}{2}\rceil$ from a triangular distribution and set $p_2=p_{12}-p_{1}$. We then round the operands by $p_1$ and $p_2$ respectively and randomly swap operands. We choose the individual signs using the following schema: (1) both positive in $40\%$ of cases, (2) only one operand negative in $40\%$ of cases and (3) both negative in $20\%$ of cases. We randomly choose an operator $op\in\{+,-\}$.

```{=latex}
\centering
```
![The magnitude distribution of addition pairs in our dataset. Operands with similar exponents are oversampled to increase difficulty.](./figures/Addition_decimal_uniform_train_30M_distribution.png){#fig:addition_dist width="0.8\\linewidth"}

```{=latex}
\centering
```
![The magnitude distribution of multiplication pairs in our dataset.](./figures/Multiplication_decimal_uniform_train_30M_distribution.png){#fig:multiplication_dist width="0.8\\linewidth"}

**Multiplication** We handle precision of the operands and their signs similar to the addition task, however, we do not swap the operands as this could result in the product being out of bounds.

**Division** Similar to addition, we draw the combined precision of the quotient and the divisor and sample their individual precisions. We then compute the dividend from the quotient and divisor. We change signs similar to the addition task, however, we do not swap the operands as this could result in the product being out of bounds.

**Exponentiation** We draw the exponent $x$ from the interval $[-14,15)$ and increase the number of samples closer to zero. We choose base $b\sim\mathcal U(10^x,10^{x-1})$, choose $\xi\in\Big[\lceil\frac{10^{-13}}{|x|+1}\rceil, \lfloor\frac{10^{14}}{|x|+1}\rfloor\Big]$ and set for all $\xi\not\in\{0,1\}$: $$exp=\begin{cases}
        \xi & \xi>1 \\
        1/\xi & \textrm{else}
    \end{cases}$$ This procedure ensures that we only use integer powers and integer roots. If $b>1$, we randomly set $b$ negative. We also choose the sign of $exp$ randomly.

**Mean / Std** We generate a list similar to the method described for task MinMax but choose $s\in[\min(13,x-13),\ \min(x+17, 13)]$ since here, a smaller spread would simplify the task.

Problem difficulty {#app:difficulty}
------------------

Our experiments show that our small scale transformer models benefit from curriculum learning. To maximize sample efficiency, we manually define a difficulty heuristic that is used for sampling. This avoids inefficient loss based selection. Curriculum training is only used for the most difficult tasks Multiplication, Division, Exponentiation, Mean and Std. Most difficulty heuristics are defined using the digit representation of the operands. Our BitTokens work in base-2 instead of base-10. When rounding decimal numbers, the number of significant bits often remains high. To ensure a fair training dynamic, we therefore create a secondary train dataset generated identical to the decimal version described in `\Cref{app:benchmark_distribution}`{=latex}. This train set uniformly samples the bit precision instead of the decimal precision. We highlight that testing is still performed on the original decimal dataset for all methods.

In the following, we will describe the difficulty for each task depending on its base. We ensure difficulty $\delta_\tau\in\mathbb N_0$.

**Multiplication** Given the fixed-point number representations of the operands in base-2 or base-10, the difficulty $\delta_{\textrm{Multiplication}}$ is given by the sum of their non-zero digits. This directly correlates with the number of steps involved to solve the task and the numbers' precision.

**Division** Similar to Multiplication, the difficulty $\delta_{\textrm{Multiplication}}$ is given by the sum of the non-zero digits of dividend, divisor and quotient.

**Mean** For Mean we found the most reliable difficulty metric to be $\delta_\mathrm{Mean}=s-x+15$ with spread $s$ and exponent $x$ of the mean of the list. Intuitively, the difficulty increases for larger spreads. We see a clear drop in performance for all models where $s>exp$, that is, where numbers do no longer share any common prefix.

**Exponentiation / Std** While the difficulty heuristic for Multiplication, Division and Mean can be derived relatively straightforward, a heuristic for the remaining tasks is a lot harder to define. This is because of their inherent multi-step nature. Each result can be computed using a large variety of different algorithms. For instance, consider the task of computing the standard deviation of a list $L=\{x_1,\dots,x_N\}$ with mean $\mu$. Possible solutions include $\sqrt{\frac1N \sum^N_{i=1}(x_i-\mu)}$, $\sqrt{\frac{1}{N}\sum_{i=1}^N x_i^2-\mu^2}$, Welford's online algorithm, and many more. We define heuristics $\delta_\mathrm{exp}$ as the product of the number of non-zero digits of each base, exponent, and result. For the lack of better alternatives, we set $\delta_\mathrm{Std}=\delta_\mathrm{Mean}$.

Subsampling
-----------

Due to cost constraints we had to subset the full set of 10,000 test samples when evaluating the frontier LLMs. As shown in `\Cref{fig:difficulty_sd_per_task_per_benchmark}`{=latex}, the benchmark for the frontier LLMs and our BitTokens follow the same difficulty distribution.

```{=latex}
\centering
```
![The benchmark for the frontier LLMs is a 500 sample subset of the BitTokens test, but follows the same difficulty distributions.](./figures/difficulty_sd_distribution_per_task.png){#fig:difficulty_sd_per_task_per_benchmark width="\\linewidth"}

```{=latex}
\FloatBarrier
```
```{=latex}
\newpage
```
Frontier benchmark results {#app:frontier_benchmark_results}
==========================

Frontier large language models: {#app:frontier_models}
-------------------------------

We benchmark eight frontier LLMs [@deepseekai2024deepseekv3technicalreport; @comanici2025gemini; @Meta2025LLaMA4; @team2025kimi; @OpenAI2025GPT5; @agarwal2025gpt; @yang2025qwen3] -- both open and closed source -- under maximal and minimal reasoning settings. \`Maximal' reasoning means that we select \`high' reasoning for GPT 5 and GPT OSS 120B, set a thinking budget of 32768 tokens for Gemini 2.5 Pro , and set a budget of 24576 tokens for Gemini 2.5 Flash. All models with permanently enabled reasoning on were also labeled \`Maximal'. \`Minimal' reasoning means that we set reasoning to \`minimal' for GPT 5, and we set a thinking budget of 128 tokens for Gemini 2.5 Pro. Models without reasoning or where it can be completely disabled are denoted as \`None' reasoning models. All models use subword or triple digit tokenization for numbers, except Gemini 2.5 Flash and Gemini 2.5 Pro, which use single digit tokenization.

log-sMAPE performance
---------------------

```{=latex}
\centering
```
```{=latex}
\footnotesize
```
```{=latex}
\setlength{\tabcolsep}{4pt}
```
::: {#tab:frontier_performance_logSMAPE}
  Model              Reasoning       Add   Mult    Div   Mean    Std   MinMax   Interval    Sort    Exp
  ------------------ ----------- ------- ------ ------ ------ ------ -------- ---------- ------- ------
  GPT OSS 120B       maximal        97.6   95.8   80.0   95.6   46.6     99.2      100.0    99.0   56.0
  Qwen3 235B 2507    maximal        98.7   88.0   77.8   94.5   48.8    100.0      100.0    99.8   56.9
  Qwen3 235B 2507    none           87.5   39.0   46.4   59.4   12.5     95.8       98.6    98.8   21.7
  DeepSeek v3.1      maximal        97.0   85.3   84.2   95.0   40.9     99.8       99.8    99.6   48.6
  DeepSeek v3.1      none           94.1   38.5   40.8   66.2   15.5     99.8       81.4    99.6   21.0
  Kimi K2 0905       none           89.6   44.5   46.8   67.4   13.6     99.6       96.8    98.4   19.9
  Llama 4 Maverick   none           79.1   33.3   35.8   60.0   11.8     99.4       76.0    95.6   20.1
  GPT 5              maximal       100.0   99.5   96.7   96.1   74.4    100.0      100.0    99.8   78.6
  GPT 5              minimal        93.7   30.9   29.5   58.2   16.5     98.8       92.4    98.4   19.0
  Gemini 2.5 Pro     maximal        97.4   64.3   74.8   94.3   29.9    100.0      100.0    99.8   45.0
  Gemini 2.5 Pro     minimal        99.6   43.4   51.3   81.9   13.0    100.0       99.6   100.0   25.5
  Gemini 2.5 Flash   maximal        96.4   44.9   56.3   91.6   27.5    100.0      100.0   100.0   28.9
  Gemini 2.5 Flash   none           92.8   32.8   45.2   67.6   13.6    100.0       99.2    99.8   20.3

  : Average performance of each frontier model on each task. The performance of addition (Add), multiplication (Mult), division (Div), exponentiation (Exp), standard deviation (Std), and mean are all measured in $\mathrm{log\textnormal{-}sMAPE}$. The performance of MinMax, Interval, and Sorting are all measured in exact match accuracy.
:::

```{=latex}
\newpage
```
Exact match accuracy
--------------------

```{=latex}
\centering
```
::: {#tab:frontier_performance_exactMatch}
  Model              Reasoning      Add   Mult    Div   Mean    Std    Exp
  ------------------ ----------- ------ ------ ------ ------ ------ ------
  GPT OSS 120B       maximal       89.4   87.2   45.4   71.2   12.6   13.4
  Qwen3 235B 2507    maximal       92.0   64.2   47.0   71.0   12.8   10.4
  Qwen3 235B 2507    none          64.4   18.2   15.6   37.2    3.0    3.4
  DeepSeek v3.1      maximal       92.4   63.6   54.0   71.2   11.0   10.6
  DeepSeek v3.1      none          74.6   20.2   16.4   42.6    3.4    3.2
  Kimi K2 0905       none          70.2   23.2   21.0   44.0    2.6    2.8
  Llama 4 Maverick   none          50.0   15.6   13.6   34.6    2.2    2.8
  GPT 5              maximal       98.4   97.8   91.4   73.8   34.4   46.8
  GPT 5              minimal       76.6   14.2    6.6   33.2    3.8    2.4
  Gemini 2.5 Pro     maximal       91.0   42.2   48.0   67.6    8.8    8.2
  Gemini 2.5 Pro     minimal       82.2   20.6   21.8   55.8    4.4    3.4
  Gemini 2.5 Flash   maximal       84.0   27.8   31.6   63.4    7.0    5.6
  Gemini 2.5 Flash   none          72.2   13.6   17.4   40.8    2.8    2.2

  : Average performance of each frontier model on each task. The performance metric for addition (Add), multiplication (Mult), division (Div), exponentiation (Exp), mean, and standard deviation (Std) is exact match accuracy on the first 15 significant digits
:::

```{=latex}
\FloatBarrier
```
Average tokens per problem
--------------------------

```{=latex}
\centering
```
```{=latex}
\footnotesize
```
```{=latex}
\setlength{\tabcolsep}{4pt}
```
::: {#tab:frontier_tokens}
  Model              Reasoning      Add    Mult     Div   Mean     Std   MinMax   Interval   Sorting     Exp
  ------------------ ----------- ------ ------- ------- ------ ------- -------- ---------- --------- -------
  GPT OSS 120B       maximal        708    1820    3133    446    2056      105        218       193    2815
  Qwen3 235B 2507    maximal       3899    5789    8161   3359    8565      965       1379      1248    9889
  Qwen3 235B 2507    none            22      22      23     20      23       21          7        63      39
  DeepSeek v3.1      maximal       5515    6921    9079   4821   11486      976       1490      1214    9459
  DeepSeek v3.1      none            13      12      12     12      13       12          7        34      94
  Kimi K2 0905       none            13      78      13     12      45       12          7        33      78
  Llama 4 Maverick   none            12      11      12     12      13       12          7        34      13
  GPT 5              maximal       3779    8749   16165   3613   19709      585        514       669   28086
  GPT 5              minimal         21      21      22     20      21       20         15        39      21
  Gemini 2.5 Pro     maximal       4761   11901   11462   4221    9281      574       1013       864    8748
  Gemini 2.5 Pro     minimal        108     101     100    111     165      111        173       117      95
  Gemini 2.5 Flash   maximal       3084   13698   15390   2494   17751      290        541       382   20149
  Gemini 2.5 Flash   none            22      22      22     19      22       20          6        63      22

  : Average number of tokens generated by each frontier model for each problem.
:::

```{=latex}
\FloatBarrier
```
```{=latex}
\newpage
```
Performance Stability
---------------------

```{=latex}
\centering
```
![Due to cost constraints we had to subset the full set of 10,000 test samples to 500 samples per task when evaluating the frontier LLMs. We observe performance to already be stable at 500 samples.](./figures/samples_vs_performance_matrix.png){#fig:methods width="\\linewidth"}

```{=latex}
\FloatBarrier
```
```{=latex}
\newpage
```
Experimental setup {#app:experimental_setup}
==================

Architecture and hyperparameter configuration {#app:hyperparams}
---------------------------------------------

We choose the nanoGPT-2 architecture with 6 layers, 6 attention heads, and an embedding dimension of 768, resulting in a model with 117M parameters. Following @modded_nanogpt_2024, we use the Muon optimizer [@jordan2024muon] and modernize the standard architecture of GPT-2 by adding rotary positional embeddings (RoPE) [@SU2024127063], QK-normalization, and flash attention 2 [@dao2024flashattention] with sequence packing. The dropout probability is set to $0$ to avoid interference with number encodings. Following [@modded_nanogpt_2024], all non-scalar parameters in the hidden layers are optimized using the Muon optimizer [@jordan2024muon] with a learning rate of $0.02$ and momentum of $0.95$. All other parameters are trained with the Adam optimizer (no weight decay) using $\beta_1 = 0.9$ and $\beta_2 = 0.95$. Scalar parameters share the same learning rate of $0.02$. The embedding layer weights are untied from the output layer and trained with a learning rate of $0.03$. All remaining parameters, including both the language and number heads, use a learning rate of $0.004$.

Training is performed with a maximum token budget of 10B tokens, a context length of 1024, and an effective batch size of 192. We employ a cosine learning rate scheduler and allocate 10% of the training tokens for warm-up. Additionally, the Muon optimizer uses 300 momentum warm-up steps. Model performance is evaluated every 32 steps for 2 steps on a small validation set, and the sampling ratio is dynamically adjusted based on current results. After training, we select the checkpoint with the best harmonic mean performance across all tasks.

The training objective is standard next-token prediction, excluding the question tokens from the loss computation. The per-token loss is defined as: $$\mathcal{L}_\mathrm{total} = \mathrm{CrossEntropy}(y_{\mathrm{pred}}, y_{\mathrm{true}}) + 10 \times \mathcal{L}_{\mathrm{num}}(n_{\mathrm{pred}}, n_{\mathrm{true}}),$$ where $\mathcal{L}_{\mathrm{num}}$ is either binary cross-entropy, cross entropy, or mean squared error (MSE), depending on the number encoding strategy.

For testing, we generate tokens by selecting the token with the highest probability.

Number parsing
--------------

To parse numerical values from text input, we use the following regexes:

```{=latex}
\begin{promptbox}[title=Python regex for signed values (BitToken / xVal)]r\texttt{"[-]?(?:(?:0(?!\textbackslash .[0-9]))|(?:[0-9]*[.][0-9]+)|(?:[1-9][0-9]*))"}
\end{promptbox}
```
```{=latex}
\begin{promptbox}[title=Python regex for absolute values (FoNE)]r\texttt{"(?:(?:0(?!\textbackslash .[0-9]))|(?:[0-9]*[.][0-9]+)|(?:[1-9][0-9]*))"}
\end{promptbox}
```
For values that exceed the numerical range of the `float64` format, we split the numbers into two parts and separate them by a dedicated $[\mathtt{overflow}]$ token. This ensures all information is preserved.

The current regex does not parse numbers written in scientific notation as a single value. That is because the use of scientific notation format may be important to the context of the source text. Future work could investigate the use of dedicated tokens for different number formats, such as scientific notation or integer representation.

Curriculum learning {#app:curriculum}
-------------------

We observe that small-scale transformer models benefit significantly from curriculum learning when trained on our numerical tasks. To optimize computational efficiency, we implement an automated curriculum scheduler that selects only those samples $\{s_\tau: \delta(s_\tau)\le\delta_\tau^{\text{frontier}}\}$ whose difficulty does not exceed the current frontier threshold for a given task. For this purpose, we use the difficulty heuristic $\delta(s, \tau)$ defined in `\Cref{app:difficulty}`{=latex}

The initial frontier difficulty is set to $\delta_\tau^{\text{frontier}} = 0.1 \cdot \delta_\tau^{\text{max}}$, and it advances once the model's performance surpasses a predefined threshold, $p(\delta_\tau^{\text{frontier}}, \tau) > \zeta$, where $\zeta = 0.9$. To enhance sample efficiency, we use Equation `\ref{eq:inv_gen_mean}`{=latex} with $\lambda = 0$ to dynamically compute sampling ratios for each difficulty level based on current performance. This strategy helps mitigate catastrophic forgetting while allocating training tokens to areas of greatest need.

To prevent abrupt shifts in the sample distribution during difficulty transitions, we reserve $20\%$ of the task's token budget for difficulty preview. The corresponding sampling ratios are computed using an exponential decay function:

$$r_{\tau,\delta} = 0.2 \cdot \frac{0.8^{\delta - \delta_\tau^{\text{frontier}}}}{\sum_{\delta > \delta_\tau^{\text{frontier}}} 0.8^{\delta - \delta_\tau^{\text{frontier}}}}$$

To ensure models do not stagnate at lower difficulty levels, we gradually reduce the advancement threshold $\zeta$ over time, starting from the lowest difficulties. This encourages progression to more complex tasks if mastery is not achieved promptly. The dynamic advancement threshold for a given task and difficulty, at any training step and learning rate, is defined as:

$$\zeta(\tau, \delta, \text{step}, \text{lr}) = \min \left(
        \frac{S}{\text{step}}, \frac{L}{\text{lr}^{\text{max}} - \text{lr}}
    \right) \cdot \zeta_0 \cdot \frac{\delta_\tau}{\delta_\tau^{\text{max}}}$$

Here, $S$ and $L$ denote the training step and learning rate, respectively, at which the threshold for the highest difficulty begins to decrease. We set this point at $50\%$ of the total training duration to ensure the model has adequate exposure to all difficulty levels in the dataset.

Multitask learning {#app:multitask}
------------------

Our final goal is to train a model that can solve every numeracy task simultaneously. This is non-trivial since we must optimize 9 tasks with varying degrees of difficulty simultaneously. Training is formulated as a multi-task learning problem, aiming to optimize the model's performance $p(\mathcal{T})$ across all tasks $\tau \in \mathcal{T}$ according to the generalized mean: $$p(\mathcal{T}) = \left(\frac{1}{|\mathcal{T}|}\sum_{\tau\in \mathcal{T}}^{|\mathcal{T}|}\left(p(\tau)+\epsilon\right)^\lambda\right)^{\frac{1}{\lambda}}$$ The scaling factor $\lambda$ controls how strong low performing tasks influence the overall score. We set $\lambda=-1$ for the harmonic mean. Since the individual losses can have different magnitudes, we normalize all losses to 10 after warmup. To account for the varying complexities of the tasks, we define a dynamic sample ratio $r_\tau\in(0,1)$ with $\sum_\tau r_\tau=1$ for each mini-batch. Given the performance $p(\tau)\in[0,1]$ of each task $\tau$ and a given time step, we update the sampling ratio by $$\label{eq:inv_gen_mean}
    r_\tau^{new} = \alpha \cdot r_\tau^{old} + (1 - \alpha) \cdot \frac{(1 - p(\tau))^{1-\lambda}}{\sum_\tau (1 - p(\tau))^{1-\lambda}}\, ,$$ where $\alpha=0.5$ is a momentum parameter. Intuitively, $\lambda <0$ leads to more aggressive oversampling of low performing tasks. This multi-task sampling strategy aims to achieve a balanced performance across all tasks without the need for expensive hyperparameter tuning for each setting. We note that this approach becomes less important for larger training budgets.

```{=latex}
\newpage
```
BitTokens benchmark
===================

Multi-task results {#app:multitask_results}
------------------

```{=latex}
\begin{table*}[ht]\centering
\begin{tabular}{llccccc}\toprule
               & Metric & Subword & Single Digit & xVal & FoNE & BitToken (Ours)               \\
\cmidrule{1-7}
Min/Max         & $\uparrow$ Exact match acc &  0.991  &   0.991   & 0.001 &  0.859    & \textbf{0.999}       \\
Interval        & $\uparrow$ Exact match acc &  0.993  &   0.992      & 0.324 & 0.997     & \textbf{1.000}                \\
Sorting         & $\uparrow$ Exact match acc &  0.927 &  0.955    & 0.000 &   0.574   & \textbf{0.998} \\
\cmidrule{1-7}
Add        & $\uparrow$ $\mathrm{log\textnormal{-}sMAPE}$ &  0.954  &   0.974    &  0.092 & 0.809 & \textbf{0.989}  \\
Mult  & $\uparrow$ $\mathrm{log\textnormal{-}sMAPE}$ &  0.450  &    0.893  & 0.094 & 0.300 & \textbf{0.991} \\
Div & $\uparrow$ $\mathrm{log\textnormal{-}sMAPE}$ &  0.522  &  0.681    &  0.093 & 0.275 & \textbf{0.989} \\
\cmidrule{1-7}
Fineweb  & $\downarrow$ Perplexity &  67.60 &  50.89    & 92.91 &  72.11 & \textbf{43.28} \\
\bottomrule
\end{tabular}
\caption{This table shows the exact multi-task performance values reported in \cref{fig:results}.}
\label{tab:lm_performance_multi}
\end{table*}
```
```{=latex}
\begin{table*}[ht]\centering
\begin{tabular}{llllll}\toprule
               & Subword & Single Digit & xVal & FoNE & BitToken (Ours) \\
\cmidrule{1-6}
Min/Max        & 34.1 + 8.5 & 60.5 + 16.0 & 11.1 + 2.0 & 11.1 + 2.0 &  11.1 + 2.0\\
Interval       & 45.1 + 2.0 & 27.1 + 2.0 & 15.9 + 2.0 & 15.9 + 2.0 &  15.9 + 2.0\\
Sorting        & 34.0 + 23.1 & 60.0 + 58.0 & 11.1 + 9.5 & 11.1 + 9.5 &  11.1 + 9.5\\
\cmidrule{1-6}
Add            & 21.6 + 9.6 & 32.4 + 16.6 & 5.9 + 2.4 & 5.9 + 2.4 &  5.9 + 2.0 \\
Mult           & 21.2 + 10.3 & 32.0 + 17.7 & 5.9 + 2.4 & 5.9 + 2.4 &  5.9 + 2.0 \\
Div           & 22.8 + 9.6 & 37.9 + 15.4 & 5.9 + 2.4 & 6.1 + 2.4 &  6.1 + 2.0 \\
\bottomrule
\end{tabular}
\caption{This table shows the mean number of tokens required (input + output) to solve one sample for each task. The predictions were generated by the models trained in the multi-task setting. A large number of output tokens substantially increases generation time, as each output token requires a separate forward pass.}
\label{tab:lm_token_count}
\end{table*}
```
### Exact match accuracy {#app:multitask_exact_match}

```{=latex}
\begin{table*}[ht]\centering
\begin{tabular}{llccccc}\toprule
               & Metric & Subword & Single Digit & xVal & FoNE & BitToken (Ours)               \\
\cmidrule{1-7}
Add        & $\uparrow$ Exact match acc & 0.926  &  0.962     & 0.002 & 0.767 &  \textbf{0.977}  \\
Mult  & $\uparrow$ Exact match acc &  0.290  & 0.784      &   0.030 & 0.164 & \textbf{0.978} \\
Div & $\uparrow$ Exact match acc &  0.379 &  0.514     &  0.013 & 0.136 & \textbf{0.982} \\
\bottomrule
\end{tabular}
\caption{Here we report the multi-task exact match accuracy for the single-step calculation tasks.}
\end{table*}
```
```{=latex}
\FloatBarrier
```
```{=latex}
\newpage
```
Solo-task results {#app:solo_task}
-----------------

```{=latex}
\begin{table*}[ht]\centering
\begin{tabular}{llccccc}\toprule
               & Metric & Subword & Single Digit & xVal & FoNE & BitToken (Ours)               \\
\cmidrule{1-7}
Min/Max         & $\uparrow$ Exact match acc & 0.997 & \textbf{1.000} & 0.002 & 0.909 & 0.999 \\
Interval        & $\uparrow$ Exact match acc & 0.999 & \textbf{1.000} & 0.328 & \textbf{1.000} & \textbf{1.000} \\
Sorting         & $\uparrow$ Exact match acc & \textbf{0.999} & 0.998 & 0.000 & 0.718 & \textbf{0.999} \\
\cmidrule{1-7}
Add        & $\uparrow$ $\mathrm{log\textnormal{-}sMAPE}$ & 0.999 & \textbf{1.000} & 0.090 & 0.851 & 0.998\\
Mult  & $\uparrow$ $\mathrm{log\textnormal{-}sMAPE}$ & 0.564 & 0.964 & 0.096 & 0.328 & \textbf{0.985} \\
Div & $\uparrow$ $\mathrm{log\textnormal{-}sMAPE}$ & 0.564 & 0.773 & 0.094 & 0.236 & \textbf{0.995} \\
\cmidrule{1-7}
Exp & $\uparrow$ $\mathrm{log\textnormal{-}sMAPE}$ & 0.403 & \textbf{0.450} & 0.047 & 0.366 & 0.391 \\
Mean & $\uparrow$ $\mathrm{log\textnormal{-}sMAPE}$ & \textbf{0.976} & 0.965 & 0.060 & 0.444 & 0.534 \\
Std  & $\uparrow$ $\mathrm{log\textnormal{-}sMAPE}$ & 0.365 & \textbf{0.420} & 0.057 & 0.191 & 0.354 \\
\cmidrule{1-7}
Fineweb  & $\downarrow$ Perplexity & 26.84 & 29.23 & 26.88 & \textbf{25.83} & 26.07 \\
\bottomrule
\end{tabular}
\caption{This table shows the performance of each model trained on a single task with the same 10B token budged.}
\label{tab:lm_performance_solo}
\end{table*}
```
```{=latex}
\newpage
```
Ablations {#app:ablations}
---------

We tested multiple token combination strategies other than our described \`\`Sum\" strategy. Product multiplies the $[\mathtt{NUM}]$ token with the binary representation. Concat sets the $[\mathtt{NUM}]$ token to zero in the 64 binary representation dimensions before adding. Zero pad removes the $[\mathtt{NUM}]$ token and inputs the binary representation directly. Weighted passes the binary representation through a linear layer. Weighted + sum passes the binary representation through a linear layer and then adds it to the $[\mathtt{NUM}]$ token. Across token combination strategies (`\Cref{tab:ablation_combination}`{=latex}), performance is stable with the simple sum and zero pad strategies performing best. For our work we chose the sum strategy as it provides extra flexibility compared to zero pad.

Changing the numeric base (`\Cref{tab:ablation_base}`{=latex}) substantially degrades performance, particularly for multiplication and division, highlighting the importance of base choice. Curriculum training (`\Cref{tab:ablation_curriculum}`{=latex}) yields clear improvements over direct multitask training, especially on arithmetic tasks, confirming its necessity for stable convergence. Finally, appending the reciprocal to the encoding improves performance with a negligible performance overhead (`\Cref{tab:ablation_reciprocal}`{=latex}).

```{=latex}
\begin{table*}[ht]\centering
    \footnotesize
\begin{tabular}{llcccccc}\toprule
               & Metric & Sum & Product & Concat & Zero Pad & Weighted & Weighted + Sum               \\
\cmidrule{1-8}
 Min/Max         &$\uparrow$ Exact match acc & 0.999 & 0.996 & 0.997 &  \textbf{1.000}    & 0.998 & 0.999\\
 Interval        &$\uparrow$ Exact match acc & \textbf{1.000} &  \textbf{1.000} &  \textbf{1.000} &  \textbf{1.000} & \textbf{1.000} &  \textbf{1.000}\\
 Sorting         &$\uparrow$ Exact match acc & 0.990 &  0.967 & 0.966 & \textbf{0.992} &  0.983 &  0.989 \\
\cmidrule{1-8}
Add        & $\uparrow$ $\mathrm{log\textnormal{-}sMAPE}$ & 0.989 & 0.980 & 0.977 & \textbf{0.990} & 0.985 & 0.984\\
Mult       & $\uparrow$ $\mathrm{log\textnormal{-}sMAPE}$& 0.991   & 0.828 & 0.871 & \textbf{0.992} & 0.988 & 0.986\\
Div      & $\uparrow$ $\mathrm{log\textnormal{-}sMAPE}$& \textbf{0.989}  & 0.849 &  0.888  & 0.981 & 0.947 & 0.923  \\
\cmidrule{1-8}
Fineweb  & $\downarrow$ Perplexity& 43.86 & 50.37 & 49.21 & \textbf{43.14} & 43.96 & 45.62\\
\bottomrule
\end{tabular}
\caption{Ablation: Performance in the multi-task setting with different combination strategies of the $[\mathtt{NUM}]$ token and the binary number representation.}
\label{tab:ablation_combination}
\end{table*}
```
```{=latex}
\begin{table*}[ht]\centering
\begin{tabular}{llc}\toprule
               & Metric & Base 10  \\
\cmidrule{1-3}
 Min/Max &$\uparrow$ Exact match acc     &  0.992\textsuperscript{\textcolor{red}{\scriptsize -0.007}}   \\
 Interval &$\uparrow$ Exact match acc       &  0.997\textsuperscript{\textcolor{red}{\scriptsize -0.003}}  \\
 Sorting  &$\uparrow$ Exact match acc       & 0.930\textsuperscript{\textcolor{red}{\scriptsize -0.068}}\\
\cmidrule{1-3}
 Add  & $\uparrow$ $\mathrm{log\textnormal{-}sMAPE}$    &   0.937\textsuperscript{\textcolor{red}{\scriptsize -0.052}} \\
 Mult & $\uparrow$ $\mathrm{log\textnormal{-}sMAPE}$ &  0.526\textsuperscript{\textcolor{red}{\scriptsize -0.465}} \\
 Div& $\uparrow$ $\mathrm{log\textnormal{-}sMAPE}$ &  0.437\textsuperscript{\textcolor{red}{\scriptsize -0.552}}  \\
\cmidrule{1-3}
 Fineweb& $\downarrow$ Perplexity  & 66.29\textsuperscript{\textcolor{red}{\scriptsize -23.01}}  \\
\bottomrule
\end{tabular}
\caption{Ablation: Performance in the multi-task setting with base 10 encoding. Deltas to baseline shown as red (drop) and green (gain).}
\label{tab:ablation_base}
\end{table*}
```
```{=latex}
\begin{table*}[ht]\centering
\begin{tabular}{llc}\toprule
               & Metric & BitTokens w/o reciprocal  \\
\cmidrule{1-3}
 Div& $\uparrow$ $\mathrm{log\textnormal{-}sMAPE}$ &  0.351\textsuperscript{\textcolor{red}{\scriptsize -0.644}}  \\
\bottomrule
\end{tabular}
\caption{Ablation: Removing the reciprocal to the encoding vector severely reduces performance on the division task.}
\label{tab:ablation_reciprocal}
\end{table*}
```
```{=latex}
\begin{table*}[ht]\centering
    \footnotesize
\begin{tabular}{llccccc}\toprule
 & Metric & Subword & Single Digit & xVal & FoNE & BitToken (Ours) \\
\cmidrule{1-7}
Min/Max   & $\uparrow$ Exact match acc &  0.990$^{\textcolor{red}{\scriptsize -0.001}}$ & 0.990$^{\textcolor{red}{\scriptsize -0.001}}$ & 0.002$^{\textcolor{green}{\scriptsize +0.001}}$ & 0.874$^{\textcolor{green}{\scriptsize +0.015}}$ & 0.991$^{\textcolor{red}{\scriptsize -0.008}}$ \\
Interval  & $\uparrow$ Exact match acc &  0.990$^{\textcolor{red}{\scriptsize -0.003}}$ & 0.991$^{\textcolor{red}{\scriptsize -0.001}}$ & 0.322$^{\textcolor{red}{\scriptsize -0.002}}$ & 0.997$^{\textcolor{gray}{\scriptsize 0.000}}$ & 1.000$^{\textcolor{gray}{\scriptsize 0.000}}$ \\
Sorting   & $\uparrow$ Exact match acc &  0.927$^{\textcolor{gray}{\scriptsize 0.000}}$ & 0.949$^{\textcolor{red}{\scriptsize -0.006}}$ & 0.000$^{\textcolor{gray}{\scriptsize 0.000}}$ & 0.622$^{\textcolor{green}{\scriptsize +0.048}}$ & 0.933$^{\textcolor{red}{\scriptsize -0.065}}$ \\
\cmidrule{1-7}
Add   & $\uparrow$ $\mathrm{log\textnormal{-}sMAPE}$ &  0.950$^{\textcolor{red}{\scriptsize -0.004}}$ & 0.967$^{\textcolor{red}{\scriptsize -0.007}}$ & 0.093$^{\textcolor{green}{\scriptsize +0.001}}$ & 0.813$^{\textcolor{green}{\scriptsize +0.004}}$ & 0.948$^{\textcolor{red}{\scriptsize -0.041}}$ \\
Mult  & $\uparrow$ $\mathrm{log\textnormal{-}sMAPE}$ &  0.339$^{\textcolor{red}{\scriptsize -0.111}}$ & 0.484$^{\textcolor{red}{\scriptsize -0.409}}$ & 0.094$^{\textcolor{gray}{\scriptsize 0.000}}$ & 0.285$^{\textcolor{red}{\scriptsize -0.015}}$ & 0.385$^{\textcolor{red}{\scriptsize -0.606}}$ \\
Div   & $\uparrow$ $\mathrm{log\textnormal{-}sMAPE}$ &  0.433$^{\textcolor{red}{\scriptsize -0.089}}$ & 0.560$^{\textcolor{red}{\scriptsize -0.121}}$ & 0.093$^{\textcolor{gray}{\scriptsize 0.000}}$ & 0.300$^{\textcolor{green}{\scriptsize +0.025}}$ & 0.376$^{\textcolor{red}{\scriptsize -0.613}}$ \\
\cmidrule{1-7}
Fineweb & $\downarrow$ Perplexity &  70.62$^{\textcolor{red}{\scriptsize -3.02}}$ & 57.53$^{\textcolor{red}{\scriptsize -6.64}}$ & 93.07$^{\textcolor{red}{\scriptsize -0.16}}$ & 68.63$^{\textcolor{green}{\scriptsize +3.48}}$ & 66.25$^{\textcolor{red}{\scriptsize -22.97}}$ \\
\bottomrule
\end{tabular}
\caption{Ablation: Performance in the multi-task setting without curriculum training. Deltas to baseline shown as red (drop), green (gain), and gray (no change).}
\label{tab:ablation_curriculum}
\end{table*}
```

[^1]: <https://github.com/AnonymousAuthor553/BitTokens>

[^2]: Contributed equally to this work as first authors.
