Efficient Numeracy In Language Models Through Single-Token Number Embeddings

Source

Core Claim

BitTokens argues that numbers should be represented as single specialized tokens whose embeddings expose the IEEE 754 sign, exponent, and significand bits, because this gives language models direct access to a representation that is efficient, wide-range, and algorithmically aligned with arithmetic.

Key Contributions

  • Defines nine desiderata for single-token number encodings: token efficiency, uniqueness, structured geometry, scale invariance, normalization compatibility, numerical stability, continuity, robustness, and arithmetic support.
  • Critiques xVal and FoNE under those desiderata, especially range/precision limits for xVal and multiplication/division difficulty for sinusoidal encodings.
  • Encodes a number as an added vector on a dedicated [NUM] token using IEEE 754 binary floating-point bits, optionally concatenated with the bit representation of the reciprocal.
  • Uses a number head with bit-wise prediction and binary cross entropy instead of direct scalar regression.
  • Reports controlled small-LM experiments where BitTokens outperform subword, single-digit, xVal, and FoNE baselines on comparison and single-step arithmetic tasks.

Method Notes

The important transfer lesson for time-series and world-model work is not that IEEE 754 is always the right representation for sensor values. It is that the embedding geometry should match the downstream computation. If the model must perform exact or near-exact arithmetic on auxiliary numeric values, exposing bits can be more useful than hiding the value behind fragmented text tokens or smooth scalar magnitude alone.

For time-series models, BitTokens are most relevant to auxiliary numeric values such as control inputs, numeric actions, numeric interventions, exogenous variables, metadata values, or symbolic numeric prompts. Continuous observations may still need smooth local geometry, normalization, and uncertainty-aware heads rather than exact binary decoding.

BitTokens are a typed numeric-token path, not a drop-in tokenizer swap: the method requires number parsing, a dedicated [NUM] token, numeric features added to the embedding, a number head, and a bit-wise loss.

Evidence And Results

The paper’s frontier-model benchmark motivates the problem by showing high reasoning-token use on arithmetic tasks. Its controlled experiments then compare number encodings in small language models trained from scratch on numeric tasks and FineWeb text. BitTokens are reported as the strongest method in the multi-task comparison for comparison and single-step arithmetic tasks, while mean, standard deviation, and exponentiation remain difficult.

The ablations support the design choices: base-2 encoding performs better than base-10 in the reported multi-task setting, reciprocal features matter strongly for division, and simple sum or zero-padding combinations work better than more elaborate combination strategies.

Limitations

The strongest evidence is from small models and synthetic numeric tasks, not large-scale pretrained LLMs or time-series foundation models. The paper also notes that production integration still needs robust parsing of numbers, notation handling, and precision-aware output policies. The binary representation is exact and algorithm-friendly, but it may be a poor inductive bias when the desired behavior is smooth interpolation over noisy sensor observations.

Open Questions

  • Can BitTokens be mixed with continuous time-series value embeddings without creating brittle discontinuities around IEEE 754 bit flips?
  • Should auxiliary numeric values use different encodings depending on whether they are observations, exogenous variables, control inputs, or interventions?
  • Can a model route between bit-level, Fourier, logarithmic, and smooth scalar encodings according to the operation being performed?