Number Tokenization

Summary

Number tokenization covers how models encode scalar numeric values as tokens, embeddings, or coefficient-space representations. In this wiki, the topic matters for point-wise time-series embeddings and for auxiliary numeric values such as exogenous variables, control inputs, interventions, metadata, and numeric prompts.

What The Wiki Currently Believes

EIDOS is the time-series anchor: each univariate scalar sample is mapped into a point-wise latent token through a sine-activated gated linear unit. The sine activation supplies bounded periodic basis responses, while the gate selects useful responses and preserves the original temporal resolution.

FlowState is not a number tokenizer, but it is part of the same design space. It encodes time-series histories into coefficient space and uses a functional basis decoder to sample continuous forecasts at a requested temporal resolution.

Pre-trained Large Language Models Use Fourier Features To Compute Addition provides the mechanistic motivation for Fourier number embeddings: pretrained LLMs can use low-frequency components for magnitude approximation and high-frequency components for modular arithmetic.

FoNE turns that observation into an explicit single-token number embedding using digit-aligned sine/cosine features. It is a strong proposal for compact, smooth, periodic scalar representations, especially for addition-like structure.

BitTokens takes the opposite route: expose the IEEE 754 binary representation directly so the model can learn bit-level algorithms for comparison and arithmetic. It argues that Fourier embeddings are elegant but not general-purpose for multiplication and division.

TabM adds a static-tabular branch of the same design space. Its numerical feature embeddings are not text-number tokens: they are typed per-feature encoders for continuous table columns. The useful options are raw scalar inputs after preprocessing, LinearReLUEmbeddings, updated piecewise-linear embeddings with feature-specific bins, and periodic embeddings with learned frequencies and cosine/sine activations.

Continuous Feature Embedding Options

TabM and its rtdl_num_embeddings dependency are useful because they separate several choices that are often blurred together:

  • Raw scalar input keeps each numeric feature as one scalar after preprocessing. This is cheap and strong enough for many MLP-style baselines, but it gives the model no explicit local basis over a feature’s value range.
  • Linear-ReLU embeddings map each scalar feature through a per-feature linear projection plus ReLU. This is the lightest learned non-linear feature interface.
  • Piecewise-linear embeddings compute feature-specific bins, encode the scalar by its position within those bins, then learn an embedding over the piecewise-linear representation. This is attractive when local thresholds, monotone regions, or quantile structure matter.
  • Periodic embeddings / PLR learn frequencies, apply cosine/sine features, then project them with an outer linear layer. This is the tabular analogue closest to Fourier-style number embeddings, but its semantics are feature-specific rather than digit-specific.
  • Fixed piecewise-linear encoding and plain linear embeddings are available in the embedding package as related options, but the current TabM implementation only accepts LinearReLUEmbeddings, PiecewiseLinearEmbeddings, and PeriodicEmbeddings as num_embeddings.

Design Implications

Numeric values should not be treated as one uniform modality. A time-series observation, an exogenous variable, a numeric control input, and a causal intervention can all be scalar numbers, but they impose different requirements on geometry, decoding, smoothness, and exactness.

Smooth point-wise embeddings are attractive for noisy observations and continuous trajectories. Fourier or other periodic bases are natural for cyclic values, phase, and digit/modular structure. Bit-level encodings are attractive when the model must perform exact arithmetic or preserve a wide numeric range. Continuous basis decoders are useful when the output itself should be a resampleable function rather than a fixed horizon of tokens.

Typed continuous feature embeddings are the more natural default for auxiliary numeric values whose identity is known in advance. A blood-glucose value, a price, a temperature, a drug dose, a control setpoint, and a calendar feature may all be numeric, but their bins, scales, periodicities, and safe extrapolation behavior differ. TabM-style per-feature embeddings make that distinction explicit, while FoNE and BitTokens focus on representing free-standing numerals.

Evidence

The current evidence is mixed by domain. EIDOS and FlowState support numeric time-series representation design through forecasting results. FoNE and BitTokens support specialized number representations through controlled language-model arithmetic experiments. TabM supports typed continuous-feature embeddings through static tabular prediction benchmarks. The 2024 Fourier-features paper explains why pretrained language models may already contain useful periodic number structure, but it does not prove that the same mechanism is sufficient for time-series foundation models.

Open Questions

  • Which scalar encoding should be used for known future exogenous variables versus controllable numeric actions or interventions?
  • Can a time-series foundation model route between point-wise smooth embeddings, Fourier bases, bit-level number tokens, and learned continuous bases?
  • When should auxiliary numeric values use feature-specific piecewise-linear bins rather than universal Fourier or bit-level number tokens?
  • How should measurement units, missingness, uncertainty, normalization, and sign be represented in auxiliary numeric values?
  • Does exact arithmetic help forecasting and world modeling, or only symbolic numeric tasks?