What Should Change After The Number Tokenization Ingest?

Answer

The immediate wiki change is to treat number tokenization as a first-class representation interface, not as a narrow LLM arithmetic detail. The new Number Tokenization topic should become the hub for scalar numeric observations, known future exogenous variables, numeric control inputs, intervention intensities, timestamps, metadata, and numeric prompts.

The strongest local time-series updates are EIDOS and FlowState. EIDOS provides point-wise SiGLU scalar tokenization for time-series samples; FlowState provides coefficient-space continuous basis decoding. These should be kept adjacent but distinct: EIDOS is an input value-to-latent-token design, while FlowState is a coefficient-to-continuous-forecast design.

FoNE and BitTokens create a live design tension. FoNE is the Fourier-feature path for compact periodic number embeddings, motivated by the 2024 Fourier-features addition paper. BitTokens is the bit-level path for exact arithmetic-friendly number tokens, and it explicitly challenges FoNE’s generality for multiplication and division.

TabM adds a separate typed-feature branch. Its numerical embeddings are not universal text-number tokens; they are per-column encoders for static tabular continuous features. This is directly useful for auxiliary time-series values because exogenous variables, numeric control inputs, intervention intensities, timestamps, and metadata often have stable feature identities and feature-specific value ranges.

Updated In This Ingest

  • Added source pages for FlowState, BitTokens, FoNE, and the 2024 Fourier-features addition paper.
  • Added entity pages for FlowState, BitTokens, and FoNE.
  • Added Number Tokenization.
  • Updated EIDOS, latent tokenization, tokenizer transfer, JEPA, representation collapse, latent-space predictive learning, time-series foundation models, time-series scaling, benchmark hygiene, and contradictions.
  • Added TabM and updated Number Tokenization and Tabular Foundation Models with typed continuous-feature embedding options.

Existing Pages To Revisit Next

  • NuTime: connect its numerical-scale embedding to the new number-tokenization page.
  • Time-Series Classification Foundation Models: add a short distinction between shape/classification embeddings and arithmetic-oriented number embeddings.
  • Tabular Foundation Models: keep TabM as a non-foundation static-tabular baseline and avoid mixing it with TabPFN-style in-context learners.
  • World Models: add only a small pointer for numeric actions/control inputs, avoiding the claim that arithmetic tokenization alone makes a model action-conditioned.
  • Action-Conditioned Time-Series Datasets: revisit when numeric control inputs or intervention intensities become a dataset-level focus.
  • Byte-Level Language Models: add a contrast between raw byte handling and typed numeric tokens if byte-level numeric examples become central.

Candidate Sources To Add

  • numerical-feature-embeddings-2022: On Embeddings for Numerical Features in Tabular Deep Learning is now the most important missing source behind TabM’s piecewise-linear and periodic embedding options.
  • tabr-2023: TabR is useful because TabM references its periodic-embedding and lightweight PeriodicEmbeddings(lite=True) usage.
  • excelformer-2023: ExcelFormer should be added if the wiki needs the GLU-style numerical embedding branch mentioned in TabM’s baseline discussion.
  • xval-2023: xVal: A Continuous Numerical Tokenization for Scientific Language Models is the main continuous single-token baseline criticized by BitTokens and should be ingested before making broad claims about numeric tokenization.
  • transformers-arithmetic-right-embeddings-2024: Transformers Can Do Arithmetic with the Right Embeddings should be added as the digit-position and arithmetic-embedding baseline behind several number-representation discussions.
  • numerologic-2024: NumeroLogic: Number Encoding for Enhanced LLMs’ Numerical Reasoning should be added if the wiki needs the extra-token/magnitude-hint branch of numeric encoding.
  • A time-series-specific value-interface source should be added if found, ideally one that compares point-wise scalar embeddings, covariate embeddings, timestamp encodings, and control-input embeddings directly.

Open Questions

  • Should the wiki split number-tokenization into a more time-series-specific page such as time-series-value-interfaces once more sources accumulate?
  • Which numeric encoding should be used for known future exogenous variables versus controllable numeric actions or interventions?
  • Can a single model route among smooth scalar embeddings, Fourier features, bit-level tokens, and continuous bases without hurting calibration or multivariate alignment?
  • Should typed feature embeddings be learned per variable, per variable family, or shared across variables with explicit unit metadata?