Compute Optimal Tokenization

Source

Status And Credibility

Compute Optimal Tokenization was submitted to arXiv on 2026-05-02, revised to v2 on 2026-05-26, and published on Meta AI’s research site on 2026-05-04. The author list is mostly FAIR at Meta with University of Washington coauthors, and the release includes an official project page plus a facebookresearch repository containing result CSVs, scaling-law fitting code, and visualization code. Treat it as current, credible large-scale preprint evidence for tokenization-aware language-model scaling, but not yet as peer-reviewed or independently reproduced.

X API credentials were unavailable during ingest, so the author announcement is preserved through X oEmbed and public search results rather than an authenticated thread capture. No verified Meta AI or full author thread was found beyond Srini Iyer’s announcement during this pass.

Core Claim

The paper argues that compute-optimal language-model scaling should measure training data in bytes rather than tokens when comparing tokenization schemes. Tokens are not a stable unit because their information density changes with compression rate, vocabulary, tokenizer family, language, and domain.

For English text, the paper generalizes the Chinchilla-style data/model rule into an approximate byte-per-parameter rule:

where is compute-optimal training data measured in bytes and is compute-optimal parameter count. The claim is not “always use exactly 60” across all domains; it is that bytes are the more stable unit than tokens when changing tokenization.

Method

The authors train and fit scaling laws over both latent-tokenized BLT-style models and subword-tokenized isotropic Transformer models.

flowchart LR
    C[Compute budget C] --> Grid[Model-size x compression-rate grid]
    T[Compression rate T: bytes/token] --> Grid
    Grid --> Train[Train many language models]
    Train --> IsoFLOP[Fit IsoFLOP curves]
    IsoFLOP --> B[Optimal bytes B*]
    IsoFLOP --> N[Optimal parameters N*]
    B --> Rho[bytes per parameter rho*]
    N --> Rho
    IsoFLOP --> Loss[Optimal loss L*]
    Loss --> Tstar[Compute-dependent optimal compression T*]

The main experiments train 988 latent-tokenized models and 320 subword-tokenized models from 50M to 6.7B parameters, under compute budgets from to FLOPs. Models are evaluated in bits per byte with a fixed 8192-byte evaluation context so tokenizers are compared against the same byte-level information window.

Key Contributions

  • Introduces compression rate as an explicit scaling variable, where is average bytes per token.
  • Finds that the compute-optimal byte-per-parameter ratio stays close to constant across compute budgets and compression rates for English, with .
  • Finds that optimal compression is non-monotonic: both too little and too much compression increase loss.
  • Fits a compute-dependent compression law where slowly decreases as training compute increases, for example roughly at FLOPs and at FLOPs in the BLT setting.
  • Reports similar scaling trends for latent tokenization and subword tokenization, including BPE, masked-vocabulary BPE, character-level tokenization, and SuperBPE.
  • Extends the analysis beyond English and finds that optimal byte-per-parameter ratio and optimal compression vary by language and correlate with byte parity.
  • Shows inference tradeoffs: higher compression can reduce inference FLOPs per byte, but compression close to the training optimum can improve quality under matched inference cost on harder tasks.

Why It Matters

This is a strong scaling-law anchor for the wiki’s tokenization branch. It says fixed tokenizers are not just preprocessing choices; they change the unit used to state scaling laws. A “20 tokens per parameter” rule is meaningful only relative to a tokenizer with a particular compression rate. When the tokenizer changes, a byte-based statement is more portable.

For multilingual language modeling, the source also turns tokenizer fairness into a compute-allocation problem. Popular multilingual subword tokenizers can over-compress some high-resource languages and under-compress lower-resource or byte-heavier languages. Latent tokenization becomes attractive because compression can be adapted per language without making vocabulary statistics the only control knob.

Limitations

  • The evidence is language modeling, not numeric time-series forecasting, graph time series, robotics trajectories, or action-conditioned world models.
  • The experiments fix several training hyperparameters rather than retuning per budget and tokenizer.
  • The largest reported models are 6.7B parameters; frontier-scale extrapolation remains an extrapolation.
  • The code repository releases result data and fitting/visualization code, but not full training code for every run.
  • Optimal compression can be task-dependent at inference: ARC-Easy, ARC-Challenge, C4, and HellaSwag do not all reward the same compression/inference-cost point.
  • Lower compression increases inference cost because more tokens must be processed per byte; compute-optimal training and cost-optimal serving can disagree.

Relation To Existing Tensions

This source sharpens the Tokenizer Removal Has Multiple Incompatible Paths tension. H-Net, Synergy, Bolmo, ConceptMoE, and BLT-style latent tokenization are not merely different ways to remove tokenizers; they imply different compute-allocation contracts. Compute Optimal Tokenization adds a quantitative test: compare them by bytes per parameter, compression rate, loss, and inference FLOPs per byte rather than by token count alone.

It also adds a methodological disagreement with “Scaling Laws with Vocabulary” style results. The paper argues that controlling compression through vocabulary size entangles compression with embedding-layer compute, while BLT and SuperBPE-style settings can vary compression with a more stable embedding-cost story.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Adaptive tokenizationadjacentShows compression rate should be a learned or chosen scaling variable, not a fixed preprocessing constant.Needs direct numeric time-series patching and multivariate evidence.
Scaling and efficiencyadjacentTurns token granularity into a compute-allocation variable with explicit byte-based scaling laws.Needs TSFM scaling laws that include sample rate, horizon, channel count, and exogenous variables.
Information-density transferadjacentShows information density and compression optimum vary by language and byte parity.Needs equivalent parity or information-density measures for numeric features, event streams, images, code, and actions.
Serving costwarningLower compression can improve quality but raises inference FLOPs per byte.Need latency, memory-bandwidth, batching, and cache-aware serving tests.

Open Questions

  • Can byte-per-parameter scaling be generalized to non-text modalities through a measurable information-density unit?
  • Should a multilingual foundation model learn language-conditioned compression directly, or should the data sampler compensate for byte parity?
  • How should training-compute-optimal compression be traded against serving-cost-optimal compression?
  • Can adaptive time-series patching produce an equivalent scaling law in bytes, samples, events, or channel-time cells per parameter?
  • Which preservation probes catch cases where compression helps average loss but erases rare regimes, numeric detail, actions, or intervention effects?