Compute Optimal Tokenization
Summary
Compute Optimal Tokenization is a Meta FAIR and University of Washington scaling-law study that treats token compression rate as a first-class model-design variable. Its practical takeaway is that data/model scaling rules should be stated in bytes per parameter when tokenization changes.
Interface
- Scaling unit: bytes of training data per parameter.
- Tokenization variable: compression rate , measured as average bytes per token.
- Main law: for English text, compute-optimal training is close to bytes per parameter across several compute budgets and compression rates.
- Compression result: there is an optimal compression rate, and it slowly decreases as training compute increases.
- Released artifacts: arXiv paper, rendered project page, Meta AI publication page, and
facebookresearchresult/fitting code repository.
Role In The Wiki
This entity is the local object card for tokenization-aware scaling laws. It should be used when a page needs the quantitative claim that token counts are not portable across tokenizers, or when comparing byte-level, latent-token, subword, and superword schemes under a compute budget.
For time-series and world-model work, this is upstream language-model evidence rather than direct TSFM evidence. Its main transfer value is the design question: what is the right information-density unit for dense numeric streams, event streams, graph time series, and action trajectories?