Byte-Level Language Models
Summary
Byte-level language modeling removes fixed subword vocabularies, but the corpus shows several incompatible ways to make that practical.
What The Wiki Currently Believes
- Bolmo byteifies existing subword LMs through distillation, recovering much of their behavior with less than 1% of typical pretraining token budget.
- H-Net learns dynamic hierarchical byte chunking end to end.
- Synergy learns abstraction routing over bytes and reports emergent token-like concepts.
Evidence
Bolmo emphasizes transfer from strong subword models; H-Net and Synergy emphasize end-to-end learned segmentation or abstraction. All three reject fixed tokenization as a permanent foundation.
Open Questions
- Does byteification mainly solve practical migration, while learned chunking solves long-term architecture?
- Which approach gives the best multilingual, code, and DNA scaling?