Byte-Level Language Models

Summary

Byte-level language modeling removes fixed subword vocabularies, but the corpus shows several incompatible ways to make that practical.

What The Wiki Currently Believes

  • Bolmo byteifies existing subword LMs through distillation, recovering much of their behavior with less than 1% of typical pretraining token budget.
  • H-Net learns dynamic hierarchical byte chunking end to end.
  • Synergy learns abstraction routing over bytes and reports emergent token-like concepts.

Evidence

Bolmo emphasizes transfer from strong subword models; H-Net and Synergy emphasize end-to-end learned segmentation or abstraction. All three reject fixed tokenization as a permanent foundation.

Open Questions

  • Does byteification mainly solve practical migration, while learned chunking solves long-term architecture?
  • Which approach gives the best multilingual, code, and DNA scaling?