Bolmo: Byteifying The Next Generation Of Language Models
Source
- Raw Markdown: paper_bolmo-2025.md
- PDF: paper_bolmo-2025.pdf
Core Claim
Bolmo shows that competitive byte-level language models can be obtained by byteifying existing subword LMs through a purpose-built architecture and exact distillation objective.
Key Contributions
- Introduces a fully open byte-level LM family at 1B and 7B scales.
- Treats byteification as tokenizer transfer from a source subword LM.
- Claims conversion can use less than 1% of a typical pretraining token budget.
- Shows gains in character understanding and some coding settings while approaching source-LM performance elsewhere.
Method Notes
Bolmo is the main source for Tokenizer Transfer. It complements H-Net and Synergy, which emphasize end-to-end learned chunking rather than distillation from subword models.
Evidence And Results
The paper compares Bolmo against byte-level baselines and source subword models, with attention to character understanding, coding, general tasks, inference speed, and post-training transfer.
Limitations
Bolmo’s strength depends on strong source subword LMs and a byteification recipe. It does not by itself settle whether future models should train from raw bytes end to end.
Links Into The Wiki
Open Questions
- Can byteification combine with dynamic learned chunking?
- Which capabilities remain bottlenecked by imperfect boundary prediction?