Bolmo: Byteifying The Next Generation Of Language Models

Source

Raw Markdown: paper_bolmo-2025.md
PDF: paper_bolmo-2025.pdf

Core Claim

Bolmo shows that competitive byte-level language models can be obtained by byteifying existing subword LMs through a purpose-built architecture and exact distillation objective.

Key Contributions

Introduces a fully open byte-level LM family at 1B and 7B scales.
Treats byteification as tokenizer transfer from a source subword LM.
Claims conversion can use less than 1% of a typical pretraining token budget.
Shows gains in character understanding and some coding settings while approaching source-LM performance elsewhere.

Method Notes

Bolmo is the main source for Tokenizer Transfer. It complements H-Net and Synergy, which emphasize end-to-end learned chunking rather than distillation from subword models.

Evidence And Results

The paper compares Bolmo against byte-level baselines and source subword models, with attention to character understanding, coding, general tasks, inference speed, and post-training transfer.

Limitations

Bolmo’s strength depends on strong source subword LMs and a byteification recipe. It does not by itself settle whether future models should train from raw bytes end to end.

Links Into The Wiki

Open Questions

Can byteification combine with dynamic learned chunking?
Which capabilities remain bottlenecked by imperfect boundary prediction?

Alex Knowledge Base

Explorer

Bolmo: Byteifying The Next Generation Of Language Models

Bolmo: Byteifying The Next Generation Of Language Models

Source

Core Claim

Key Contributions

Method Notes

Evidence And Results

Limitations

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks