Bolmo: Byteifying The Next Generation Of Language Models

Source

Core Claim

Bolmo shows that competitive byte-level language models can be obtained by byteifying existing subword LMs through a purpose-built architecture and exact distillation objective.

Key Contributions

  • Introduces a fully open byte-level LM family at 1B and 7B scales.
  • Treats byteification as tokenizer transfer from a source subword LM.
  • Claims conversion can use less than 1% of a typical pretraining token budget.
  • Shows gains in character understanding and some coding settings while approaching source-LM performance elsewhere.

Method Notes

Bolmo is the main source for Tokenizer Transfer. It complements H-Net and Synergy, which emphasize end-to-end learned chunking rather than distillation from subword models.

Evidence And Results

The paper compares Bolmo against byte-level baselines and source subword models, with attention to character understanding, coding, general tasks, inference speed, and post-training transfer.

Limitations

Bolmo’s strength depends on strong source subword LMs and a byteification recipe. It does not by itself settle whether future models should train from raw bytes end to end.

Open Questions

  • Can byteification combine with dynamic learned chunking?
  • Which capabilities remain bottlenecked by imperfect boundary prediction?