Byte-Level Language Models

Summary

Byte-level language modeling removes fixed subword vocabularies, but the corpus shows several incompatible ways to make that practical.

What The Wiki Currently Believes

Bolmo byteifies existing subword LMs through distillation, recovering much of their behavior with less than 1% of typical pretraining token budget.
H-Net learns dynamic hierarchical byte chunking end to end.
Synergy learns abstraction routing over bytes and reports emergent token-like concepts.

Evidence

Bolmo emphasizes transfer from strong subword models; H-Net and Synergy emphasize end-to-end learned segmentation or abstraction. All three reject fixed tokenization as a permanent foundation.

Open Questions

Does byteification mainly solve practical migration, while learned chunking solves long-term architecture?
Which approach gives the best multilingual, code, and DNA scaling?

Latent Tokenization
Tokenizer Transfer

Alex Knowledge Base

Explorer

Byte-Level Language Models

Byte-Level Language Models

Summary

What The Wiki Currently Believes

Evidence

Open Questions

Graph View

Table of Contents

Backlinks

Alex Knowledge Base

Explorer

Byte-Level Language Models

Byte-Level Language Models

Summary

What The Wiki Currently Believes

Evidence

Open Questions

Related Pages

Graph View

Table of Contents

Backlinks