Dynamic Chunking For End-To-End Hierarchical Sequence Modeling

Source

Core Claim

H-Net replaces the tokenizer-LM-detokenizer pipeline with an end-to-end hierarchical network that learns dynamic content- and context-dependent byte chunking.

Key Contributions

  • Introduces dynamic chunking learned jointly with the rest of the model.
  • Builds an explicit hierarchical network over byte-level inputs.
  • Shows one-stage hierarchy can outperform a compute/data-matched BPE Transformer.
  • Reports improved scaling with multiple hierarchy stages and strong gains in domains where tokenization heuristics are weak.

Method Notes

H-Net is central to Latent Tokenization and Byte-Level Language Models.

Evidence And Results

The abstract reports improved byte-level language modeling, increased character robustness, meaningful learned chunking, and nearly 4x data-efficiency improvement on DNA sequences relative to baselines.

Limitations

H-Net emphasizes end-to-end chunking, while Bolmo emphasizes practical transfer from existing subword LMs. The two solve different deployment problems.

Open Questions

  • How should dynamic chunking scale to multimodal or action-conditioned data?
  • Can H-Net-like hierarchy inherit capabilities from pretrained subword systems?