Latent Tokenization

Summary

Latent tokenization is the broader pattern of replacing fixed tokens with learned chunks, concepts, or abstraction levels.

What The Wiki Currently Believes

  • H-Net learns content- and context-dependent chunking inside a hierarchical network.
  • Synergy learns a routing mechanism that bridges byte-level and higher-level abstraction.
  • ConceptMoE merges semantically similar token sequences into concept representations before the expensive concept model.

Evidence

These papers suggest segmentation is no longer just preprocessing. It can become a differentiable compute-allocation and abstraction problem inside the model.

Open Questions

  • Should learned tokenization be byte-native, token-compressive, or concept-level?
  • How should learned chunking interact with attention, KV cache, and MoE routing?