Latent Tokenization
Summary
Latent tokenization is the broader pattern of replacing fixed tokens with learned chunks, concepts, or abstraction levels.
What The Wiki Currently Believes
- H-Net learns content- and context-dependent chunking inside a hierarchical network.
- Synergy learns a routing mechanism that bridges byte-level and higher-level abstraction.
- ConceptMoE merges semantically similar token sequences into concept representations before the expensive concept model.
Evidence
These papers suggest segmentation is no longer just preprocessing. It can become a differentiable compute-allocation and abstraction problem inside the model.
Open Questions
- Should learned tokenization be byte-native, token-compressive, or concept-level?
- How should learned chunking interact with attention, KV cache, and MoE routing?