ConceptMoE: Adaptive Token-To-Concept Compression For Implicit Compute Allocation
Source
- Raw Markdown: paper_conceptmoe-2026.md
- PDF: paper_conceptmoe-2026.pdf
Core Claim
ConceptMoE improves efficiency and effectiveness by merging semantically similar token sequences into concept representations before expensive MoE computation.
Key Contributions
- Introduces learnable token-to-concept chunking based on semantic similarity.
- Uses MoE to compare architectures under matched total parameters and activated FLOPs.
- Reports improvements on language pretraining, long-context understanding, multimodal benchmarks, and continual conversion.
- Reduces attention computation and KV cache requirements at higher compression ratios.
Method Notes
ConceptMoE connects Latent Tokenization with Mixture Of Experts: compression is not only a preprocessing step, but an implicit compute-allocation mechanism.
Evidence And Results
The abstract reports +0.9 language pretraining points, +2.3 long-context points, +0.6 multimodal points, and +5.5 points during continual training conversion under controlled settings.
Limitations
The source does not remove tokenization entirely; it compresses already-tokenized streams into concepts. It should be compared with byte-native methods such as H-Net and Synergy.
Links Into The Wiki
Open Questions
- How stable are learned concept boundaries across domains?
- Can concept compression be combined with byte-level or pixel-level inputs?