ConceptMoE: Adaptive Token-To-Concept Compression For Implicit Compute Allocation

Source

Core Claim

ConceptMoE improves efficiency and effectiveness by merging semantically similar token sequences into concept representations before expensive MoE computation.

Key Contributions

  • Introduces learnable token-to-concept chunking based on semantic similarity.
  • Uses MoE to compare architectures under matched total parameters and activated FLOPs.
  • Reports improvements on language pretraining, long-context understanding, multimodal benchmarks, and continual conversion.
  • Reduces attention computation and KV cache requirements at higher compression ratios.

Method Notes

ConceptMoE connects Latent Tokenization with Mixture Of Experts: compression is not only a preprocessing step, but an implicit compute-allocation mechanism.

Evidence And Results

The abstract reports +0.9 language pretraining points, +2.3 long-context points, +0.6 multimodal points, and +5.5 points during continual training conversion under controlled settings.

Limitations

The source does not remove tokenization entirely; it compresses already-tokenized streams into concepts. It should be compared with byte-native methods such as H-Net and Synergy.

Open Questions

  • How stable are learned concept boundaries across domains?
  • Can concept compression be combined with byte-level or pixel-level inputs?