Mixture Of Experts

Summary

MoE appears in this corpus as a tool for scaling and compute allocation, not only as a parameter-count trick.

What The Wiki Currently Believes

  • Beyond Language Modeling finds MoE useful for multimodal scaling and modality specialization.
  • ConceptMoE uses MoE to isolate the benefits of concept-level processing under matched FLOPs and total parameters.

Evidence

Both sources treat MoE as a way to separate or reallocate computation where uniform processing is wasteful.

Open Questions

  • Can MoE routing align naturally with modality boundaries, concept boundaries, and task difficulty at the same time?
  • How should MoE interact with learned token compression?