mHC: Manifold-Constrained Hyper-Connections
Source
- Raw Markdown: paper_mhc-2025.md
- PDF: paper_mhc-2025.pdf
- Preprint: arXiv 2512.24880
- Gonzo ML discussion: post 4497
Core Claim
This paper turns Hyper-Connections into a large-scale trainable residual-stream mechanism by projecting the residual mixing matrix onto the doubly stochastic manifold with Sinkhorn-Knopp normalization.
Relevance To This Wiki
mHC is architecture evidence for widening the residual stream without treating every layer as a single vector state. It matters because it introduces matrix-valued residual state as a third scaling axis alongside depth and hidden width.
It is also the direct upstream mechanism for Hyperloop Transformers, where hyper-connections are applied at loop boundaries rather than every layer.
Limitations
The evidence is language-model pretraining on DeepSeek-style MoE architectures, including in-house 3B, 9B, and 27B experiments. It is not direct evidence for multivariate time-series state tracking, action-conditioned rollouts, or always-on serving.
The method also makes memory access, recomputation, fused kernels, and communication overlap part of the architecture contract. Those costs need to be counted before treating mHC as an efficiency win in another domain.
Foundation TSFM Relevance
Adjacent to the dynamic-compute and representation-quality slots. For a foundation time-series model, the interesting transfer hypothesis is whether a bounded set of parallel residual streams can preserve regimes, channel interactions, exogenous context, or action history better than a single residual stream under the same memory-bandwidth budget.
Links Into The Wiki
- mHC
- Hyperloop Transformers
- MoDA
- Intermediate-Layer Representations
- Looped Transformers And Test-Time Memory
- Efficient Recurrent Sequence Models
- Time-Series Scaling And Efficiency
- Foundation Time-Series Model Research Agenda
Open Questions
- Does a matrix-valued residual stream preserve time-series latent state, channel-local deviations, or intervention history better than depth-KV retrieval, memory tokens, or a wider ordinary hidden state?
- Can the Sinkhorn-constrained residual map stay efficient under always-on streaming inference, where memory bandwidth can dominate nominal FLOPs?
- What public implementation or reproduction should be used before treating the DeepSeek kernel-level claims as portable?