ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models

Source

Core Claim

ParaRNN argues that nonlinear recurrent neural networks do not have to be trained by strictly sequential unrolling. It recasts the full hidden-state trajectory as a nonlinear system, solves the linearized Newton steps with parallel reduction, and demonstrates practical language-model training for adapted GRU and LSTM cells up to 7B parameters.

Why It Matters

This is the landmark source for the wiki’s efficient recurrent sequence-model thread because it directly relaxes the assumption that only linear recurrences, selective SSMs, or attention-like mixers can be practical at large scale. It makes nonlinear recurrent latent-state dynamics a serious architecture option for language models and, by analogy, for time-series and world-model backbones where nonlinear state updates are often the natural modeling object.

Method Notes

  • ParaRNN writes the sequential RNN update as a system of equations over all time steps, then applies Newton iterations to solve for the hidden trajectory.
  • Each Newton step reduces to a linear recurrence over Jacobian and residual terms; that linear recurrence can be solved by associative parallel reduction in O(log L) span instead of O(L) sequential span.
  • The backward pass is already linear in the hidden-state adjoints, so the same parallel reduction machinery applies with fewer Newton-style overheads.
  • The paper introduces ParaGRU and ParaLSTM variants whose hidden-state Jacobians have diagonal or small block-diagonal structure, keeping the parallel reduction memory and multiplication costs tractable.
  • The implementation is a PyTorch plus CUDA library that parallelizes a nonlinear cell from its recurrence definition, rather than a single hard-coded architecture.

Evidence And Results

  • The paper reports that three Newton iterations are sufficient for the ParaGRU and ParaLSTM cells used in the language-model experiments, with residuals reaching machine precision within 3-4 iterations across initialization and trained states.
  • It trains ParaGRU, ParaLSTM, Mamba2, and Transformer baselines at roughly 400M, 1B, 2.9B, and 7B parameters on SlimPajama with Books3 removed and Chinchilla-style token budgets.
  • At the 7B scale, ParaGRU reports lower perplexity than the Transformer baseline on the paper’s SlimPajama test setup, while Mamba2 remains the strongest perplexity baseline among the compared architectures.
  • Runtime results show the fused nonlinear RNN application can be comparable to or faster than the tested Mamba scan path for the measured sequence lengths, and the forward-plus-backward setting is especially favorable because the nonlinear backward pass needs only one parallel reduction.

Limitations

  • The method still depends on fast Newton convergence; the paper verifies this for its ParaGRU and ParaLSTM cells but does not prove it for arbitrary nonlinear recurrent cells.
  • Efficient scaling requires Jacobian structure. Dense hidden-state Jacobians would make the method impractical, so architecture design remains constrained by the parallel solver.
  • The 7B comparison is compelling as a feasibility result, but Mamba2 is still stronger on perplexity in the reported language-model table.
  • The work is about token-sequence language modeling, not directly about numeric time series or action-conditioned world models; transfer to those settings needs separate evidence.

Open Questions

  • Which nonlinear recurrent cells beyond ParaGRU and ParaLSTM converge in a small, stable number of Newton iterations?
  • Can ParaRNN-style nonlinear recurrence improve passive dynamics models for numeric time series where nonlinear state transitions are more natural than token mixing?
  • Can the same solver support action-conditioned world models with explicit actions or control inputs without losing the Jacobian structure that makes the method efficient?