Universal Transformers
Source
- Raw Markdown: paper_universal-transformers-2018.md
- PDF: paper_universal-transformers-2018.pdf
- Preprint: arXiv 1807.03819
- Official code: tensorflow/tensor2tensor
Core Claim
Universal Transformer reuses a Transformer block across recurrent depth, combines self-attention with a recurrent inductive bias, and adds Adaptive Computation Time for per-position halting.
Relevance To This Wiki
This is the root source for the looped and recurrent-depth Transformer branch. It gives the ancestor interface: shared layers, iterative state refinement, optional dynamic compute, and a claim that depth recurrence improves systematic generalization.
Limitations
The evidence predates modern large-scale pretraining and does not settle whether shared-depth models beat unique-depth Transformers under matched memory, latency, and training compute.
Foundation TSFM Relevance
Use as architecture background for dynamic compute and compact latent-state updates, not as direct evidence for numeric time-series or action-conditioned world models.
Links Into The Wiki
- Universal Transformer
- Looped Transformers And Test-Time Memory
- Efficient Recurrent Sequence Models
- Time-Series Scaling And Efficiency
- Foundation Time-Series Model Research Agenda
Open Questions
- What matched-budget baseline should this source be compared against: unique-depth Transformer layers, recurrent state, explicit memory, or extra inference steps?
- Which claims transfer from token-sequence reasoning to multivariate time-series state tracking, event streams, or action-conditioned world models?