Pretrained Transformers as Universal Computation Engines

Source

Raw Markdown: paper_pretrained-transformers-universal-computation-engines-2021.md
PDF: paper_pretrained-transformers-universal-computation-engines-2021.pdf
Preprint: arXiv 2103.05247
Official code: kzl/universal-computation

Core Claim

A Transformer pretrained on natural language can transfer useful computation to non-language sequence tasks with minimal fine-tuning, even when self-attention and feed-forward residual blocks are frozen.

Key Contributions

Introduces the Frozen Pretrained Transformer setup.
Fine-tunes only input/output layers, positional embeddings, and layer norms while leaving core GPT-2 blocks frozen.
Tests transfer to numerical computation, image classification, and protein fold prediction.
Compares language-pretrained Transformers with random Transformers and LSTMs.
Finds that language pretraining can improve performance and training efficiency on non-language tasks.

Method Notes

This is a cross-modality transfer source, not a time-series forecasting paper. Its relevance is the “training on structured data is better than random noise” lesson: pretrained sequence computation may transfer even when the source modality differs sharply from the target modality.

For TSFMs, the question is whether text-pretrained or vision-pretrained sequence backbones provide useful initialization for numeric temporal data, and which layers should stay frozen versus adapted.

Evidence And Results

The paper reports that Frozen Pretrained Transformers can match or approach strong task-specific baselines across several non-language tasks and often converge faster than from-scratch alternatives.

Alex Notes

From Kotenkov.
Alex note: “Training on cat videos is better than random noise” and many studies on jumpstarting training from a different modality.

Limitations

The transfer tasks are not modern large-scale time-series forecasting benchmarks.
It does not prove that natural-language pretraining is the best initialization for numeric time series.
Frozen transfer can hide whether gains come from architecture, optimization priors, data priors, or representation geometry.

Links Into The Wiki

Open Questions

Does language-pretrained attention help time-series forecasting after controlling for architecture and optimizer?
Which components transfer best: attention maps, MLPs, layer norms, positional embeddings, or only initialization statistics?
Can cross-modality initialization reduce TSFM pretraining cost without damaging numerical calibration?

Alex Knowledge Base

Explorer

Pretrained Transformers as Universal Computation Engines

Pretrained Transformers as Universal Computation Engines

Source

Core Claim

Key Contributions

Method Notes

Evidence And Results

Alex Notes

Limitations

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks