Pretrained Transformers as Universal Computation Engines

Source

Core Claim

A Transformer pretrained on natural language can transfer useful computation to non-language sequence tasks with minimal fine-tuning, even when self-attention and feed-forward residual blocks are frozen.

Key Contributions

  • Introduces the Frozen Pretrained Transformer setup.
  • Fine-tunes only input/output layers, positional embeddings, and layer norms while leaving core GPT-2 blocks frozen.
  • Tests transfer to numerical computation, image classification, and protein fold prediction.
  • Compares language-pretrained Transformers with random Transformers and LSTMs.
  • Finds that language pretraining can improve performance and training efficiency on non-language tasks.

Method Notes

This is a cross-modality transfer source, not a time-series forecasting paper. Its relevance is the “training on structured data is better than random noise” lesson: pretrained sequence computation may transfer even when the source modality differs sharply from the target modality.

For TSFMs, the question is whether text-pretrained or vision-pretrained sequence backbones provide useful initialization for numeric temporal data, and which layers should stay frozen versus adapted.

Evidence And Results

The paper reports that Frozen Pretrained Transformers can match or approach strong task-specific baselines across several non-language tasks and often converge faster than from-scratch alternatives.

Alex Notes

  • From Kotenkov.
  • Alex note: “Training on cat videos is better than random noise” and many studies on jumpstarting training from a different modality.

Limitations

  • The transfer tasks are not modern large-scale time-series forecasting benchmarks.
  • It does not prove that natural-language pretraining is the best initialization for numeric time series.
  • Frozen transfer can hide whether gains come from architecture, optimization priors, data priors, or representation geometry.

Open Questions

  • Does language-pretrained attention help time-series forecasting after controlling for architecture and optimizer?
  • Which components transfer best: attention maps, MLPs, layer norms, positional embeddings, or only initialization statistics?
  • Can cross-modality initialization reduce TSFM pretraining cost without damaging numerical calibration?