---
abstract: |
  We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning -- in particular, without finetuning of the self-attention and feedforward layers of the residual blocks. We consider such a model, which we call a Frozen Pretrained Transformer (FPT), and study finetuning it on a variety of sequence classification tasks spanning numerical computation, vision, and protein fold prediction. In contrast to prior works which investigate finetuning on the same modality as the pretraining dataset, we show that pretraining on natural language can improve performance and compute efficiency on non-language downstream tasks. Additionally, we perform an analysis of the architecture, comparing the performance of a random initialized transformer to a random LSTM. Combining the two insights, we find language-pretrained transformers can obtain strong performance on a variety of non-language tasks[^1].
author:
- |
  Kevin Lu\
  UC Berkeley\
  `kzl@berkeley.edu`\
  \
  **Pieter Abbeel**\
  UC Berkeley\
  `pabbeel@cs.berkeley.edu` `\And`{=latex} Aditya Grover\
  Facebook AI Research\
  `adityagrover@fb.com`\
  \
  **Igor Mordatch**\
  Google Brain\
  `imordatch@google.com`
bibliography:
- citations.bib
title: |
  Pretrained Transformers As\
  Universal Computation Engines
---

```{=latex}
\newcommand{\figleft}{{\em (Left)}}
```
```{=latex}
\newcommand{\figcenter}{{\em (Center)}}
```
```{=latex}
\newcommand{\figright}{{\em (Right)}}
```
```{=latex}
\newcommand{\figtop}{{\em (Top)}}
```
```{=latex}
\newcommand{\figbottom}{{\em (Bottom)}}
```
```{=latex}
\newcommand{\captiona}{{\em (a)}}
```
```{=latex}
\newcommand{\captionb}{{\em (b)}}
```
```{=latex}
\newcommand{\captionc}{{\em (c)}}
```
```{=latex}
\newcommand{\captiond}{{\em (d)}}
```
```{=latex}
\newcommand{\newterm}[1]{{\bf #1}}
```
```{=latex}
\def\figref#1{figure~\ref{#1}}
```
```{=latex}
\def\Figref#1{Figure~\ref{#1}}
```
```{=latex}
\def\twofigref#1#2{figures \ref{#1} and \ref{#2}}
```
```{=latex}
\def\quadfigref#1#2#3#4{figures \ref{#1}, \ref{#2}, \ref{#3} and \ref{#4}}
```
```{=latex}
\def\secref#1{section~\ref{#1}}
```
```{=latex}
\def\Secref#1{Section~\ref{#1}}
```
```{=latex}
\def\twosecrefs#1#2{sections \ref{#1} and \ref{#2}}
```
```{=latex}
\def\secrefs#1#2#3{sections \ref{#1}, \ref{#2} and \ref{#3}}
```
```{=latex}
\def\eqref#1{equation~\ref{#1}}
```
```{=latex}
\def\Eqref#1{Equation~\ref{#1}}
```
```{=latex}
\def\plaineqref#1{\ref{#1}}
```
```{=latex}
\def\chapref#1{chapter~\ref{#1}}
```
```{=latex}
\def\Chapref#1{Chapter~\ref{#1}}
```
```{=latex}
\def\rangechapref#1#2{chapters\ref{#1}--\ref{#2}}
```
```{=latex}
\def\algref#1{algorithm~\ref{#1}}
```
```{=latex}
\def\Algref#1{Algorithm~\ref{#1}}
```
```{=latex}
\def\twoalgref#1#2{algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\Twoalgref#1#2{Algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\partref#1{part~\ref{#1}}
```
```{=latex}
\def\Partref#1{Part~\ref{#1}}
```
```{=latex}
\def\twopartref#1#2{parts \ref{#1} and \ref{#2}}
```
```{=latex}
\def\ceil#1{\lceil #1 \rceil}
```
```{=latex}
\def\floor#1{\lfloor #1 \rfloor}
```
```{=latex}
\def\1{\bm{1}}
```
```{=latex}
\newcommand{\train}{\mathcal{D}}
```
```{=latex}
\newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}}
```
```{=latex}
\newcommand{\test}{\mathcal{D_{\mathrm{test}}}}
```
```{=latex}
\def\eps{{\epsilon}}
```
```{=latex}
\def\reta{{\textnormal{$\eta$}}}
```
```{=latex}
\def\ra{{\textnormal{a}}}
```
```{=latex}
\def\rb{{\textnormal{b}}}
```
```{=latex}
\def\rc{{\textnormal{c}}}
```
```{=latex}
\def\rd{{\textnormal{d}}}
```
```{=latex}
\def\re{{\textnormal{e}}}
```
```{=latex}
\def\rf{{\textnormal{f}}}
```
```{=latex}
\def\rg{{\textnormal{g}}}
```
```{=latex}
\def\rh{{\textnormal{h}}}
```
```{=latex}
\def\ri{{\textnormal{i}}}
```
```{=latex}
\def\rj{{\textnormal{j}}}
```
```{=latex}
\def\rk{{\textnormal{k}}}
```
```{=latex}
\def\rl{{\textnormal{l}}}
```
```{=latex}
\def\rn{{\textnormal{n}}}
```
```{=latex}
\def\ro{{\textnormal{o}}}
```
```{=latex}
\def\rp{{\textnormal{p}}}
```
```{=latex}
\def\rq{{\textnormal{q}}}
```
```{=latex}
\def\rr{{\textnormal{r}}}
```
```{=latex}
\def\rs{{\textnormal{s}}}
```
```{=latex}
\def\rt{{\textnormal{t}}}
```
```{=latex}
\def\ru{{\textnormal{u}}}
```
```{=latex}
\def\rv{{\textnormal{v}}}
```
```{=latex}
\def\rw{{\textnormal{w}}}
```
```{=latex}
\def\rx{{\textnormal{x}}}
```
```{=latex}
\def\ry{{\textnormal{y}}}
```
```{=latex}
\def\rz{{\textnormal{z}}}
```
```{=latex}
\def\rvepsilon{{\mathbf{\epsilon}}}
```
```{=latex}
\def\rvtheta{{\mathbf{\theta}}}
```
```{=latex}
\def\rva{{\mathbf{a}}}
```
```{=latex}
\def\rvb{{\mathbf{b}}}
```
```{=latex}
\def\rvc{{\mathbf{c}}}
```
```{=latex}
\def\rvd{{\mathbf{d}}}
```
```{=latex}
\def\rve{{\mathbf{e}}}
```
```{=latex}
\def\rvf{{\mathbf{f}}}
```
```{=latex}
\def\rvg{{\mathbf{g}}}
```
```{=latex}
\def\rvh{{\mathbf{h}}}
```
```{=latex}
\def\rvu{{\mathbf{i}}}
```
```{=latex}
\def\rvj{{\mathbf{j}}}
```
```{=latex}
\def\rvk{{\mathbf{k}}}
```
```{=latex}
\def\rvl{{\mathbf{l}}}
```
```{=latex}
\def\rvm{{\mathbf{m}}}
```
```{=latex}
\def\rvn{{\mathbf{n}}}
```
```{=latex}
\def\rvo{{\mathbf{o}}}
```
```{=latex}
\def\rvp{{\mathbf{p}}}
```
```{=latex}
\def\rvq{{\mathbf{q}}}
```
```{=latex}
\def\rvr{{\mathbf{r}}}
```
```{=latex}
\def\rvs{{\mathbf{s}}}
```
```{=latex}
\def\rvt{{\mathbf{t}}}
```
```{=latex}
\def\rvu{{\mathbf{u}}}
```
```{=latex}
\def\rvv{{\mathbf{v}}}
```
```{=latex}
\def\rvw{{\mathbf{w}}}
```
```{=latex}
\def\rvx{{\mathbf{x}}}
```
```{=latex}
\def\rvy{{\mathbf{y}}}
```
```{=latex}
\def\rvz{{\mathbf{z}}}
```
```{=latex}
\def\erva{{\textnormal{a}}}
```
```{=latex}
\def\ervb{{\textnormal{b}}}
```
```{=latex}
\def\ervc{{\textnormal{c}}}
```
```{=latex}
\def\ervd{{\textnormal{d}}}
```
```{=latex}
\def\erve{{\textnormal{e}}}
```
```{=latex}
\def\ervf{{\textnormal{f}}}
```
```{=latex}
\def\ervg{{\textnormal{g}}}
```
```{=latex}
\def\ervh{{\textnormal{h}}}
```
```{=latex}
\def\ervi{{\textnormal{i}}}
```
```{=latex}
\def\ervj{{\textnormal{j}}}
```
```{=latex}
\def\ervk{{\textnormal{k}}}
```
```{=latex}
\def\ervl{{\textnormal{l}}}
```
```{=latex}
\def\ervm{{\textnormal{m}}}
```
```{=latex}
\def\ervn{{\textnormal{n}}}
```
```{=latex}
\def\ervo{{\textnormal{o}}}
```
```{=latex}
\def\ervp{{\textnormal{p}}}
```
```{=latex}
\def\ervq{{\textnormal{q}}}
```
```{=latex}
\def\ervr{{\textnormal{r}}}
```
```{=latex}
\def\ervs{{\textnormal{s}}}
```
```{=latex}
\def\ervt{{\textnormal{t}}}
```
```{=latex}
\def\ervu{{\textnormal{u}}}
```
```{=latex}
\def\ervv{{\textnormal{v}}}
```
```{=latex}
\def\ervw{{\textnormal{w}}}
```
```{=latex}
\def\ervx{{\textnormal{x}}}
```
```{=latex}
\def\ervy{{\textnormal{y}}}
```
```{=latex}
\def\ervz{{\textnormal{z}}}
```
```{=latex}
\def\rmA{{\mathbf{A}}}
```
```{=latex}
\def\rmB{{\mathbf{B}}}
```
```{=latex}
\def\rmC{{\mathbf{C}}}
```
```{=latex}
\def\rmD{{\mathbf{D}}}
```
```{=latex}
\def\rmE{{\mathbf{E}}}
```
```{=latex}
\def\rmF{{\mathbf{F}}}
```
```{=latex}
\def\rmG{{\mathbf{G}}}
```
```{=latex}
\def\rmH{{\mathbf{H}}}
```
```{=latex}
\def\rmI{{\mathbf{I}}}
```
```{=latex}
\def\rmJ{{\mathbf{J}}}
```
```{=latex}
\def\rmK{{\mathbf{K}}}
```
```{=latex}
\def\rmL{{\mathbf{L}}}
```
```{=latex}
\def\rmM{{\mathbf{M}}}
```
```{=latex}
\def\rmN{{\mathbf{N}}}
```
```{=latex}
\def\rmO{{\mathbf{O}}}
```
```{=latex}
\def\rmP{{\mathbf{P}}}
```
```{=latex}
\def\rmQ{{\mathbf{Q}}}
```
```{=latex}
\def\rmR{{\mathbf{R}}}
```
```{=latex}
\def\rmS{{\mathbf{S}}}
```
```{=latex}
\def\rmT{{\mathbf{T}}}
```
```{=latex}
\def\rmU{{\mathbf{U}}}
```
```{=latex}
\def\rmV{{\mathbf{V}}}
```
```{=latex}
\def\rmW{{\mathbf{W}}}
```
```{=latex}
\def\rmX{{\mathbf{X}}}
```
```{=latex}
\def\rmY{{\mathbf{Y}}}
```
```{=latex}
\def\rmZ{{\mathbf{Z}}}
```
```{=latex}
\def\ermA{{\textnormal{A}}}
```
```{=latex}
\def\ermB{{\textnormal{B}}}
```
```{=latex}
\def\ermC{{\textnormal{C}}}
```
```{=latex}
\def\ermD{{\textnormal{D}}}
```
```{=latex}
\def\ermE{{\textnormal{E}}}
```
```{=latex}
\def\ermF{{\textnormal{F}}}
```
```{=latex}
\def\ermG{{\textnormal{G}}}
```
```{=latex}
\def\ermH{{\textnormal{H}}}
```
```{=latex}
\def\ermI{{\textnormal{I}}}
```
```{=latex}
\def\ermJ{{\textnormal{J}}}
```
```{=latex}
\def\ermK{{\textnormal{K}}}
```
```{=latex}
\def\ermL{{\textnormal{L}}}
```
```{=latex}
\def\ermM{{\textnormal{M}}}
```
```{=latex}
\def\ermN{{\textnormal{N}}}
```
```{=latex}
\def\ermO{{\textnormal{O}}}
```
```{=latex}
\def\ermP{{\textnormal{P}}}
```
```{=latex}
\def\ermQ{{\textnormal{Q}}}
```
```{=latex}
\def\ermR{{\textnormal{R}}}
```
```{=latex}
\def\ermS{{\textnormal{S}}}
```
```{=latex}
\def\ermT{{\textnormal{T}}}
```
```{=latex}
\def\ermU{{\textnormal{U}}}
```
```{=latex}
\def\ermV{{\textnormal{V}}}
```
```{=latex}
\def\ermW{{\textnormal{W}}}
```
```{=latex}
\def\ermX{{\textnormal{X}}}
```
```{=latex}
\def\ermY{{\textnormal{Y}}}
```
```{=latex}
\def\ermZ{{\textnormal{Z}}}
```
```{=latex}
\def\vzero{{\bm{0}}}
```
```{=latex}
\def\vone{{\bm{1}}}
```
```{=latex}
\def\vmu{{\bm{\mu}}}
```
```{=latex}
\def\vtheta{{\bm{\theta}}}
```
```{=latex}
\def\va{{\bm{a}}}
```
```{=latex}
\def\vb{{\bm{b}}}
```
```{=latex}
\def\vc{{\bm{c}}}
```
```{=latex}
\def\vd{{\bm{d}}}
```
```{=latex}
\def\ve{{\bm{e}}}
```
```{=latex}
\def\vf{{\bm{f}}}
```
```{=latex}
\def\vg{{\bm{g}}}
```
```{=latex}
\def\vh{{\bm{h}}}
```
```{=latex}
\def\vi{{\bm{i}}}
```
```{=latex}
\def\vj{{\bm{j}}}
```
```{=latex}
\def\vk{{\bm{k}}}
```
```{=latex}
\def\vl{{\bm{l}}}
```
```{=latex}
\def\vm{{\bm{m}}}
```
```{=latex}
\def\vn{{\bm{n}}}
```
```{=latex}
\def\vo{{\bm{o}}}
```
```{=latex}
\def\vp{{\bm{p}}}
```
```{=latex}
\def\vq{{\bm{q}}}
```
```{=latex}
\def\vr{{\bm{r}}}
```
```{=latex}
\def\vs{{\bm{s}}}
```
```{=latex}
\def\vt{{\bm{t}}}
```
```{=latex}
\def\vu{{\bm{u}}}
```
```{=latex}
\def\vv{{\bm{v}}}
```
```{=latex}
\def\vw{{\bm{w}}}
```
```{=latex}
\def\vx{{\bm{x}}}
```
```{=latex}
\def\vy{{\bm{y}}}
```
```{=latex}
\def\vz{{\bm{z}}}
```
```{=latex}
\def\evalpha{{\alpha}}
```
```{=latex}
\def\evbeta{{\beta}}
```
```{=latex}
\def\evepsilon{{\epsilon}}
```
```{=latex}
\def\evlambda{{\lambda}}
```
```{=latex}
\def\evomega{{\omega}}
```
```{=latex}
\def\evmu{{\mu}}
```
```{=latex}
\def\evpsi{{\psi}}
```
```{=latex}
\def\evsigma{{\sigma}}
```
```{=latex}
\def\evtheta{{\theta}}
```
```{=latex}
\def\eva{{a}}
```
```{=latex}
\def\evb{{b}}
```
```{=latex}
\def\evc{{c}}
```
```{=latex}
\def\evd{{d}}
```
```{=latex}
\def\eve{{e}}
```
```{=latex}
\def\evf{{f}}
```
```{=latex}
\def\evg{{g}}
```
```{=latex}
\def\evh{{h}}
```
```{=latex}
\def\evi{{i}}
```
```{=latex}
\def\evj{{j}}
```
```{=latex}
\def\evk{{k}}
```
```{=latex}
\def\evl{{l}}
```
```{=latex}
\def\evm{{m}}
```
```{=latex}
\def\evn{{n}}
```
```{=latex}
\def\evo{{o}}
```
```{=latex}
\def\evp{{p}}
```
```{=latex}
\def\evq{{q}}
```
```{=latex}
\def\evr{{r}}
```
```{=latex}
\def\evs{{s}}
```
```{=latex}
\def\evt{{t}}
```
```{=latex}
\def\evu{{u}}
```
```{=latex}
\def\evv{{v}}
```
```{=latex}
\def\evw{{w}}
```
```{=latex}
\def\evx{{x}}
```
```{=latex}
\def\evy{{y}}
```
```{=latex}
\def\evz{{z}}
```
```{=latex}
\def\mA{{\bm{A}}}
```
```{=latex}
\def\mB{{\bm{B}}}
```
```{=latex}
\def\mC{{\bm{C}}}
```
```{=latex}
\def\mD{{\bm{D}}}
```
```{=latex}
\def\mE{{\bm{E}}}
```
```{=latex}
\def\mF{{\bm{F}}}
```
```{=latex}
\def\mG{{\bm{G}}}
```
```{=latex}
\def\mH{{\bm{H}}}
```
```{=latex}
\def\mI{{\bm{I}}}
```
```{=latex}
\def\mJ{{\bm{J}}}
```
```{=latex}
\def\mK{{\bm{K}}}
```
```{=latex}
\def\mL{{\bm{L}}}
```
```{=latex}
\def\mM{{\bm{M}}}
```
```{=latex}
\def\mN{{\bm{N}}}
```
```{=latex}
\def\mO{{\bm{O}}}
```
```{=latex}
\def\mP{{\bm{P}}}
```
```{=latex}
\def\mQ{{\bm{Q}}}
```
```{=latex}
\def\mR{{\bm{R}}}
```
```{=latex}
\def\mS{{\bm{S}}}
```
```{=latex}
\def\mT{{\bm{T}}}
```
```{=latex}
\def\mU{{\bm{U}}}
```
```{=latex}
\def\mV{{\bm{V}}}
```
```{=latex}
\def\mW{{\bm{W}}}
```
```{=latex}
\def\mX{{\bm{X}}}
```
```{=latex}
\def\mY{{\bm{Y}}}
```
```{=latex}
\def\mZ{{\bm{Z}}}
```
```{=latex}
\def\mBeta{{\bm{\beta}}}
```
```{=latex}
\def\mPhi{{\bm{\Phi}}}
```
```{=latex}
\def\mLambda{{\bm{\Lambda}}}
```
```{=latex}
\def\mSigma{{\bm{\Sigma}}}
```
```{=latex}
\newcommand{\tens}[1]{\bm{\mathsfit{#1}}}
```
```{=latex}
\def\tA{{\tens{A}}}
```
```{=latex}
\def\tB{{\tens{B}}}
```
```{=latex}
\def\tC{{\tens{C}}}
```
```{=latex}
\def\tD{{\tens{D}}}
```
```{=latex}
\def\tE{{\tens{E}}}
```
```{=latex}
\def\tF{{\tens{F}}}
```
```{=latex}
\def\tG{{\tens{G}}}
```
```{=latex}
\def\tH{{\tens{H}}}
```
```{=latex}
\def\tI{{\tens{I}}}
```
```{=latex}
\def\tJ{{\tens{J}}}
```
```{=latex}
\def\tK{{\tens{K}}}
```
```{=latex}
\def\tL{{\tens{L}}}
```
```{=latex}
\def\tM{{\tens{M}}}
```
```{=latex}
\def\tN{{\tens{N}}}
```
```{=latex}
\def\tO{{\tens{O}}}
```
```{=latex}
\def\tP{{\tens{P}}}
```
```{=latex}
\def\tQ{{\tens{Q}}}
```
```{=latex}
\def\tR{{\tens{R}}}
```
```{=latex}
\def\tS{{\tens{S}}}
```
```{=latex}
\def\tT{{\tens{T}}}
```
```{=latex}
\def\tU{{\tens{U}}}
```
```{=latex}
\def\tV{{\tens{V}}}
```
```{=latex}
\def\tW{{\tens{W}}}
```
```{=latex}
\def\tX{{\tens{X}}}
```
```{=latex}
\def\tY{{\tens{Y}}}
```
```{=latex}
\def\tZ{{\tens{Z}}}
```
```{=latex}
\def\gA{{\mathcal{A}}}
```
```{=latex}
\def\gB{{\mathcal{B}}}
```
```{=latex}
\def\gC{{\mathcal{C}}}
```
```{=latex}
\def\gD{{\mathcal{D}}}
```
```{=latex}
\def\gE{{\mathcal{E}}}
```
```{=latex}
\def\gF{{\mathcal{F}}}
```
```{=latex}
\def\gG{{\mathcal{G}}}
```
```{=latex}
\def\gH{{\mathcal{H}}}
```
```{=latex}
\def\gI{{\mathcal{I}}}
```
```{=latex}
\def\gJ{{\mathcal{J}}}
```
```{=latex}
\def\gK{{\mathcal{K}}}
```
```{=latex}
\def\gL{{\mathcal{L}}}
```
```{=latex}
\def\gM{{\mathcal{M}}}
```
```{=latex}
\def\gN{{\mathcal{N}}}
```
```{=latex}
\def\gO{{\mathcal{O}}}
```
```{=latex}
\def\gP{{\mathcal{P}}}
```
```{=latex}
\def\gQ{{\mathcal{Q}}}
```
```{=latex}
\def\gR{{\mathcal{R}}}
```
```{=latex}
\def\gS{{\mathcal{S}}}
```
```{=latex}
\def\gT{{\mathcal{T}}}
```
```{=latex}
\def\gU{{\mathcal{U}}}
```
```{=latex}
\def\gV{{\mathcal{V}}}
```
```{=latex}
\def\gW{{\mathcal{W}}}
```
```{=latex}
\def\gX{{\mathcal{X}}}
```
```{=latex}
\def\gY{{\mathcal{Y}}}
```
```{=latex}
\def\gZ{{\mathcal{Z}}}
```
```{=latex}
\def\sA{{\mathbb{A}}}
```
```{=latex}
\def\sB{{\mathbb{B}}}
```
```{=latex}
\def\sC{{\mathbb{C}}}
```
```{=latex}
\def\sD{{\mathbb{D}}}
```
```{=latex}
\def\sF{{\mathbb{F}}}
```
```{=latex}
\def\sG{{\mathbb{G}}}
```
```{=latex}
\def\sH{{\mathbb{H}}}
```
```{=latex}
\def\sI{{\mathbb{I}}}
```
```{=latex}
\def\sJ{{\mathbb{J}}}
```
```{=latex}
\def\sK{{\mathbb{K}}}
```
```{=latex}
\def\sL{{\mathbb{L}}}
```
```{=latex}
\def\sM{{\mathbb{M}}}
```
```{=latex}
\def\sN{{\mathbb{N}}}
```
```{=latex}
\def\sO{{\mathbb{O}}}
```
```{=latex}
\def\sP{{\mathbb{P}}}
```
```{=latex}
\def\sQ{{\mathbb{Q}}}
```
```{=latex}
\def\sR{{\mathbb{R}}}
```
```{=latex}
\def\sS{{\mathbb{S}}}
```
```{=latex}
\def\sT{{\mathbb{T}}}
```
```{=latex}
\def\sU{{\mathbb{U}}}
```
```{=latex}
\def\sV{{\mathbb{V}}}
```
```{=latex}
\def\sW{{\mathbb{W}}}
```
```{=latex}
\def\sX{{\mathbb{X}}}
```
```{=latex}
\def\sY{{\mathbb{Y}}}
```
```{=latex}
\def\sZ{{\mathbb{Z}}}
```
```{=latex}
\def\emLambda{{\Lambda}}
```
```{=latex}
\def\emA{{A}}
```
```{=latex}
\def\emB{{B}}
```
```{=latex}
\def\emC{{C}}
```
```{=latex}
\def\emD{{D}}
```
```{=latex}
\def\emE{{E}}
```
```{=latex}
\def\emF{{F}}
```
```{=latex}
\def\emG{{G}}
```
```{=latex}
\def\emH{{H}}
```
```{=latex}
\def\emI{{I}}
```
```{=latex}
\def\emJ{{J}}
```
```{=latex}
\def\emK{{K}}
```
```{=latex}
\def\emL{{L}}
```
```{=latex}
\def\emM{{M}}
```
```{=latex}
\def\emN{{N}}
```
```{=latex}
\def\emO{{O}}
```
```{=latex}
\def\emP{{P}}
```
```{=latex}
\def\emQ{{Q}}
```
```{=latex}
\def\emR{{R}}
```
```{=latex}
\def\emS{{S}}
```
```{=latex}
\def\emT{{T}}
```
```{=latex}
\def\emU{{U}}
```
```{=latex}
\def\emV{{V}}
```
```{=latex}
\def\emW{{W}}
```
```{=latex}
\def\emX{{X}}
```
```{=latex}
\def\emY{{Y}}
```
```{=latex}
\def\emZ{{Z}}
```
```{=latex}
\def\emSigma{{\Sigma}}
```
```{=latex}
\newcommand{\etens}[1]{\mathsfit{#1}}
```
```{=latex}
\def\etLambda{{\etens{\Lambda}}}
```
```{=latex}
\def\etA{{\etens{A}}}
```
```{=latex}
\def\etB{{\etens{B}}}
```
```{=latex}
\def\etC{{\etens{C}}}
```
```{=latex}
\def\etD{{\etens{D}}}
```
```{=latex}
\def\etE{{\etens{E}}}
```
```{=latex}
\def\etF{{\etens{F}}}
```
```{=latex}
\def\etG{{\etens{G}}}
```
```{=latex}
\def\etH{{\etens{H}}}
```
```{=latex}
\def\etI{{\etens{I}}}
```
```{=latex}
\def\etJ{{\etens{J}}}
```
```{=latex}
\def\etK{{\etens{K}}}
```
```{=latex}
\def\etL{{\etens{L}}}
```
```{=latex}
\def\etM{{\etens{M}}}
```
```{=latex}
\def\etN{{\etens{N}}}
```
```{=latex}
\def\etO{{\etens{O}}}
```
```{=latex}
\def\etP{{\etens{P}}}
```
```{=latex}
\def\etQ{{\etens{Q}}}
```
```{=latex}
\def\etR{{\etens{R}}}
```
```{=latex}
\def\etS{{\etens{S}}}
```
```{=latex}
\def\etT{{\etens{T}}}
```
```{=latex}
\def\etU{{\etens{U}}}
```
```{=latex}
\def\etV{{\etens{V}}}
```
```{=latex}
\def\etW{{\etens{W}}}
```
```{=latex}
\def\etX{{\etens{X}}}
```
```{=latex}
\def\etY{{\etens{Y}}}
```
```{=latex}
\def\etZ{{\etens{Z}}}
```
```{=latex}
\newcommand{\pdata}{p_{\rm{data}}}
```
```{=latex}
\newcommand{\ptrain}{\hat{p}_{\rm{data}}}
```
```{=latex}
\newcommand{\Ptrain}{\hat{P}_{\rm{data}}}
```
```{=latex}
\newcommand{\pmodel}{p_{\rm{model}}}
```
```{=latex}
\newcommand{\Pmodel}{P_{\rm{model}}}
```
```{=latex}
\newcommand{\ptildemodel}{\tilde{p}_{\rm{model}}}
```
```{=latex}
\newcommand{\pencode}{p_{\rm{encoder}}}
```
```{=latex}
\newcommand{\pdecode}{p_{\rm{decoder}}}
```
```{=latex}
\newcommand{\precons}{p_{\rm{reconstruct}}}
```
```{=latex}
\newcommand{\laplace}{\mathrm{Laplace}}
```
```{=latex}
\newcommand{\E}{\mathbb{E}}
```
```{=latex}
\newcommand{\Ls}{\mathcal{L}}
```
```{=latex}
\newcommand{\R}{\mathbb{R}}
```
```{=latex}
\newcommand{\emp}{\tilde{p}}
```
```{=latex}
\newcommand{\lr}{\alpha}
```
```{=latex}
\newcommand{\reg}{\lambda}
```
```{=latex}
\newcommand{\rect}{\mathrm{rectifier}}
```
```{=latex}
\newcommand{\softmax}{\mathrm{softmax}}
```
```{=latex}
\newcommand{\sigmoid}{\sigma}
```
```{=latex}
\newcommand{\softplus}{\zeta}
```
```{=latex}
\newcommand{\KL}{D_{\mathrm{KL}}}
```
```{=latex}
\newcommand{\Var}{\mathrm{Var}}
```
```{=latex}
\newcommand{\standarderror}{\mathrm{SE}}
```
```{=latex}
\newcommand{\Cov}{\mathrm{Cov}}
```
```{=latex}
\newcommand{\normlzero}{L^0}
```
```{=latex}
\newcommand{\normlone}{L^1}
```
```{=latex}
\newcommand{\normltwo}{L^2}
```
```{=latex}
\newcommand{\normlp}{L^p}
```
```{=latex}
\newcommand{\normmax}{L^\infty}
```
```{=latex}
\newcommand{\parents}{Pa}
```
```{=latex}
\DeclareMathOperator*{\argmax}{arg\,max}
```
```{=latex}
\DeclareMathOperator*{\argmin}{arg\,min}
```
```{=latex}
\DeclareMathOperator{\sign}{sign}
```
```{=latex}
\DeclareMathOperator{\Tr}{Tr}
```
```{=latex}
\let\ab\allowbreak
```
```{=latex}
\maketitle
```
```{=latex}
\centering
```
![ A *frozen* language-pretrained transformer (FPT) -- without finetuning the self-attention and feedforward layers -- can achieve strong performance compared to a transformer fully trained from scratch on a downstream modality on benchmarks from literature [@tay2020lra; @rap2019tape]. We show results on diverse classification tasks (see Section `\ref{sec:tasks}`{=latex}): numerical computation (Bit Memory/XOR, ListOps), image classification (MNIST, CIFAR-10), and protein fold prediction (Homology). We also show results for a fully trained LSTM to provide a baseline. ](figures/main/performance.png "fig:"){#fig:main_result width="1\\linewidth"} `\vspace{0.5em}`{=latex} ![ A *frozen* language-pretrained transformer (FPT) -- without finetuning the self-attention and feedforward layers -- can achieve strong performance compared to a transformer fully trained from scratch on a downstream modality on benchmarks from literature [@tay2020lra; @rap2019tape]. We show results on diverse classification tasks (see Section `\ref{sec:tasks}`{=latex}): numerical computation (Bit Memory/XOR, ListOps), image classification (MNIST, CIFAR-10), and protein fold prediction (Homology). We also show results for a fully trained LSTM to provide a baseline. ](figures/main/legend.png "fig:"){#fig:main_result width="0.9\\linewidth"}

```{=latex}
\clearpage
```
`\etocdepthtag`{=latex}.tocmtchapter `\etocsettagdepth{mtchapter}{subsection}`{=latex} `\etocsettagdepth{mtappendix}{none}`{=latex} `\tableofcontents`{=latex}

```{=latex}
\clearpage
```
Introduction {#sec:intro}
============

The transformer architecture [@vaswani2017attention] has shown broad successes in deep learning, serving as the backbone of large models for tasks such as modeling natural language [@brown2020gpt3], images [@dosovitskiy2020vit], proteins [@jumper2021alphafold], behaviors [@abramson2020imitating], and multimodal tasks comprising of both images and text [@lu2019vilbert; @radford2021clip]. Inspired by these successes, we seek to explore the generalization capabilities of a transformer in transferring from one modality to another.

Classical approaches to sequence processing used recurrent neural network (RNN) approaches [@rumelhart1985rnn; @hochreiter1997lstm]. In contrast, transformers utilize self-attention layers to extract features across tokens of a sequence, such as words [@vaswani2017attention] or image patches [@dosovitskiy2020vit]. Furthermore, it has become common practice to train large models on unsupervised or weakly supervised objectives before finetuning or evaluating zero-shot generalization on a downstream task. However, the downstream tasks that have been studied are generally restricted to the same modality as the original training set: for example, train GPT [@radford2018gpt] on a large language corpus, and finetune on a small task-specific dataset. Our goal in this work is to investigate finetuning on modalities distinct from the training modality.

We hypothesize that transformers, namely the self-attention layers, can be pretrained on a data-rich modality (i.e. where data is plentiful, such as a natural language corpus) and identify feature representations that are useful for *arbitrary* data sequences, enabling downstream transfer to different modalities. In particular, we seek to investigate what pretrained language models (LMs) are capable of in terms of generalizing to other modalities with sequential structure.

To investigate this hypothesis, we take a transformer model pretrained on natural language data, GPT-2 [@radford2019gpt2], and finetune only the linear input and output layers, as well as the positional embeddings and layer norm parameters. We call this model a Frozen Pretrained Transformer (FPT). On a range of tasks across a variety of modalities -- including numerical computation, image classification, and protein fold prediction -- FPT displays comparable performance to training the entire transformer or LSTM models from scratch, matching reported benchmarks for these tasks (Figure `\ref{fig:main_result}`{=latex}). Additionally, we find FPT models also converge faster during training. Our results suggest the self-attention layers learned by a language model may have properties amenable to efficient universal computation. Through a series of experiments, we seek to investigate what contributes to the performance of FPTs by isolating various sub-components of these models.

Methodology
===========

Tasks {#sec:tasks}
-----

We evaluate on a diverse set of classification tasks representative of different modalities. In particular, we are interested in if language models are inherently capable of *universal computation*, by which we mean the ability to learn representations for predictive learning across diverse modalities.

**Bit memory.** Similar to the task proposed by [@miconi2018hebbian], we consider a bit memory task where the model is shown 5 bitstrings each of length 1000. Afterwards, the model is shown a masked version of one of the bitstrings, where each bit is masked with probability $0.5$, and the model is tasked with producing the original bitstring. The bitstrings are broken up into sequences of length 50, so that the models are fed 120 tokens of dimension 50.

**Bit XOR.** Similar to the bit memory task, the model is shown 2 bitstrings of length 5, where the model must predict the element-wise XOR of the two bitstrings. The bitstrings are shown 1 bit at a time, so the models are fed 10 tokens of dimension 1.

**ListOps.** Taken from [@tay2020lra], the model is shown a sequence of list operations (ex. `[ MAX 4 3 [ MIN 2 3 ] 1 0 ]`) and tasked with predicting the resulting output digit (ex. `4`). This task evaluates the ability of a model to parse mathematical expressions and evaluate over a long context. The model is shown 1 token at a time, so the models are fed 512 tokens of dimension 15.

**MNIST.** We use the standard MNIST benchmark, where the model must classify a handwritten digit from a $32 \times 32$ black-and-white image. The tokens given to the model are $4 \times 4$ image patches, so the models are fed 64 tokens of dimension 16.

**CIFAR-10.** We use the standard CIFAR-10 benchmark [@krizhevsky2009cifar], where the tokens given to the model are $4 \times 4$ image patches, so the models are fed 64 tokens of dimension 16.

**CIFAR-10 LRA.** This is a modified version of the above task taken from the Long Range Arena benchmark where the images are converted to grayscale and flattened with a token length of 1 [@tay2020lra]. As a result, the input sequence consists of 1024 tokens of dimension 1. This task is much more challenging than vanilla CIFAR-10 classification above as the models must learn patterns over a significantly longer sequence length and have minimal spatial inductive bias.

**Remote homology detection.** In this task, we are interested in predicting the fold for a protein, represented as an amino acid sequence. We use the datasets provided by TAPE [@rap2019tape; @fox2013scop; @hou2018deepsf], where the train/test split is generated by holding out certain evolutionary groups. Note that we do not pretrain on Pfam [@elgebali2019pfam], which is common in other works. There are 20 common and 5 uncommon amino acids (25 different types of inputs), and there are 1195 possible labels to predict. We only consider sequences of length less than 1024 for simplicity. The models are thus fed up to 1024 tokens of dimension 25.

Architecture {#sec:architecture}
------------

```{=latex}
\centering
```
![ Frozen Pretrained Transformer (FPT). The self-attention & feedforward layers are frozen. ](figures/architecture/architecture.png){#fig:architecture width="1\\linewidth"}

The architecture we use is summarized in Figure `\ref{fig:architecture}`{=latex}. Denote the embedding size/hidden dimension of the transformer as $n_{dim}$, the number of layers as $n_{layers}$, (note $n_{dim} = 768$ and $n_{layers} = 12$ for the base size models), the input dimension as $d_{in}$, the output dimension (number of classes) as $d_{out}$, and the maximum length of the sequence as $l$. We consider finetuning the following parameters of a pretrained GPT-2 model [@radford2019gpt2]:

-   **Output layer:** it is crucial to finetune the output layer since we are transferring to a completely new task -- we use the simplest possible instantiation of an output network, being a single linear layer applied to the last output token output by the transformer, in order to highlight that almost all the computation is being performed by the frozen transformer. The output layer has $n_{dim} \times d_{out}$ parameters for the weight matrix. For example, for the base models on CIFAR-10, this comes out to $768 \cdot 10 = 7680$ parameters.

-   **Input layer:** it is important to reinitialize a new input layer since we are reading in a new modality; in essence, we are learning how to query the transformer. This contrasts with prior unsupervised embedding evaluation techniques, such as linear probing -- due to the change in modality, we instead should train the input layer as well, and evaluate if the frozen intermediate transformer model performs effective computation. Again, we use a linear layer to minimize the amount of computation outside the transformer. The input layer has $d_{in} \times n_{dim}$ parameters for the weight matrix/embeddings, and an additional $n_{dim}$ parameters if there is a bias term. For the base models on CIFAR-10, this comes out to $16 \cdot 768 = 13056$ parameters.

-   **Layer norm parameters:** as is standard practice in other finetuning works [@rebuffi2017adapter; @houlsby2019adapter], we also finetune the affine layer norm parameters (scale and bias), which adapt to the statistics of the downstream task in a new domain. In GPT-2, layer norm is applied twice per block, so these are a total of $4 \times n_{dim} \times n_{layers}$ parameters. For the base models on CIFAR-10, these come out to $4 \cdot 768 \cdot 12 = 36684$ parameters.

-   **Positional embeddings:** While we observe that positional embeddings can be surprisingly universal between modalities (see Section `\ref{sec:params}`{=latex}), we generally see a small benefit to finetuning the positional embeddings which have a cheap parameter cost of $l \times n_{dim}$. For the base models on CIFAR-10, these come out to $64 \cdot 768 = 49512$ parameters.

```{=latex}
\vspace{2em}
```
Given the cheap linear scaling of these parameters, the parameter counts of large transformer models are dominated by the quadratic (in $n_{dim}$ and $l$) self-attention and feedforward layers. For the base CIFAR-10 model with 124M parameters, these come out to approximately $0.086\%$ of the network. Due to this scaling, this number decreases with larger model sizes, down to $0.029\%$ of the GPT-2 XL model. We further ablate the importance of each parameter in Section `\ref{sec:params}`{=latex}. For more details and a description of the architecture, see Appendix `\ref{app:architecture}`{=latex}.

Note that, crucially, all communication between tokens in the model are frozen. The data in each datapoint is chunked into discrete tokens (bits, image patches, amino acids, etc.), and can only reference each other via the frozen attention connections, which are not trained; additionally, neither the output nor the input layers are connected to multiple tokens. Our key investigation is to analyze the computation that is already inherent in the language model, and hence we do a minimal amount of computation that is learned on the downstream modality.

Empirical Evaluations {#sec:experiments}
=====================

In this section, we review the results demonstrating transfer from language to other modalities, and seek to better understand why this occurs and what enables this transfer. All model sizes are the base model size (12 layers, 768 hidden dimension), unless stated otherwise. See Appendix `\ref{app:experimental_details}`{=latex} for more details on experiments.

Can pretrained language models transfer to different modalities? {#sec:transfer}
----------------------------------------------------------------

We investigate if the self-attention and feedforward layers -- the main body -- of a pretrained transformer can be applied to a classification problem in a different modality without finetuning. To do this, we apply our base procedure as described above, where the input embedding layer, output readout layer, and layer norm parameters are finetuned.

Our results are shown in Figure `\ref{fig:main_result}`{=latex} and also summarized below in Table `\ref{table:main_result}`{=latex}. We compare to state-of-the-art from literature when available (full transformer on ListOps, CIFAR-10 LRA, and Remote Homology; LSTM on Remote Homology). Note the benchmarks from literature do not include decimal points, so for those numbers we report without a decimal.

We find that across all seven tasks considered, FPT achieves comparable performance to the fully trained transformer benchmarks. We believe these results support the idea that these models are learning representations and performing computation that is agnostic to the modality. We also note that both transformer variants significantly outperform LSTMs on some tasks, particularly ListOps and CIFAR-10 LRA, which have long sequence lengths of 512 and 1024, respectively.

On the two bit tasks (Memory and XOR), the models achieve 100% performance, i.e. they are able to recover the exact algorithm. Although our tables show results for $n=5$, we actually find FPT can still recover the exact algorithm on sequence lengths greater than $n=256$ (the elementwise XOR of two bitstrings each of length $256$), hinting that FPT has a fairly large working memory.

::: {#table:main_result}
   **Model**   `\multicolumn{1}{c}{\bf Bit Memory}`{=latex}   `\multicolumn{1}{c}{\bf XOR}`{=latex}   `\multicolumn{1}{c}{\bf ListOps}`{=latex}   `\multicolumn{1}{c}{\bf MNIST}`{=latex}   `\multicolumn{1}{c}{\bf CIFAR-10}`{=latex}   `\multicolumn{1}{c}{\bf C10 LRA}`{=latex}   `\multicolumn{1}{c}{\bf Homology}`{=latex}
  ----------- ---------------------------------------------- --------------------------------------- ------------------------------------------- ----------------------------------------- -------------------------------------------- ------------------------------------------- --------------------------------------------
      FPT                          100%                                       100%                                      38.4%                                      98.0%                                      72.1%                                        38.6%                                       12.7%
     Full                          100%                                       100%                                       38%                                       99.1%                                      70.3%                                         42%                                          9%
     LSTM                         60.9%                                       50.1%                                     17.1%                                      99.5%                                      73.6%                                        11.7%                                        12%

  : Test accuracy of FPT vs fully training transformer on downstream task vs fully training LSTM on downstream task (results are transcribed from Figure `\ref{fig:main_result}`{=latex}).
:::

We highlight a few important points for contextualizing these results. We find that it can be difficult to fully train a 12-layer transformer on some of these (relatively small) datasets, as training can either diverge/overfit or be unstable. For CIFAR-10, we report the full transformer results for a 3-layer model; for ListOps and CIFAR-10 LRA we report the number given for the 3-layer model from [@tay2020lra]; for Remote Homology we report the number for a smaller 12-layer model from [@rap2019tape]. From an engineering perspective, this makes the full transformers harder to tune since we must choose model sizes that are stable and avoid overfitting -- see Section `\ref{sec:generalization}`{=latex} for more analysis. In particular, the numbers from [@tay2020lra] are generated from \`\`extensive sweeps over different hyper-parameters" and use task-specific hyperparameters, while we do not tune the hyperparameters for FPT (except for remote homology; see Appendix `\ref{app:experimental_details}`{=latex}). In contrast, we find it is easy to improve the performance of FPT by increasing model size (see Section `\ref{sec:size}`{=latex}) -- the CIFAR-10 number for FPT here is for the 36-layer large model.

Furthermore, unlike some other works utilizing transformers for vision, we use minimal spatial bias to emphasize the universal sequential aspect of the problem -- for instance, we do not interleave self-attention and convolution layers. Note that we also do not use 2D positional embeddings (or other domain-specific techniques), hence providing very weak inductive prior to the model. Our reasoning for these decisions is to evaluate the ability of transformers to work on arbitrary sequential tasks.

What is the importance of the pretraining modality? {#sec:pretraining}
---------------------------------------------------

We now compare pretraining on language to other pretraining methods for base model sizes:

-   Random initialization (Random): initialization of the frozen transformer parameters randomly using the default initialization choices for GPT-2, i.e. without pretraining.

-   Bit memory pretraining (Bit): pretraining from scratch on the Bit Memory task and then freezing the parameters before transferring. This allows the transformer to gain supervision working with arbitrary bit strings and performing memory/denoising on independent inputs.

-   Image pretraining (ViT): using a pretrained Vision Transformer [@dosovitskiy2020vit] pretrained on ImageNet-21k [@deng2009imagenet]. Note that the architecture is a bit different, notably not using the autoregressive masking of GPT-2, since ViT is only pretrained on classification tasks (for other details, see Appendix `\ref{app:details_pretraining}`{=latex}).

These experiments highlight the significance of pretraining -- as opposed to simply the transformer architecture -- and compare language to other methods of supervision. Our results are shown in Table `\ref{table:random}`{=latex}. Although the random transformers can achieve surprisingly strong accuracies, there is a considerable gap to using natural language pretraining, such as in MNIST, where random transformers achieve similar performance to a linear classifier on top of raw features (92%). Thus we believe that while the transformer architecture might be naturally conducive to these evaluations, the attention mechanisms used to transfer may be nontrivial and not fully specified by the architecture. We also find that, in addition to performance benefits, language pretraining improves convergence compared to the randomly initialized transformer (see Section `\ref{sec:compute_efficiency}`{=latex}).

::: {#table:random}
   **Model**   `\multicolumn{1}{c}{\bf Bit Memory}`{=latex}   `\multicolumn{1}{c}{\bf XOR}`{=latex}   `\multicolumn{1}{c}{\bf ListOps}`{=latex}   `\multicolumn{1}{c}{\bf MNIST}`{=latex}   `\multicolumn{1}{c}{\bf C10}`{=latex}   `\multicolumn{1}{c}{\bf C10 LRA}`{=latex}   `\multicolumn{1}{c}{\bf Homology}`{=latex}
  ----------- ---------------------------------------------- --------------------------------------- ------------------------------------------- ----------------------------------------- --------------------------------------- ------------------------------------------- --------------------------------------------
      FPT                          100%                                       100%                                      38.4%                                      98.0%                                    68.2%                                     38.6%                                       12.7%
    Random                        75.8%                                       100%                                      34.3%                                      91.7%                                    61.7%                                     36.1%                                        9.3%
      Bit                          100%                                       100%                                      35.4%                                      97.8%                                    62.6%                                     36.7%                                        7.8%
      ViT                          100%                                       100%                                      37.4%                                      97.8%                                    72.5%                                     43.0%                                        7.5%

  :  Test accuracy of language-pretrained (FPT) vs randomly initialized (Random) vs Bit Memory pretraining (Bit) vs pretrained Vision Transformer (ViT) models. The transformer is frozen.
:::

Pretraining on bit memory improves performance compared to the random models, but still lags behind training on natural language data. Furthermore, measured by gradient steps, all models converge faster than the randomly initialized transformers (more details in Section `\ref{sec:compute_efficiency}`{=latex}), indicating that all modes of pretraining improve upon random initialization even without considering accuracy.

Additionally, while freezing a vision transformer yields better improvements on CIFAR-10, pretraining on images is not uniformly better; e.g., ViT is worse on protein classification. One hypothesis is that protein sequences are structured like language, in terms of discrete units of information with a \`\`grammar", so transfer from language to proteins may be more natural. `\vspace{2em}`{=latex}

How important is the transformer architecture compared to LSTM architecture? {#sec:architecture_results}
----------------------------------------------------------------------------

In Section `\ref{sec:pretraining}`{=latex} we found the transformer architecture can already be fairly effective in this regime, even with only random parameters. In this section, we consider using a random LSTM architecture instead of the transformer, allowing us to consider the raw effect of architecture and ablating pretraining. Like FPT, we finetune the input, output, and layernorm parameters for the LSTMs.

::: {#table:random_architecture}
   **Model**   `\multicolumn{1}{c}{\bf Bit Memory}`{=latex}   `\multicolumn{1}{c}{\bf XOR}`{=latex}   `\multicolumn{1}{c}{\bf ListOps}`{=latex}   `\multicolumn{1}{c}{\bf MNIST}`{=latex}   `\multicolumn{1}{c}{\bf CIFAR-10}`{=latex}   `\multicolumn{1}{c}{\bf C10 LRA}`{=latex}   `\multicolumn{1}{c}{\bf Homology}`{=latex}
  ----------- ---------------------------------------------- --------------------------------------- ------------------------------------------- ----------------------------------------- -------------------------------------------- ------------------------------------------- --------------------------------------------
    Trans.                        75.8%                                       100%                                      34.3%                                      91.7%                                      61.7%                                        36.1%                                        9.3%
     LSTM                         50.9%                                       50.0%                                     16.8%                                      70.9%                                      34.4%                                        10.4%                                        6.6%
   LSTM$^*$                       75.0%                                       50.0%                                     16.7%                                      92.5%                                      43.5%                                        10.6%                                        8.6%

  : Test accuracy of randomly initialized transformers vs randomly initialized LSTM models. Note unlike in Figure `\ref{fig:main_result}`{=latex}, the LSTM here is frozen. Frozen LSTMs perform very poorly. LSTM$^*$ represents an LSTM with additional architecture improvements to match the transformers (see below).
:::

```{=latex}
\vspace{-.5em}
```
Our results are shown in Table `\ref{table:random_architecture}`{=latex}. \`\`LSTM" refers to a 3-layer \`\`standard" LSTM with a hidden dimension of 768, matching standard implementations of LSTMs, without residual connections or positional embeddings (see discussion below). This matches the width of the FPT models, but not the depth or total parameter count (note that LSTMs also do not have positional embeddings). We find that the self-attention architecture already serves as an effective inductive bias for universal computation, improving significantly over the recurrent LSTM model and comprising most of the improvement in test accuracy from random LSTM to FPT.

Here, we compare the 3-layer \`\`standard" LSTM to a 12-layer \`\`standard" LSTM. Note that most LSTM implementations, including the one used in Table `\ref{table:random_architecture}`{=latex}, do not feature residual connections and positional embeddings. We include this comparison to represent the traditional method more faithfully, but add these additional architectural components below. In the same style of FPT and GPT-2, we do not use a bidirectional LSTM. Under these model choices, we report the performance of a frozen random 3-layer vs 12-layer LSTM in Table `\ref{table:lstm_layers}`{=latex}. Naively, the 12-layer model is much worse than the 3-layer model, hinting that there is some loss of information by repeated LSTM layers.

::: {#table:lstm_layers}
   **Layers**   `\multicolumn{1}{c}{\bf ListOps}`{=latex}   `\multicolumn{1}{c}{\bf MNIST}`{=latex}   `\multicolumn{1}{c}{\bf CIFAR-10}`{=latex}   `\multicolumn{1}{c}{\bf C10 LRA}`{=latex}
  ------------ ------------------------------------------- ----------------------------------------- -------------------------------------------- -------------------------------------------
       12                         16.2%                                      11.7%                                      10.8%                                        10.4%
       3                          16.8%                                      70.9%                                      34.4%                                        10.4%

  : Test accuracy of randomly initialized \`\`standard" LSTMs varying number of layers with a hidden dimension of 768. The simple 12-layer LSTM achieves only near-trivial performance.
:::

```{=latex}
\vspace{-.5em}
```
We also experiment with ablating other architectural improvements included with the transformer architecture in Table `\ref{table:lstm_layers_residual}`{=latex}. Once residual connections [@he2016resnet] are added, the 12-layer LSTM makes up a lot of the performance drops, hinting that residual connections could make up for loss of information from the LSTM layers which otherwise linearly combine the features. We also add positional embeddings, which finishes bridging the gap between standard LSTM implementations and the transformer. Even with these additional benefits, the LSTM still performs worse. Note that the final 12-layer LSTM has about the same number of trainable parameters as the transformer.

::: {#table:lstm_layers_residual}
          **Model**           `\multicolumn{1}{c}{\bf ListOps}`{=latex}   `\multicolumn{1}{c}{\bf MNIST}`{=latex}   `\multicolumn{1}{c}{\bf CIFAR-10}`{=latex}   `\multicolumn{1}{c}{\bf C10 LRA}`{=latex}
  -------------------------- ------------------------------------------- ----------------------------------------- -------------------------------------------- -------------------------------------------
        12-Layer LSTM                           16.2%                                      11.7%                                      10.8%                                        10.4%
   \+ Residual Connections                      16.8%                                      70.9%                                      34.4%                                        10.4%
   \+ Positional Embeddings                     16.7%                                      92.5%                                      43.5%                                        10.6%
      Random Transformer                        34.3%                                      91.7%                                      61.7%                                        36.1%

  : Test accuracy of 12-layer randomly initialized \`\`standard" LSTMs additional architectures modifications to match transformers: residual connections and positional embeddings. The bottom row, LSTM with residual connections and positional embeddings, is nearly identical to GPT-2.
:::

Does language pretraining improve compute efficiency over random initialization? {#sec:compute_efficiency}
--------------------------------------------------------------------------------

We investigate compute efficiency by considering the number of gradient steps to converge for FPT vs random transformer models, shown in Table `\ref{table:convergence}`{=latex}. We generally find FPT converges faster, which indicates language pretrainining can yield compute benefits for non-language tasks. While random transformer models achieve decent test accuracies, in particular when compared to random LSTMs, there is still a considerable gap in the compute efficiency compared to using pretraining. Note that bit memory pretraining introduced in Section `\ref{sec:pretraining}`{=latex} generally falls between the two models, and notably is $6 \times$ slower than FPT on Bit XOR, which is significantly better than random.

::: {#table:convergence}
    **Model**    `\multicolumn{1}{c}{\bf Memory}`{=latex}   `\multicolumn{1}{c}{\bf XOR}`{=latex}   `\multicolumn{1}{c}{\bf ListOps}`{=latex}   `\multicolumn{1}{c}{\bf MNIST}`{=latex}   `\multicolumn{1}{c}{\bf C10}`{=latex}   `\multicolumn{1}{c}{\bf C10 LRA}`{=latex}   `\multicolumn{1}{c}{\bf Homology}`{=latex}
  ------------- ------------------------------------------ --------------------------------------- ------------------------------------------- ----------------------------------------- --------------------------------------- ------------------------------------------- --------------------------------------------
       FPT                   $1 \times 10^4$                           $5 \times 10^2$                           $2 \times 10^3$                            $5 \times 10^3$                          $4 \times 10^5$                           $3 \times 10^5$                             $1 \times 10^5$
     Random                  $4 \times 10^4$                           $2 \times 10^4$                           $6 \times 10^3$                            $2 \times 10^4$                          $4 \times 10^5$                           $6 \times 10^5$                             $1 \times 10^5$
   **Speedup**                  $4 \times$                               $40\times$                                $3 \times$                                 $4 \times$                               $1 \times$                                $2 \times$                                   $1 \times$

  : Approximate number of gradient steps until convergence for pretrained (FPT) vs randomly initialized (Random) models. Note that we use the same batch size and learning rate for both models.
:::

Do the frozen attention layers attend to modality-specific tokens? {#sec:attention_maps}
------------------------------------------------------------------

We investigate if FPT attends to semantically meaningful patterns in the data. We plot the attention weights (i.e. the values of the softmax of query-key dot product) from the first layer. We show the results in Figures `\ref{fig:attn_xor_pretrained}`{=latex} and `\ref{fig:attn_memory_pretrained}`{=latex} for the bit tasks. Note GPT-2 is autoregressive, so the upper right corner of the attention mask is zeroed out. On these tasks, FPT yields an interpretable attention pattern despite not training the self-attention layers themselves. We did not find easily interpretable patterns on the other tasks.

```{=latex}
\centering
```
![ On Bit XOR, the model must produce the element-wise XOR of two bitstrings presented sequentially (inputs 0-4 are the first bitstring, inputs 5-9 are the second). Each token is one bit. FPT learns to attend positionally to the two bits that are XOR'ed by the output token. ](figures/attention_maps/xor_pretrained.png){#fig:attn_xor_pretrained width="0.55\\linewidth"}

```{=latex}
\centering
```
![ On Bit Memory, the model must return one of five strings (inputs 0-99) given a masked version of one of the strings (inputs 100-119). Each token is 50 bits. FPT learns to attend to the correct string based on finding similarity to the inputs, not relying solely on position as in Bit XOR. ](figures/attention_maps/memory_pretrained.png){#fig:attn_memory_pretrained width=".95\\linewidth"}

We also include the attention map for Bit XOR using a randomly initialized transformer (which also solves the task) in Figure `\ref{fig:attn_xor_random}`{=latex}. This model also learns to exploit the diagonal pattern, although the strength is a little weaker. This indicates that while the random transformer still learns to solve the task, it learns a less semantically interpretable/strong attention pattern.

```{=latex}
\centering
```
![ A transformer with frozen randomly initialized self-attention layers also learns to correlate the two diagonal elements on Bit XOR, although the magnitude of the diagonals is lower (note the extra attention weights distributed in between the diagonals). ](figures/attention_maps/xor_random.png){#fig:attn_xor_random width="0.3\\linewidth"}

Does freezing the transformer prevent overfitting or underfitting? {#sec:generalization}
------------------------------------------------------------------

Our general findings are that -- in contrast to their fully trained counterparts -- FPT models underfit the data, which lends them to further improvements by increasing model capacity (see Section `\ref{sec:size}`{=latex}). For example, consider CIFAR-10 LRA, which is maximally difficult due to lack of inductive prior over the sequence (each pixel is fed in as an arbitrary token only ordered by a raster scan) and relatively small dataset (50k images). In Table `\ref{table:generalization}`{=latex}, we show the train/test gap between training FPT vs a 3-layer transformer from [@tay2020lra], which we find to give stronger results than our experiments. In particular, they are much better than training a 12-layer transformer, which works poorly. Our results indicate that FPT is generally providing generalizable task representations without causing overfitting, whereas transformers can overfit arbitrarily poorly in low-data regimes (such as for Linformer, which overfit the most out of the architectures tested by [@tay2020lra]). More work can investigate how to increase the model expressiveness, which could yield performance benefits.

::: {#table:generalization}
        **Model**        **\# Layers**   `\multicolumn{1}{c}{\bf Test Accuracy}`{=latex}   `\multicolumn{1}{c}{\bf Train Accuracy}`{=latex}
  --------------------- --------------- ------------------------------------------------- --------------------------------------------------
       FPT (GPT-2)            12                              38.6%                                             38.5%
   Vanilla Transformer         3                               42%                                               70%
        Linformer              3                               39%                                               97%

  : Train vs test accuracies on CIFAR-10 LRA task.
:::

Does performance scale with model size? {#sec:size}
---------------------------------------

We evaluate the efficacy of adding more parameters to these models on CIFAR-10. Most of the additional parameters are in the transformer layers and are trained during the natural language pretraining phase. Our results for pretrained and random models are in Table `\ref{table:larger_models}`{=latex}. Unlike fully training a transformer, which exhibits more overfitting and divergence during training with larger models, increasing model size stably increases the capacity of the models. This result indicates our observations and results are likely to scale as we move towards larger models and higher-data regimes.

::: {#table:larger_models}
   **Model Size**   `\multicolumn{1}{c}{\bf \# Layers}`{=latex}   `\multicolumn{1}{c}{\bf Total Params}`{=latex}   **Trained Params**   `\multicolumn{1}{c}{\bf FPT}`{=latex}   `\multicolumn{1}{c}{\bf Random}`{=latex}
  ---------------- --------------------------------------------- ------------------------------------------------ -------------------- --------------------------------------- ------------------------------------------
    Small (Base)                        12                                             117M                               106K                          68.2%                                    61.7%
       Medium                           24                                             345M                               190K                          69.8%                                    64.0%
       Large                            36                                             774M                               300K                          72.1%                                    65.7%

  : Test accuracy of larger frozen transformer models on CIFAR-10.
:::

Can performance be attributed simply to better statistics for initialization? {#sec:initialization}
-----------------------------------------------------------------------------

In this section, we ablate taking the layer-wise mean and standard deviation from the pretrained model and using it to initialize a random transformer, in order to ablate if a better initialization scheme via an \`\`oracle" standard deviation can recover the performance of FPT. Note that the GPT-2 initialization scheme initializes parameters as Gaussian; traditionally, the standard deviation is $0.02$ by default. For clarity, we show the standard deviation by layer for the weights and biases of the attention and feedforward layers in Figure `\ref{fig:statistics}`{=latex} for the pretrained models.

```{=latex}
\centering
```
![ Standard deviation of the parameters by layer for the pretrained GPT-2 model versus default initialization hyperparameters ($0.02$ for weights and $0$ for biases). ](figures/statistics/statistics.png "fig:"){#fig:statistics width="1\\linewidth"} `\vspace{1em}`{=latex} ![ Standard deviation of the parameters by layer for the pretrained GPT-2 model versus default initialization hyperparameters ($0.02$ for weights and $0$ for biases). ](figures/statistics/stats_legend.png "fig:"){#fig:statistics width="0.6\\linewidth"}

```{=latex}
\vspace{-1em}
```
We show the results using this initialization scheme in Table `\ref{table:initialization}`{=latex} (note that all of the weights, biases, layer norm, and positional embeddings are initialized -- both mean and variance -- in this fashion). This yields better results on most tasks, but does poorly on CIFAR-10. As a result, we believe the benefits of language pretraining cannot be recovered with a simple better initialization scheme, although we believe future work in transformer initialization could yield different results.

::: {#table:initialization}
   **Initialization**   `\multicolumn{1}{c}{\bf Memory}`{=latex}   `\multicolumn{1}{c}{\bf XOR}`{=latex}   `\multicolumn{1}{c}{\bf ListOps}`{=latex}   `\multicolumn{1}{c}{\bf MNIST}`{=latex}   `\multicolumn{1}{c}{\bf C10}`{=latex}   `\multicolumn{1}{c}{\bf C10 LRA}`{=latex}   `\multicolumn{1}{c}{\bf Homology}`{=latex}
  -------------------- ------------------------------------------ --------------------------------------- ------------------------------------------- ----------------------------------------- --------------------------------------- ------------------------------------------- --------------------------------------------
       Pretrained                         100%                                     100%                                      38.4%                                      98.0%                                    68.2%                                     38.6%                                       12.7%
    Statistics Only                       100%                                     100%                                      37.4%                                      97.2%                                    56.5%                                     33.1%                                       11.0%
        Default                          75.8%                                     100%                                      34.3%                                      91.7%                                    61.7%                                     36.1%                                        9.3%

  : Test accuracy when initializing parameters with pretrained weights (i.e., FPT) vs randomly initializing parameters according to the mean and variance of the pretrained transformer (Statistics Only) vs random initialization with default parameters (Default).
:::

```{=latex}
\vspace{-1.8em}
```
Can we train a transformer by only finetuning the output layer? {#sec:reservoir}
---------------------------------------------------------------

We consider using FPT solely for naive feature extraction for linear classification, where we fix a randomly initialized input layer and freeze all parts of the model except for the output. Note that this resembles resevoir computing/echo state networks (see Section `\ref{sec:gwt}`{=latex} for discussion). The model evaluates on every example in the training set once, caches the features, and then we train a linear output layer. This enables subsequent epochs after the first to run extremely quickly, but does not easily handle dropout/data augmentations, and scales well in terms of number of epochs, but not in dataset size. Note that this is mathematically equivalent to linear classification. Our results are shown in Table `\ref{table:linear}`{=latex}. Although we find speedups extremely significant and they obtain nontrivial performance, performance significantly degrades and the models also exhibit overfitting (likely due to lack of regularization; unlike the training of FPT, dropout is not applied).

::: {#table:linear}
     **Task**       **Speedup**      `\multicolumn{1}{c}{\bf Output Only}`{=latex}   `\multicolumn{1}{c}{\bf FPT}`{=latex}   `\multicolumn{1}{c}{\bf Full Transformer}`{=latex}
  -------------- ------------------ ----------------------------------------------- --------------------------------------- ----------------------------------------------------
     ListOps      $500-2000\times$                       32.8%                                       38.4%                                          38%
   CIFAR-10 LRA   $500-2000\times$                       24.7%                                       38.6%                                          42%

  : Training only the output layer as a linear regression problem. Speedup refers to wall clock time per epoch (after the first). Larger models have larger speedups.
:::

What is the role of model depth in token mixing? {#sec:model_depth}
------------------------------------------------

One interesting question is the importance of the depth of the transformer for generating representions which \`\`mix" tokens: for instance, if there is only one layer and the parameters are random, it is unlikely for the tokens to be mixed well, whereas if there are many layers, there are many chances for the tokens to mix and form interesting representations useful for downstream tasks. We investigate this on ListOps by considering pretrained vs random models, where we only take the first X layers of the 12-layer pretrained model (i.e. for X=3, we use the first 3 layers of the pretrained GPT-2 model and perform classification from those hidden states). Additionally, to maximally highlight the importance of the pretrained parameters, we randomly initialize the input layer, and do not train the input or positional parameters. We first show results are finetuning the output layer and layernorm parameters, and then show only finetuning the output layer.

**With finetuning layernorm.** We first investigate this question with finetuning the layernorm parameters (i.e. we finetune only the output layer and the layernorm parameters). Results are shown in Table `\ref{table:depth_ln}`{=latex}. Both models are unable to do well with only one layer, but the pretrained model performs significantly better than the random model at 2 layers, indicating that while the difference in performance at 12 layers is relatively small, there is a great benefit to using pretrained layers for when considering a small number of layers in that the tokens are \`\`mixed" faster.

::: {#table:depth_ln}
   **Number of Layers**   `\multicolumn{1}{c}{\bf Pretrained}`{=latex}   `\multicolumn{1}{c}{\bf Random}`{=latex}
  ---------------------- ---------------------------------------------- ------------------------------------------
            1                                 17%                                          17%
            2                                 36%                                          16%
            6                                 38%                                          35%

  : Test accuracy on Listops while varying model depth and finetuning layernorm parameters. Pretrained layers \`\`mix" the tokens faster, performing better at low model depths.
:::

**Without finetuning layernorm.** We now investigate this question without finetuning the layernorm parameters, and only finetuning the output parameters, as in the reservoir computing setup in Section `\ref{sec:reservoir}`{=latex}. Note this is equivalent to linear classification. This setting is the most challenging since all processing that is able to mix tokens is done by either random or pretrained parameters, and we are only able to train a linear layer on top of the output of the last token; as a result, the *only* token mixing that is done is performed entirely by the pretrained self-attention layers. Results are shown in Table `\ref{table:depth_no_ln}`{=latex}. The random model does not do well even for a large number of layers, while the pretrained model can still do reasonably well, even though it requires more layers than before.

::: {#table:depth_no_ln}
   **Number of Layers**   `\multicolumn{1}{c}{\bf Pretrained}`{=latex}   `\multicolumn{1}{c}{\bf Random}`{=latex}
  ---------------------- ---------------------------------------------- ------------------------------------------
            1                                 12%                                           \-
            3                                 18%                                           \-
            6                                 33%                                           \-
            12                                33%                                          17%
            24                                 \-                                          17%

  : Test accuracy on Listops while varying model depth and only training output parameters. Even for a large number of layers, the random model does not learn to perform well.
:::

Can training more parameters improve performance? {#sec:moreparams}
-------------------------------------------------

Our focus in this work was primarily to investigate if and how efficient, general-purpose pretraining can transfer across modalities. However, for practical applications, it would naturally be more suited to choose a more specialized finetuning scheme or add more trainable parameters. In this section, we investigate additionally finetuning parameters with various methods, to see if frozen language transformers can serve as a practical base for future work.

We first investigate additionally finetuning the self-attention and feedforward layers, which were previously frozen. We simply add them to the list of parameters finetuned, without changing the optimization or learning rate scheme, although this is suboptimal. Our results are shown in Table `\ref{table:finetune_attn_ff}`{=latex}. Note that +Both is fully finetuning the 12-layer transformer (in other sections, we use full transformer to denote fully finetuning a transformer from scratch where the depth was tuned, whereas here the depth is fixed). We find that finetuning the feedforward layers can improve performance, which is similar to techniques used in prior work [@houlsby2019adapter], but finetuning the attention layers can lead to divergence.

::: {#table:finetune_attn_ff}
     **Model**      `\multicolumn{1}{c}{\bf Memory}`{=latex}   `\multicolumn{1}{c}{\bf XOR}`{=latex}   `\multicolumn{1}{c}{\bf ListOps}`{=latex}   `\multicolumn{1}{c}{\bf MNIST}`{=latex}   `\multicolumn{1}{c}{\bf C10}`{=latex}   `\multicolumn{1}{c}{\bf C10 LRA}`{=latex}   `\multicolumn{1}{c}{\bf Homology}`{=latex}
  ---------------- ------------------------------------------ --------------------------------------- ------------------------------------------- ----------------------------------------- --------------------------------------- ------------------------------------------- --------------------------------------------
        FPT                           100%                                     100%                                      38.4%                                      98.0%                                    68.2%                                     38.6%                                       12.7%
   \+ Feedforward                     100%                                     100%                                      36.0%                                      98.3%                                    76.6%                                     38.2%                                       13.1%
    \+ Attention                      100%                                     100%                                      36.8%                                 89.0%$^\dagger$                          47.7%$^\dagger$                                23.0%                                       10.9%
      \+ Both                         100%                                     100%                                      35.8%                                 93.1%$^\dagger$                               32.9%                                     21.0%                                       10.5%

  :  Additionally finetuning either the feedforward layers, attention layers, or both. We do not use a per-layer learning scheme/etc. $^\dagger$training diverged, number reported before divergence.
:::

On CIFAR-10, we experiment with additionally finetuning the last attention layer, shown in Table `\ref{table:morelayers}`{=latex}. Generally we find smarter pretraining methods can yield better performance, so we are optimistic about the possibility of multimodal training/architectures *improving* performance in future work.

::: {#table:morelayers}
   **Task**   `\multicolumn{1}{c}{\bf Base (FPT)}`{=latex}   `\multicolumn{1}{c}{\bf + Finetuning All FF Layers}`{=latex}   `\multicolumn{1}{c}{\bf + Finetuning Last Attn Layer}`{=latex}
  ---------- ---------------------------------------------- -------------------------------------------------------------- ----------------------------------------------------------------
   CIFAR-10                      68.2%                                                  76.6%                                                           80.0%

  : Test accuracy on CIFAR-10 when finetuning additional parameters. In addition to FPT, if we finetune the feedforward layers and the last self-attention layer, we can achieve 80% accuracy.
:::

Which parameters of the model are important to finetune? {#sec:params}
--------------------------------------------------------

We now run ablations for only finetuning select parameters to see which parameters are most sensitive. Note for all experiments (including the previous ones), we initialize the input layers as Gaussian if embeddings are used, or use an orthogonal initialization for linear layers; in particular, we find orthogonal initialization to be very important when input parameters are not trained. We highlight some results in Table `\ref{table:finetuning_add}`{=latex}; full results are shown on Page `\pageref{table:finetuning_indep}`{=latex}. Similar to a study of random CNNs by [@frankle2020batchnorm], we generally find the layer norm parameters to be most important.

::: {#table:finetuning_add}
     **Task**     `\multicolumn{1}{c}{\bf output only}`{=latex}   `\multicolumn{1}{c}{\bf + layernorm}`{=latex}   `\multicolumn{1}{c}{\bf + input}`{=latex}   `\multicolumn{1}{c}{\bf + positions}`{=latex}
  -------------- ----------------------------------------------- ----------------------------------------------- ------------------------------------------- -----------------------------------------------
    Bit Memory                         76%                                             94%                                          100%                                          100%
     Bit XOR                           56%                                             98%                                           98%                                          100%
     ListOps                           15%                                             36%                                           36%                                           38%
      MNIST                            23%                                             96%                                           98%                                           98%
     CIFAR-10                          25%                                             54%                                           60%                                           68%
   CIFAR-10 LRA                        17%                                             39%                                           39%                                           39%
     Homology                          2%                                              9%                                            10%                                           13%

  : Ablation by successively adding certain parameters to the list of finetuned parameters for pretrained frozen transformers.
:::

Is finetuning layer norm necessary for FPT to perform well?
-----------------------------------------------------------

While previously we showed performance gains with finetuning layer norm, we could instead consider only finetuning the input and output layers, treating the entire GPT model as a black box. We show results on CIFAR-10 in Table `\ref{table:nolayernorm}`{=latex}. The model performs worse; note accuracy is similar to not finetuning the positional embeddings (see Section `\ref{sec:params}`{=latex}). This suggests the internal modulation of the affine layer norm parameters help, possibly by about as much as finer positional information.

::: {#table:nolayernorm}
   **Initialization**   `\multicolumn{1}{c}{\bf Frozen Layer Norm}`{=latex}   `\multicolumn{1}{c}{\bf Finetuned Layer Norm}`{=latex}
  -------------------- ----------------------------------------------------- --------------------------------------------------------
       Pretrained                              61.5%                                                  68.2%
         Random                                55.0%                                                  61.7%

  : Test accuracy on CIFAR-10 when only finetuning the input and output layer parameters.
:::

How well do the trends hold across other transformer models? {#sec:alternative_architectures}
------------------------------------------------------------

We also investigate how other transformer architectures perform when swapped out with GPT-2: BERT [@devlin2019bert], T5 [@raffel2019t5], and Longformer [@beltagy2020longformer]. For T5, we only use the encoder, and not the decoder. Our results are in Table `\ref{table:nlp_architectures}`{=latex}. We find results to roughly hold across some architectures -- with some differences -- although T5 tends to be slightly worse than the other models. An interesting question for future work is whether subtle differences in architecture, pretraining objective, or dataset contribute to these differences.

::: {#table:nlp_architectures}
   **Task**   `\multicolumn{1}{c}{\bf GPT-2 (FPT Default)}`{=latex}   `\multicolumn{1}{c}{\bf BERT}`{=latex}   `\multicolumn{1}{c}{\bf T5}`{=latex}   `\multicolumn{1}{c}{\bf Longformer}`{=latex}
  ---------- ------------------------------------------------------- ---------------------------------------- -------------------------------------- ----------------------------------------------
   ListOps                            38.4%                                           38.3%                                   15.4%                                      17.0%
   CIFAR-10                           68.2%                                           68.8%                                   64.7%                                      66.8%

  : Test accuracy for frozen pretrained transformer variants (base model sizes).
:::

Related Work and Discussion {#sec:relatedwork}
===========================

Transformers in multimodal settings
-----------------------------------

Transformers [@vaswani2017attention] were first used successfully for natural language processing [@radford2018gpt; @devlin2019bert; @radford2019gpt2; @brown2020gpt3]. In recent years, they have also been shown to be effective architectures for other modalities. One particular modality of interest is computer vision [@chen2020imagegpt; @touvron2020deit]; in particular, [@dosovitskiy2020vit] showed that transformers can outperform CNNs in the high-data regime on standard object recognition benchmarks such as ImageNet and CIFAR. Furthermore, transformers have also been used for prediction tasks over protein sequences [@jumper2021alphafold; @rao2021msa], reinforcement learning [@parisotto2020stabilizing], and imitation learning [@abramson2020imitating].

Work specifically tackling multimodal tasks include [@kaiser2017multitask], who showed a single model could learn a variety of multimodal tasks with an attention architecture. Recent work has utilized transformers for multimodal predictive tasks, such as images and text in ViLBERT [@lu2019vilbert] and CLIP [@radford2021clip]; these approaches generally use two distinct transformers to embed images and text. [@lu2020vilbertmulti] applies ViLBERT to train a single model for a variety of combined vision and language tasks. Recent work from OpenAI [@goh2021multimodal] finds that some neurons learned by CLIP are activated by a particular semantic concept, regardless of if the concept is presented in language or picture form. Our work is most similar to DALL-E [@ramesh2021dalle], which uses a single transformer to embed both the image and text modalities, which we consider to be generating a \`\`universal latent space" that projects any type of input into a single latent space. Such a latent space would be useful for a model that could learn from many sources of supervision.

Transformers in transfer settings
---------------------------------

There are also many works looking at transformers specifically in the context of in-modality transfer, such as ViT for vision [@dosovitskiy2020vit], T5 for language [@raffel2019t5], and UDSMProt for protein sequences [@strodthoff2020udsmprot]. CLIP [@radford2021clip] showed that training on text in addition to images could allow for zero-shot classification via providing downstream labels as text. [@hernandez2021scaling] do a thorough investigation of transfer with language pretraining, notably showing transfer from English to Python, which they consider to be reasonably distanced from English; many works have also looked at transferring from one langauge to another [@artetxe2019cross; @ponti2019towards]. Similar to our work, [@papadimitriou2020music] investigate transfer for LSTMs between modalities including code, different languages, and music, finding that pretraining on \`\`non-linguistic data with latent structure" can transfer to language, finding grammatical structure in a modality to be important, although we generally investigate the other direction and explore more distanced modalities. [@kiela2019supervised] make similar observations for aligning representation spaces of language and vision. [@li2020communication] pretrain on a referential communication game where an emergent learned language is used to transfer to NLP tasks. [@wu2021lime] found explicitly pretraining computational primitives to transfer to mathematics tasks.

Pretraining and finetuning of transformer models
------------------------------------------------

A common trend in deep learning models is to first train a large model on an unsupervised objective on a large dataset [@dai2015semi; @radford2018gpt] and then finetune on a small downstream dataset (e.g., by freezing the model and only finetuing the output layer). A common method used to finetune transformers are adapter networks [@rebuffi2017adapter; @houlsby2019adapter], which add a fully connected residual block for each unique downstream task and also finetune the layer norm parameters. For simplicity, we do not add the full adapter block but only train the layer norm parameters, reducing the number of parameters we consider. These techniques used are similar to prior approaches such as FiLM [@perez2018film] and self-modulation [@chen2018selfmodulation]. A recent direction of research has explored learning prompt templates for large models [@shin2020autoprompt] that simply require forward passes over the transformer. Unlike these works, we consider finetuning on one modality (language) and finetuning on others, whereas prior work investigates finetuning on the same modality as the pretraining task. Another interesting related work, although not investigating transformers, by [@frankle2020batchnorm] find randomly initialized CNNs, which only train the batchnorm affine parameters, to work well on CIFAR-10. Their numbers are stronger than ours on CIFAR-10, but include significantly more inductive bias via a convolutional architecture, so the main takeaway is slightly more relevant towards image tasks rather than arbitrary sequences.

Self-attention layers as optimization steps
-------------------------------------------

The nature of computation performed by self-attention layers has also been explored by other related works. [@bai2019deq] show that a single transformer self-attention block can be trained to perform an optimization step towards finding a stationary point, representing the solution to the task. [@ramsauer2020hopfield] show that the self-attention layer is a gradient step in a Hopfield network with a learning rate of 1, hinting that transformers are capable of storing and retrieving a large amount of patterns with an implicit energy function. An interesting discussion from [@goyal2020inductive] points out a connection in viewing the key-value queries used in attention as similar to function signatures in computer programming: the key maps the input to a type (e.g., float) and the value maps the input to its value (e.g., $3.14$), and if the type matches the function signature, the function can be applied to the value -- this may be particularly relevant when we consider using a single self-attention layer applied to different modalities, as the modality may be embedded in the type.

Global workspace theory {#sec:gwt}
-----------------------

A common technique for evaluating the embeddings learned by an unsupervised model is to train a linear layer on top of the embeddings for a downstream task [@donahue2016bigan; @oord2018cpc; @chen2020simclr], which is reasonable when you finetune on the same modality as the pretrained one. However, when finetuning on a different modality, as in our setting, we have to reframe this notion of generalizable embedding quality -- instead of only finetuning the output layer, we also want to finetune the input layer, and instead evaluate the ability of the frozen intermediate model to perform generalizable *computation*. This is reminiscent of Global Workspace Theory [@baars1993gwt], which revolves around the notion that there is a \`\`blackboard" that different parts of the brain send data to; we might consider the frozen language model as being a blackboard in this setting. Language might also be a natural choice of model for this blackboard, as there are hypotheses that language may serve as a good multipurpose high-level representation for cognitive behavior and conscious planning [@andreas2017l3; @goyal2020inductive].

Reservoir computing {#sec:resevoir}
-------------------

Similarly to the FPT setup and Global Workspace Theory, in reservoir computing [@tanaka2019reservoir] and echo state networks [@jaeger2001echo; @jaeger2004harnessing], a random recurrent network is frozen and only the output readout layer is trained. These models are very fast to train, using a similar setup as in Section `\ref{sec:reservoir}`{=latex}, because the activations of the recurrent network can be cached and it is unnecessary to backpropagate over time. Somewhat differently to the FPT architecture, echo state networks are recurrent and thus feed back into themselves, which allows the outputs of the random frozen network to modulate future inputs. Unlike echo state networks, we also notably finetune the input and positional embeddings, which allow the inputs to the frozen network to adapt to a particular modality/for a query to the frozen network to be learned. Echo state networks are also similar to the perspective of self-attention applying a data-dependent filter to the inputs, as opposed to 1D convolutions, which are fixed filters regardless of the input modality.

Conclusion
==========

We proposed transferring a pretrained transformer language model for downstream tasks in non-language modalities. Through extensive empirical evaluation, we showed that these models could achieve performance competitive with transformers fully trained on the downstream task without having to finetune the self-attention and feedforward layers, relying solely on frozen parameters from the language model to perform the bulk of the computation.

We believe this work can serve as the foundation for future work investigating transfer between modalities. In future, we are interested in investigating the use of other data-rich modalities (e.g., vision) or a hybrid of multiple domains being used to provide the necessary substrate for pretraining a universal computational engine. It would also be interesting to explore frozen pretrained models for tasks beyond predictive modeling, such as reinforcement learning [@abramson2020imitating].

We note that a limitation of our analysis in that we analyze specific models on a restricted set of tasks. More investigation can highlight whether or not similar behavior occurs for other models on other tasks. For instance, in Section `\ref{sec:alternative_architectures}`{=latex}, we find the architecture can have a significant impact on results. As training regimes for these models evolve, performing similar experiments may yield different results, and we are excited for more research in this direction.

For high stakes applications in the real-world, there are potential concerns with transfer of harmful biases from one modality to one another using pretrained transformer models trained on vast quantities of unlabeled, uncurated datasets [@sheng2019woman; @bender2021dangers]. Mitigating these biases is an active area of research [@grover2019bias; @choi2020fair]. Conversely, there are also potential upsides with FPT models being able to better exploit representative datasets from one or more modalities, which merit future investigation as well.

Acknowledgements {#acknowledgements .unnumbered}
================

```{=latex}
\addcontentsline{toc}{section}{Acknowledgements}
```
We would like to thank Luke Metz, Kimin Lee, Fangchen Liu, Roshan Rao, Aravind Srinivas, Nikita Kitaev, Daniel Freeman, Marc'Aurelio Ranzato, Jacob Andreas, and Ashish Vaswani for valuable feedback and discussions. We would also like to thank members of the community for providing feedback online on an earlier version of this paper.

```{=latex}
\clearpage
```
Parameter ablations for pretrained models {#parameter-ablations-for-pretrained-models .unnumbered}
=========================================

::: {#table:finetuning_indep}
     **Task**     `\multicolumn{1}{c}{\bf output only}`{=latex}   `\multicolumn{1}{c}{\bf output + input}`{=latex}   `\multicolumn{1}{c}{\bf output + positions}`{=latex}   `\multicolumn{1}{c}{\bf output + layernorm}`{=latex}
  -------------- ----------------------------------------------- -------------------------------------------------- ------------------------------------------------------ ------------------------------------------------------
    Bit Memory                         76%                                            **98%**                                                93%                                                    94%
     Bit XOR                           56%                                              72%                                                  84%                                                  **98%**
     ListOps                           15%                                              17%                                                  35%                                                  **36%**
      MNIST                            23%                                              85%                                                  93%                                                  **96%**
     CIFAR-10                          25%                                              53%                                                  38%                                                  **54%**
   CIFAR-10 LRA                        17%                                              22%                                                  30%                                                  **39%**
     Homology                          2%                                                8%                                                   8%                                                   **9%**

  : Ablation by only finetuning individual types of parameters for pretrained frozen transformers. We bold the most important parameter (measured by highest test accuracy) for each task.
:::

::: {#table:finetuning_add}
     **Task**     `\multicolumn{1}{c}{\bf output only}`{=latex}   `\multicolumn{1}{c}{\bf + layernorm}`{=latex}   `\multicolumn{1}{c}{\bf + input}`{=latex}   `\multicolumn{1}{c}{\bf + positions}`{=latex}
  -------------- ----------------------------------------------- ----------------------------------------------- ------------------------------------------- -----------------------------------------------
    Bit Memory                         76%                                             94%                                          100%                                          100%
     Bit XOR                           56%                                             98%                                           98%                                          100%
     ListOps                           15%                                             36%                                           36%                                           38%
      MNIST                            23%                                             96%                                           98%                                           98%
     CIFAR-10                          25%                                             54%                                           60%                                           68%
   CIFAR-10 LRA                        17%                                             39%                                           39%                                           39%
     Homology                          2%                                              9%                                            10%                                           13%

  : Ablation by successively adding certain parameters to the list of finetuned parameters for pretrained frozen transformers.
:::

Parameter ablations for random models {#parameter-ablations-for-random-models .unnumbered}
=====================================

::: {#table:finetuning_random_indep}
     **Task**     `\multicolumn{1}{c}{\bf output only}`{=latex}   `\multicolumn{1}{c}{\bf output + input}`{=latex}   `\multicolumn{1}{c}{\bf output + positions}`{=latex}   `\multicolumn{1}{c}{\bf output + layernorm}`{=latex}
  -------------- ----------------------------------------------- -------------------------------------------------- ------------------------------------------------------ ------------------------------------------------------
    Bit Memory                         75%                                              75%                                                  75%                                                    75%
     Bit XOR                           50%                                              51%                                                  59%                                                  **100%**
     ListOps                           17%                                              17%                                                  18%                                                  **35%**
      MNIST                            25%                                              28%                                                  34%                                                  **83%**
     CIFAR-10                          20%                                              24%                                                  21%                                                  **46%**
   CIFAR-10 LRA                        11%                                              16%                                                  12%                                                  **34%**
     Homology                          2%                                                2%                                                   6%                                                   **9%**

  : Finetuning individual types of parameters for random frozen transformers.
:::

::: {#table:finetuning_random_add}
     **Task**     `\multicolumn{1}{c}{\bf output only}`{=latex}   `\multicolumn{1}{c}{\bf + layernorm}`{=latex}   `\multicolumn{1}{c}{\bf + input}`{=latex}   `\multicolumn{1}{c}{\bf + positions}`{=latex}
  -------------- ----------------------------------------------- ----------------------------------------------- ------------------------------------------- -----------------------------------------------
    Bit Memory                         75%                                             75%                                           75%                                           76%
     Bit XOR                           50%                                            100%                                          100%                                          100%
     ListOps                           17%                                             35%                                           36%                                           37%
      MNIST                            25%                                             83%                                           92%                                           92%
     CIFAR-10                          20%                                             46%                                           56%                                           62%
   CIFAR-10 LRA                        11%                                             34%                                           36%                                           36%
     Homology                          2%                                              9%                                            9%                                            9%

  : Ablation by successively adding certain parameters to the list of finetuned parameters for random frozen transformers.
:::

```{=latex}
\clearpage
```
```{=latex}
\addcontentsline{toc}{section}{References}
```
```{=latex}
\clearpage
```
```{=latex}
\appendix
```
`\LARGE`{=latex}**Appendix**

```{=latex}
\addcontentsline{toc}{section}{Appendix}
```
`\etocdepthtag`{=latex}.tocmtappendix `\etocsettagdepth{mtchapter}{none}`{=latex} `\etocsettagdepth{mtappendix}{subsection}`{=latex} `\tableofcontents`{=latex}

```{=latex}
\clearpage
```
Summary of arXiv Updates {#app:changelog}
========================

We summarize changes made in updated versions:

1.  (9 Mar 2021) Original release.

2.  (30 June 2021) Updated Section `\ref{sec:architecture_results}`{=latex} with more analysis of the frozen LSTM architecture and additional experimental details. Added new Section `\ref{sec:model_depth}`{=latex} discussing model depth and token mixing, new results in Section `\ref{sec:moreparams}`{=latex} discussing how different freezing strategies can improve performance, and attention mask visualization for random frozen transformer to Section `\ref{sec:attention_maps}`{=latex}. Included more details about experiments and hyperparameters, and added some new citations (notably [@wu2021lime] for related LIME work and [@frankle2020batchnorm] for similar frozen analysis for CNNs). Github was also updated to include LSTM architecture, vision pretraining, and remote homology tasks. Minor writing updates.

Background on Transformers {#app:architecture}
==========================

In this section, we give a description of the transformer architecture used in our experiments, namely the GPT-2 architecture [@radford2019gpt2].

Self-Attention
--------------

The main subcomponent of the transformer architecture is the self-attention layer, which takes in $l$ input tokens and outputs $l$ output tokens, both of dimension $n_{dim}$. Each input token $x_i$ is mapped by linear transformations $Q$, $K$, and $V$ -- denoting query, key, and values, respectively -- into $q_i$, $k_i$, and $v_i$. Both $q_i$ and $k_i$ have dimension $d_k$, and $v_i$ has dimension $d_v$. To generate the output token $y_i$, dot products are calculated between query $q_i$ and keys $k_j$, and fed into a softmax operation to generate weights $w_i \in [0, 1]$ (in practice, a scaling temperature factor of $\frac{1}{\sqrt{d_k}}$ is used to reduce the sharpness of the softmax). Then, the weights are used to generate $y_i$ as a weighted sum of all the values, i.e.: $$\label{eq:attention}
    y_i = \sum_{j=1}^l \frac{\text{exp}(q_i^\top k_j)}{\sum_{k=1}^l \text{exp}(q_i^\top k_k)} v_j$$

This is extended to *multi-head* attention over $n_{heads}$ heads by doing the above procedure $n_{heads}$ times, and then concatenating. To recover the original dimension the concatenated vector (of dimension $d_v n_{heads}$) is multiplied by a projection matrix $W_{proj} \in \mathbb{R}^{d_v n_{heads} \times n_{dim}}$.

GPT-2 applies a causal mask to its inputs, i.e. the output token $i$ is only allowed to attend to the input tokens $j \leq i$, which changes the upper bounds of the sums in Equation `\ref{eq:attention}`{=latex} to $i$ instead of $l$. This allows for unsupervised pretraining methods like language modeling (see Appendix `\ref{app:objective}`{=latex}).

A residual connection is used to connect the inputs with the outputs of the attention layer. Then, in the rest of the transformer block, a two-layer MLP is used, conventionally projecting the dimension upwards to $4 \cdot n_{dim}$ for the inner dimension and using the GELU activation function [@hendrycks2016gelu]. Another residual connection is used to connect the outputs of the MLP with the previous outputs of the attention layer.

This forms the basis of the transformer block. As it preserves the dimension $n_{dim}$, multiple blocks can be learned and stacked on top of each other $n_{layers}$ times, before feeding the final hidden states to the output layer. In our work, we only use the output of the last hidden state for classification, although in principle other methods are reasonable.

Positional Embeddings
---------------------

As the self-attention blocks are permutation-invariant, in order to capture positional information about sequences, positional embeddings are learned. For each position $i \in (1, \dots, \text{max\_len})$, a vector $p_i$ is learned. At the front of the transformer, before feeding in the inputs $x_i$ into the self-attention blocks, the positional embeddings are added to the input embeddings as $x_i := x_i + p_i$.

Layer Norm
----------

Layer norm [@ba2016layernorm] is frequently used in recurrent and transformer architectures as a means of normalizing the activations. In particular, for the activations of training example $x$ of dimension $n_{dim}$, it normalizes by the mean and variance over the features: $$\tilde{y}_i = \frac{x_i - \text{mean}(\{x_j\}_{j=1}^{n_{dim}})}{\text{std}(\{x_j\}_{j=1}^{n_{dim}})}$$

Then, affine scale and shift parameters each of dimension $n_{dim}$ -- $\gamma$ and $\beta$, respectively -- are learned to generate the outputs $y$. $$y_i = \gamma_i \tilde{y}_i + \beta_i$$

Layer norm is applied twice per self-attention block: once before the attention layer and once before the MLP. As a result, a total of $4 \cdot n_{layers} \cdot n_{dim}$ layer norm parameters are learned.

Pretraining Objective {#app:objective}
---------------------

GPT-2 is pretrained on an retrogressive language modeling objective optimizing for parameters which maximize the log-likelihood of the data: $\text{max}_\theta \mathbb{E}[\log p_\theta(x)]$. GPT-2 models sequences autoregressively, factorizing the probability distribution $p(x) = p(x_1, \dots, x_l)$ via chain rule as: $$p(x) = \prod_{i=1}^l p(x_i | x_{i-1}, \dots, x_1)$$

For the language domain, this objective can be interpreted as \`\`given the previous $i-1$ words of a sentence, predict the next word".

Model Sizes
-----------

The model sizes from Section `\ref{sec:size}`{=latex} are as follows:

::: {#table:model_sizes}
   **Model Size**   $n_{layers}$   $n_{dim}$   $n_{heads}$   \# Parameters
  ---------------- -------------- ----------- ------------- ---------------
    Small (Base)         12           768          12            117M
       Medium            24          1024          16            345M
       Large             36          1280          20            774M

  : Hyperparameters for architectures for larger model sizes.
:::

The hyperparameters for the experiments with other architectures (Vision Transformer, BERT, Longformer, T5) are the same as for the base model size shown above.

Experimental Details {#app:experimental_details}
====================

We use implementations of and pretrained models from the Huggingface Transformers library [@wolf2020transformers]. We train all models using the Adam [@kingma2014adam] optimizer following Pytorch [@paszke2019pytorch] defaults. For all transformer models, we use a learning rate of $10^{-3}$ without learning rate scheduling. For the remote homology task only, we use a learning rate of $10^{-4}$ as we found it to give better performance than $10^{-3}$. We generally use the largest batch size that fits on an RTX 2080 Ti graphics card, somewhere between 2 and 16, without gradient accumulation. Note that except for the remote homology task, we did not tune the FPT hyperparameters. For all LSTMs, we use a lower learning rate of $3 \times 10^{-4}$ and the same batch sizes as transformers of the same size. Models are trained to convergence and evaluated on a heldout test set. `\vspace{2em}`{=latex}

Details by Table
================

For clarity, we explicitly write out finer details for some experiment sections where numbers can represent different model types.

Can pretrained language models transfer to different modalities? {#can-pretrained-language-models-transfer-to-different-modalities}
----------------------------------------------------------------

This section refers to Table `\ref{table:main_result}`{=latex} in Section `\ref{sec:transfer}`{=latex}.

**Bit Memory**

1.  FPT: 12-layer base size FPT model (finetuning input, output, position, and layernorm params).

2.  Full: 12-layer base size GPT-2 model (training all params).

3.  LSTM: 3-layer, 768 hidden dimension LSTM model (training all params).

**Bit XOR**

1.  FPT: 12-layer base size FPT model (finetuning input, output, position, and layernorm params).

2.  Full: 12-layer base size GPT-2 model (training all params).

3.  LSTM: 3-layer, 768 hidden dimension LSTM model (training all params).

**ListOps**

1.  FPT: 12-layer base size FPT model (finetuning input, output, position, and layernorm params).

2.  Full: number reported from [@tay2020lra] (3-layer vanilla transformer).

3.  LSTM: 3-layer, 768 hidden dimension LSTM model (training all params).

**CIFAR-10**

1.  FPT: 36-layer large size FPT model (finetuning input, output, position, and layernorm params).

2.  Full: 3-layer, 768 hidden dimension GPT-2 model (training all params).

3.  LSTM: 3-layer, 768 hidden dimension LSTM model (training all params).

**CIFAR-10 LRA**

1.  FPT: 12-layer base size FPT model (finetuning input, output, position, and layernorm params).

2.  Full: number reported from [@tay2020lra] (3-layer vanilla transformer).

3.  LSTM: 3-layer, 768 hidden dimension LSTM model (training all params).

**Remote Homology**

1.  FPT: 12-layer base size FPT model (finetuning input, output, position, and layernorm params).

2.  Full: number reported from [@rap2019tape] (12-layer, 512 hidden dimension vanilla transformer).

3.  LSTM: 3-layer, 768 hidden dimension LSTM model (training all params).

```{=latex}
\vspace{2em}
```
What is the importance of the pretraining modality? {#app:details_pretraining}
---------------------------------------------------

This section refers to Table `\ref{table:random}`{=latex} in Section `\ref{sec:pretraining}`{=latex}.

**All tasks**

1.  FPT: 12-layer base size FPT model (finetuning input, output, position, and layernorm params). This differs from Table `\ref{table:main_result}`{=latex}, Section `\ref{sec:transfer}`{=latex} only in the CIFAR-10 model size.

2.  Random: 12-layer randomly initialized (default scheme) base size GPT-2 model (training input, output, position, and layernorm params).

3.  Bit: 12-layer base size GPT-2 model (finetuning input, output, position, and layernorm params), after first being fully finetuned on Bit Memory following default random initialization.

4.  ViT: 12-layer, 768 hidden dimension base size ViT model (finetuning input, output, position, and layernorm params), pretrained on 224 $\times$ 224 ImageNet-21k with a patch size of 16. (`vit_base_patch16_224` from the `timm` Pytorch library [@wightman2019timm]). We reinitialize the input layer from scratch to match each task, and do not use a CLS token or an MLP as the output network -- instead using a linear layer from the last token -- matching the protocol for the other methods.

How important is the transformer architecture compared to LSTM architecture? {#how-important-is-the-transformer-architecture-compared-to-lstm-architecture}
----------------------------------------------------------------------------

The following refer to Section `\ref{sec:architecture_results}`{=latex}. In Table `\ref{table:random_architecture}`{=latex}:

**All tasks**

1.  Trans: 12-layer randomly initialized (default scheme) base size GPT-2 model (training input, output, and layernorm params). Note: same as \`\`Random" in Table `\ref{table:random}`{=latex}, Section `\ref{sec:pretraining}`{=latex}.

2.  LSTM: 3-layer, 768 hidden dimension \`\`standard" LSTM (training input, output, and layernorm params). Does not have residual connections or positional embeddings.

3.  LSTM$^*$: 12-layer, 768 hidden dimension LSTM (training input, output, position, and layernorm params).

Table `\ref{table:lstm_layers}`{=latex}:

**All tasks**

1.  12: 12-layer, 768 hidden dimension \`\`standard" LSTM (training input, output, and layernorm params).

2.  3: 3-layer, 768 hidden dimension \`\`standard" LSTM (training input, output, and layernorm params).

Table `\ref{table:lstm_layers_residual}`{=latex}:

**All tasks**

1.  12-layer LSTM: 12-layer, 768 hidden dimension \`\`standard" LSTM (training input, output, and layernorm params). Note: same as \`\`12" in Table `\ref{table:lstm_layers}`{=latex}, Section `\ref{sec:architecture_results}`{=latex}.

2.  \+ Residual Connections: 12-layer, 768 hidden dimension LSTM with residual connections (training input, output, and layernorm params).

3.  \+ Positional Embeddings: 12-layer, 768 hidden dimension LSTM with residual connections and positional embeddings (training input, output, position, and layernorm params). Note: same as \`\`LSTM$^*$" in Table `\ref{table:random_architecture}`{=latex}, Section `\ref{sec:architecture_results}`{=latex}.

```{=latex}
\vspace{4em}
```
Does language pretraining improve compute efficiency over random initialization? {#does-language-pretraining-improve-compute-efficiency-over-random-initialization}
--------------------------------------------------------------------------------

This section refers to Table `\ref{table:convergence}`{=latex} in Section `\ref{sec:compute_efficiency}`{=latex}.

**All tasks**

1.  FPT: 12-layer base size FPT model (finetuning input, output, position, and layernorm params). Note: same models as \`\`FPT" in Table `\ref{table:random}`{=latex}, Section `\ref{sec:pretraining}`{=latex}.

2.  Random: 12-layer randomly initialized (default scheme) base size GPT-2 model (training input, output, position, and layernorm params). Note: same models as \`\`Random" in Table `\ref{table:random}`{=latex}, Section `\ref{sec:pretraining}`{=latex}.

[^1]: Code available at [github.com/kzl/universal-computation](https://github.com/kzl/universal-computation). For a summary of changes made in the updated arXiv version, see Appendix `\ref{app:changelog}`{=latex}.
