---
author:
- Daniel Bolya
- 'Po-Yao Huang'
- Peize Sun
- Jang Hyun Cho
- Andrea Madotto
- Chen Wei
- Tengyu Ma
- Jiale Zhi
- Jathushan Rajasegaran
- Hanoona Rasheed
- Junke Wang
- Marco Monteiro
- Hu Xu
- Shiyu Dong
- Nikhila Ravi
- Daniel Li
- Piotr Dollár
- Christoph Feichtenhofer
bibliography:
- reference.bib
title: 'Perception Encoder: The best visual embeddings are not at the output of the network'
---

```{=latex}
\def\bf{\bfseries\sffamily}
```
```{=latex}
\newcommand{\expandref}[2]{\ref{#1}\hyperref[#1]{#2}}
```
```{=latex}
\newcommand{\kindatiny}{\fontsize{6pt}{7.2pt}\selectfont}
```
```{=latex}
\newcommand{\tablestyle}[2]{%
    \fontfamily{ptm}\selectfont%
    \let\itold\it%
    \def\it{\itold \fontfamily{ptm}\selectfont}%
    \setlength{\tabcolsep}{#1}\renewcommand{\arraystretch}{#2}\centering\kindatiny%
    \let\citeold\cite%
    \renewcommand{\cite}[1]{\normalfont\fontfamily{ptm}\selectfont\tiny\citeold{##1}}%
}
```
```{=latex}
\newcommand{\bigtablestyle}[2]{%
    \fontfamily{ptm}\selectfont%
    \let\itold\it%
    \def\it{\itold \fontfamily{ptm}\selectfont}%
    \setlength{\tabcolsep}{#1}\renewcommand{\arraystretch}{#2}\centering\footnotesize%
    \let\citeold\cite%
    \renewcommand{\cite}[1]{\normalfont\fontfamily{ptm}\selectfont\footnotesize\citeold{##1}}%
}
```
```{=latex}
\renewcommand{\paragraph}[1]{\vspace{1.25mm}\noindent\textbf{#1}}
```
```{=latex}
\newcommand\blfootnote[1]{\begingroup\renewcommand\thefootnote{}\footnote{#1}\addtocounter{footnote}{-1}\endgroup}
```
```{=latex}
\newcommand{\uu}[1]{{#1}}
```
```{=latex}
\newcommand{\bb}[1]{\textbf{#1}}
```
```{=latex}
\newcommand{\rp}[2]{{#1\textcolor{c0-item-text}{\tiny /#2}}}
```
```{=latex}
\newcommand{\addpadding}{%
  \rule{0pt}{\dimexpr\normalbaselineskip-1pt\relax}%
}
```
```{=latex}
\newcommand{\cc}[2][c0]{\cellcolor{#1-item-bkg}\textcolor{#1-item-text}{\it \tiny{#2}}}
```
```{=latex}
\newcommand{\ct}[2][c0]{\addpadding{\cellcolor{#1-item-bkg}\textcolor{#1-title-text}{#2}}}
```
```{=latex}
\newcommand{\promptbox}[2]{
    \begin{tcolorbox}[
        top=0.3em,bottom=0.3em,left=0.5em,right=0.5em,
        toptitle=0.3em,bottomtitle=0.2em,boxsep=0pt,
        colframe=promptcolorheader,colback=promptcolor!50,boxrule=0.5pt,
        title={\footnotesize \fontfamily{zi4}\selectfont #1}
    ]
        \fontfamily{zi4}\selectfont #2
    \end{tcolorbox}
}
```
```{=latex}
\DeclareRobustCommand{\PEcore}{PE$_\text{core}$}
```
```{=latex}
\DeclareRobustCommand{\PElang}{PE$_\text{lang}$}
```
```{=latex}
\DeclareRobustCommand{\PEspat}{PE$_\text{spatial}$}
```
```{=latex}
\DeclareRobustCommand{\PEplm}{PE$_\text{PLM}$}
```
```{=latex}
\newcommand{\ccustom}[3][c0]{%
    \cellcolor{#1-item-bkg}{%
        \rotbox[l,t]{90}{%
            \parbox[t]{\ccustomlen}{%
                \ifthenelse{\isempty{#3}}{%
                    \mbox{%
                        \kindatiny\textcolor{#1-title-text}{#2}%
                    }%
                }{%
                    \kindatiny\textcolor{#1-title-text}{#2} \\%
                    \tiny{\textcolor{#1-item-text}{\it #3}}%
                }%
            }%
        }%
    }%
}
```
```{=latex}
\newcommand{\cb}[3][c0]{%
    \setlength{\ccustomlen}{1.2cm}%
    \ccustom[#1]{#2}{#3}%
}
```
```{=latex}
\newcommand{\ca}[1]{\cellcolor{avg-item-bkg}{#1}}
```
```{=latex}
\newcommand{\cat}[1]{\addpadding{\cellcolor{avg-item-bkg}{#1}}}
```
```{=latex}
\newcommand{\cmark}{\ding{51}}
```
```{=latex}
\newcommand{\xmark}{\ding{55}}
```
```{=latex}
\def\eg{\emph{e.g.}\xspace}
```
```{=latex}
\def\ie{\emph{i.e.}\xspace}
```
```{=latex}
\def\vs{\emph{vs.}\xspace}
```
```{=latex}
\newcommand{\etc}{\emph{etc.}\xspace}
```
```{=latex}
\maketitle
```
```{=latex}
\vspace{-3pt}
```
Introduction {#sec:intro}
============

```{=latex}
\vspace{-2pt}
```
For the last decade in computer vision, pretrained vision encoders have been the core building block for most applications requiring *perception*. From million-scale ImageNet [@imagenet] pretrained convolutional networks [@alexnet; @vgg; @resnet; @efficientnet; @convnext] to billion-scale web-pretrained transformers [@vit; @align; @basic; @coca; @vit22b; @dfn; @metaclip; @internvl; @eva18b], the dominant strategy in vision has consistently been to adapt large-scale pretrained encoders to downstream tasks.

There are many pretraining objectives today, each with distinct characteristics and each yielding representations better suited for specific tasks: vision-language contrastive losses [@clip; @siglip] learn a global vision and language embedding well-suited for zero-shot classification and retrieval as well as provide vision-language alignment for open-world [@glip; @owlv1] and generative tasks [@ldm; @dalle2]; captioning losses [@cappa; @aimv2] learn to predict image descriptions using a language decoder, which transfers well to downstream multimodal language model (MLLM) tasks; and spatially self-supervised losses [@mae; @dinov2] learn dense spatial correspondences without language supervision, making them useful for tasks requiring precise localization like object detection.

Many works are now attempting to combine two or more of these techniques in different ways [@coca; @aimv2; @internvl; @eva; @eva2; @ranzinger2023radio; @heinrich2024radio2.5; @maninis2024tips]. While many have been successful, the complexity of these strategies grows exponentially with number of use cases, which can make scaling difficult. There has not yet been shown a *single, simple, and easily scalable* pretraining technique that can learn state-of-the-art features for all downstream tasks.

```{=latex}
\centering
```
```{=latex}
\begin{overpic}[width=\linewidth, trim=8.7in 0in 0in 17.48in, clip]{fig/teaser.pdf}
        \put(21.7,11.75){\fontsize{7}{6.2}\selectfont \hyperref[sec:core]{\bf \textcolor{c1-title-text}{\S}\textcolor{citecolor}{2}}}
        \put(46.2,12){\fontsize{7}{6.2}\selectfont \hyperref[sec:layerfinder]{\bf \textcolor[HTML]{5e757c}{\S}\textcolor{citecolor}{3}}}
        \put(63.9,17.35){\fontsize{7}{6.2}\selectfont \hyperref[sec:la]{\bf \textcolor{c4-title-text}{\S}\textcolor{citecolor}{4}}}
        \put(63.9,6.5){\fontsize{7}{6.2}\selectfont \hyperref[sec:sa]{\bf \textcolor{c6-title-text}{\S}\textcolor{citecolor}{5}}}
    \end{overpic}
```
In this work we discover that *global vision-language contrastive learning alone* can be one such approach. After building a state-of-the-art contrastive model for image and video, we found a surprising result: *inside the model were specific features aligned to OCR, VQA, grounding, detection, depth estimation, and tracking*. Compared to the state-of-the-art models with captioning [@aimv2] and spatially self-supervised [@dinov2] pretraining, our contrastive encoder has specific layers that, when used as frozen features, matches or exceeds the performance of the other two pretraining techniques *on tasks they should be the best at*. The only problem is---these features exist at *different layers* for each task. By exploiting this phenomenon with *alignment tuning*, we show it is possible to align these features to the end of the network in order to create state-of-the-art encoders for downstream MLLM and spatial tasks---all following the same easily scalable contrastive pretraining.

We begin by building `\PEcore{}`{=latex} (Fig. `\ref{fig:teaser}`{=latex}, left), a large-scale contrastively pretrained model with state-of-the-art zero-shot performance on *both* images and video (§`\ref{sec:core}`{=latex}). To accomplish this, we first focus on developing a strong *image-only* contrastive pretraining recipe to extract general knowledge from billion-scale image-text data. Keeping the data and training FLOPs fixed, this recipe significantly improves upon vanilla CLIP in both absolute performance and robustness (§`\ref{sec:core_image_pt}`{=latex}). We then use the resulting model as a frame-based encoder to develop a *video* data engine for generating well-aligned video captions. Finetuning on this synthetic video-text data substantially improves performance on *both image and video* classification and retrieval tasks (§`\ref{sec:video_data_engine}`{=latex}). Motivated by this success, we release a large portion of the data used to train the engine: PE Video Dataset (PVD), consisting of 1M diverse videos with 120K human-refined annotations (§`\ref{sec:pvd}`{=latex}). Finally, we scale our robust image pretraining and well-aligned video finetuning strategy to 2B parameters to produce `\PEcore{G}`{=latex} (§`\ref{sec:unified-encoder}`{=latex}), a single unified encoder that outperforms SigLIP2 [@siglip2] on zero-shot image tasks and InternVideo2 [@internvideo2] on most zero-shot video tasks. We further transfer this power to smaller model scales through distillation.

With the strongest image and video recognition model in hand, we shift our focus to downstream tasks. Remarkably, despite being pretrained with CLIP loss, we find that the *intermediate layers* of `\PEcore{G}`{=latex} can rival AIMv2-3B [@aimv2] on language tasks and DINOv2-g [@dinov2] on spatial tasks, both of which among the strongest pretrained models in their respective domains. Upon investigation, we attribute this capability to our robust image pretraining strategy, which appears to have unlocked the potential of contrastive pretraining to scale effectively for downstream tasks (§`\ref{sec:layerfinder}`{=latex}). However, a challenge remains: the model does not naturally output these features, keeping them hidden internally. To address this, we introduce two *alignment tuning* methods (Fig. `\ref{fig:teaser}`{=latex}, right) to extract these strong, general features.

First, in §`\ref{sec:la}`{=latex}, we investigate the most effective technique to align features to the end of the network by adapting to a large language model. This *language alignment* enables us to construct `\PElang{G}`{=latex}, which individually outperforms all other popular vision encoders for MLLM tasks. Moreover, when paired with our Perception Language Model (PLM) [@PLM], the combination rivals the latest state-of-the-art MLLMs, like InternVL3 [@internvl3].

Second, in §`\ref{sec:sa}`{=latex}, we identify a dichotomy in the layers optimal for spatial tasks. By visualizing the features and pinpointing the explicit reason for this dichotomy, we develop a straightforward *spatial alignment* approach: distilling *from the model's own frozen features* to achieve most of the alignment, complemented by a novel use of SAM 2 [@sam2] for *spatial correspondence* distillation to refine the process. The resulting `\PEspat{G}`{=latex} not only outperforms other popular models in depth estimation, tracking, and semantic segmentation, but also sets a new absolute state-of-the-art on COCO [@coco] detection with a much simpler decoder.

With this family of checkpoints, Perception Encoder unlocks the potential to scale one simple pretraining method to solve many downstream vision tasks. We are releasing our models, code, and PE Video Dataset.

Perception Encoder: *Core* {#sec:core}
==========================

```{=latex}
\vspace{-5pt}
```
To build Perception Encoder (PE), we start by training a large-scale, robust, and highly performant vision-language contrastive model for image *and video*. We have two objectives: first, to enhance the scalability and data efficiency of contrastive training; and second, to create a unified model effective on both image and video.

These goals are somewhat conflicting: image-text data is plentiful and training on images is efficient, but video-text data is scarce and video training is expensive. Thus, we decouple image and video training into two stages. We first develop a strong *image* pretraining recipe (§`\ref{sec:core_image_pt}`{=latex}) with several regularization techniques to create a robust starting point. Then we use the resulting image model as a frame encoder to develop a *video data engine* (§`\ref{sec:video_data_engine}`{=latex}) supported by our novel human-refined video-text dataset (§`\ref{sec:pvd}`{=latex}) to generate aligned captions for video clips. Finally, we finetune the image encoder on the resulting aligned video data (§`\ref{sec:unified-encoder}`{=latex}). Using our data engine design, this short finetuning step substantially improves *both* image and video performance.

```{=latex}
\vspace{-5pt}
```
Robust Image Pretraining {#sec:core_image_pt}
------------------------

In the first stage of pretraining, we want to learn as much visual information as possible from a large set of image-text data. Notably, a unique quirk of contrastive training is the loss for a given sample depends on the other samples in the batch. Because each batch is different, there is potential to learn new information every time an example is sampled, even if that sample has been seen before. Thus, we find contrastive learning to benefit from a long training schedule. To exploit this, we design our pretraining recipe with high regularization, stability, and training efficiency in mind.

```{=latex}
\begin{wrapfigure}{r}{0.545\textwidth}
\vspace{-27pt}
  \begin{center}
      
        \includegraphics[width=1\linewidth, trim = 14.4in 0in 0in 7.6in, clip]{fig/convnext.pdf}
    
    \end{center}
    \caption{{\bf Robust Image Pretraining.} We tune our pretraining recipe (\S\ref{sec:core_image_pt}) to maximize performance on a fixed set of data, starting with an OpenCLIP~\cite{openclip} ViT-L/14 model.
    We report cumulative zero-shot classification results for each modification.
    The inner bars show robustness evaluation, calculated as the average of 6 robustness benchmarks~\cite{imagenet,imagenetv2,objectnet,imagenet-a,imagenet-r,imagenet-sketch}, and the outer bars show ImageNet val~\cite{imagenet} alone. Several changes significantly improve robustness, indicating that ImageNet val scales more with data, while robustness can scale with refined training techniques.
    }
    \label{fig:core_pt_ablations}
\vspace{-20pt}
\end{wrapfigure}
```
#### Setup.

 (Fig. `\expandref{fig:core_pt_ablations}{.1}`{=latex}) We track our changes on a vanilla CLIP model using an OpenCLIP [@openclip] ViT-L/14 model at 224 resolution as a baseline. We keep the training budget fixed to around 1T GFLOPs (*i.e.*, a ZFLOP), and train on a fixed 2.3B image-text dataset curated using the MetaCLIP [@metaclip] text-only curation pipeline. For the baseline, we use a global batch size of 32K, class token, AdamW [@adamw], and train for 12B samples seen. To assess the *generality* of the information learned during pretraining, we report not only zero-shot ImageNet val [@imagenet] results but also the average performance across a range of robustness metrics, including ImageNet val [@imagenet], ImageNet v2 [@imagenetv2], ObjectNet [@objectnet], ImageNet Adversarial [@imagenet-a], ImageNet Rendition [@imagenet-r], and ImageNet Sketch [@imagenet-sketch]. As observed with other pure CLIP models [@clip; @dfn; @metaclip], the average robustness metric performance of this vanilla recipe is much lower than ImageNet val alone.

#### Progressive Resolution.

 (Fig. `\expandref{fig:core_pt_ablations}{.2}`{=latex}) To enable longer training, we first improve training efficiency. As shown in many works [@efficientnet; @li2023clipa; @swin; @touvron2022deit; @li2023clipav2], vision encoders work well with a *progressively increasing* resolution schedule. Thus, we *halve* the training FLOPs while maintaining performance by evenly splitting the baseline 12B-sample run into 98, 154, and 224 resolution stages, with 4B samples per stage.

#### Increasing Batch Size.

 (Fig. `\expandref{fig:core_pt_ablations}{.3}`{=latex}) We use the extra budget to double the batch size from 32K to 64K, increasing the total samples seen from 12B to 24B. Larger batch size means a higher likelihood for there to be a non-trivially novel pair of samples, *i.e.*, hard negatives. This is akin to increasing the \`\`task difficulty" of CLIP and improves ImageNet val by +0.6% and robustness by double of that, +1.1%.

#### LAMB Optimizer.

 (Fig. `\expandref{fig:core_pt_ablations}{.4}`{=latex}) We switch from AdamW to LAMB [@lamb], which is known to stabilize large batch training. More importantly, LAMB allows us to train stably with a higher learning rate of $2$ $\times$ 10$^{-3}$ compared to the original $5$ $\times$ 10$^{-4}$. We observe that starting with a high learning rate is important to allow the model to adapt to different resolutions. These factors combine for +0.4% on ImageNet val and +0.7% on robustness.

#### Increasing Final Resolution.

 (Fig. `\expandref{fig:core_pt_ablations}{.5}`{=latex}) A classic finding is that parameters and resolution should be scaled together [@efficientnet; @feichtenhofer2020x3d]. Thus, we add a fourth 336 resolution stage at the end of training. To keep the training FLOPs the same, we adjust the training schedule to 10B samples at 98 resolution, 8B at 154, 4B at 224, and 2B at 336. While ImageNet val only increases by +0.5%, robustness improves threefold, rising by +1.4%.

#### RoPE.

 (Fig. `\expandref{fig:core_pt_ablations}{.6}`{=latex}) We add 2D RoPE [@rope] to each attention layer to improve extrapolation, keeping the original position embedding. 2D RoPE only improves ImageNet val by +0.3% but enhances robustness by +0.9%.

#### Attention Pooling.

 (Fig. `\expandref{fig:core_pt_ablations}{.7}`{=latex}) We follow [@siglip] in constructing the CLIP embedding using an attention probing transformer block. Surprisingly, we found keeping the class token as an input to this block is important for small model performance. Together, this improves ImageNet val by +0.3% and robustness by +0.9%.

#### Tuned Data Augmentation.

 (Fig. `\expandref{fig:core_pt_ablations}{.8}`{=latex}) Despite training on billions of samples, we find data augmentation still important---especially for transfer to unlikely scenarios like in ObjectNet [@objectnet]. We add heavy random cropping, brightness/saturation jitter, and horizontal flip. Random cropping encourages using the entire caption, as not everything is in frame. Jitter helps low-light settings and documents. Horizontal flip improves natural images and does not hurt OCR (see §`\ref{sec:core_results}`{=latex}). These improve robustness by +0.7%, notably, ObjectNet by +2.4%.

#### Mask Regularization.

 (Fig. `\expandref{fig:core_pt_ablations}{.9}`{=latex}) As regularization, we want the model to produce the same features if some patches are not visible. However, passing the CLIP gradients through masked images may negatively alter behavior on unmasked images. Thus, we convert MaskFeat [@maskfeat] into a regularization loss by duplicating and masking 1/16th of the batch. At the output, the masked tokens are aligned to their unmasked counterparts by maximizing cosine similarity. Care is taken to ensure that the CLIP and masked gradients are disjoint.

#### Scaling Behavior.

 (Figs. `\ref{fig:core_pt_scaling}`{=latex} and `\ref{fig:core_pt_scaling2}`{=latex}) In Fig. `\ref{fig:core_pt_scaling}`{=latex}, we show the performance of our recipe (Fig. `\expandref{fig:core_pt_ablations}{.9}`{=latex}) `\vs `{=latex}the original CLIP recipe (Fig. `\expandref{fig:core_pt_ablations}{.1}`{=latex}) across S/14, B/14, and L/14 models. For each benchmark, our recipe scales around the same rate or better than the original CLIP recipe. On some difficult datasets like ObjectNet [@objectnet] and ImageNet Adversarial [@imagenet-a], our recipe shows distinctly better scaling. This indicates that the improvements in performance were not at the cost of scalability, meaning we can further benefit from scaling the model size.

```{=latex}
\centering
```
![**Scaling Behavior (Model Size).** Results before and after our recipe changes (Fig. `\ref{fig:core_pt_ablations}`{=latex}) for S/14, B/14, and L/14 models. Our recipe improves scaling for difficult metrics like ObjectNet [@objectnet] and ImageNet Adeversarial [@imagenet-a]. ](fig/model_scaling.png){#fig:core_pt_scaling width="1.0\\linewidth"}

In Fig. `\ref{fig:core_pt_scaling2}`{=latex}, we additionally show the performance of our recipe `\vs `{=latex}the original CLIP recipe across L/14 models trained with 120K steps (one-third schedule), 240K steps (two-thirds schedule), and 360K steps (full ablation schedule). All models are their own training runs with full learning rate annealing and the progressive resolution schedule adjusted proportionally. We see nearly linear trends for our recipe on most datasets. This suggests we can train longer for more performance, even at L scale and with 24B samples seen already.

```{=latex}
\centering
```
![**Scaling Behavior (Training Steps).** Results before and after our recipe changes for an L/14 model trained with 120K, 240K, and 360K steps, adjusting the learning rate and progressive resolution schedules accordingly. Despite our recipe being much stronger than the original, there is still room for further improvement by training longer. ](fig/model_scaling2.png){#fig:core_pt_scaling2 width="1.0\\linewidth"}

```{=latex}
\newpage
```
Bootstrapping a Video Data Engine with Perception Encoder {#sec:video_data_engine}
---------------------------------------------------------

```{=latex}
\begin{wrapfigure}{r}{0.65\textwidth}
\vspace{5pt}
    \centering
    \includegraphics[width=\linewidth, trim=2.1in 0in 0in 5.1in, clip]{fig/video_caption_pipeline.pdf}
    \caption{{\bf Video Data Engine.}
    To create aligned video-text data for contrastive training, we use a PE-based video captioner~\cite{PLM} to generate a holistic video caption and an image-level captioner~\cite{llama3} on sampled frames.
    We then provide those captions as well as the original video metadata to text-only LLM~\cite{llama3} to synthesize a single short, aligned caption optimal for contrastive training.
    }
    \label{fig:video_caption_pipeline}
\vspace{-5pt}
\end{wrapfigure}
```
`\label{sec:core_video_ft}`{=latex} With a robust image pretraining recipe settled and its scaling behavior confirmed, our next step is to extend the image-only encoder to accommodate video and build a unified image-video model. Unlike web-scale image-text data, which comes in many cases with human-generated descriptive alt-text information, videos with aligned language annotation are inherently scarce. High-quality human-annotated captions for videos are even rarer. This scarcity presents a unique and significant challenge in training encoders capable of effectively processing video inputs. Inspired by the recent success of image data engines [@sam; @sam2; @veclip; @Nguyen2023recap; @altogether], we extend this concept to develop a robust video data engine that generates well-aligned synthetic captions for a diverse set of videos, facilitating the training of a video encoder. This innovative approach represents the first large-scale exploration of its kind. In the following sections, we introduce the process of building our video data engine.

To bootstrap our contrastive video finetuning, we focus on synthesizing video captions. We build our data engine in three stages: (1) we create a strong baseline video captioner, which we call the Perception Language Model (PLM), described in [@PLM]; (2) we add additional high quality video data with human-refined captions to further enhance the captioner's quality; (3) we refine and summarize the generated video captions with an LLM to construct a large video dataset to use for the contrastive video finetuning of our Perception Encoder.

#### Phase 1: Base Video Captioner (PLM).

We build our data engine on an early version of PLM [@PLM], a multimodal large language model with PE as the vision encoder and Llama [@llama3] as the language decoder. We train PLM on a large-scale collection of open-access image and video datasets [@PLM]. In total, the training dataset consists of 64.7M images and videos covering natural images, charts, documents, exocentric and egocentric videos.

```{=latex}
\begin{wraptable}{r}{0.5\textwidth}
\centering
\vspace{-13pt}
{
\tablestyle{0pt}{1.05} 
\begin{tabular}{y{90} x{21}x{21}x{25}x{25}x{45}}
    \shline
    & \multicolumn{2}{c}{\ct[c3]{AuroraCap~\cite{auroracap}}} & \multicolumn{2}{c}{\ct[c4]{VCG Diverse~\cite{Maaz2024VideoGPT+}}} & \ct[c5]{VCG Bench~\cite{Maaz2023VideoChatGPT}} \\
    Captioner & \cc[c3]{Score} & \cc[c3]{Acc} & \cc[c4]{Score} & \cc[c4]{Acc} & \cc[c5]{Score} \\
    \hline
    \addpadding
    PLM & {2.2} & 51.9 & 3.1 & 65.1 & 34.3 \\
    PLM + Human-Refined Data & \textbf{3.4} & \textbf{71.1} & \textbf{3.6} & \textbf{79.4} & \textbf{35.2} \\
    \shline
\end{tabular}
}
\caption{{\bf Video Captioning.}
We use an early version of PLM-8B~\cite{PLM}, consisting of our image-only PE encoder and a Llama decoder, for captioning. Adding human-refined data greatly boosts captioning performance (higher is better).
}
\vspace{-15pt}
\label{tab:plm_ablation_for_caption}
\end{wraptable}
```
#### Phase 2: PLM + Refined Data.

To further boost captioning performance, we collect a set of 265K videos (105K from PVD which we release, see §`\ref{sec:pvd}`{=latex}), caption them with our base PLM model, and ask human raters to refine the captions[^1]. We then finetune our base PLM model with this data, significantly improving captioning quality (see Tab. `\ref{tab:plm_ablation_for_caption}`{=latex}).

#### Phase 3: LLM Summarization.

We synthesize the final aligned video captions by incorporating the PLM video captions, Llama 3.2 [@llama3] image-only frame captions, and the existing video metadata of video titles and descriptions (Fig. `\ref{fig:video_caption_pipeline}`{=latex}). Similar to image alt-text, video metadata contains knowledge often not covered by the image and video captioning models. Thus, combining the two leads to more comprehensive captions. We summarize video captions, frame captions, and video metadata together using the Llama 3.3 70B model to provide the final captions. The prompt used to generate the summary can be found in Appendix `\ref{sec:appx_video_caption}`{=latex}.

#### Using the Engine.

Finally, we use the resulting data engine bootstrapped with an image-only checkpoint of PE to generate well-aligned, information-dense captions for a diverse set of 22M videos for contrastive finetuning.

#### Training with Recaptioned Videos

. Our goal is to develop a unified image *and* video encoder. To encode videos using our existing image encoder, we uniformly sample $N$ $=$ $8$ frames from video clips and extract frame-level embeddings with the image encoder. We then apply average pooling over these frame embeddings to obtain video embeddings, which are used for contrastive learning with encoded video captions by the text encoder. Despite being extremely simple, we find this technique surprisingly effective in producing a strong joint image-video encoder. We share this finding with previous studies [@clip4clip; @internvl], which note that simple average pooling outperforms more complex pooling strategies like attention-based compression for video.

```{=latex}
\begin{wraptable}{r}{0.55\textwidth}
    \vspace{-10pt}
    \centering
    { % needs extra block because table stype changes citation size
    \tablestyle{0pt}{1.05} 
    \begin{tabular}{x{15}x{15}x{15}x{15}awwwww awwwww}
        \shline
        &&&  & \multicolumn{6}{c}{\ct[c1]{\it Image Zero-Shot}} & \multicolumn{5}{c}{\ct[c3]{\it Video Zero-Shot}} \\
              \cb{Title}{}
            & \cb{Description}{}
            & \cb{Video Caption}{}
            & \cb{Frame Caption}{}
            & \cb[c1]{\textit{\textbf{Average Image}}}{}
            & \cb[c1]{ImageNet}{val~\cite{imagenet}}
            & \cb[c1]{ImageNet}{v2~\cite{imagenetv2}}
            & \cb[c1]{ObjectNet}{IN Classes~\cite{objectnet}}
            & \cb[c2]{MS-COCO}{txt$\rightarrow$img~\cite{coco}}
            & \cb[c2]{MS-COCO}{img$\rightarrow$txt~\cite{coco}}
            & \cb[c3]{\textit{\textbf{Average Video}}}{}
            & \cb[c3]{Kinetics}{400~\cite{kay2017kinetics}}
            & \cb[c3]{Kinetics}{600~\cite{kay2017kinetics}}
            & \cb[c3]{MSR-VTT}{txt$\rightarrow$vid~\cite{vtt}}
            & \cb[c3]{MSR-VTT}{vid$\rightarrow$txt~\cite{vtt}}
            \\
            
        \hline
        &  &           &               
        & \cat{72.6} & 83.3 & 77.8 & 85.8  & 49.4 & 66.8
        & \cat{50.9} & 69.7 & 68.4  & 38.0 & 27.3
        \\
        
        $\checkmark$ & $\checkmark$ &           &               
        & \ca{75.4} & 83.2 & 78.2 & 87.1  & 47.3 & 66.0
        & \ca{56.0} & 74.1 & 73.5  &    39.0 &  37.3
        \\
        
        $\checkmark$ & $\checkmark$ & $\checkmark$ &            
        & \ca{78.2} & 83.5 & 78.4 & 86.8  & 56.0& 74.3
        & \ca{60.9} & 73.8 & 73.4    & 47.6 & 48.8
        \\

        $\checkmark$ & $\checkmark$ & \hphantom{\textsuperscript{*}}$\checkmark$\textsuperscript{*}  & $\checkmark$  
        & \ca{78.1} & 83.7 & 79.0 & 87.7  & 54.1 & 73.0
        & \ca{60.9} & 75.4 &    75.1 &      46.7 &  46.5
        \\

        
        $\checkmark$ & $\checkmark$ & $\checkmark$     & $\checkmark$ 
        & \ca{78.2} & 83.7 & 79.0 & 87.5  & 54.6 & 73.2 
        & \ca{61.6} & 75.8 &    75.5 &      47.4 &  48.1
        \\
        

        \shline
    \end{tabular}
    }
    \caption{{\bf Video Data Engine Ablation.}
    We ablate our video data engine in Fig.~\ref{fig:video_caption_pipeline} by finetuning on an in-development image-only version of PE by averaging the frame embeddings to create a single video CLIP embedding.
    Video captions are generated by PLM trained with or without\textsuperscript{*} human-refined data (see \S\ref{sec:pvd}).
    Frame captions are generated by the Llama 3.2 vision model.
    Each component helps on different metrics, overall culminating in a huge boost to \textit{both} image and video zero-shot performance. 
    }
    \label{tab:video-ft-ablation}
    \vspace{-10pt}
\end{wraptable}
```
#### Ablations.

In Tab. `\ref{tab:video-ft-ablation}`{=latex}, we conduct an ablation study on the components of the video data engine by finetuning an intermediate image-only checkpoint on 17M of the 22M videos recaptioned by our video data engine. The results show that the video data engine significantly enhances zero-shot classification and retrieval performance for both image and video benchmarks, compared to the image-only baseline encoder (first row). Notably, using the video data engine's video-level and frame-level captions provides significant improvements over relying solely on metadata such as video title and description (second row), highlighting the importance of building a robust video data engine to compensate for noise in web videos. Our analysis reveals that the most critical components are the video metadata and PLM's video caption; however, all components are necessary to achieve peak performance in our video data engine.

In Fig. `\ref{fig:core_video_scaling}`{=latex}, we investigate the impact of scaling recaptioned video data on a later checkpoint of the same image-only model as in Fig. `\ref{tab:video-ft-ablation}`{=latex}. Notably, scaling synthetic video data demonstrates consistent improvement in both image and video benchmarks. Full results of this scaling experiment can be found in the Appendix `\ref{tbl_app:video_ft}`{=latex}.

```{=latex}
\centering
```
![**Video Data Scaling.** Finetuning on videos recaptioned by the PE video data engine from 0M (baseline image-only model) to 17M samples consistently improves both image and video performance, both classification and retrieval.](fig/video_scaling.png){#fig:core_video_scaling width="0.98\\linewidth"}

In the top row, scaling synthetic video data consistently improves performance on image benchmarks, with monotonic improvements of +1.1% in ObjectNet and +1.6% in ImageNet Adversarial. ImageNet val and ImageNet v2 have smaller gains, with accuracy increases of 0.3% to 0.5%, plateauing at $\sim$7M samples. We also observe a significant boost to zero-shot retrieval (here, COCO [@coco]) of +3.8% to +4.1% top-1 recall.

The video tasks listed in the bottom row demonstrate a consistent story. We observe a significant jump in performance between none and 3M videos across all video classification tasks, indicating that there is a domain gap for image-only models that hinders their ability to perform well on video out of the box. Further scaling synthetic video data leads to substantial performance gains in both video classification and retrieval. Video classification accuracy improves consistently by +5.6% to +11.7% without plateauing, while video retrieval shows significant improvements of +7.7 to +15.3 top-1 recall.

These experiments highlight the quality of our video data engine and its ability to significantly improve encoder performance, even with only a relatively modest 17M videos compared to the billions of images seen during pretraining. Our video data engine is a vital component in build a strong, unified image-video encoder.

PE Video Dataset (PVD) {#sec:pvd}
----------------------

For the benefit of the community, we release a new video dataset: PE Video Dataset (PVD).[^2] PVD comprises of 1M high-quality and diverse videos with accompanying tags and descriptions. The videos are motion-centered, covering both first-person and third-person views with a wide coverage of scenes.

We additionally select 120K of these videos with the highest degree of motion to annotate with detailed captions by generating synthetic captions using our video captioner (§`\ref{sec:video_data_engine}`{=latex}) and employing 200 annotators to verify and refine them. We ask the human annotators to improve the synthetic captions by removing any hallucinations, correcting words that describe the video inaccurately, eliminating repetitive or redundant words to make the caption more concise, and adding any missing actions being performed in the video.

```{=latex}
\begin{wraptable}{r}{0.25\textwidth}
\vspace{-10pt}
\centering
{
\tablestyle{4pt}{1.05} 
\begin{tabular}{z{60}y{40}}
    \shline
    \ct[c1]{Videos} & 998,862 \\
    \ct[c2]{Human Captions} & 118,862 \\

    \ct[c3]{Total Duration} & 4625 hrs \\
    \ct[c4]{Duration (s)} & 16.7$\pm$9.8  \\
    \ct[c5]{Human Caption Length} & 57.1$\pm$25.4\\
    \ct[c6]{Model Caption Length} & 111.7$\pm$43.2\\
    \shline
\end{tabular}
}

\captionsetup{justification=centering}
\caption{{\bf PVD Statistics.\label{tab:PEvideo_stat}} %imago
}
\vspace{-10pt}
\end{wraptable}
```
We release two versions of annotations for the 120K PVD subset: (1) Human verified captions: extended summaries with an average length of 57.1 words that provide a high-level description of each video. These captions are suitable for CLIP-style training. (2) Long automated captions: detailed and fine-grained descriptions with an average length of 111.7 words that capture spatial and temporal events. These captions are ideal for fine-grained video understanding.

```{=latex}
\begin{figure*}
    \centering
    %\vspace{1cm}
    \includegraphics[width=\linewidth, trim=7.85in 0in 0in 6.38in, clip]{fig/pvd_video_example.pdf}
    \caption{{\bf PE Video Dataset Example.} A sample from PVD, our released video-text dataset. Initial captions are generated by our video captioning model and then refined by human annotators. Annotators are instructed to add details and remove model hallucination. In this example, the model hallucination ``a spoon'' is removed; and more details such as ``glass bowl'' and the action ``scraping'' are added. See Appendix Fig.~\ref{fig:video_data_example_more} for more.}
    \label{fig:video_data_example}
    %\vspace{2cm}
\end{figure*}
```
In Fig. `\ref{fig:video_data_example}`{=latex}, we visualize a video example together with their model and human captions from PE Video Dataset (See Fig. `\ref{fig:video_data_example_more}`{=latex} for more). The dataset statistics are summarized in Tab. `\ref{tab:PEvideo_stat}`{=latex}. Finally, We use 105K of these refined samples to improve the data engine (§`\ref{sec:video_data_engine}`{=latex} phase 2) and 15K as a high-quality video retrieval benchmark.

#### PVD Benchmark.

We use 15K of the human-refined video-caption pairs as a held-out test set, which we introduce as a new video retrieval benchmark, PVD Benchmark, to evaluate finegrained video-caption alignment. We follow the format of MSR-VTT [@vtt] to construct the benchmark. We select videos from 10 different categories, including hand actions, object interactions, food preparation, work activities, outdoor scenes, animals, water scenes, object handling, close-up shots, and nature scenes, with an overall average caption length of 51.7 words (see Appendix `\ref{appx:pvd_bench_distribution}`{=latex} for statistics). We use PVD Benchmark to evaluate SigLIP [@siglip], SigLIP2 [@siglip2], InternVL [@internvl], and PE models, and the results can be found in Tab. `\ref{tab:core_pe_bench}`{=latex}.

A Unified Encoder for Image and Video {#sec:unified-encoder}
-------------------------------------

Using a robust, scalable image pretraining recipe and video-pretraining data recaptioned by the proposed video data engine, in this section we present **`\PEcore{}`{=latex}**, a unified image-and-video encoder.

```{=latex}
\begin{wraptable}{r}{0.4\textwidth}
\vspace{-14pt}
\centering
\tablestyle{0pt}{1.05} 
\begin{tabular}{x{20}x{30}x{25}x{20}x{20}x{20}x{20}x{30}}
    \shline
        \ct{Scale} & \ct{Tower} & \ct[c1]{Params} & \ct[c2]{Width} & \ct[c3]{Depth} & \ct[c4]{MLP} & \ct[c5]{Heads} & \ct[c6]{CLIP Dim} \\
        \hline
       \addpadding
        \multirow{2}{*}{B} & Vision & 0.09B & 768 & 12 & 3072 & 12 & \multirow{2}{*}{1024}\addpadding{} \\
                           & Text   & 0.31B & 1024 & 24 & 4096 & 16 &  \\
       \hline
       \addpadding
        \multirow{2}{*}{L} & Vision & 0.32B& 1024 & 24 & 4096 & 16 & \multirow{2}{*}{1024}\addpadding{} \\
                           & Text   & 0.31B & 1024 & 24 & 4096 & 16 &  \\
                           \hline
       \addpadding
        \multirow{2}{*}{G} & Vision & 1.88B & 1536 & 50 & 8960 & 16 & \multirow{2}{*}{1280}\addpadding{} \\
                           & Text   & 0.47B & 1280 & 24 & 5120 & 20 &  \\
    \shline
\end{tabular}
\captionsetup{justification=centering}
\caption{{\bf PE Model Configurations.} } 
\label{tab:pe2b}
\vspace{-20pt}
\end{wraptable}
```
#### Model Architecture

. To capitalize on the promising scaling behavior observed in §`\ref{sec:core_image_pt}`{=latex}, we scale the largest `\PEcore{}`{=latex} model to 2B parameters[^3] (G scale). Tab. `\ref{tab:pe2b}`{=latex} shows the detailed model configuration of the vision and text transformers and the dimension of the output clip embedding space.

#### Smaller Model Distillation.

To maximize the performance of smaller models (B and L scales in Tab. `\ref{tab:pe2b}`{=latex}), we employ a distillation finetuning approach [@distillation] using `\PEcore{G}`{=latex} as the teacher. This process involves a short finetuning schedule where both the student and teacher models encode image and text inputs separately to compute image-to-text and text-to-image similarity distributions, similar to CLIP training [@clip]. The student's distributions are then optimized to match those of the teacher by minimizing KL-divergence, distilling multimodal relational knowledge from the teacher into the student.

Notably, we find that using a smaller softmax temperature for the teacher's distributions, specifically 0.5$\times$ the temperature used for the student's distribution, significantly enhances the effectiveness of knowledge distillation. By leveraging the strong embeddings provided by `\PEcore{G}`{=latex}, our short distillation finetuning schedule significantly boosts the performance of both B and L scale models of `\PEcore{}`{=latex} (see Appendix `\ref{appx:core_smaller_models}`{=latex}).

#### Model Training

. The training process of `\PEcore{}`{=latex} involves three stages:

1.  *Image pretraining.* We scale up image pretraining to 5.4B publicly available image alt-text pairs curated with MetaCLIP [@metaclip] and a total of 86B samples seen to ensure convergence (58B for B and L). We use a global batch size of 131K, with progressive resolution from 98 to up to 448 depending on the model.

2.  *Image and video finetuning.* Following the initial pretraining, we subsequently finetune the model at max resolution with a short schedule for 50M samples on the image pretraining data (as cooldown) followed by 22M samples on the recaptioned videos with a smaller learning rate and batch size. The video captions are produced using the proposed video data engine (§`\ref{sec:video_data_engine}`{=latex}). For each video clip, we uniformly sample 8 frames, encode them, take their average to produce a single video embedding, and align them with the corresponding video captions using the same contrastive objective in image training.

3.  *Smaller model distillation.* We distill the 2B model (G scale) into smaller contrastive pretrained models at B and L scales under their final resolutions, using a short schedule that covers approximately 4B samples seen ($\sim$8% of the pretraining schedule) with a lower learning rate and no weight decay.

The detailed training configuration and setups are listed in Appendix `\ref{sec:appx_joint_train}`{=latex}.

```{=latex}
\begin{table*}[t!]\centering
    \makebox[\linewidth][c]{
    \tablestyle{0pt}{1.15} 
    \begin{tabular}{y{55}www awwwwww awwwwwwww awwww}
        \shline
        \multirow{2}{*}{\vspace{-2.2cm} Model}  &&&& \multicolumn{7}{c}{\ct[c1]{\it Zero-Shot Classification}} %
        & \multicolumn{9}{c}{\ct[c2]{\it Zero-Shot Fine-Grained Classification}} %
        & \multicolumn{5}{c}{\ct[c3]{\it Zero-Shot Retrieval}}\\
            & \cb{Encoder Params}{}
            & \cb{Resolution}{}
            & \cb{Data}{}
            & \cb[c1]{\textit{\textbf{Avg Class.}}}{}
            & \cb[c1]{ImageNet}{val~\cite{imagenet}}
            & \cb[c1]{ImageNet}{v2~\cite{imagenetv2}}
            & \cb[c1]{ObjectNet}{IN Classes~\cite{objectnet}}
            & \cb[c1]{ImageNet}{Adversarial~\cite{imagenet-a}}
            & \cb[c1]{ImageNet}{Renditions~\cite{imagenet-r}}
            & \cb[c1]{ImageNet}{Sketch~\cite{imagenet-sketch}}
            %
            & \cb[c2]{\textit{\textbf{Avg Fine.}}}{}
            & \cb[c2]{Food}{101~\cite{food101}}
            & \cb[c2]{Flowers}{Oxford~\cite{flower102}}
            & \cb[c2]{Pets}{Oxford~\cite{pets}}
            & \cb[c2]{Cars}{Stanford~\cite{cars}}
            & \cb[c2]{Aircrafts}{FGVC~\cite{aircraft}}
            & \cb[c2]{Countries}{211~\cite{thomee2016yfcc100m}}
            & \cb[c2]{Scenes}{SUN397~\cite{sun397}}
            & \cb[c2]{Satellite}{RESISC~\cite{cheng_2017_resisc}}
            %
            & \cb[c3]{\textit{\textbf{Avg Retrieval}}}{}
            & \cb[c3]{MS-COCO}{txt$\rightarrow$img~\cite{coco}}
            & \cb[c3]{MS-COCO}{img$\rightarrow$txt~\cite{coco}}
            & \cb[c3]{Flickr-30k}{txt$\rightarrow$img~\cite{flickr}}
            & \cb[c3]{Flickr-30k}{img$\rightarrow$txt~\cite{flickr}}
            \\
        \hline
        \multicolumn{4}{l}{{\textit{Proprietary}}} & \cat{} &&&&&&& \cat{} &&&&&&&&& \cat{} \\
        BASIC~\cite{basic}          & 2.4B    & 224 & 6.6B    & \ca{\textcolor{black}{84.3}} & \textcolor{black}{85.7} & \textcolor{black}{80.6} & \textcolor{black}{82.3} & \textcolor{black}{85.6} & \textcolor{black}{95.7} & {\textcolor{black}{76.1}} %
                & \ca{\textcolor{black}-} & \textcolor{black}{95.1} & \textcolor{black}{91.2} & \textcolor{black}{97.9} &\textcolor{black}- &\textcolor{black}- &\textcolor{black}- & \textcolor{black}{76.2} & \textcolor{black}{72.7}
            %
            & \ca{\textcolor{black}-} & \textcolor{black}- & \textcolor{black}- & \textcolor{black}- & \textcolor{black}-  \\
        CoCa~\cite{coca}            & 1.0B    & 576 & 4.8B    & \ca{\textcolor{black}{85.7}} & {\textcolor{black}{86.3}} & {\textcolor{black}{80.7}} & \textcolor{black}{82.7} & {\textcolor{black}{90.2}} & {\textcolor{black}{96.5}} & {\textcolor{black}{77.6}}  %
                & \ca{\textcolor{black}-} &\textcolor{black}- &\textcolor{black}- &\textcolor{black}- &\textcolor{black}- &\textcolor{black}- &\textcolor{black}- &\textcolor{black}- &\textcolor{black}-
        %
        & \ca{\textcolor{black}{72.6}} & \textcolor{black}{51.2} & \textcolor{black}{66.3} & \textcolor{black}{80.4} & \textcolor{black}{92.5} \\
        LiT-22B~\cite{vit22b}       & 21.7B\,\,\,   & 224 & 15B     & \ca{\textcolor{black}-}    & {\textcolor{black}{85.9}} & {\textcolor{black}{80.9}} & {\textcolor{black}{87.6}} & \textcolor{black}{90.1} & \textcolor{black}{96.0} &  \textcolor{black}-  %
                & \ca{\textcolor{black}-} &\textcolor{black}- &\textcolor{black}- &\textcolor{black}- &\textcolor{black}- &\textcolor{black}- &\textcolor{black}- &\textcolor{black}- &\textcolor{black}- 
            %
            & \ca{\textcolor{black}-} & \textcolor{black}- & \textcolor{black}- & \textcolor{black}- & \textcolor{black}-  \\
        \hline
        
        \multicolumn{4}{l}{{\textit{B Scale}}} & \cat{} &&&&&&& \cat{} &&&&&&&&& \cat{} \\
        SigLIP-B/16$^\dagger$~\cite{siglip}                & 0.1B  & 224 & 10B   
        & \ca{69.9} & 76.2 & 69.5 & 70.7 & 45.1 & 90.2 & 67.9 %
        & \ca{69.5} & 91.6 & 85.2 & 94.2 & 90.8 & 44.0 & 15.9 & 70.0 & 64.6 %
        & \ca{69.8} & 47.2 & 64.5 & 77.9 & 89.6 \\
        SigLIP2-B/16$^\dagger$~\cite{siglip2}      & 0.1B & 224 & 10B   
        & \ca{{73.1}} & {78.2} & {71.4} & \textbf{73.6} & {55.0} & \textbf{91.7} & \textbf{68.9} %
        & \ca{{73.1}} & \textbf{92.8} & {85.7} & \textbf{95.4} & \textbf{93.4} & {54.8} & {19.2} & {72.7} & {71.1} %
        & \ca{{73.7}} & \textbf{52.1} & {68.9} & {80.7} & {93.0} \\
        {\bf \PEcore{B}}                & 0.1B  & 224 & 5.4B    
        & \ca{\textbf{73.2}} & \textbf{78.4} & \textbf{71.7} & {71.9} & \textbf{62.4} & {88.7} & {66.1} %
        & \ca{\textbf{75.0}} & {92.5} & \textbf{86.5} & {94.6} & {92.1} & \textbf{57.0} & \textbf{30.5} & \textbf{74.0} & \textbf{72.7} %
        & \ca{\textbf{74.3}} & {50.9} & \textbf{71.0} & \textbf{80.8} & \textbf{94.4} \\
        \hline
        \multicolumn{4}{l}{{\textit{L Scale}}} & \cat{} &&&&&&& \cat{} &&&&&&&&& \cat{} \\
        SigLIP-L/16$^\dagger$~\cite{siglip} & 0.3B  & 384 & 10B   
        & \ca{80.7} & 82.1 & 75.9 & 80.9 & 76.5 & 95.0 & {73.6} %
        & \ca{74.4} & 95.6 & {89.4} & \textbf{96.8} & {94.8} & 53.2 & 24.7 & 72.5 & 67.9 %
        & \ca{74.7} & 52.8 & 70.5 & 82.6 & 92.9 \\
        SigLIP2-L/16$^\dagger$~\cite{siglip2}   & 0.3B & 384 & 10B
        & \ca{{83.3}} & {83.1} & {77.4} & {84.4} & {84.3} & \textbf{95.7} & \textbf{75.5} % 
        & \ca{{78.4}}   & {96.1}    & \textbf{90.0} & {96.4}    & \textbf{95.8} & {67.0} & {31.6} & {74.8} & {75.5}
            %
        & \ca{{76.7}} & {55.3} & {71.4} & {85.0} & {95.2} \\
        {\bf \PEcore{L}}                & 0.3B  & 336 & 5.4B
        & \ca{\textbf{83.9}} & \textbf{83.5} & \textbf{77.9} & \textbf{84.7} & \textbf{89.0} & {95.2} & 73.4 %
        & \ca{\textbf{80.0}} & \textbf{96.2} & 87.2 & {96.4} & 93.7 & \textbf{67.8} & \textbf{45.6} & \textbf{77.4} & \textbf{75.7} %
        & \ca{\textbf{78.8}} & \textbf{57.1} & \textbf{75.9} & \textbf{85.5} & \textbf{96.6} \\
        \hline
        \multicolumn{4}{l}{{\textit{Unbounded Scale}}} & \cat{} &&&&&&& \cat{} &&&&&&&&& \cat{} \\
        
        DFN-H+$^\dagger$~\cite{dfn} & 0.6B  & 378 & 5B   & \ca{81.6} & 84.3 & 78.3 & 79.6 & 79.6 & 93.6 & 73.3 %
        & \ca{80.5} & 96.2 & \textbf{91.6} & 96.8 & \textbf{96.0} & 72.5 & 37.9 & 77.4 & {75.9} 
        & \ca{75.8} & 55.6 & 71.8 & 82.1 & 93.6  \\
        InternVL-C~\cite{internvl}  & 5.5B    & 224 & 5B      & \ca{82.5} & 83.2 & 77.3 & 80.6 & 83.8 & 95.7 & 74.3 %
            & \ca{76.4} & 95.3 & 85.8 & 96.3 & 94.4 & 53.3 & 35.1 & 76.3 & 74.4
            %
            & \ca{{78.6}} & {\bf 58.6} & {74.9} & {85.0} & 95.7 \\
        EVA 18B~\cite{eva18b}       & 17.5B\,\,\, & 224 & 2B       & \ca{83.6} & 83.8 & 77.9 & 82.2 & 87.3 & 95.7 & 74.7 %
            & \ca{78.8} & 95.8 & 86.0 & 96.1 & {94.9} & 59.7 & 43.1 & {77.7} & \textbf{76.9}
            & \ca{77.5} & 56.2 & 73.6 & 83.3 & {\bf 96.7} \\
        EVA 18B+~\cite{eva18b} & 17.5B\,\,\, & 336 & 2B 
            & \ca{84.1} & 83.9 & 78.2 & 83.6 & 88.9 & 95.6 & 74.3 %
            & \ca{-} & -&- &- &- &- &- &- &-
            & \ca{-} & - & - & - & - \\            
        SigLIP2-g-opt$^\dagger$~\cite{siglip2}      & 1.1B & 384 & 10B      
        & \ca{{86.2}} & 85.0 & {79.8} &  {88.0} & 90.5 & \textbf{96.6} & \textbf{77.4} %
                & \ca{81.0} & \textbf{97.0} & {91.5} & \textbf{97.8} & {95.9} & {73.6} & 40.1 & 76.3 & {75.9}
            & \ca{78.0} & 56.1 & 72.8 & \textbf{86.0} & 95.4 \\
        {\bf \PEcore{G}} {\tiny\it (image only)}                & 1.9B  & 448 & 5.4B    & \ca{86.0} & {85.2} & \textbf{80.2} & 87.1 & {91.2} & 96.1 & 76.1 %
             & \ca{{82.7}} & 96.6 & 91.0 & 96.4 & 94.6 & {76.7} & {57.3} & 77.5 & 71.8
             & \ca{74.9} & 53.1 & 70.9 & 81.6 & 93.9 \\
        {\bf \PEcore{G}}                & 1.9B  & 448 & 5.4B    & \ca{{\bf 86.6}} & \textbf{85.4} & \textbf{80.2} & \textbf{88.2} & \textbf{92.6} & {96.5} & {76.5} %
                & \ca{\bf 83.7} & {96.9} & 91.4 & {96.9} & 94.7 & \textbf{78.2} & \textbf{57.6} & \textbf{78.5} & 75.8
            & \ca{\bf 78.9} & {58.1} & {\bf 75.4} & {85.7} & {96.2} \\
        \shline
    \end{tabular}
    }
    \caption{{\bf Zero-Shot Image Results.} Image zero-shot performance of \PEcore{} compared to the state-of-the-art for \textit{both} proprietary and open models.  \PEcore{G} is the first vision encoder to outperform the best models trained on the proprietary JFT-3B~\cite{vit} and WebLI~\cite{pali} on general classification. 
    Moreover at all model sizes, \PEcore{} obtains state-of-the-art results across general classification, retrieval, and finegrained classification.
    $^\dagger$Re-evaluated: DFN by~\cite{eva18b}; SigLIP and SigLIP2 by us with the same benchmark settings if not reported in~\cite{siglip2} (see Appendix~\ref{appx:zeroshot_settings}).
    }
    \label{tab:core_general_image}
\end{table*}
```
Core Results {#sec:core_results}
------------

#### Zero-Shot Image Results.

In Tab. `\ref{tab:core_general_image}`{=latex}, we present `\PEcore{}`{=latex}'s performance on zero-shot image benchmarks for classification and retrieval `\vs `{=latex}the strongest existing models, including SigLIP2 [@siglip2] and proprietary models using JFT-3B [@vit], which is likely tuned for ImageNet. `\PEcore{}`{=latex} outperforms all other contrastive models across the board on all zero-shot tasks, including the highly competitive average of zero-shot ImageNet robustness metrics [@imagenet; @imagenetv2; @objectnet; @imagenet-a; @imagenet-r; @imagenet-sketch]. This marks a significant achievement, as we are the first to accomplish this in over 3 years without access to Google's internal JFT-3B [@vit] or WebLI [@pali] datasets. And *at the same time*, `\PEcore{}`{=latex} also exceeds the existing state-of-the-art on image-text retrieval and significantly improves on fine-grained classification---the first to simultaneously hold state-of-the-art on all common zero-shot categories.

By harnessing the power of our video data engine, training with a relatively small dataset of 22M videos and their corresponding synthetic captions leads to substantial *gains in image benchmarks*, with average general image classification improving by +0.6% with emphasis on more difficult benchmarks (notably +1.2% ObjectNet, +1.4% ImageNet Adversarial) and fine-grained classification by +1.0% on average. Furthermore, due to the high level of detail and alignment of our synthetic captions, zero-shot retrieval is significantly boosted by +3.6% on average. These results emphasize that training with well-aligned video text data does not just improve video performance---it creates a strictly better model for both videos *and* images.

```{=latex}
\begin{wraptable}{r}{0.7\textwidth}
    \vspace{-10pt}
    \centering
    \makebox[\linewidth][c]{
    \tablestyle{0pt}{1.2} 
    \begin{tabular}{y{51}wx{11}x{9}x{16} awwwww awwwwww}
        \shline
        \multirow{2}{*}{\vspace{-2.2cm} Model}  &&&&& \multicolumn{6}{c}{\ct[c4]{\it Zero-Shot Classification}} %
        & \multicolumn{7}{c}{\ct[c5]{\it Zero-Shot Retrieval}}\\
            & \cb{Encoder Params}{}
            & \cb{Resolution}{}
            & \cb{\# Frames}{}
            & \cb{Video Data}{}
            & \cb[c4]{\textit{\textbf{Avg Class.}}}{}
            & \cb[c4]{Kinetics}{400~\cite{kay2017kinetics}}
            & \cb[c4]{Kinetics}{600~\cite{kay2017kinetics}}
            & \cb[c4]{Kinetics}{700~\cite{kay2017kinetics}}
            & \cb[c4]{UCF}{101~\cite{soomro2012ucf101}}
            & \cb[c4]{HMDB}{51~\cite{kuehne2011hmdb}}
            %
            & \cb[c5]{\textit{\textbf{Avg Retrieval}}}{}
            & \cb[c5]{MSR-VTT}{txt$\rightarrow$video~\cite{coco}}
            & \cb[c5]{MSR-VTT}{video$\rightarrow$txt~\cite{coco}}
            & \cb[c5]{MSVD}{txt$\rightarrow$video~\cite{flickr}}
            & \cb[c5]{MSVD}{video$\rightarrow$txt~\cite{flickr}}
            & \cb[c5]{ActivityNet}{txt$\rightarrow$video~\cite{flickr}}
            & \cb[c5]{ActivityNet}{video$\rightarrow$txt~\cite{flickr}}
            \\
        \hline
        \multicolumn{5}{l}{{\textit{B Scale}}} & \cat{} &&&&&& \cat{} \\      
        CLIP~\cite{clip} & 0.1B  & 224 & 8 & n/a   & \ca{54.3} & 58.4 & {55.1} & 46.1 & 68.9 & 43.2 % 
            & \ca{29.2} & 30.4 & 24.2 & 40.5 & 57.2  & 9.1 & 13.2 \\
        CLIP4CLIP~\cite{clip4clip} & 0.1B  & 224 & 12 & n/a   & \ca{-} & - & - & - & - & -  % 
            & \ca{-} & 32.0 & - & 38.5 & -  & - & -  \\
        SigLIP2-B/16$^\dagger$~\cite{siglip2} & 0.1B  & 224 & 8 & n/a   
        & \ca{{{57.3}}} & {58.7} & 55.0 & {48.4}  & {82.0} & {42.3} %
            & \ca{{{39.9}}} & {38.5} & {30.1} & {49.0} & {67.2} & {28.6} & {25.8}  \\
        {\bf \PEcore{B}}                & 0.1B  & 224 & 8 & 22M    & \ca{{\textbf{63.9}}} & \textbf{65.6} & \textbf{65.1} & \textbf{55.8}  & \textbf{84.6} & \textbf{48.2} %
            & \ca{{\textbf{49.9}}} & \textbf{47.6} & \textbf{47.3} & \textbf{50.4} & \textbf{76.7} & \textbf{39.0} & \textbf{38.4}  \\
        \hline
        \multicolumn{5}{l}{{\textit{L Scale}}} & \cat{} &&&&&& \cat{} \\   

        
        UMT-L~\cite{umt} & 0.3B  & 224 & 8 & 25M   & \ca{-} & - & - & - & - & -  % 
            & \ca{{47.1}} & 40.7 & {37.1} & 49.0 & {74.5}  & {41.9} & {39.4}  \\
        
        SigLIP2-L/16$^\dagger$~\cite{siglip2} & 0.3B  & 384 & 8 & n/a   
        & \ca{{64.1}} & {65.3} & {62.5} & {56.8} & {86.7} & {49.3} % 
        & \ca{44.7} & {41.5} & 31.4 & {53.7} & 74.2 & 35.9 & 31.5    \\
        
        {\bf \PEcore{L}}                & 0.3B  & 336 & 8 & 22M    & \ca{\textbf{71.4}} & \textbf{73.4} & \textbf{72.7}  & \textbf{65.3}  & \textbf{87.1} & \textbf{58.5} %
            & \ca{\textbf{54.8}} & \textbf{50.3} & \textbf{50.1} & \textbf{57.2} & \textbf{82.4} & \textbf{46.4} & \textbf{42.1}  \\
        \hline
        \multicolumn{5}{l}{{\textit{Unbounded Scale}}} & \cat{} &&&&&& \cat{} \\          


                    
        InternVL~\cite{internvl} & 5.5B  & 224 & 8 & n/a   & \ca{-} & 69.1 & 68.9 & 60.6 & - & -  % 
            & \ca{-} & 44.7 & 40.2 & - & -  & - & -  \\


        InternVideo2 ~\cite{internvideo2} & 1.0B  & 224 & 8 & 102M   & \ca{70.7} & 73.1 & {72.8} & {64.9} & 88.8 & 53.9  % 
            & \ca{\textbf{59.9}} & \textbf{51.9} & {50.9} & {58.1} & {83.3}  & \textbf{60.4} & \textbf{54.8}  \\
        VideoPrism-g\textsuperscript{*}~\cite{videoprism} & 1.1B  & 288 & 16 & 619M   & \ca{-} & {76.4} & - & - & -& -  % 
            & \ca{-} & 39.7 & \textbf{71.0} & - & -  & 52.7 & 50.3  \\
                        
        
        SigLIP2-g-opt$^\dagger$~\cite{siglip2} & 1.1B  & 384 & 8 & n/a  
        & \ca{68.2} & 69.8 & 67.0 & 61.8 & 90.7 & 51.8 % 
        & \ca{46.6} & 43.1 & 34.2 & 55.8 & 74.6 & 38.3 & 33.4    \\

      
        {\bf \PEcore{G}} {\tiny\it (image only)}                  & 1.9B  & 448 & 8 & n/a    & \ca{{70.9}} & 73.1 & 72.2 & 64.3  & {89.5} & {55.5} %
            & \ca{47.6} & 44.3 & 35.2 & 54.3 & 73.9 & 41.4 & 36.3  \\
        
        {\bf \PEcore{G}}                & 1.9B  & 448 & 8 & 22M    & \ca{\textbf{74.8}} & \textbf{76.9} & \textbf{76.1} & \textbf{69.1}  & \textbf{90.7} & \textbf{61.1} %
            & \ca{{58.7}} & {51.2} & 49.9 & \textbf{59.7} & \textbf{85.4} & {54.7} & {51.2}  \\
        \shline
    \end{tabular}
    }
    \caption{{\bf Zero-Shot Video Results.} Video performance of \PEcore{} compared to recent video and image encoders. \PEcore{} obtains state-of-the-art in video classification and comparable performance on retrieval benchmarks while using only 22M videos. \textsuperscript{*}Proprietary models.
    $^\dagger$SigLIP2 are evaluated by us with the same zero-shot prompts frame embedding averaging strategy (as in~\cite{clip, internvl, clip4clip}). See Appendix~\ref{appx:zeroshot_settings}.
    }
    \label{tab:core_general_video}
    \vspace{-20pt}
\end{wraptable}
```
#### Zero-Shot Video Results.

We assess the performance of `\PEcore{}`{=latex} on zero-shot video benchmarks by employing the same model as a frame-based video encoder, utilizing 8 uniformly sampled frames, as described in §`\ref{sec:video_data_engine}`{=latex}.

We present the corresponding video results in Tab. `\ref{tab:core_general_video}`{=latex}. Our base image encoder already outperforms all other image-only encoders on both zero-shot classification and retrieval, including SigLIP2-g-opt. With video finetuning, `\PEcore{}`{=latex}G significantly outperforms even native video models that use full temporal attention on video classification, and nearly matches the state-of-the-art on video retrieval using a simple frame-level encoder. This result underscores the importance of our video data engine, resulting in +3.9% on average zero-shot video classification, and a massive +11.1% on retrieval. Moreover, `\PEcore{}`{=latex} does this with much less video data compared to other video-based approaches like InternVideo2 [@internvideo2] and VideoPrism [@videoprism], highlighting the benefits of a joint image-video encoder.

#### Additional Zero-Shot Benchmarks.

We further evaluate `\PEcore{}`{=latex} on an additional set of zero-shot classification and retrieval benchmarks we construct in Tab. `\ref{tab:core_pe_bench}`{=latex} to address key gaps in common benchmarks. For comparison, we also evaluate SigLIP2 [@siglip2] and InternVL-C [@internvl] on these benchmarks.

First, we note that the version of ObjectNet [@objectnet] that is standard to benchmark robustness (e.g., in Tab. `\ref{tab:core_general_image}`{=latex}) is *not* the full set. ObjectNet consists of 313 classes of objects in challenging and uncommon orientations, locations, and viewpoints. However, the standard version used for benchmarking is a 113 class subset of classes that overlap with ImageNet-1k [@imagenet]. Naturally, benchmarking in this way rewards performing well on ImageNet classes over generality. To remove this bias, we construct the full ObjectNet set with all classes and compare to the reduced ObjectNet set in Tab. `\ref{tab:core_pe_bench}`{=latex}. Surprisingly, we find that while `\PEcore{G}`{=latex} performs +7.6% over InternVL-C and only +0.2% over SigLIP2-g-opt on the reduced ObjectNet set, it performs +11.8% over InternVL-C and +0.9% over SigLIP2-g-opt on the full set of classes, highlighting PE's generality.

Next, we include iNaturalist [@inat2017] as a *zero-shot* benchmark because of its level of specificity with 2,101 fine-grained long-tail classes. `\PEcore{G}`{=latex} outperforms the next best SigLIP2-g-opt model by *+9.6%*, emphasizing PE's long tail knowledge. We then evaluate PE's cultural diversity on Dollar Street [@dollar_st][^4], which consists of images of under-represented populations. Here too we find `\PEcore{G}`{=latex} to outperform existing methods, with +3.0% over SigLIP2-g-opt. Further, we test OCR performance by setting up TextCaps [@sidorov2020textcaps] as a retrieval dataset. Notably, `\PEcore{}`{=latex} performs on par or better than SigLIP, which is known for good OCR performance. This is potentially surprising, as the horizontal flip augmentation we used during robust pretraining (§`\ref{sec:core_image_pt}`{=latex}) is typically thought to hurt OCR performance. However, instead it seems to have given `\PEcore{}`{=latex} the ability to read backwards: we test the same TextCaps retrieval but with all images horizontally flipped. Other models suffer from this, but `\PEcore{G}`{=latex}'s performance only drops by 0.1%. Finally, we evaluate `\PEcore{G}`{=latex} on the PVD benchmark (§`\ref{sec:pvd}`{=latex}), a challenging video retrieval task on 15K diverse and human-refined videos. Here, `\PEcore{G}`{=latex} significantly outperforms InternVL [@internvl] by +13.6% on text$\rightarrow$video and +9.5% to SigLIP2 [@siglip2] on video$\rightarrow$text.

#### Frozen Encoder Probing Results.

To compare against models that are not capable of zero-shot classification, we additionally evaluate `\PEcore{}`{=latex} using k nearest neighbors (following [@dinov2]), linear probing (following [@internvl]), and attention probing (following [@aimv2]) on top of the ImageNet-1k [@imagenet] train set. We present these results in Tab. `\ref{tab:core_frozen_features}`{=latex} and compare to other encoders using their reported numbers. In every case, `\PEcore{G}`{=latex} outperforms all existing open encoders, including those with significantly more parameters.

#### Summary.

`\PEcore{}`{=latex}, a unified image-video encoder, achieves state-of-the-art performance across zero-shot classification and retrieval on both images and videos on a wide variety of benchmarks. This synergy is made possible by our robust image pretraining recipe (§`\ref{sec:core_image_pt}`{=latex}) and powerful video data engine (§`\ref{sec:core_video_ft}`{=latex}), which together enable the model to effectively leverage the strengths of both image and video data at scale.

```{=latex}
\centering
```
```{=latex}
\vspace{0pt}
```
```{=latex}
\centering
```
```{=latex}
\tablestyle{0pt}{1.2}
```
```{=latex}
\begin{tabular}{y{60}www wwww wwww}
        \shline
        \multirow{2}{*}{\vspace{-2.2cm} Model}  &&&& \multicolumn{4}{c}{\ct[c1]{\it Zero-Shot Classification}} %
        & \multicolumn{4}{c}{\ct[c4]{\it Zero-Shot Retrieval}}\\
            & \cb{Encoder Params}{}
            & \cb{Resolution}{}
            & \cb{Data}{}
            & \cb[c1]{ObjectNet~\cite{objectnet}}{IN Overlap (113)}
            & \cb[c1]{ObjectNet~\cite{objectnet}}{All Classes (313)}
            & \cb[c1]{iNaturalist}{2017~\cite{inat2017}}
            & \cb[c1]{Dollar St}{58~\cite{dollar_st, datacomp}}
            & \cb[c4]{TextCaps}{img$\rightarrow$txt~\cite{sidorov2020textcaps}}
            & \cb[c4]{TextCaps {\tiny \bf Flip}}{img$\rightarrow$txt~\cite{sidorov2020textcaps}}
            & \cb[c4]{PVD Bench}{text$\rightarrow$vid}
            & \cb[c4]{PVD Bench}{vid$\rightarrow$txt}  
            \\
        \hline
        \addpadding
        %\multicolumn{4}{l}{{\textit{B Scale}}} &&&&&&&&  \\         
        {SigLIP2-B/16~\cite{siglip2}} & 0.1B  & 224 & 10B & \textbf{73.6} & \textbf{59.1} & {16.9}  & \textbf{55.9} & {72.0} & {69.8} & {53.9} & {60.1} \\
        {\bf \PEcore{B}} & 0.1B  & 224 & 5.4B & {71.9} & {58.3} & \textbf{25.9} & {52.1} & \textbf{72.3} & \textbf{71.9} & \textbf{59.8} & \textbf{61.1} \\
        \hline
        \addpadding
        {SigLIP2-L/16~\cite{siglip2}} & 0.3B  & 384 & 10B & {84.4} & {73.2}
        & {26.7} & {57.6} & {78.0} & {76.2} & {61.9} & \textbf{67.1} \\
        {\bf \PEcore{L}} & 0.3B  & 336 & 5.4B & \textbf{84.7} & \textbf{74.3} & \textbf{35.3}   & \textbf{59.6} & \textbf{78.5} & \textbf{78.3} & \textbf{64.7} & {65.2} \\
        \hline
        \addpadding
        {InternVL-C~\cite{internvl}} & 5.5B & 224 & 5B & 80.6 & 67.2 & 19.4 & 58.2 &  72.3 & 67.8 & {63.4} & 65.1 \\
        {SigLIP2-g-opt~\cite{siglip2}} & 1.1B  & 384 & 10B & {88.0} & {78.1} & {31.5} &  {59.3} & \textbf{78.8} & {76.9} & {62.5} & {67.1} \\
        {\bf \PEcore{G}} & 1.9B  & 448 & 5.4B & \textbf{88.2} & \textbf{79.0} & \textbf{41.1}   & \textbf{62.3} & \textbf{78.8} 
        & \textbf{78.7} & \textbf{77.0} & \textbf{76.6} \\
        \shline
    \end{tabular}
```
```{=latex}
\hfill
```
```{=latex}
\vspace{0pt}
```
```{=latex}
\centering
```
```{=latex}
\tablestyle{0pt}{1.2}
```
```{=latex}
\begin{tabular}{y{60}www www}
        \shline
        \multirow{2}{*}{\vspace{-2.2cm} Model} &&&& \multicolumn{3}{c}{\ct[c1]{\it Encoder Probing}} \\ %
            & \cb{Encoder Params}{}
            & \cb{Resolution}{}
            & \cb{Data}{}
            & \cb[c1]{ImageNet~\cite{imagenet}}{KNN}
            & \cb[c1]{ImageNet~\cite{imagenet}}{Linear}
            & \cb[c1]{ImageNet~\cite{imagenet}}{Attention}  \\    
            \hline
            \addpadding
            DINOv2-g~\cite{dinov2} & 1.1B & 224 & 145M & 83.5 & 86.5 & \hphantom{$^\dagger$}87.2$^\dagger$ \\
            RADIOv2.5-g~\cite{heinrich2024radio2.5} & 1.1B & 518 & - & {85.3} & - & - \\
            AIMv2 3B~\cite{aimv2} & 2.7B & 448 & 7.2B& -  & - & {89.5} \\
            InternVL-C~\cite{internvl} & 5.5B & 224 & 5B & - & 88.2  & - \\
            EVA 18B~\cite{eva18b} & 17.5B\,\,\, & 224 & 2B & - & {88.9} & - \\
            {\bf \PEcore{G}} & 1.9B & 448 & 5.4B & {\bf 86.8}  & {\textbf{89.5}} & {\bf 89.8} \\
            \shline
        \end{tabular}
```
```{=latex}
\clearpage
```
General Features in a Contrastive Disguise {#sec:layerfinder}
==========================================

`\PEcore{}`{=latex} puts up strong results on the tasks contrastive encoders are known for, like zero-shot classification and retrieval. But while those tasks are useful, they are only a small part of the vision ecosystem. What *really matters* is whether or not the features learned with our pretraining recipe are useful to downstream tasks.

Today's common wisdom in the vision community cites that different pretraining methods result in features useful for different tasks: e.g., contrastive for classification, captioning for language modeling, and self-supervised learning for spatial tasks. To see how `\PEcore{}`{=latex} stacks up against against models with different pretraining techniques, we compare its *frozen features* to the state-of-the-art large-scale models for captioning (AIMv2-3B [@aimv2]) and self-supervised learning (DINOv2-g [@dinov2]) on a variety of downstream tasks.

```{=latex}
\begin{wrapfigure}{r}{0.6\textwidth}
\vspace{-10pt}
    \centering
    \includegraphics[width=\linewidth, trim = 12.7in 0in 0in 9.8in, clip]{fig/layerfinder.pdf}
    \caption{
        {\bf Layer Analysis.} Evaluating intermediate layers as frozen features across tasks for different pretraining methods: captioning (AIMv2-3B~\cite{aimv2}, left), spatially self-supervised (DINOv2-g~\cite{dinov2}, middle), and our contrastive recipe (\PEcore{G}, right). Vertical lines denote the best layer and horizontal lines the best performance across models. As expected, AIMv2 performs well on language but not spatial, and DINOv2 performs well on spatial but not language. 
        But surprisingly, \textit{intermediate layers} of \PEcore{G} perform well on \textit{both} language modeling and spatial tasks. 
    }
    \label{fig:layerfinder}
\vspace{-20pt}
\end{wrapfigure}
```
#### Layerwise Feature Analysis.

We summarize the results of our frozen feature analysis in Fig. `\ref{fig:layerfinder}`{=latex} for several downstream benchmarks in 3 categories: classification, language modeling, and spatial tasks. For classification, we probe each model using a randomly initialized cross attention transformer block. For language alignment, we use the Perception Language Model (PLM) [@PLM] frozen encoder evaluation setup, learning a projector and finetuning a decoder-only LLM (see §`\ref{sec:la}`{=latex}), and for spatial tasks we train with several different decoders (ViTDet [@vitdet] Mask-RCNN [@maskrcnn] with Absolute Win [@abswin] for detection, DPT [@dpt] for depth, and zero-shot feature correspondance for tracking [@jabri2020space]). For each experiment, we sweep over the layers of the model as the optimal features are not necessarily the last [@chen2024internvit2p5]. In each case, we use an equivalent image size (window size for detection) of $32\times32$ tokens. In each plot, we normalize performance by the maximum and minimum performance across models on that task.

#### An Alignment Problem.

This analysis reveals several insights. First, as expected, AIMv2 performs well at classification and the best at visual Q&A language tasks. Similarly, DINOv2 performs the well on spatial tasks like detection, depth, and even performs the best at grounding through an LLM. Then as already established by other works: DINOv2 lacks performance on OCR tasks [@cambrian]. This is no secret, but what is interesting is that its performance *peaks in the middle of the network* and then drops significantly by the end. And so does the performance of other models for other downstream tasks (AIMv2: tracking, grounding, detection; DINOv2: VQ&A, grounding).

`\PEcore{}`{=latex} exhibits similar behavior, but with unexpected results. Unlike the others, in earlier layers of the network `\PEcore{}`{=latex} *performs well on all tasks, often matching or exceeding the leading models*. Remarkably, PE has intermediate layers that perform near to or on par with AIMv2 for language tasks and DINOv2 for spatial tasks, despite being trained with contrastive loss. Depth estimation is particularly noteworthy, as contrastive encoders are not typically considered state-of-the-art in that area.

However, in almost all cases this strong performance *diminishes rapidly* towards the end of the network. In fact, the performance of `\PEcore{}`{=latex} in the final layer is *abysmal* for certain tasks, such as LLM-based grounding (the reason for which will become apparent in §`\ref{sec:sa}`{=latex}). This behavior is less pronounced the closer the downstream task is to the pretraining method, suggesting an *alignment problem*. Specifically, a well-tuned large-scale contrastive model can learn general embeddings in the process of fitting its objective, *but it fails to output them*. Therefore, to reveal these embeddings, the model must be subsequentially aligned to downstream tasks.

#### Analysis.

The finding that pure CLIP models possess features which match the performance of state-of-the-art pretraining methods in their specialized domains is new. In fact, recent work [@webdino] has shown the opposite---that CLIP models fail to scale on downstream tasks. We next investigate how our approach yields these results.

```{=latex}
\begin{wrapfigure}{r}{0.545\textwidth}
\vspace{-10pt}
  \begin{center}
      
        \includegraphics[width=1\linewidth, trim = 14.4in 0in 0in 7.6in, clip]{fig/coco_convnext.pdf}
    
    \end{center}
    \caption{{\bf The Downstream Effects of Robust Pretraining.} The ViT-L/14 checkpoints from Fig.~\ref{fig:core_pt_ablations} evaluated as frozen features on COCO~\cite{coco} using Mask R-CNN~\cite{maskrcnn}. We report the last layer performance, best layer performance, and the best layer's index.
    }
    \label{fig:layerfinder_ablations}
\vspace{-10pt}
\end{wrapfigure}
```
To start, we perform layerwise frozen feature analysis on COCO detection. `\PEcore{}`{=latex} was particularly \`\`peaky" on this task in Fig. `\ref{fig:layerfinder}`{=latex}, with its best layer on par with DINOv2, but last layer significantly worse. We already ablated each change we made from vanilla CLIP in Fig. `\ref{fig:core_pt_ablations}`{=latex} using a ViT-L/14 model. So to retrace our steps, we run frozen feature analysis on those checkpoints. For efficiency, we perform this experiment at a lower resolution and only sample even layers. In Fig. `\ref{fig:layerfinder_ablations}`{=latex}, we report COCO box mAP for the last and best layers for each cumulative ablation, along with the index of the best layer. Further, we plot the layerwise performance for each change in Fig. `\ref{fig:layerfinder_ablations_plot}`{=latex}.

Surprisingly, the simple changes we made in §`\ref{sec:core_image_pt}`{=latex} to construct our pretraining recipe overall improved the best layer's performance by *almost 10 mAP* over vanilla CLIP! Some changes like high resolution (5) and RoPE (6) improving spatial features is to be expected, but unexpectedly data augmentation (8) and *especially* progressive resolution (2) help considerably. It is possible that contrastive pretraining is prone to overfit to the \`\`global" nature of the task through \`\`global tokens" [@vitsneedregisters]. However, as the model cannot maintain global tokens in the same place due to the resolution progressively changing, it is forced to be more robust. Also of note is that both progressive resolution (2) and attention pooling (7) move the argmax layer deeper into the network (rightmost column of Fig. `\ref{fig:layerfinder_ablations}`{=latex}). Attention pooling in particular alters the whole shape of the layerwise performance curve (Fig. `\ref{fig:layerfinder_ablations_plot}`{=latex}), while the other changes typically only raise or lower it.

```{=latex}
\begin{wrapfigure}{l}{0.3\textwidth}
\vspace{-15pt}
  \begin{center}
      
        \includegraphics[width=1\linewidth, trim = 16.5in 0in 0in 8.6in, clip]{fig/layerfinder_ablation.pdf}
    
    \end{center}
    \caption{{\bf Layer Analysis} corresponding to the results presented in Fig.~\ref{fig:layerfinder_ablations}.}
    \label{fig:layerfinder_ablations_plot}
\vspace{-10pt}
\end{wrapfigure}
```
Potentially more interesting is what did not improve performance: specifically, increasing the batch size (3) and using LAMB with a high learning rate (4). Both of these changes explicitly help the model fit the CLIP loss better, which after a certain point may not improve the general features. Moreover, while the best layer overall improved significantly, the last layer performance stagnated after (2). This suggests that constructing the global CLIP token requires a substantial \`\`decoder" (in this case, 6 layers for the final L/14 model). Although the features of this decoder are beneficial for some tasks (e.g., Visual Q&A as shown in Fig. `\ref{fig:layerfinder}`{=latex}), they are not general. Nevertheless, this does not prevent the model from learning general features; it merely limits their expression in the output.

```{=latex}
\begin{wrapfigure}{r}{0.6\textwidth}
\vspace{-14pt}
  \begin{center}
      
        \includegraphics[width=1\linewidth, trim = 10.65in 0in 0in 17.1in, clip]{fig/layerfinder_scaling_coco.pdf}
    
    \end{center}
    \caption{{\bf The Downstream Scalability of Robust Pretraining.} Left: frozen feature layer analysis of the S/14, B/14, and L/14 models from Fig.~\ref{fig:core_pt_scaling} using the same setup as Fig.~\ref{fig:layerfinder_ablations}. Right: scaling behavior of the \textit{best layer} for each model. Note: G is our final model and has a different schedule.
    }
    \label{fig:layerfinder_scaling_coco}
\vspace{-10pt}
\end{wrapfigure}
```
#### Scaling Behavior.

Finding a simple, easily scalable vision pretraining method that produces generally useful features has been the white whale of the vision community for a while. Evidently, our robust recipe can enable contrastive pretraining to produce general features. So that begs the question, \`\`does it scale?"

```{=latex}
\begin{wrapfigure}{r}{0.6\textwidth}
\vspace{18pt}
  \begin{center}
      
        \includegraphics[width=1\linewidth, trim = 10.7in 0in 0in 5.52in, clip]{fig/layerfinder_scaling_beeeg.pdf}
    
    \end{center}
    \caption{{\bf Further Scalability Analysis.} We repeat the analysis from Fig.~\ref{fig:layerfinder_scaling_coco} on a wide range of downstream tasks by adapting to a language model. Each category is an average of several downstream tasks (see \S\ref{sec:la}).
    }
    \label{fig:layerfinder_scaling_lang}
\vspace{-30pt}
\end{wrapfigure}
```
We can answer this question in the same way: by performing frozen feature layer analysis of our S/14, B/14, and L/14 scaling ablation checkpoints from Fig. `\ref{fig:core_pt_scaling}`{=latex}. We report the result of that analysis in Fig. `\ref{fig:layerfinder_scaling_coco}`{=latex}. We also include our final `\PEcore{}`{=latex}G model using the same setup, but note this is an estimate as our ablation and final schedules are different.

Immediately, we see a stark contrast between the scaling behavior of the vanilla CLIP recipe and ours. While the vanilla recipe quickly plateaus at L scale (300M), the best layer of our robust pretraining recipe demonstrates scaling to G scale (2B) and potentially beyond---despite being trained with a decidedly non-spatially aligned global contrastive loss. However, this is the *best* layer. The *last* layer performance still stagnates for both the vanilla recipe and ours. This may be why prior work [@webdino] finds contrastive pretraining to not scale for downstream tasks---CLIP loss obfuscates its general features even with our recipe, placing them several layers deep.

However, this is just for a single spatial task. To see whether the trend is consistent, we repeat this scaling analysis on a wide variety of downstream language modeling tasks using the same frozen evaluation setup as Fig. `\ref{fig:layerfinder}`{=latex} and report the results in Fig. `\ref{fig:layerfinder_scaling_lang}`{=latex}. Surprisingly, the simple change in pretraining recipe improves scaling for most language tasks as well---including output-side grounding (RefCOCO). Note that in this benchmarking setup, the LLM never sees videos during training so the Video Q&A per-layer results are noisy. Yet, the best layer trend is still the same.

Clearly, contrastive pretraining with our robust recipe produces strong general features that scale. However, these features are not going to be much use stuck in the middle of the network. To remedy this, in the remaining sections we will discuss methods for *aligning* these general features to the output of the network for both language modeling and spatial tasks.

```{=latex}
\clearpage
```
Perception Encoder: *Language Alignment* {#sec:la}
========================================

```{=latex}
\vspace{-5pt}
```
In §`\ref{sec:layerfinder}`{=latex} we have seen that `\PEcore{}`{=latex} already possesses useful features for vision-language modeling. In this section, we *lift* these features through *alignment tuning* to construct a new encoder, `\PElang{}`{=latex}, specialized for multimodal large language models (MLLMs). Our principle is to design not only the most performant, but also the most *general* vision encoder for use in MLLM development. To this end, we want a single language-aligned encoder that performs well across language models, across input resolutions, and for a wide variety of MLLM tasks.

#### MLLM Evaluation Tasks.

In this section, our main testbed is to adapt vision encoders to MLLMs and test on various MLLM tasks. We evaluate the downstream performance of each MLLM across five task categories: (1) *OCR*, *Chart*, *Document Q&A* on ChartQA [@zheng2024chartqa], DocVQA [@mathew2021docvqa], InfoVQA [@mathew2022infographicvqa] and AI2D [@kembhavi2016ai2d]; (2) *Visual Q&A* on TextVQA [@singh2019textvqa], OK-VQA [@schwenk2022okvqa], POPE [@li2023popebenchmark], and VQAv2 [@goyal2017vqav2]; (3) *Captioning* on Flicker [@flickr], COCO [@coco], and No Cap [@agrawal2019nocaps]; (4) *Video Understanding* on VideoMME [@fu2024videomme], STAR [@wu2021star], TGIF-QA [@jang2017tgif], EgoSchema [@mangalam2024egoschema], MVBench [@li2024mvbench], and PerceptionTest [@patraucean2024perceptiontest]; and finally (5) *Grounding* on RefCOCO [@kazemzadeh2014referitgame].

```{=latex}
\vspace{-5pt}
```
Language Alignment Method {#sec:la_method}
-------------------------

We begin by searching for the optimal language alignment method. We design our alignment tuning based on the *midtraining* stage of Perception Language Model (PLM) [@PLM], which is to adapt `\PEcore{}`{=latex} to a pretrained decoder-only LLM (Llama 3 [@llama3]) connected by a vision projector. We start with \`\`warmup" training stage with autoregressive next-token prediction loss on 1M image-text samples from pretraining, where everything but the projector is frozen. Then, we proceed to finetune all parameters on 70M data samples [@PLM] covering natural images, documents/charts/diagrams, and videos, using the same next-token prediction loss. After completing this language alignment, we extract the vision encoder from the model and refer to it as `\PElang{}`{=latex}.

```{=latex}
\begin{wraptable}{r}{0.375\textwidth}
\vspace{-8pt}
    \centering
    \makebox[\linewidth][c]{
    \tablestyle{0pt}{1.05} 
    \setlength{\ccustomlen}{1.5cm}
    \begin{tabular}{wwwwwww awwww}
        \shline
            \ccustom{LLM scale}{}
            & \ccustom{LLM unfrozen}{}
            & \ccustom{Regularization?}{}
            & \ccustom{Projector}{}
            & \ccustom{Layer}{}
            & \ccustom[c4]{\textbf{Avg.}}{}
            & \ccustom[c4]{{OCR Q\&A}}{Average of 4}
            & \ccustom[c4]{Captioning}{Average of 3}
            & \ccustom[c4]{Visual Q\&A}{Average of 4}
            & \ccustom[c4]{Video Q\&A}{Average of 6}
            \\
        \hline
        \multicolumn{5}{l}{\it \addpadding LLM Setup}&\ca{}\\
        1B &    &  & MLP     & 47  & \ca{76.5} & 60.7 & 115.1 & 76.0 & 54.0 \\
        3B &   &  & MLP     & 47  & \ca{78.1} & 65.9 & 115.7 & 76.6 & 54.1 \\
        3B & $\checkmark$  &  & MLP  & 47  & \ca{78.4} & 65.8 & 117.6 & 76.3 & 53.7 \\
        \hline
        \multicolumn{5}{l}{\it \addpadding Vision Projector}&\ca{}\\
        3B &    &  & Linear  & 47  & \ca{77.2} & 64.5 & 114.1 & 76.5 & 53.7 \\
        3B &    &  & MLP    & 47  & \ca{78.1} & 65.9 & 115.7 & 76.6 & 54.1 \\
        \hline
        \multicolumn{5}{l}{\it \addpadding PE Output Layer}&\ca{}\\
        3B &    &  & MLP     & 50  & \ca{75.9} & 56.6 & 116.7 & 76.5 & 53.7 \\
        3B &    &  & MLP     & 47  & \ca{78.1} & 65.9 & 115.7 & 76.6 & 54.1 \\
        3B &    &  & MLP     & 41 & \ca{76.9} & 65.5 & 112.8 & 75.4 & 53.9 \\
        \hline
        \multicolumn{5}{l}{\it \addpadding PE Regularization}&\ca{}\\
        3B &    & $\checkmark$ & MLP  & 47 & \ca{79.9} & 69.0 & 117.5 & 77.4 & 55.6 \\% & 68.4 \\
        3B & $\checkmark$      & $\checkmark$ & MLP & 47 & \ca{\textbf{80.1}} & 68.7 & 118.3 & 77.0 & 56.3 \\
        \shline
    \end{tabular}
    }
    \caption{{\bf Language Alignment.} 
    We find the best configuration to language align \PEcore{G} using autoregressive language training. 
    }
    \vspace{-30pt}
    \label{tab:align_ablation}
\end{wraptable}
```
To arrive at the optimal training configuration presented in PLM [@PLM], we first conduct ablation studies using a 20M subset of the data. In Tab. `\ref{tab:align_ablation}`{=latex}, we ablate the LLM sizes, training parameters, vision projector types, output layers to project, and encoder regularization. We evaluate across OCR Q&A, Captioning, Visual Q&A, and Video Q&A and find the best configuration.

**LLM Setup.** We explore different *scales* (1B or 3B parameters) and *freezing* weights of the LLM. We observe that going from 1B to 3B parameters increases average score by 1.6 points (76.5$\rightarrow$78.1). Unfreezing the LLM boosts this number to 78.4.

**Vision Projector.** Using a *2-layer MLP* vision projector instead of a *linear layer* improves the average score from 77.2 to 78.1, while only adding few parameters (13.5M $\rightarrow$ 27M).

**PE Output Layer.** As shown in §`\ref{sec:layerfinder}`{=latex}, `\PEcore{G}`{=latex} has intermediate layers that perform significantly better than the last layer when used as features for certain tasks. However, it is not clear if that same behavior applies when finetuning. We test applying the projector to layers 41, 47, and 50 (the last layer), and find that layer 47 works best. Incidentally, this is also the optimal layer for frozen VQ&A in Fig. `\ref{fig:layerfinder}`{=latex}.

**PE Regularization.** We apply LayerScale [@layerscale] and DropPath [@droppath] to the vision encoder during the alignment, for stabilizing training. This improves the 78.1 average score to 79.9 ($+1.8$ points). Unfreezing the LLM boosts this number further to 80.1. We choose this configuration (last row) as our final alignment setup.

To construct `\PElang{}`{=latex}, we scale this recipe up the 70M samples mentioned above (more details in [@PLM]). In summary, we use a pretrained Llama3.2 3B, unfrozen, with a 2-layer MLP as a vision projector on top of layer `\PEcore{G}`{=latex} layer 47 (with the last 3 discarded) and regularize the encoder with LayerScale and DropPath. Compared to the 20M sample ablation setting in Tab. `\ref{tab:align_ablation}`{=latex}, the final `\PElang{}`{=latex} trained on 70M total samples gives another +2.1 points to 82.2 on the average across OCR Q&A, Captioning, Visual Q&A, and Video Q&A.

`\label{sec:lang_layerfinder}`{=latex}

#### Effects.

The goal of alignment tuning is to *lift* the strong features found in intermediate layers of `\PEcore{}`{=latex} described in §`\ref{sec:layerfinder}`{=latex} to the end of the network. To see if we actually accomplished that, we perform the same layerwise

```{=latex}
\begin{wrapfigure}{r}{0.45\textwidth}
\vspace{-0pt}
  \begin{center}
      
        \includegraphics[width=1\linewidth, trim = 13.8in 0in 0in 14.59in, clip]{fig/layerfinder_lang.pdf}
    
    \end{center}
    \caption{{\bf Language Alignment.} We analyze how language alignment changes the internal features of PE. 
    Similar to our \PEcore{} analysis in Fig.~\ref{fig:layerfinder_scaling_lang}, we extract \PElang{} and adapt each layer to a new LLM. 
    }
    \label{fig:lang_language_alignment_analysis}
\vspace{-10pt}
\end{wrapfigure}
```
analysis as in Fig. `\ref{fig:layerfinder}`{=latex} on our final `\PElang{G}`{=latex} model and compare it to the original `\PEcore{G}`{=latex} checkpoint it was initialized from. We present the results of this analysis in Fig. `\ref{fig:lang_language_alignment_analysis}`{=latex}, and immediately we see that language alignment was a success: across all categories, the performing layer for the aligned model was the last, no matter the performance of the original checkpoint. Notably, our `\PElang{}`{=latex} training mix did *not* contain grounding data, which means that this significantly lifted grounding performance is entirely due to the strong intermediate grounding features in `\PEcore{}`{=latex} now being aligned to the end of the network. Moreover, specific domains such as OCR Q&A that *were* represented in the training mix see a significant boost to performance compared to even the best layer of `\PEcore{}`{=latex}, which was already strong. Thus, with an order of magnitude fewer samples compared to pretraining, we were able to *language align* `\PEcore{G}`{=latex} to create a single, strong encoder for all visual language modeling tasks. Following this success, we align `\PEcore{L}`{=latex} in a similar manner to construct `\PElang{L}`{=latex} (see [@PLM]). `\label{sec:la_layer_find}`{=latex}

```{=latex}
\begin{table*}[b!]\centering
    \vspace{-10pt}
    \makebox[\linewidth][c]{
    \tablestyle{0pt}{1.05} 
        \begin{tabular}{y{50}wx{20} awwww awwww awww a awwwwww}
        \shline
        \multirow{2}{*}{\vspace{-2.2cm} Model}  &&& \multicolumn{5}{c}{\ct[c3]{\it OCR / Chart / Doc. Q\&A}} %
        & \multicolumn{5}{c}{\ct[c4]{\it Visual Q\&A}} & \multicolumn{4}{c}{\ct[c5]{\it Captioning}} & \multicolumn{1}{c}{\ct[c6]{}} & \multicolumn{7}{c}{\ct[c7]{\it Video}} \\
            & \cb{Encoder Params}{}
            & \cb{Resolution}{Patch Size}
            & \cb[c3]{\textit{\textbf{Avg. OCR QA}}}{}
            & \cb[c3]{ChartQA}{Acc.~\cite{zheng2024chartqa}}
            & \cb[c3]{DocVQA}{Acc.~\cite{mathew2021docvqa}}
            & \cb[c3]{Info. QA}{Acc.~\cite{mathew2022infographicvqa}}
            & \cb[c3]{AI2D}{Acc.~\cite{kembhavi2016ai2d}}
            %
            & \cb[c4]{\textit{\textbf{Avg. VQA}}}{}
            & \cb[c4]{TextVQA}{Acc.~\cite{singh2019textvqa}}
            & \cb[c4]{OK-VQA}{Acc.~\cite{schwenk2022okvqa}}
            & \cb[c4]{POPE}{Acc. ~\cite{li2023popebenchmark}}
            & \cb[c4]{VQAv2}{Acc.~\cite{goyal2017vqav2}}
            %
            & \cb[c5]{\textit{\textbf{Avg. Cap.}}}{}
            & \cb[c5]{Flicker}{CIDEr~\cite{flickr}}
            & \cb[c5]{COCO}{CIDEr ~\cite{coco}}
            & \cb[c5]{No Cap}{CIDEr~\cite{agrawal2019nocaps}}
            % <--
            % <--
            & \cb[c6]{\textit{\textbf{Avg. Ground.}}}{RefCOCO/g/+~\cite{kazemzadeh2014referitgame}} 
            & \cb[c7]{\textit{\textbf{Avg. Video}}}{}
            & \cb[c7]{VideoMME}{Acc.~\cite{fu2024videomme}}
            & \cb[c7]{STAR}{Acc.~\cite{wu2021star}}
            & \cb[c7]{TGIF-QA}{Acc.~\cite{jang2017tgif}}
            & \cb[c7]{EgoSchema}{Acc.~\cite{mangalam2024egoschema}}
            & \cb[c7]{MVBench}{Acc.~\cite{li2024mvbench}}
            & \cb[c7]{PerceptionTest}{Acc.~\cite{patraucean2024perceptiontest}} \\
        \hline
\multicolumn{1}{l}{{\textit{256 Tokens per Image}}}                & & & \cat{} &&&&& \ca{} &&&&& \ca{} &&&& \ca{} & \ca{} &&&&&&  \\
MetaCLIP-L~\cite{metaclip}                  & 0.3B           & \rp{224}{14} & \ca{44.9} & 47.9 & 33.0 & 28.7 & 70.2 & \ca{68.4} & 47.6 & 62.5 & 86.9 & 76.5 & \ca{110.5} & 87.5 & 130.0 & 114.1& \ca{60.6}  & \ca{53.9} & 46.1 & 51.0 & 66.4 & 58.6 & 49.4 & 51.9  \\
MetaCLIP-G~\cite{metaclip}                  & 1.8B             & \rp{224}{14} & \ca{44.8} & 47.6 & 33.1 & 27.9 & 70.6 & \ca{68.8} & 48.2 & 63.5 & 86.5 & 76.9 & \ca{111.1} & 86.5 & 132.1 & 114.8& \ca{60.5}  & \ca{53.1} & 45.0 & 50.7 & 66.4 & 56.0 & 48.7 & 51.9  \\
\textbf{\PElang{} G}$^\dagger$                 & \,\,\,1.7B$^*$          & \rp{224}{14} & \ca{53.7} & 61.3 & 47.1 & 32.2 & 74.1 & \ca{71.8} & 55.1 & 65.3 & 86.8 & 79.8 & \ca{116.4} & 91.0 & 136.9 & 121.2 & \ca{65.7} & \ca{55.5} & 47.3 & 55.7 & 68.9 & 59.6 & 48.6 & 52.9 \\

\hline
\multicolumn{1}{l}{{\textit{576 Tokens per Image}}}                 & & & \cat{} &&&&& \ca{} &&&&& \ca{} &&&& \ca{} & \ca{} &&&&&&  \\
CLIP~\cite{clip}                          & 0.3B           & \rp{336}{14} & \ca{53.5} & 61.7 & 49.5 & 32.8 & 70.1 & \ca{72.7} & 60.7 & 63.9 & 87.3 & 78.9 & \ca{113.3} & 92.0 & 132.9 & 115.0& \ca{65.0}  & \ca{54.2} & 46.3 & 52.1 & 68.6 & 57.4 & 48.5 & 52.3  \\
AIMv2-L~\cite{aimv2}                        & 0.3B           & \rp{336}{14} & \ca{53.3} & 61.6 & 48.0 & 32.1 & 71.4 & \ca{73.7} & 62.7 & 64.3 & 87.7 & 80.1 & \ca{115.2} & 90.9 & 135.6 & 119.2 & \ca{63.3} & \ca{52.5} & 44.3 & 50.9 & 67.5 & 54.4 & 44.9 & 53.2  \\
AIMv2 L Dist.~\cite{aimv2}                      & 0.3B           & \rp{336}{14} & \ca{53.7} & 61.1 & 49.4 & 31.5 & 72.7 & \ca{74.1} & 62.8 & 64.8 & 88.3 & 80.3 & \ca{117.8} & 94.7 & 137.5 & 121.2& \ca{62.6}  & \ca{53.8} & 44.3 & 52.4 & 65.0 & 57.4 & 50.0 & 53.6  \\
SigLIP2-so~\cite{siglip2}                 & 0.4B           & \rp{384}{16} & \ca{58.9} & 69.0 & 58.3 & 35.2 & 73.1 & \ca{76.8} & 69.8 & \textbf{67.2} & 88.7 & 81.6 & \ca{116.5} & 92.1 & 137.7 & 119.8& \ca{67.4}  & \ca{54.5} & 45.5 & 53.1 & 67.2 & 57.6 & 49.3 & 54.5  \\
SigLIP2-g-opt~\cite{siglip2}                 & 1.1B                       & \rp{384}{16} & \ca{56.2} & 63.1 & 55.3 & 34.0 & 72.4 & \ca{77.0} & 70.3 & 66.7 & 89.6 & 81.6 & \ca{117.7} & 94.9 & 137.8 & 120.3& \ca{66.5}  & \ca{53.9} & 46.2 & 53.9 & 66.6 & 53.8 & 48.5 & 54.7  \\
\textbf{\PElang{} G}$^\dagger$                 & \,\,\,1.7B$^*$           & \rp{336}{14} & \ca{66.9} & 76.8 & 73.6 & 41.1 & 76.1 & \ca{76.2} & 68.5 & 66.0 & 89.1 & 81.3 & \ca{119.7} & 96.1 & 139.6 & 123.4 & \ca{68.9} & \ca{58.1} & 48.7 & 58.9 & 70.5 & 61.8 & 52.7 & 55.9 \\
\hline
\multicolumn{1}{l}{{\textit{1024 Tokens per Image}}}                 & & & \cat{} &&&&& \ca{} &&&&& \ca{} &&&& \ca{} & \ca{} &&&&&&  \\
InternViT 2.5 L~\cite{chen2024internvit2p5}  & 0.3B           & \rp{448}{14} & \ca{60.6} & 74.1 & 59.2 & 35.9 & 73.1 & \ca{74.2} & 65.4 & 64.4 & 87.6 & 79.6 & \ca{112.3} & 88.4 & 133.7 & 114.9& \ca{66.9}  & \ca{50.6} & 45.2 & 44.8 & 62.7 & 54.2 & 46.0 & 50.5  \\
SigLIP2-so~\cite{siglip2}                 & 0.4B           & \rp{512}{16} & \ca{63.3} & 72.1 & 69.3 & 39.0 & 72.7 & \ca{77.9} & 74.8 & 66.0 & 89.0 & \textbf{81.8} & \ca{117.4} & 93.5 & 138.3 & 120.2& \ca{69.6}  & \ca{55.8} & 46.2 & 55.4 & 67.0 & \textbf{62.0} & 50.0 & 54.5  \\
\textbf{\PEcore{L}}                      & 0.3B & \rp{448}{14}           & \ca{59.4} & 68.7 & 62.5 & 36.6 & 69.7 & \ca{74.7} & 67.7 & 64.3 & 88.3 & 78.7 & \ca{112.7} & 89.6 & 133.4 & 114.9& \ca{59.7}  & \ca{50.9} & 41.7 & 51.2 & 61.6 & 52.6 & 47.4 & 50.6  \\
\textbf{\PElang{L}}                      & 0.3B & \rp{448}{14}           & \ca{71.1} & 81.0 & 81.9 & 46.4 & 75.0 & \ca{77.1} & 73.0 & 65.5 & 89.3 & 80.8 & \ca{117.3} & 94.3 & 137.3 & 120.1& \ca{70.5}  & \ca{56.5} & 47.0 & 57.2 & 68.0 & 59.8 & 52.3 & 54.7  \\
\hline
DINOv2-g~\cite{dinov2}                      & 1.1B             & \rp{448}{14} & \cat{30.0} & 19.6 & 14.7 & 24.2 & 61.5 & \ca{61.0} & 19.3 & 60.4 & 88.6 & 75.8 & \ca{109.4} & 86.5 & 131.6 & 110.1& \ca{64.9}  & \ca{49.5} & 39.7 & 52.1 & 60.1 & 46.8 & 47.4 & 50.8 \\
AIMv2 3B~\cite{aimv2}                     & 2.7B             & \rp{448}{14} & \ca{48.9} & 40.5 & 53.9 & 33.9 & 67.2 & \ca{73.0} & 64.1 & 64.0 & 85.2 & 78.9 & \ca{115.7} & 93.8 & 135.2 & 118.1& \ca{36.1}  & \ca{54.6} & 45.1 & 54.5 & 66.7 & 55.4 & 51.7 & 54.3 \\
InternViT2.5-6B~\cite{chen2024internvit2p5}  & 5.5B             & \rp{448}{14} & \ca{59.9} & 72.3 & 59.4 & 35.2 & 72.5 & \ca{75.5} & 68.9 & 64.9 & 88.2 & 80.2 & \ca{115.0} & 92.2 & 136.3 & 116.3& \ca{68.0}  & \ca{49.6} & 44.5 & 47.0 & 62.6 & 45.8 & 48.9 & 48.5 \\
\textbf{\PEcore{G}}                       & 1.9B           & \rp{448}{14} & \ca{60.8} & 69.9 & 65.4 & 36.7 & 71.1 & \ca{73.3} & 65.9 & 60.7 & 88.4 & 78.0 & \ca{112.5} & 91.6 & 133.6 & 112.4 & \ca{66.6}  & \ca{52.0} & 42.3 & 53.1 & 62.9 & 51.4 & 48.8 & 53.6 \\ 
\textbf{\PElang{G}}                 & \,\,\,1.7B$^*$ & \rp{448}{14} & \ca{\textbf{72.4}} & \textbf{80.5} & \textbf{84.4} & \textbf{48.3} & \textbf{76.4} & \ca{\textbf{78.1}} & \textbf{75.2} & 65.4 & \textbf{90.1} & \textbf{81.8} & \ca{\textbf{120.1}} & \textbf{96.6} & \textbf{140.0} & \textbf{123.6} & \ca{\textbf{71.3}} & \ca{\textbf{58.0}} & \textbf{48.0} & \textbf{60.1} & \textbf{69.4} & \textbf{62.0} & \textbf{52.4} & \textbf{56.0} \\



        \shline
    \end{tabular}
    }
    \caption{{\bf MLLM Results with Llama 3.1 8B.} We compare various vision encoders at their native resolution using Llama 3.1-instruct 8B~\cite{llama3} as the language model. The table compares models of similar class in number of vision tokens and parameters. \PElang{} shows strong performance across all benchmarks, including against models 3$\times$ its size. $^*$\PElang{} has 1.7B parameters since we discard the last 3 layers during language alignment. $^\dagger$Interpolated without extra training. }
    \label{tab:lang_mllm_bench}
\end{table*}
```
Comparisons with Existing Vision Encoders {#sec:la_main_results}
-----------------------------------------

We compare `\PEcore{}`{=latex} and `\PElang{}`{=latex} with other vision encoders that are popular choices in MLLM literature: MetaCLIP [@metaclip], SigLIP2 [@siglip2], CLIP [@clip], AIMv2 [@aimv2], DINOv2 [@dinov2], and InternViT2.5 [@chen2024internvit2p5]. Overall, these encoders span several different pretraining losses (e.g., contrastive, captioning, self-supervised, and mixed supervision), encoder sizes (from 300M to 6B parameters), and resolutions (from 224 to 512). *For all vision encoders, we find the best intermediate layers to train MLLM for fair comparison* (more in Appendix `\ref{appx:mmlm_benchmark_set}`{=latex}).

#### MLLM Benchmarking Setup. {#sec:mllm_bench_setting}

We connect each vision encoder, including `\PElang{}`{=latex}, to a language decoder with a fresh 2-layer MLP projector. Similar to the alignment stage, we first train only the projector on a subset of 1M image-text pairs from pretraining. Then, we train both the projector and LLM on 2.6M visual Q&A pairs, image captions, and image grounding samples (see Appendix `\ref{appx:mmlm_benchmark_set}`{=latex} for details). We benchmark at the native resolution of each encoder (with higher resolution tiling results in Appendix `\ref{appx:mmlm_benchmark_results}`{=latex}). Finally, we ablate over two language decoders, Llama 3.1 8B [@llama3] and QwenLM 2.5 7B [@qwen2.5], to measure generalization across LLMs.

#### Results.

Tab. `\ref{tab:lang_mllm_bench}`{=latex} shows benchmarks results for native resolution input across existing encoders, `\PEcore{}`{=latex} and `\PElang{}`{=latex}. Notably, AIMv2 [@aimv2], InternViT2.5 [@chen2024internvit2p5], SigLIP2 [@siglip2] and `\PElang{}`{=latex} are trained jointly with a language decoder using next token prediction objective, and thus they perform better overall compared to the base contrastive and self-supervised models across all the metrics. However, `\PElang{}`{=latex} uses a fraction of the training FLOPs for language alignment tuning, while significantly outperforming all vision encoders by large margin (an average of +3.5 points for G and +2.0 points for L). Similarly, when tiling with 4 tiles and 1 thumbnail (see Appendix Tab. `\ref{tab:lang_mllm_bench_tiling}`{=latex}), both `\PElang{L}`{=latex} and `\PElang{G}`{=latex} outperform all existing vision encoders, including InternViT2.5 [@chen2024internvit2p5], which was specifically pretrained in a tiling setting and with grounding data. Appendix `\ref{appx:mmlm_benchmark_results}`{=latex}, shows a breakdown of the RefCOCO results, as well as results for tiling with higher resolution.

#### Transferability.

As `\PElang{}`{=latex} is aligned with Llama 3.2-instruct 3B, we conduct a separate set of experiments to check if our model performs well with a different base LLM. In Tab. `\ref{tab:lang_mllm_bench_qwen}`{=latex} we repeat the native resolution comparison with QwenLM 2.5 7B [@qwen2.5]. Interestingly, `\PElang{}`{=latex} not only outperforms all vision encoders in this setting, but it also outperforms InternViT2.5 [@chen2024internvit2p5], which is specifically aligned to QwenLM 2 [@qwen2] throughout midtraining. In fact, `\PElang{G}`{=latex} with QwenLM even improves its performance with Llama in some cases like with OCR Q&A and video benchmarks, emphasizing the generality of our language alignment.

```{=latex}
\begin{table*}[ht]\centering
    \vspace{-3pt}
    \makebox[\linewidth][c]{
    \tablestyle{0pt}{1.05} 
        \begin{tabular}{y{50}wx{20} awwww awwww awww a awwwwww}
        \shline
        \multirow{2}{*}{\vspace{-2.2cm} Model}  &&& \multicolumn{5}{c}{\ct[c3]{\it OCR / Chart / Doc. Q\&A}} %
        & \multicolumn{5}{c}{\ct[c4]{\it Visual Q\&A}} & \multicolumn{4}{c}{\ct[c5]{\it Captioning}} & \multicolumn{1}{c}{\ct[c6]{}} & \multicolumn{7}{c}{\ct[c7]{\it Video}}\\
            & \cb{Encoder Params}{}
            & \cb{Resolution}{Patch Size}
            & \cb[c3]{\textit{\textbf{Avg. OCR QA}}}{}
            & \cb[c3]{ChartQA}{Acc.~\cite{zheng2024chartqa}}
            & \cb[c3]{DocVQA}{Acc.~\cite{mathew2021docvqa}}
            & \cb[c3]{Info. QA}{Acc.~\cite{mathew2022infographicvqa}}
            & \cb[c3]{AI2D}{Acc.~\cite{kembhavi2016ai2d}}
            %
            & \cb[c4]{\textit{\textbf{Avg. VQA}}}{}
            & \cb[c4]{TextVQA}{Acc.~\cite{singh2019textvqa}}
            & \cb[c4]{OK-VQA}{Acc.~\cite{schwenk2022okvqa}}
            & \cb[c4]{POPE}{Acc. ~\cite{li2023popebenchmark}}
            & \cb[c4]{VQAv2}{Acc.~\cite{goyal2017vqav2}}
            %
            & \cb[c5]{\textit{\textbf{Avg. Cap.}}}{}
            & \cb[c5]{Flicker}{CIDEr~\cite{flickr}}
            & \cb[c5]{COCO}{CIDEr ~\cite{coco}}
            & \cb[c5]{No Cap}{CIDEr~\cite{agrawal2019nocaps}}
            % <--
            % <--
            & \cb[c6]{\textit{\textbf{Avg. Ground.}}}{RefCOCO/g/+~\cite{kazemzadeh2014referitgame}} 
            & \cb[c7]{\textit{\textbf{Avg. Video}}}{}
            & \cb[c7]{VideoMME}{Acc.~\cite{fu2024videomme}}
            & \cb[c7]{STAR}{Acc.~\cite{wu2021star}}
            & \cb[c7]{TGIF-QA}{Acc.~\cite{jang2017tgif}}
            & \cb[c7]{EgoSchema}{Acc.~\cite{mangalam2024egoschema}}
            & \cb[c7]{MVBench}{Acc.~\cite{li2024mvbench}}
            & \cb[c7]{PerceptionTest}{Acc.~\cite{patraucean2024perceptiontest}} \\
        \hline
\multicolumn{1}{l}{{\textit{576 Tokens per Image}}}                & & & \cat{} &&&&& \ca{} &&&&& \ca{} &&&& \ca{} & \ca{} &&&&&& \\
SigLIP2-so~\cite{siglip2}                 & 0.4B           & \rp{384}{16} & \ca{60.5} & 72.0 & 59.1 & 36.7 & 74.3 & \ca{76.2} & 69.0 & 65.4 & 89.2 & 81.1 & \ca{116.3} & 91.6 & 137.3 & 120.0 & \ca{70.0} & \ca{57.0} & 51.3 & 55.8 & 66.0 & 61.0 & 51.9 & 55.7 \\
SigLIP2-g-opt~\cite{siglip2}                 & 1.1B             & \rp{384}{16} & \ca{60.8} & 71.0 & 60.4 & 36.7 & 75.2 & \ca{76.8} & 70.3 & 65.6 & 89.5 & 81.8 & \ca{118.8} & 96.4 & 139.0 & 121.1 & \ca{69.9} & \ca{58.3} & 52.0 & 57.6 & 68.1 & 62.0 & 52.8 & 57.4 \\
\textbf{\PElang{} G}$^\dagger$                 & \,\,\,1.7B$^*$ & \rp{336}{14} & \ca{66.8} & 77.5 & 72.4 & 41.1 & 76.4 & \ca{76.0} & 67.9 & 65.4 & 89.1 & 81.5 & \ca{118.8} & 94.6 & 139.5 & 122.3 & \ca{70.1} & \ca{60.2} & 54.6 & 61.7 & 69.8 & 63.6 & 54.3 & 57.2 \\
\hline
\multicolumn{1}{l}{{\textit{1024 Tokens per Image}}}                & & & \cat{} &&&&& \ca{} &&&&& \ca{} &&&& \ca{} & \ca{} &&&&&& \\
InternViT2.5~\cite{chen2024internvit2p5}  & 0.3B           & \rp{448}{14} & \ca{60.3} & 75.4 & 61.1 & 36.2 & 68.4 & \ca{74.2}          & 65.6 & 63.7 & 87.8 & 79.5 & \ca{112.1} & 88.5 & 133.5 & 114.1 & \ca{68.1} & \ca{55.8} & 50.3 & 54.7 & 66.6 & 59.0 & 50.6 & 53.8 \\
SigLIP2-so~\cite{siglip2}                 & 0.4B         & \rp{512}{16} & \ca{66.3} & 77.2 & 71.9 & 42.4 & 73.9 & \ca{\textbf{77.9}} & 74.2 & 65.6 & 89.9 & 81.8 & \ca{117.1} & 93.0 & 138.0 & 120.3 & \ca{70.5} & \ca{55.9} & 50.3 & 57.3 & 67.2 & 62.6 & 50.3 & 47.4 \\
\textbf{\PEcore{L}} & 0.3B & \rp{448}{14} & \ca{63.5} & 73.9 & 67.4 & 40.5 & 72.2 & \ca{75.7} & 69.2 & 64.0 & 89.4 & 80.2 & \ca{113.3} & 88.7 & 135.2 & 115.9 & \ca{66.5} & \ca{57.3} & 49.6 & 57.8 & 67.7 & 60.8 & 52.3 & 55.5 \\
\textbf{\PElang{L}} & 0.3B & \rp{448}{14} & \ca{70.2} & 80.6 & 80.7 & 46.0 & 73.5 & \ca{76.8} & 72.8 & 64.1 & 89.4 & 81.0 & \ca{116.4} & 93.4 & 137.6 & 118.1 & \ca{70.4} & \ca{58.3} & 51.6 & 59.8 & 67.4 & 62.2 & 53.4 & 55.4 \\
\hline
\addpadding
DINOv2~\cite{dinov2}                      & 1.1B             & \rp{448}{14} & \cat{31.3} & 21.7 & 14.7 & 24.6 & 64.3 & \ca{61.0} & 18.9 & 59.5 & 88.9 & 76.9 & \ca{110.1} & 87.3 & 132.1 & 110.8 & \ca{69.3} & \ca{54.3} & 46.9 & 56.5 & 63.4 & 56.8 & 49.7 & 52.2 \\
AIMv2 3B~\cite{aimv2}                     & 2.7B             & \rp{448}{14} & \ca{66.0} & 76.7 & 70.5 & 41.4 & 75.2 & \ca{\textbf{77.9}} & 74.2 & \textbf{66.2} & 89.4 & \textbf{81.9} & \ca{\textbf{119.2}} & \textbf{96.4} & 139.2 & 122.0 & \ca{67.6} & \ca{56.3} & 45.9 & 58.0 & 67.8 & 60.8 & 51.4 & 53.9 \\
InternViT2.5~\cite{chen2024internvit2p5}  & 5.5B             & \rp{448}{14} & \ca{64.2} & 78.2 & 65.3 & 39.6 & 73.6 & \ca{76.4} & 70.1 & 64.5 & 89.3 & 81.7 & \ca{117.6} & 95.9 & 138.4 & 118.6 & \ca{\textbf{72.8}} & \ca{56.1} & 50.3 & 59.1 & 67.3 & 56.6 & 51.1 & 52.2 \\
\textbf{\PEcore{G}}                       & 1.9B           & \rp{448}{14} & \ca{64.8} & 75.9 & 68.8 & 41.6 & 72.9 & \ca{75.2} & 67.9 & 62.4 & 89.7 & 80.7 & \ca{113.1} & 91.7 & 135.2 & 112.3 & \ca{70.5} & \ca{57.0} & 48.7 & 58.3 & 66.9 & 60.8 & 52.9 & 54.5 \\
\textbf{\PElang{G}}                    & \,\,\,1.7B$^*$ & \rp{448}{14} & \ca{\textbf{72.9}} & \textbf{81.6} & \textbf{83.7} & \textbf{49.5} & \textbf{76.7} & \ca{\textbf{77.9}} & \textbf{74.9} & 64.5 & \textbf{90.3} & \textbf{81.9} & \ca{118.9} & 94.6 & \textbf{139.8} & \textbf{122.3} & \ca{72.1} & \ca{\textbf{60.4}} & \textbf{54.1} & \textbf{62.5} & \textbf{68.3} & \textbf{66.6} & \textbf{54.2} & \textbf{56.8} \\
        \shline
    \end{tabular}
   }
    \caption{{\bf MLLM Results with QwenLM 2.5 7B.} Same setting as Tab.~\ref{tab:lang_mllm_bench}, but with QwenLM2.5 7B~\cite{qwen2.5} as the language model. Although \PElang{} is aligned to Llama3.2 3B, the language alignment transfers well to a different language model.   }
    \label{tab:lang_mllm_bench_qwen}
    \vspace{-14pt}
\end{table*}
```
```{=latex}
\vspace{10pt}
```
#### System-Level MLLM Comparison.

In Tab. `\ref{tab:lang_mllm_system_level}`{=latex}, we conduct a system-level comparison to the state-of-the-art open-access MLLMs: LLaVA-OneVision 7B [@llava-onevision], Gemma3 12B [@gemma3], Molmo-D 7B [@molmo], Qwen2 VL 7B [@qwen2vl], InternVL 2.5 8B [@chen2024internvit2p5] and the very recent InternVL 3 8B [@internvl3]. Each baseline uses a contrastively pretrained ViT (SigLIP-so400M [@siglip], CLIP-L [@clip], DFN-H [@dfn], and InternViT 2.5 300M [@chen2024internvit2p5]). For our PLM-8B we use `\PElang{G}`{=latex} as the vision encoder with 36 tiles for images and 32 frames for video and Llama 3.1-instruct 8B as the language decoder (more details in [@PLM]). We show numbers from their respective works or evaluate them ourselves if they are not reported (except for Gemma and InternVL 3). PLM-8B outperforms all other models tested, emphasizing that `\PElang{G}`{=latex} can be used to drive strong results across a wide range of tasks.

```{=latex}
\begin{table*}[ht]\centering
    \vspace{-2pt}
    \makebox[\linewidth][c]{
    \tablestyle{0pt}{1.05} 
        \begin{tabular}{y{52} y{52} awwww awwww awww awwwwww}
        \shline
        \multirow{2}{*}{\vspace{-2.3cm} Model}  & & \multicolumn{5}{c}{\ct[c3]{\it OCR / Chart / Doc. Q\&A}} %
        & \multicolumn{5}{c}{\ct[c4]{\it Visual Q\&A}} & \multicolumn{4}{c}{\ct[c5]{\it Captioning}} & \multicolumn{7}{c}{\ct[c6]{\it Video}}\\
            & Encoder 
            & \cb[c3]{\textit{\textbf{Avg. OCR QA}}}{}
            & \cb[c3]{ChartQA}{Acc.~\cite{zheng2024chartqa}}
            & \cb[c3]{DocVQA}{Acc. (test)~\cite{mathew2021docvqa}}
            & \cb[c3]{Info. QA}{ Acc. (test)~\cite{mathew2022infographicvqa}}
            & \cb[c3]{AI2D}{w/o mask~\cite{kembhavi2016ai2d}}
            %
            & \cb[c4]{\textit{\textbf{Avg. VQA}}}{}
            & \cb[c4]{TextVQA}{Acc.~\cite{singh2019textvqa}}
            & \cb[c4]{OK-VQA}{Acc.~\cite{schwenk2022okvqa}}
            & \cb[c4]{POPE}{Acc. ~\cite{li2023popebenchmark}}
            & \cb[c4]{VQAv2}{Acc. (val)~\cite{goyal2017vqav2}}
            %
            & \cb[c5]{\textit{\textbf{Avg. Cap.}}}{}
            & \cb[c5]{Flicker}{CIDEr~\cite{flickr}}
            & \cb[c5]{COCO}{CIDEr ~\cite{coco}}
            & \cb[c5]{No Cap}{CIDEr~\cite{agrawal2019nocaps}}
            % <--
            % <--
            & \cb[c6]{\textit{\textbf{Avg. Video}}}{}
            & \cb[c6]{VideoMME}{Acc.~\cite{fu2024videomme}}
            & \cb[c6]{STAR}{Acc.~\cite{wu2021star}}
            & \cb[c6]{TGIF-QA}{Acc.~\cite{jang2017tgif}}
            & \cb[c6]{EgoSchema}{(test) Acc.~\cite{mangalam2024egoschema}}
            & \cb[c6]{MVBench}{Acc.~\cite{li2024mvbench}}
            & \cb[c6]{PerceptionTest}{Acc. (test)~\cite{patraucean2024perceptiontest}}            \\
        \hline
        \addpadding
LLaVA-OV 7B~\cite{llava-onevision} & SigLIP-so400M & \ca{81.4} & 80.0 & 86.7 & 68.8 & 90.1 & \ca{79.9} & 77.3  & \textbf{69.6} & 89.2 & 83.5 & \ca{79.5} & 55.7 & 70.7 & 112.1 & \ca{63.8} & 57.7 & 66.0 & 77.2 & 65.2 & 57.1 & 58.1 \\
Gemma3 12B~\cite{gemma3} & SigLIP-so400M & \ca{-} & 75.7 & 87.1 & 64.9 & - & \ca{-} & 67.7 & - &  - & 71.6 & \ca{ -} & - & - & - & \ca{-} & - & - & - & - & - & 54.9   \\
Qwen2 VL 7B~\cite{qwen2vl} & DFN-H & \ca{86.6} & 83.6 & {94.5} & 76.5 & {91.7} & \ca{80.9} & 83.6 & 67.9 & 88.3 & 83.8 & \ca{93.7} & 79.9 & 102.5 & 98.7 & \ca{67.7} & {62.9} & 67.3 & 81.8 & 65.4 & 61.6 & 66.9 \\
InternVL 2.5 8B~\cite{chen2024internvit2p5} & InternViT 2.5-300M & \ca{87.0} & 84.6 & 93.0 & 77.6 & \textbf{92.8} & \ca{79.9} & 79.3 & {69.2} & {90.6} & 80.6 & \ca{113.0} & 96.5 & 125.8 & 116.7 & \ca{72.9} & 60.6 & 77.6 & 91.3 & 66.2 & 72.6 & 68.9   \\
InternVL 3 8B~\cite{internvl3} & InternViT 2.5-300M & \ca{87.2} & \textbf{86.6} & 92.7 & 76.8 & 92.6 & \ca{-} & 80.2 & - & \textbf{91.1} & - & \ca{-} & - & - & - & \ca{-} & \textbf{66.3} & - & - & - & 75.4 & -   \\
\textbf{PLM-8B} & \textbf{\PElang{G}} & \ca{\textbf{88.4}} & {85.5} & \bb{94.6} & \textbf{{80.9}} & {92.7} & \ca{\textbf{82.9}} & \textbf{86.5} & \textbf{69.6} & 89.9 & \textbf{85.6} & \ca{\textbf{127.4}} & \textbf{105.6} & \textbf{146.7} & \textbf{129.9} & \ca{\textbf{77.9}} & {58.3} & \textbf{84.9} & \textbf{95.5} & \textbf{{68.8}} & \textbf{77.1} & \textbf{82.7} \\
\shline
    \end{tabular}
   }
    \caption{{\bf MLLM System-Level Comparison.} 
    We show a system-level comparison between PLM-8B based on \PElang{G} and popular open-access models of similar LLM scale using existing encoders. 
    We report test set results where specified.
    }
    \vspace{-10pt}
    \label{tab:lang_mllm_system_level}
\end{table*}
```
```{=latex}
\clearpage 
```
Perception Encoder: *Spatial Alignment* {#sec:sa}
=======================================

While language alignment with a pretrained LLM decoder is well-established, the best way to spatially align a model is not obvious. As shown in §`\ref{sec:layerfinder}`{=latex}, `\PEcore{}`{=latex} already has features that perform well for spatial tasks. However, the layer that performs the best for higher level spatial tasks like detection or depth estimation (layer $\sim$40) is vastly different than the layer that performs the best for a pure spatial task like tracking (layer $\sim$30). While we were able to ignore this disparity during language alignment by aligning to an LLM decoder that could do all tasks, classical spatial tasks have decoders that come in all shapes and sizes. It would be impractical to simply align the model using all downstream decoders mirroring language alignment. Thus, we must first answer the question, what is happening in the features at those layers to make them useful for spatial tasks?

Core Feature Analysis {#sec:core_feature_analysis}
---------------------

```{=latex}
\begin{wrapfigure}{r}{0.67\textwidth}
    \vspace{-35pt}
    \centering
  \begin{center}
      
    \includegraphics[width=1\linewidth, trim=10.45in 0in 0in 15.1in, clip]{fig/core_feature_example.pdf}
    
    \end{center}
    \caption{\textbf{\PEcore{}G Feature Analysis.} To understand the dichotomy between optimal \PEcore{} features for spatial tasks observed in Fig.~\ref{fig:layerfinder}, we analyze the spatial properties of the features between layers 30 and 34.}
    \label{fig:core_feature_example}
    \vspace{-15pt}
\end{wrapfigure}
```
We begin by analyzing the spatial properties of the features for `\PEcore{}`{=latex}G in the range of layers where it performed optimally for zero-shot tracking in §`\ref{sec:layerfinder}`{=latex}. In Fig. `\ref{fig:core_feature_example}`{=latex}, we plot (1) the pairwise feature cosine similarity between the pink token and all others, (2) the head average attention map for that token, and (3) the full attention matrix ($HW\times HW$).

#### An 18 Layer Decoder.

Remarkably, the cause for the tracking performance peak at layer 32 is abundantly clear from observing the visualizations. Up until layer 32, the attention maps remain local. However, that changes abruptly at layer 33, at which point several tokens in the background of the image become \`\`global" tokens. As shown by the vertical lines in the full attention matrix, starting from layer 33 every token attends to them. Thus, every layer 33 and up become part of a *decoder* for global information.

This is not a new phenomenon. Recent work [@vitsneedregisters] shows this happening in all modern vision transformers above L scale. But notably these \`\`global tokens" are not necessarily harmful. Given the optimal layer for most tasks in Fig. `\ref{fig:layerfinder}`{=latex} lies within the global token region, the information they aggregate is useful downstream. However, tracking in §`\ref{sec:layerfinder}`{=latex} is zero-shot and relies purely on spatial correspondences, meaning it cannot make use of the global tokens. This explains why tracking peaks right before their introduction, while tasks that rely on semantic understanding or have larger decoders that can benefit from them do well with the later layers.

Spatial Alignment Method {#sec:sa_method}
------------------------

Given the analysis in §`\ref{sec:core_feature_analysis}`{=latex}, we have two objectives in creating a spatial alignment method: (1) we must preserve the optimal *semantic information* of the model (including the global tokens) that peaks around layer 40, and (2) we must do so while emphasizing *local alignment* in service of spatial tasks with shallow decoders. The first can be easily achieved by aligning with the model's own features (e.g., with MaskFeat [@maskfeat]), but the second is more challenging. To accomplish this, we employ the Segment Anything Model (SAM) 2.1 [@sam2] in a novel way to enforce spatial correspondence information in PE.

#### Retaining Semantics.

To retain the strong semantic features from `\PEcore{}`{=latex}, we finetune the model with itself as a teacher. Specifically, we train the model to minimize the cosine similarity between its *last layer* and the frozen layer 41 features of its initialization (a layer around the peak for many tasks in Fig. `\ref{fig:layerfinder}`{=latex}). On its own this would be a tautology, so we apply heavy regularization to the student: DropPath [@droppath] and LayerScale [@layerscale] similar to language alignment, as well as performing MaskFeat [@maskfeat] with 75% masking. We keep the teacher fixed in contrast to other state-of-the-art spatial models, which all employ an EMA teacher [@dinov2; @siglip2]. This could potentially help, but we opt for simplicity.

```{=latex}
\begin{wrapfigure}{r}{0.48\textwidth}
    \vspace{-26pt}
    \centering
  \begin{center}
      
    \includegraphics[width=1\linewidth, trim=13.44in 0in 0in 16.83in, clip]{fig/sam_feature_example.pdf}
    
    \end{center}
    \caption{\textbf{SAM 2.1 Feature Similarity.} The cosine similarity between the pink marked token and all others for SAM 2.1-L~\cite{sam2} features \vs our proposed mask logit features.}
    \label{fig:sam_feature_example}
    \vspace{-10pt}
\end{wrapfigure}
```
#### Encouraging Locality.

While we could \`\`retain" locality by self-distilling from layer 32 features, that may be less effective as we are already distilling another layer of the model. Instead, we turn to a model that is explicitly tuned for locality: SAM [@sam; @sam2]. Notably, several works [@ranzinger2023radio; @shang2024theia; @sariyildiz2024unic] have shown SAM to *not* be an effective teacher when distilling from multiple sources (though recently [@heinrich2024radio2.5] has shown it can help with some tricks). However, upon observation of the raw features of SAM 2.1-L (Fig. `\ref{fig:sam_feature_example}`{=latex}), the main problem may be the same one we are currently trying to solve: *SAM has global tokens as well*! In this case, they appear as dark spots in a grid-like arrangement across all examples in Fig. `\ref{fig:sam_feature_example}`{=latex} raw features.

Using the features of a model that itself has global tokens to mitigate the effect of global tokens is dubious at best. But, we don't have to use SAM's *features* to learn locality. At its core, SAM is a model that transforms points into spatially contiguous masks of select object. If what we want is smooth, locally consistent features, we can use the *mask predictions* themselves. Specifically, we query SAM 2.1-L with 1024 points arranged in a $32$ $\times$ $32$ grid. For each point, SAM returns a $H$ $\times$ $W$ mask logit the size of the image, which it normally would threshold and NMS. However, we instead concatenate those logits into a $H$ $\times$ $W$ $\times$ $1024$ tensor and use *that* as the feature map for alignment. This explicitly produces locally well-aligned features compared to the underlying feature space and has no spatial artifacts caused by global tokens, as shown in Fig. `\ref{fig:sam_feature_example}`{=latex}.

Then to align, we distill the spatial correspondences between tokens by computing their pairwise cosine similarity for both the student and the teacher (creating a $HW$ $\times$ $HW$ matrix for each) and aligning them with MSE loss. Unlike SAM's underlying feature space (which [@heinrich2024radio2.5] shows may be brittle to interpolation), the mask logit features are robust to interpolation, so we simply interpolate them down and train at the `\PEcore{}`{=latex} model's original 448px resolution. Finally, like for self-distillation we add the same masking and regularization. For both teachers, we apply loss to all tokens and add no extra parameters other than LayerScale.

```{=latex}
\begin{wrapfigure}{r}{0.48\textwidth}
    \vspace{-18pt}
    \centering
    \includegraphics[width=1\linewidth, trim = 13.8in 0in 0in 14.05in, clip]{fig/layerfinder_spatial.pdf}
    \caption{\textbf{Spatial Alignment.}
    We analyze how our two spatial alignment methods individually change the internal features of \PEcore{G}. Then we combine both alignment methods to create \PEspat{G} (see Appendix~\ref{appx:spatial_align_details}).
    }
    \label{fig:spatial_alignment}
    \vspace{-20pt}
\end{wrapfigure}
```
#### Effects.

Again, the goal of alignment is to *lift* the strong features already learned by the core model as shown in §`\ref{sec:layerfinder}`{=latex}. Thus, like we did for language alignment in §`\ref{sec:la_method}`{=latex}, we perform layerwise frozen feature analysis on spatial tasks in Fig. `\ref{fig:spatial_alignment}`{=latex}. This time, we evaluate the original `\PEcore{G}`{=latex} checkpoint as well `\PEcore{G}`{=latex} aligned to its own layer 41, to SAM 2.1 mask logits, and finally both. We denote aligning to both as `\PEspat{G}`{=latex}.

Aligning purely based on the original model's layer 41 features performs well on detection, depth, and semantic segmentation, but falls short for zero-shot tracking, where precise locality is necessary to define boundaries between objects. In contrast, aligning to SAM 2.1 mask logits lowers last layer performance on every task except for tracking, where it significantly improves performance. Understandably, this is because the mask logits have little semantics (see Fig. `\ref{fig:feature_viz}`{=latex}). Thus, the optimal approach is to combine both teachers. As a result, `\PEspat{G}`{=latex} not only lifts the features for all tasks to the end of the network, but it also improves over self-alignment alone. Notably, `\PEspat{G}`{=latex}'s tracking performance is lower than the SAM-aligned model, but it is still ahead of other methods while being a generally good model, see §`\ref{sec:sa_results}`{=latex}.

```{=latex}
\begin{wrapfigure}{r}{0.5\textwidth}
    \vspace{-10pt}
    \centering
    \includegraphics[width=1\linewidth, trim=12.4in 0in 0in 15.03in, clip]{fig/feature_viz.pdf}
    \caption{\textbf{Last Layer Visualization}
    for the models in Fig.~\ref{fig:spatial_alignment} using 3 dimensional PCA to map features to LCh color space (see Appendix~\ref{appx:feature_viz}). More examples in Appendix~\ref{appx:more_feature_viz}.
    }
    \label{fig:feature_viz}
    \vspace{-20pt}
\end{wrapfigure}
```
#### Last Layer Feature Visualization.

In Fig. `\ref{fig:feature_viz}`{=latex}, we visualize the last layer features for the `\PEcore{G}`{=latex} and the 3 aligned models, with similar colors denoting similar features. In the first column, we see why the last layer performance of `\PEcore{}`{=latex} is so poor: while the last layer features contain information about the salient objects, they seem to have lost spatial coherence. Aligning to the model's own layer 41 features fixes this, but its spatial quality is lacking. In contrast, the model aligned to SAM 2.1 mask logits has locally clear features, but without semantics (similar objects have dissimilar features, see row 1 cats and row 2 cows). `\PEspat{}`{=latex}, using both teachers at once, retains the semantics of `\PEcore{}`{=latex} while producing high quality spatial features.

```{=latex}
\vspace{10pt}
```
```{=latex}
\centering
```
```{=latex}
\vspace{0pt}
```
```{=latex}
\tablestyle{0pt}{1.1}
```
```{=latex}
\newcommand{\loo}[2]{\ca{\tiny \textcolor{gray}{#1/#2}}}
```
```{=latex}
\begin{tabular}{y{47}x{25}x{36} x{16}x{16}x{16} x{16}x{16}x{16} x{16}x{16}x{16}}
        \shline
        &&& 
        \multicolumn{3}{c}{\ct[c3]{Tracking}} & 
        \multicolumn{3}{c}{\ct[c4]{Segmentation}} & 
        \multicolumn{3}{c}{\ct[c5]{Depth}} \\
        &&&
        \multicolumn{3}{c}{\ct[c3]{DAVIS ($\uparrow$)~\cite{davis2017}}} &
        \multicolumn{3}{c}{\ct[c4]{ADE20k ($\uparrow$)~\cite{ade20k}}} & 
        \multicolumn{3}{c}{\ct[c5]{NYU ($\downarrow$)~\cite{nyu_depth}}} \\
        Encoder & Params & Resolution &
        \cc[c3]{Best} & \cc[c3]{Last} & \cc[c3]{Idx} &
        \cc[c4]{Best} & \cc[c4]{Last} & \cc[c4]{Idx} &
        \cc[c5]{Best} & \cc[c5]{Last} & \cc[c5]{Idx} \\
        \hline
        \addpadding
        OAI CLIP-L~\cite{clip}       & 0.3B              & \rp{224}{14} &     39.4  &     37.1  & \loo{17}{24} &     39.4  &     38.3  & \loo{19}{24} &     .366  &     .397  & \loo{19}{24} \\
        AIMv2-3B~\cite{aimv2}        & 2.7B              & \rp{448}{14} &     54.7  &     29.3  & \loo{13}{24} &     41.6  &     31.9  & \loo{20}{24} &     .311  &     .326  & \loo{16}{24} \\
        SigLIP-so~\cite{siglip}      & 0.4B              & \rp{384}{14} &     48.7  &     36.3  & \loo{16}{27} &     40.1  &     38.3  & \loo{22}{27} &     .339  &     .369  & \loo{21}{27} \\
        SigLIP2-so~\cite{siglip2}    & 0.4B              & \rp{512}{16} &     51.4  &     45.3  & \loo{15}{27} &     44.0  &     42.9  & \loo{24}{27} &     .306  &     .329  & \loo{25}{27} \\
        SigLIP2-g-opt~\cite{siglip2} & 1.1B              & \rp{384}{16} &     43.5  &     38.8  & \loo{32}{40} &     42.1  &     41.3  & \loo{34}{40} &     .302  &     .324  & \loo{34}{40} \\
        DINOv2-L~\cite{dinov2}       & 0.3B              & \rp{448}{14} & \uu{58.7} &     58.2  & \loo{23}{24} &     47.3  &     47.3  & \loo{24}{24} &     .297  &     .308  & \loo{23}{24} \\
        DINOv2-g~\cite{dinov2}       & 1.1B              & \rp{448}{14} &     58.5  & \uu{58.5} & \loo{40}{40} & \uu{48.7} & \uu{48.4} & \loo{37}{40} &     .279  & \uu{.290} & \loo{27}{40} \\
        \textbf{\PEcore{G}}          & 1.9B              & \rp{448}{14} &     56.8  &     42.8  & \loo{32}{50} &     41.5  &     38.6  & \loo{44}{50} & \bb{.249} &     .309  & \loo{39}{50} \\
        \textbf{\PEspat{G}}          & 1.9B              & \rp{448}{14} & \bb{61.5} & \bb{61.5} & \loo{50}{50} & \bb{49.3} & \bb{48.9} & \loo{49}{50} & \uu{.262} & \bb{.275} & \loo{46}{50} \\
        \shline
    \end{tabular}
```
`\hfill`{=latex}

```{=latex}
\centering
```
```{=latex}
\vspace{0pt}
```
```{=latex}
\tablestyle{0pt}{1.1}
```
```{=latex}
\begin{tabular}{y{47}x{25}x{30}x{24}x{24}x{24}x{24}}
        \shline
        &&\multirow{2}{*}{\parbox{1.0cm}{\centering Pretrain Resolution}}& \multicolumn{2}{c}{\ct[c6]{ LVIS~\cite{lvis}}} & \multicolumn{2}{c}{\ct[c7]{COCO~\cite{coco}}} \\
        Encoder & Params &  
        & \ct[c6]{$\text{AP}_\text{box}$}{}
        & \ct[c6]{$\text{AP}_\text{mask}$}{}
        & \ct[c7]{$\text{AP}_\text{box}$}{}
        & \ct[c7]{$\text{AP}_\text{mask}$}{}\\
        \hline
        \addpadding
        OAI CLIP-L~\cite{clip}  & 0.3B & \rp{224}{14} &45.0 & 41.9 & 54.0 & 47.5 \\
        MetaCLIP-G~\cite{metaclip} & 1.8B & \rp{224}{14} & 45.1 & 41.9 & 53.2 & 46.7   \\
        SigLIP-so~\cite{siglip} & 0.4B & \rp{224}{14}&  45.0 & 41.9 & 54.4 & 47.6 \\
        MAE-L~\cite{mae} & 0.3B & \rp{224}{14} & 46.1 & 43.9 & 55.6 & 49.3 \\
        EVA02-L~\cite{eva2}  & 0.3B &  \rp{224}{14} & 49.3 & 45.2 & 54.9 & 48.2 \\
        SigLIP2-so~\cite{siglip2} & 0.4B & \rp{512}{16} & 49.3 & 45.6 & 56.0 & 49.4 \\
        SigLIP2-g-opt~\cite{siglip2} & 1.1B & \rp{384}{16} & {52.9} & {48.5} & 57.1 & {50.2} \\
        DINOv2-L~\cite{dinov2} & 0.3B & \rp{518}{14} & 46.7 & 43.5 & 55.7 & 49.0 \\
        DINOv2-g~\cite{dinov2}  & 1.1B & \rp{518}{14} & 51.5 & 47.3 & {57.2} & 50.0 \\
        \textbf{\PEcore{G}} & 1.9B & \rp{448}{14} &  51.9 & 47.9 & 57.0 & 49.8 \\
        \textbf{\PEspat{G}} & 1.9B & \rp{448}{14} & \textbf{54.2} & \textbf{49.3} & \textbf{57.8} & \textbf{50.3} \\
        \shline
    \end{tabular}
```
```{=latex}
\vspace{-5pt}
```
Comparisons with Existing Vision Encoders {#sec:sa_results}
-----------------------------------------

#### Frozen Feature Dense Prediction.

In Tab. `\ref{tbl:dense_prediction}`{=latex}, we compare different vision encoder's frozen features on three dense prediction tasks: DAVIS tracking [@davis2017] (J&F) following the training-free setting from [@jabri2020space; @vgpt], ADE20k semantic segmentation [@ade20k] (mIoU) linear probing, and NYU depth estimation [@nyu_depth] (RMSE) with a DPT head [@dpt]. For each model, we report both its best layer and last layer performance. Across the board, `\PEspat{}`{=latex} performs outperforms other state-of-the-art spatial models, with its best features being much better aligned to the last layer than the `\PEcore{}`{=latex} it started from. Notably, SigLIP2, which during pretraining combines spatial, captioning, and contrastive losses [@siglip2] is *not* aligned well to the last layer in comparison.

#### End-to-End Finetuning Detection and Segmentation.

In Tab. `\ref{tbl:det_SFT}`{=latex}, we compare `\PEcore{}`{=latex} and `\PEspat{}`{=latex} with other popular vision encoders in the standard *full-finetuning* ViTDet [@vitdet] Mask-RCNN [@maskrcnn] setting using COCO [@coco] and LVIS [@lvis] as benchmarks. In this controlled experiment, `\PEspat{}`{=latex} is state-of-the-art among various vision backbones. This is significant, as contrastive encoders (especially large ones like MetaCLIP-G [@metaclip]) usually perform very poorly on detection, with smaller models often performing better. Typically, encoders only scale for detection if using spatial pretraining or a significant amount of detection data [@dinov2] is used to align them directly to downstream tasks. In contrast, `\PEspat{}`{=latex} *uses no detection data for alignment*, making it general.

```{=latex}
\begin{wraptable}{r}{0.4\textwidth}
\vspace{-15pt}
\centering
{
    \tablestyle{0pt}{1.1} 
    \begin{tabular}{y{60}x{35}x{35}x{45}}
        \shline
        Encoder  & Params & Detector
        & \ct[c7]{COCO AP$_\text{box}$}{}\\
        \hline
        \addpadding{}%
        SwinV2-G~\cite{swin2} & 3.0B & HTC++~\cite{htc}  & 62.5 \\
        Swin-L~\cite{swin}& 0.3B & DINO~\cite{dino_det}  & 63.2 \\
        % MAE-H~\cite{mae}& 632M & Cascade~\cite{cascadercnn} & 61.3 \\
        EVA02-L~\cite{eva2} & 0.3B & Cascade~\cite{cascadercnn} & 64.1 \\
        InternImage-G~\cite{internimage} & 3.0B & DINO~\cite{dino_det}   & {65.3} \\
        EVA02-L~\cite{eva2} & 0.3B & CoDETR~\cite{codetr} & 65.9 \\
        \textbf{\PEspat{G}} & 1.9B & DETA~\cite{deta} & \textbf{66.0} \\
        \shline
    \end{tabular}
}
\caption{{\bf System-Level Comparison on Detection.} Comparing to the leading results on COCO~\cite{coco} val2017. See Appendix~\ref{appx:det_sota_setting} for training recipe.
}
\label{tbl:det_coco}
\vspace{-20pt}
\end{wraptable}
```
#### System-Level Detection.

In Tab. `\ref{tbl:det_coco}`{=latex}, we provide a system-level end-to-end finetuning comparison `\vs `{=latex}the absolute state-of-the-art in COCO detection. With only Object365 [@o365] as extra detection data, `\PEspat{}`{=latex} can match the performance of more complex models tuned for detection, while only using a simple DETR-style decoder [@carion2020detr; @deta]. `\PEspat{}`{=latex} marks the first general, contrastively pretrained model to accomplish this.

```{=latex}
\clearpage
```
```{=latex}
\newpage
```
Related Work
============

Learning vision-semantic representations has long been the leading approach for developing foundational models in perception. By aligning visual and textual representations, these models excel not only in vision tasks such as zero-shot image classification and image-text retrieval [@clip; @openclip; @laion], open-vocabulary detection [@owlv1; @fvlm; @owlv2] and segmentation [@ding2023zss; @cho2024catseg], but also serve as the basis for multi-modal large language models (MLLMs) [@qwen-vl; @kosmos-2; @llava; @paligemma; @mm1; @cambrian].

#### Contrastive Language-Image Pretraining.

The early works of Virtex [@desai2021virtex], ICMLM [@sariyildiz2020icmlm], and ConVIRT [@pmlr-v182-zhang22a] developed the techniques for learning through contrastive objectives between vision and language modalities. Subsequently, vision encoders such as CLIP [@clip; @openclip] and ALIGN [@align] scaled these techniques to much larger datasets and model sizes, popularizing vision-language contrastive learning. A series of open-weight contrastive models have been developed to enhance the performance and robustness of CLIP [@EVA-CLIP; @siglip; @li2023clipav2; @dfn; @metaclip; @laion]. For instance, SigLIP [@siglip] replaces the traditional softmax with a sigmoid function in contrastive learning, while FLIP [@flip] employs masking techniques to expedite the training process. We are among this effort and build a state-of-the-art open Perception Encoder (PE) (§`\ref{sec:core_image_pt}`{=latex}). Other objectives that have proven useful for building visual encoders include captioning loss, which learns to predict image descriptions using a language model decoder and transfers well to downstream multi-modal language modeling tasks [@aimv2; @cappa]. Many works are now attempting to combine two or more objectives to address different downstream tasks through pretraining with multiple objectives [@aimv2; @coca] or training sequentially [@internvl; @llava-onevision].

#### Efficient Training.

Various axes of efficient training of clip models have been explored. BASIC [@basic] and LAION [@laion] explored scaling the batch size up to 160K, and shows the benefits of large batch sizes during training. EVA-CLIP [@eva18b] uses LAMB optimizer [@lamb] for large batch training of clip models. Rotary positional embedding (RoPE)  [@rope] has been successfully adopted in large language models. In vision transformers  [@heo2024rotary; @agrawal2024pixtral] adopted 2D rotatory positional embeddings. For data engine, a series of works focus on large-scale sourcing and filtering through efficient data curation [@datacomp; @laion; @dfn; @metaclip] and explore recaptioning training images using MLLMs or VLMs [@rewrite; @veclip; @Nguyen2023recap; @altogether]. We extend these concepts to build a video data engine and scale our model to function as one strong model for both image and video (§`\ref{sec:core_video_ft}`{=latex}).

#### Best Embedding Layer Inside the Network.

Typically, most vision encoders rely on the last layer to extract features for the task it is trained on. However, when trained on proxy or self-supervised tasks, the last layer is often not the ideal candidate for other tasks [@zheng2016good; @bordes2022guillotine; @shekhar2023objectives; @igpt; @aimv1; @vgpt; @repa; @ma2024sit; @sun2024cliper; @walmer2023teaching; @chen2020simclr]. For example, when using image colorization as pretraining objective, [@zhang2016colorful; @zheng2016good] showed that the middle layers were better at image classification compared to last layers. Subsequently, in iGPT [@igpt], when trained for next token prediction, intermediate layers performed better at image classification. AIMv1 [@aimv1] also showed similar behavior for image based next token prediction with patch normalized MSE loss. Toto [@vgpt] showed this can be extended for next token prediction in videos, and intermediate layers are best for image classification, video classification, tracking and robotics. REPA [@repa] showed this behavior for image generation models, where the intermediate layers of SiT [@ma2024sit] has better linear probing accuracy compared to earlier or later layers. In CLIP models, CLIPer [@sun2024cliper] identified that early layers in CLIP possess good spatial understanding. In contrast to these lines of work, in this paper, we first show this behavior is not limited to one class of encoders. Specifically, we show this behavior exists in a spatially self-supervised model [@dinov2], generative captioning model [@aimv2], and also in our own PE. Then we study this behavior for PE encoder in depth, and show it is possible for CLIP training to produce rich spatial and semantic features in intermediate layers (§`\ref{sec:layerfinder}`{=latex}).

#### Alignment Tuning.

We explore alignment tuning for language (§`\ref{sec:la}`{=latex}) and for spatial understanding (§`\ref{sec:sa}`{=latex}). For language alignment, we focus on adapting to multimodal large language models (MLLMs); for spatial alignment, we employ self-distillation of the models own features combined with a teacher for locality. In MLLM literature, *midtraining*---i.e., a middle stage of training used to exploit large-scale multimodal data---has been actively studied. LLaVA-OneVision [@llava-onevision], InternVL series [@internvl; @chen2024internvit2p5], QwenVL series [@qwen-vl; @qwen2vl], and several other leading MLLMs [@llama3; @gemma3] adopt this paradigm. Our `\PElang{}`{=latex} can be seen as a variant of midtraining, but with one critical difference in principle: our goal is *not* to build the best MLLM, but to make the vision encoder the most *general*. Throughout §`\ref{sec:la}`{=latex}, we benchmark our `\PElang{}`{=latex} across different language models, input resolution, on various tasks for image and video to show this generality. For spatial tasks, we utilize the hidden embeddings in the intermediate layers. Recently, several works showed the effectiveness of distilling teacher model via representation alignment with cosine similarity. REPA [@repa] distilled an early layer features of DINO for image diffusion models, RADIO [@ranzinger2023radio] used multi-teacher distillation (DINO, CLIP and SAM). The key idea is to borrow semantic understanding (*e.g.*, CLIP) and spatial understanding (*e.g.*, SAM, DINO) of a pretrained vision encoders. In our `\PEspat{}`{=latex}, we exploit the intermediate features of `\PEcore{}`{=latex} for semantics, and a novel way to use SAM for spatial understanding.

Conclusion
==========

We have presented Perception Encoders (PE), a family of best-in-class foundation models comprising `\PEcore{}`{=latex}, `\PElang{}`{=latex}, and `\PEspat{}`{=latex}. We have shown that `\PEcore{}`{=latex} can outperform models trained with WebLI and JFT-3B, which were previously the undisputed leaders in zero-shot image recognition, while also excelling in zero-shot video recognition. We have demonstrated that `\PElang{}`{=latex} can be used to build a multimodal language model [@PLM] that is at the forefront of the field in terms of performance. We have established that `\PEspat{}`{=latex} can match the long-standing state-of-the-art in object detection with a significantly simpler decoder. Throughout all of this, one conclusion is abundantly clear: Perception Encoder unlocks the potential to scale simple contrastive vision-language pretraining to address a wide range of downstream vision tasks.

#### Additional Contributors and Acknowledgments.

We would like to thank Abhimanyu Dubey, Adel Ahmadyan, Andrew Westbury, Arkabandhu Chowdhury, Azita Shokrpour, Babak Damavandi, Chay Ryali, Cyprien de Lichy, Didac Suris Coll-Vinent, Dong Wang, Filip Radenovic, George Orlin, Han Zou, Harry Tran, Jitendra Malik, Joelle Pineau, Joseph Greer, Kavya Srinet, Kirmani Ahmed, Laura Gustafson, Lu Zhang, Muhammad Maaz, Natalia Neverova, Nicolas Carion, Oleksandr Maksymets, Ramya Raghavendra, Romy Luo, Ronghang Hu, Sam Doud, Sasha Mitts, Sean Bell, Shane Moon, Shuming Hu, Soerian Lieve, Stephane Kasriel, Valentin Gabeur, Vanessa Stark, Vignesh Ramanathan, Vivian Lee, Xuan Hu, Yang Li, and Ziyang Wang for their contributions and support for the project. And we thank you, the reader, for reading this far.

```{=latex}
\clearpage
```
```{=latex}
\appendix
```
Video Data Engine
=================

Video Caption {#sec:appx_video_caption}
-------------

#### LLM Summarization prompt

```{=latex}
\promptbox{LLM Summarization prompt 72 tokens}{Create a concise caption of a video using the provided metadata, video caption, and frame captions.

TASK: Extract key information from the captions and combine it into an alt text format using single phrase or set of phrases that includes all relevant details.

Steps to Follow:

1. Review the metadata (title and description) for general context, you can rely it for entity names but do not rely on it as the primary source of information for your caption. 

2. Blend title / description with video caption and frame captions for the main storyline

3. Extract the most relevant and concise information.

4. Combine extracted information into a alt text format using short phrase or set of phrases with approximately 120 tokens, considering special characters like comma as part of the token count.

5. Prioritize including all key information over sentence structure or grammar.

6. Minimize the use of special characters and focus of key information.

What to Avoid:

- Avoid adding or inferring information not present in the original metadata and captions.

- Avoid using complex sentence structures or prioritizing sentence flow.

Create a concise caption of the video based on the metadata, video caption, and frame captions. }
```
PE Video Dataset Details {#sec:appx_video_datasets}
------------------------

PE Video is a dataset that we collected and curated from a licensed data source. The videos are high-resolution and high-quality with a focus on motion. The total number of videos is 1M. Among these, 120K videos have human-refined video captions, and we selected 15K from the 120K videos as a benchmark.

### Video Data Filtering Pipeline

The goal of video data filtering is to identify videos that contain motions such as object motion, camera motion, interaction between objects, human actions, sequences of actions, and manipulation of objects, while rejecting videos with static scenes, like landscapes, or those that are artificial or highly edited.

To achieve this, we created a video filtering pipeline consisting of the following steps:

#### Step 1

: Compute motion features. For each video, we compute a list of features from video frames, including frames per second (fps), number of frames, number of I-frames, motion vector magnitude, and motion vector variance, using off-the-shelf tools like OpenCV [@opencv].

#### Step 2

: Extract video frame features. For each video, we uniformly sample three frames and encode them using a DINOv2 model [@dinov2] and a SigLIP model [@siglip].

#### Step 3

: LLM Features. For each video, we also run a multimodal large language model (LLM) like Llava-Onevision QwenLM 2 0.5B [@llava-onevision] to extract MLLM features. We composed a list of 26 questions and performed MLLM inference on the videos. The questions can be found here in §`\ref{text:llm_feature_extraction_qeustions}`{=latex}.

#### Step 4

: Video Quality Scoring. We combine all the features collected so far and use a random forest model to predict a score between 0 and 5. To train the model, we manually annotated approximately 1,000 videos with scores between 0 and 5. A low score indicates that the video is almost static and can be nearly summarized by a single frame, while a high score indicates that there are multiple temporal events in the video, requiring several frames to accurately caption it. We use these annotated videos as training data to fit a random forest model for video quality score prediction.

#### Step 5

: We apply k-means clustering to the videos and rank them within each cluster. By selecting the top-ranked videos from each cluster, we effectively reduce the number of duplicated videos in the final dataset.

```{=latex}
\newpage
```
### LLM Feature Extraction {#text:llm_feature_extraction_qeustions}

```{=latex}
\promptbox{LLM Feature extraction question list}{
Is the camera capturing the scene static? Reply yes or no.

Is the camera capturing the scene moving? Reply yes or no.

Is the video capturing a landscape? Reply yes or no.

Is the video capturing a static scene? Reply yes or no.

Is the scene captured from a distance? Reply yes or no.

Is the video captured with a drone? Reply yes or no.

Is the video computer-generated? Reply yes or no.

Is the video content abstract? Reply yes or no.

Is there something moving through the scene? Reply yes or no.

Is there someone doing something in the video? Reply yes or no.

Are there several things moving in the video? Reply yes or no.

Is there an object that is being manipulated? Reply yes or no.

Are there animals in the video? Reply yes or no.

Is the scene mostly static? Reply yes or no.

Are things occluding each other in this video? Reply yes or no.

Is there something obstructing the view apart from the watermark? Reply yes or no.

Is there a large number of things in the video? Reply yes or no.

Are there more than 5 different objects in the video? Reply yes or no.

Is it hard to keep track of some entities because they are moving so much? Reply yes or no.

Is someone looking at a phone, a tablet or a computer screen? Reply yes or no.

Are they looking at a phone, a tablet or a computer screen during the whole video? Reply yes or no.

Are there several moving persons in this video? Reply yes or no.

Are there several moving animals in this video? Reply yes or no.

Are there several objects in this video? Reply yes or no.

Are there several similar-looking objects in the video? Reply yes or no.

Do they look similar? Reply yes or no.
}
```
We use LLaVA-OneVision [@llava] model to extract LLM features from the videos. For each video, we prompt with 26 different questions to extract features ranging from, \`\`is the video a landscape video?" to, \`\`are there any moving objects in the video?" The features are then used by a random forest model to determine the video quality score.

### PVD Benchmark Distribution {#appx:pvd_bench_distribution}

```{=latex}
\centering
```
```{=latex}
\tablestyle{4pt}{1.05}
```
```{=latex}
\begin{tabular}{x{45}x{40}x{40}}
    \shline
    \addpadding
    \multirow{2}{*}{Category} & Number of videos & Avg. Caption Length\\
    
    \hline
    \addpadding
    \cc[c1]{Hand Actions} & 2143 & 54.2 \\
    \cc[c2]{Object Interactions} & 1864 & 42.6 \\
    \cc[c3]{Food Preparation} & 1691 & 56.8 \\
    \cc[c4]{Work Activities} & 1689 & 47.8 \\
    \cc[c5]{Outdoor Scenes} & 1558 & 50.7 \\
    \cc[c6]{Animals} & 1423 & 50.9 \\
    \cc[c7]{Water Scenes} & 1337 & 44.6 \\
    \cc[c8]{Object Handling} & 1307 & 51.6 \\
    \cc[c9]{Close-up Shots} & 1122 & 45.1 \\
    \cc[c10]{Nature Scenes} & 866 & 38.4 \\
    \shline
\end{tabular}
```
```{=latex}
\begin{figure*}[t!]\centering
    \vspace{1cm}
    \includegraphics[width=\linewidth, trim=10.5in 0in 0in 10in, clip]{fig/pvd_video_example_more.pdf}
    \caption{{\bf More PE Video Dataset Examples.} For each of the ten categories, we randomly pick one video and show its video caption. The captions were generated by our video data pipeline and then refined by human annotators.}
    \label{fig:video_data_example_more}
    \vspace{2cm}
\end{figure*}
```
```{=latex}
\clearpage
```
Implementation Details
======================

PE Core
-------

We provide additional implementation details for building `\PEcore{}`{=latex}. Our implementation is based on OpenCLIP[^5].

### Architecture and Training Setups

#### Model Architecture.

Following CLIP, `\PEcore{}`{=latex} comprises a Transformer-based [@transformer] vision and a text encoder. We employ customized Transformer configurations as detailed in Tab. `\ref{tab:app_model_arch}`{=latex}. For pooling, we an attention pooling block in the style of SigLIP [@siglip] *with 8 heads* from the last-layer feature to construct image and video embeddings. Regarding positional embedding, we use 2D RoPE [@rope] for relative positional embeddings and 2D learnable absolute positional embeddings (abs) the same size as the model's input resolution. We interpolate positional embeddings to enable support for various resolutions beyond the default. The text context length is 72 for G-scale and 32 for B and L-scale models. Originally a bug, we find it optimal to *not disable the class token* when using attention pooling for smaller models. Thus, the B and L models use a class token, then the attention pooling layer probes all features at once (class token included). Finally, we use an input mean and standard deviation of $(0.5, 0.5, 0.5)$ for simplicity.

```{=latex}
\centering
```
```{=latex}
\tablestyle{0pt}{1.05}
```
```{=latex}
\begin{tabular}{x{20}x{30}x{25}x{20}x{20}x{20}x{20}x{20}x{40}x{40}x{40}x{40}x{40}x{40}}
    \shline
        \ct{Scale} & \ct{Tower} & \ct[c1]{Params} & \ct[c2]{Width} & \ct[c3]{Depth} & \ct[c4]{MLP} & \ct[c5]{Heads} & \ct[c6]{CLIP Dim} & \ct[c7]{Pooling} & \ct[c8]{Positional Embedding} & \ct[c9]{Resolution \& Context Len}  & \ct[c10]{Patch Size} & \ct[c10]{Class Token Register}\\
        \hline
       \addpadding
        \multirow{2}{*}{B} & Vision & 0.09B & 768 & 12 & 3072 & 12 & \multirow{2}{*}{1024} & Attn Pool & RoPE+Abs & 224 & 16 & \cmark \\
                           & Text   & 0.31B & 1024 & 24 & 4096 & 16 & &  EOS Token & Abs & 32 & - & - \\
       \hline
       \addpadding
        \multirow{2}{*}{L} & Vision & 0.32B& 1024 & 24 & 4096 & 16 & \multirow{2}{*}{1024}& Attn Pool & RoPE+Abs & 336 & 14 & \cmark \\
                           & Text   & 0.31B & 1024 & 24 & 4096 & 16 & &  EOS Token & Abs & 32 & - & - \\
                           \hline
       \addpadding
        \multirow{2}{*}{G} & Vision & 1.88B & 1536 & 50 & 8960 & 16 & \multirow{2}{*}{1280} & Attn Pool & RoPE+Abs & 448 & 14 & \xmark \\
                           & Text   & 0.47B & 1280 & 24 & 5120 & 20 & &  EOS Token & Abs & 72 & - & - \\
    \shline
\end{tabular}
```
```{=latex}
\captionsetup{justification=centering}
```
#### PE Core Training. {#sec:appx_joint_train}

As discussed in §`\ref{sec:unified-encoder}`{=latex}, the training of `\PEcore{}`{=latex} involves three stages: 1) image pretraining; 2) image and video finetuning; and 3) an additional model distillation for smaller models. These three stages work together to develop a robust and effective `\PEcore{}`{=latex} model.

We first provide training recipes for 1) image pretraining in Tab. `\ref{tbl_app:pretrain}`{=latex} and 2) video finetuning in Tab. `\ref{tbl_app:video_ft}`{=latex}.

`\label{sec:appx_image_pretrain}`{=latex}

```{=latex}
\vspace{0pt}
```
```{=latex}
\centering
```
```{=latex}
\tablestyle{5pt}{1.1}
```
::: {#tbl_app:distillation}
+--------------------------------------------------------------------------------+------------------------+
| config                                                                         | values                 |
+:===============================================================================+:======================:+
| ```{=latex}                                                                    | LAMB                   |
| \addpadding                                                                    |                        |
| ```                                                                            |                        |
| optimizer                                                                      |                        |
+--------------------------------------------------------------------------------+------------------------+
| $\beta_1, \beta_2$                                                             | (0.9, 0.95)            |
+--------------------------------------------------------------------------------+------------------------+
| weight decay                                                                   | 0.05                   |
+--------------------------------------------------------------------------------+------------------------+
| learning rate                                                                  | 2e-3                   |
+--------------------------------------------------------------------------------+------------------------+
| batch size                                                                     | 131,072                |
+--------------------------------------------------------------------------------+------------------------+
| warm-up steps                                                                  | 2K                     |
+--------------------------------------------------------------------------------+------------------------+
| training steps                                                                 | 443K (B, L) / 656K (G) |
+--------------------------------------------------------------------------------+------------------------+
| data quantity                                                                  | 5.4B                   |
+--------------------------------------------------------------------------------+------------------------+
| samples seen                                                                   | 58B (B, L) / 86B (G)   |
+--------------------------------------------------------------------------------+------------------------+
| max logit scale                                                                | 100                    |
+--------------------------------------------------------------------------------+------------------------+
|                                                                                |                        |
+--------------------------------------------------------------------------------+------------------------+
| mask reg ratio                                                                 | 0.4                    |
+--------------------------------------------------------------------------------+------------------------+
| mask reg batch                                                                 | 8192                   |
+--------------------------------------------------------------------------------+------------------------+
|                                                                                |                        |
+--------------------------------------------------------------------------------+------------------------+
| progressive res                                                                |                        |
+--------------------------------------------------------------------------------+------------------------+
| 98-154-224-336 (L)                                                             |                        |
+--------------------------------------------------------------------------------+------------------------+
| 98-154-224-336-448 (G)                                                         |                        |
+--------------------------------------------------------------------------------+------------------------+
|                                                                                |                        |
+--------------------------------------------------------------------------------+------------------------+
| data aug                                                                       |                        |
+--------------------------------------------------------------------------------+------------------------+
| `\text{rand crop}`{=latex} `\text{\texttt{\tiny s(0.08,1)}}`{=latex}           |                        |
+--------------------------------------------------------------------------------+------------------------+
| `\text{color jitter}`{=latex} `\text{\texttt{\tiny j(0.32,0,0.32,0)}}`{=latex} |                        |
+--------------------------------------------------------------------------------+------------------------+
| `\text{hflip}`{=latex} `\text{\texttt{\tiny p(0.5)}}`{=latex}                  |                        |
+--------------------------------------------------------------------------------+------------------------+

: **Distillation.**
:::

\

```{=latex}
\captionsetup{justification=centering}
```
```{=latex}
\hfill
```
```{=latex}
\vspace{0pt}
```
```{=latex}
\centering
```
```{=latex}
\tablestyle{5pt}{1.1}
```
::: {#tbl_app:distillation}
+--------------------------------------------------------------------------------+-------------+
| config                                                                         | values      |
+:===============================================================================+:===========:+
| ```{=latex}                                                                    | LAMB        |
| \addpadding                                                                    |             |
| ```                                                                            |             |
| optimizer                                                                      |             |
+--------------------------------------------------------------------------------+-------------+
| $\beta_1, \beta_2$                                                             | (0.9, 0.95) |
+--------------------------------------------------------------------------------+-------------+
| weight decay                                                                   | 0.05        |
+--------------------------------------------------------------------------------+-------------+
| learning rate                                                                  | 1e-6        |
+--------------------------------------------------------------------------------+-------------+
| batch size                                                                     | 4096        |
+--------------------------------------------------------------------------------+-------------+
| warm-up steps                                                                  | 2K          |
+--------------------------------------------------------------------------------+-------------+
| training steps                                                                 | 5.4K        |
+--------------------------------------------------------------------------------+-------------+
| data quantity                                                                  | 22M         |
+--------------------------------------------------------------------------------+-------------+
| samples seen                                                                   | 22M         |
+--------------------------------------------------------------------------------+-------------+
| max logit scale                                                                | 100         |
+--------------------------------------------------------------------------------+-------------+
|                                                                                |             |
+--------------------------------------------------------------------------------+-------------+
| number of frames                                                               | 8           |
+--------------------------------------------------------------------------------+-------------+
|                                                                                |             |
+--------------------------------------------------------------------------------+-------------+
| data aug                                                                       |             |
+--------------------------------------------------------------------------------+-------------+
| `\text{rand crop}`{=latex} `\text{\texttt{\tiny s(0.08,1)}}`{=latex}           |             |
+--------------------------------------------------------------------------------+-------------+
| `\text{color jitter}`{=latex} `\text{\texttt{\tiny j(0.32,0,0.32,0)}}`{=latex} |             |
+--------------------------------------------------------------------------------+-------------+
| `\text{hflip}`{=latex} `\text{\texttt{\tiny p(0.5)}}`{=latex}                  |             |
+--------------------------------------------------------------------------------+-------------+

: **Distillation.**
:::

\

```{=latex}
\captionsetup{justification=centering}
```
```{=latex}
\hfill
```
```{=latex}
\vspace{0pt}
```
```{=latex}
\centering
```
```{=latex}
\tablestyle{5pt}{1.1}
```
::: {#tbl_app:distillation}
+---------------------+-------------------------------------------------+
| config              | values                                          |
+:====================+:===============================================:+
| ```{=latex}         | LAMB                                            |
| \addpadding         |                                                 |
| ```                 |                                                 |
| optimizer           |                                                 |
+---------------------+-------------------------------------------------+
| $\beta_1, \beta_2$  | (0.9, 0.95)                                     |
+---------------------+-------------------------------------------------+
| weight decay        | 0.05                                            |
+---------------------+-------------------------------------------------+
| learning rate       | 1e-6                                            |
+---------------------+-------------------------------------------------+
| batch size          | 16384                                           |
+---------------------+-------------------------------------------------+
| warm-up steps       | 2K                                              |
+---------------------+-------------------------------------------------+
| training steps      | 269K                                            |
+---------------------+-------------------------------------------------+
| data quantity       | 5.4B                                            |
+---------------------+-------------------------------------------------+
| samples seen        | 4.4B                                            |
+---------------------+-------------------------------------------------+
| max logit scale     | 100                                             |
+---------------------+-------------------------------------------------+
|                     |                                                 |
+---------------------+-------------------------------------------------+
| teacher logit scale | 200 (§`\ref{appx:core_smaller_models}`{=latex}) |
+---------------------+-------------------------------------------------+
|                     |                                                 |
+---------------------+-------------------------------------------------+
| data aug            | None                                            |
+---------------------+-------------------------------------------------+

: **Distillation.**
:::

\

```{=latex}
\captionsetup{justification=centering}
```
After training the largest G-scale model, we train the smaller models with image pretraining, then distill with image distillation in Tab. `\ref{tbl_app:distillation}`{=latex}, then finally apply video finetuning at the end.

### Zero-Shot Classification and Retrieval {#appx:zeroshot_settings}

#### Zero-Shot Evaluation on Images and Videos.

We use CLIPBench[^6] for zero-shot classification and retrieval benchmarking. The benchmark datasets and splits are obtained from the original dataset websites or HuggingFace. We extend the CLIPBench zero-shot evaluation to include video datasets such as MSR-VTT and Kinetics, and will release our model checkpoints, evaluation code, and scripts for reproducibility.

#### Prompt Design.

For zero-shot image-text and video-text retrieval, we rely solely on the original captions without any additional prompts. In contrast, for zero-shot classification, we utilize task-specific prompts graciously provided by the InternVL [@internvl] authors. All additional prompts will be released.

For example, we employ specific prompts for zero-shot image classification on various ImageNet benchmarks (e.g., ImageNet val, ImageNet v2) and video classification on Kinetics datasets (e.g., K400, K600, K700). `\label{text:Zero-shot Image Classification Prompts}`{=latex} `\promptbox{Zero-Shot Image Classification Prompts - ImageNet}{
a bad photo of a \{c\}. a photo of many \{c\}. a sculpture of a \{c\}. a photo of the hard to see \{c\}. a low resolution photo of the \{c\}. a rendering of a \{c\}. graffiti of a \{c\}. a bad photo of the \{c\}. a cropped photo of the \{c\}. a tattoo of a \{c\}. the embroidered \{c\}. a photo of a hard to see \{c\}. a bright photo of a \{c\}. a photo of a clean \{c\}. a photo of a dirty \{c\}. a dark photo of the \{c\}. a drawing of a \{c\}. a photo of my \{c\}. the plastic \{c\}. a photo of the cool \{c\}. a close-up photo of a \{c\}. a black and white photo of the \{c\}. a painting of the \{c\}. a painting of a \{c\}. a pixelated photo of the \{c\}. a sculpture of the \{c\}. a bright photo of the \{c\}. a cropped photo of a \{c\}. a plastic \{c\}. a photo of the dirty \{c\}. a jpeg corrupted photo of a \{c\}. a blurry photo of the \{c\}. a photo of the \{c\}. a good photo of the \{c\}. a rendering of the \{c\}. a \{c\} in a video game. a photo of one \{c\}. a doodle of a \{c\}. a close-up photo of the \{c\}. a photo of a \{c\}. the origami \{c\}. the \{c\} in a video game. a sketch of a \{c\}. a doodle of the \{c\}. a origami \{c\}. a low resolution photo of a \{c\}. the toy \{c\}. a rendition of the \{c\}. a photo of the clean \{c\}. a photo of a large \{c\}. a rendition of a \{c\}. a photo of a nice \{c\}. a photo of a weird \{c\}. a blurry photo of a \{c\}. a cartoon \{c\}. art of a \{c\}. a sketch of the \{c\}. a embroidered \{c\}. a pixelated photo of a \{c\}. itap of the \{c\}. a jpeg corrupted photo of the \{c\}. a good photo of a \{c\}. a plushie \{c\}. a photo of the nice \{c\}. a photo of the small \{c\}. a photo of the weird \{c\}. the cartoon \{c\}. art of the \{c\}. a drawing of the \{c\}. a photo of the large \{c\}. a black and white photo of a \{c\}. the plushie \{c\}. a dark photo of a \{c\}. itap of a \{c\}. graffiti of the \{c\}. a toy \{c\}. itap of my \{c\}. a photo of a cool \{c\}. a photo of a small \{c\}. a tattoo of the \{c\}.
}`{=latex}

`\label{text:Zero-shot Video Classification Prompts}`{=latex} `\promptbox{Zero-Shot Video Classification Prompts - Kinetics}{
a photo of \{c\}. a photo of a person \{c\}. a photo of a person using \{c\}. a photo of a person doing \{c\}. a photo of a person during \{c\}. a photo of a person performing \{c\}. a photo of a person practicing \{c\}. a video of \{c\}. a video of a person \{c\}. a video of a person using \{c\}. a video of a person doing \{c\}. a video of a person during \{c\}. a video of a person performing \{c\}. a video of a person practicing \{c\}. a example of \{c\}. a example of a person \{c\}. a example of a person using \{c\}. a example of a person doing \{c\}. a example of a person during \{c\}. a example of a person performing \{c\}. a example of a person practicing \{c\}. a demonstration of \{c\}. a demonstration of a person \{c\}. a demonstration of a person using \{c\}. a demonstration of a person doing \{c\}. a demonstration of a person during \{c\}. a demonstration of a person performing \{c\}. a demonstration of a person practicing \{c\}.
}`{=latex}

#### Evaluation Method. {#appx:zeroshot_eval_method}

Several works use different input transformations for different datasets when evaluating zero-shot performance (e.g., [@eva18b; @dfn; @siglip; @siglip2]). To be as fair as possible, we follow [@eva18b] in evaluating with two transformations---center crop and non aspect ratio preserving resize (\`\`squash")---and report the max between the two for all models and all datasets we evaluate. Additionally, ObjectNet has a red border around every image to facilitate deduplication, which we remove for evaluation. Finally, we follow [@internvl] in using *retrieval reweighting* (DSL), applying the softmax score distribution to the similarities used for retrieval: $$\texttt{scores = scores * softmax(scores, dim=0)}$$ This slightly improves retrieval for most models, so we do it for all models we evaluate for fairness. Notably, we were able to reproduce the reported numbers for most papers with these techniques, but for cases where we could not, we default to the reported number.

PE: Language Alignment {#appx:mmlm_benchmark_set}
----------------------

We provide details of the MLLM experimental setup in § `\ref{sec:la}`{=latex}. We describe *data*, *model*, and *training* separately.

#### Data.

Our MLLM training contains *warmup* data and *supervised finetuning (SFT)* data. Our warmup data is a 1M subset image-text pairs of our `\PEcore{}`{=latex} pretraining dataset. For SFT data, we use a diverse data mix consisting of 2.6M unique samples. This dataset is composed of 1.7M [^7] visual QAs samples from the Cauldron [@laurençon2024matters], 0.5M grounded QA pairs from Visual Genome [@krishna2017visual], Flickr-Entities [@plummer2015flickr30k] and Densely Captioned Images [@Urbanek_2024_CVPR], 0.1M image-captioning pairs from COCO [@coco] and 0.3M text-only samples. This comprehensive data mix allows us to thoroughly assess our model's capabilities in various MLLM tasks.

#### Model.

As described in § `\ref{sec:la_method}`{=latex}, we use a simple vision-language model architecture where a vision encoder and a pretrained decoder-only LLM are connected by a vision projector. For all tables, we use either Llama3.1-instruct 8B or QwenLM 2.5-instruct 7B as a language model, and 2-layer MLP as a vision projector. For fair comparison, we use the native resolution for image input. During inference, we evaluate the models on video tasks in *zeroshot* manner: We concatenate all video frames into a sequence and feed to language model, without seeing video samples during SFT. For all video tasks, we use $8$ frames with the same native resolution of height and width. For `\PEcore{}`{=latex} and `\PElang{}`{=latex}, this makes $448\times 448\times8$ input and $32\times 32\times 8$ vision tokens.

#### Training.

MLLM training consists of *warmup* and *supervised finetuning (SFT)* stages. In both stages, we freeze vision encoder and train vision projector and LLM. During warmup stage, we use a global batch size of $128$ with a learning rate of $1\times 10^{-4}$. We gradually increase the learning rate from $1\times 10^{-6}$ to $1\times 10^{-4}$ over 120 steps, and follow a cosine learning rate decay schedule to train a total of 8,000 steps. During SFT stage, we use a global batch size $256$ with a learning rate of $1\times 10^{-5}$. Similar to the warmup, we gradually increase the learning rate from $1\times 10^{-7}$ to $1\times 10^{-5}$ over 300 steps, and follow a cosine learning rate decay schedule to train a total of 12.5K steps. We truncate text-sequences longer than 2,048 tokens on top the visual tokens. This makes the maximum sequence length to be $\texttt{(num. vision tokens)} + 2,048$. With $448\times 448$ input resolution and patch size of $14$, we set the maximum sequence length to $1,024 + 2,048 = 3,072$. To represent bounding boxes on output side for image grounding tasks, we simply use text tokens to represent each bounding box: each coordinate is normalized between `000` and `999`, in \`\``[x, y, x, y]`" box format for top-left and bottom-right corners (*e.g.*, `[012, 122, 633, 782]`).

For all baselines, we search for the **best** intermediate layer features to adapt to LLM. We search over $\{-1, -2, -4, -6, -8, -10, -12,-14, -16, -18, -20, -40\}$ layers (counting from last) and report the best result in average over OCR/Chart/Document Q&A, Visual Q&A, Image Captioning and Video Understanding.

PE: Spatial Alignment
---------------------

### Training Details {#appx:spatial_align_details}

#### Loss Functions.

For self-aligning to frozen `\PEcore{G}`{=latex} layer 41 features ($L_\text{core}$), we minimize cosine similarity: $$\label{eq:loss_core}
    L_\text{core} = \frac{1}{n_\text{tok}}\sum\left(\frac{(S_{50})(T_{41})^T}{||S_{50}||\cdot||T_{41}||}\right)$$ where $S_{50}$ denotes the last layer features of the student, $T_{41}$ denotes frozen layer 41 features from `\PEcore{G}`{=latex}, and $n_\text{tok}$ represents the number of tokens. Note that we chose 41 fairly arbitrarily (it is layer 40 when written with indexing from 0). Judging by Fig. `\ref{fig:layerfinder}`{=latex}, any layer around 40 should work (and 39 may be slightly better).

For the encouraging locality loss ($L_\text{loc}$), we compute the pairwise cosine similarity between a model's own tokens and itself. This forms a \`\`spatial correspondence map" for what tokens should be considered similar. We then compute the same for the student, and minimize the difference between the two with MSE loss: $$\label{eq:loss_sam}
    L_\text{loc} = \frac{1}{n_\text{tok}^2} \sum\left(\frac{{(S_{50})} ({S_{50}})^T}{||{S_{50}}||^2} - \frac{(T_\text{SAM}) (T_\text{SAM})^T}{||T_\text{SAM}||^2}\right)^2$$ where $T_\text{SAM}$ denotes the \`\`SAM Mask Logits" constructed in §`\ref{sec:sa_method}`{=latex}. We also find using a temperature ($t$) on the SAM teacher's pairwise cosine similarity term ($x$) useful: $e^{t(x - 1)}$. The full loss is $L_\text{spatial} = L_\text{core} + L_\text{loc}$.

#### Hyperparameters.

In Tab. `\ref{tbl_app:spatial_train}`{=latex} we show the training hyperparameters for spatial alignment, finetuned on top of the initial `\PEcore{G}`{=latex} checkpoint. Then in Tab. `\ref{tbl_app:spatial_sam}`{=latex} and Tab. `\ref{tbl_app:spatial_core}`{=latex}, we show the settings for the two teachers and losses. Note that when running the teachers, we run them on the exact same image as the student (same data aug and all). Additionally, because the SAM 2.1 teacher operates at a resolution of 1024, we upsample the image, generate the mask logits, and then downsample the result. Both teachers are frozen.

```{=latex}
\vspace{0pt}
```
```{=latex}
\centering
```
```{=latex}
\tablestyle{5pt}{1.1}
```
::: {#tbl_app:spatial_core}
+--------------------------------------------------------------------------------+----------------------------------------------------+
| config                                                                         | values                                             |
+:===============================================================================+:==================================================:+
| ```{=latex}                                                                    | LAMB                                               |
| \addpadding                                                                    |                                                    |
| ```                                                                            |                                                    |
| optimizer                                                                      |                                                    |
+--------------------------------------------------------------------------------+----------------------------------------------------+
| $\beta_1, \beta_2$                                                             | (0.9, 0.95)                                        |
+--------------------------------------------------------------------------------+----------------------------------------------------+
| weight decay                                                                   | 0.05                                               |
+--------------------------------------------------------------------------------+----------------------------------------------------+
| learning rate                                                                  | 5e-4                                               |
+--------------------------------------------------------------------------------+----------------------------------------------------+
| batch size                                                                     | 12,288                                             |
+--------------------------------------------------------------------------------+----------------------------------------------------+
| warm-up steps                                                                  | 0                                                  |
+--------------------------------------------------------------------------------+----------------------------------------------------+
| training steps                                                                 | 24K                                                |
+--------------------------------------------------------------------------------+----------------------------------------------------+
| data quantity                                                                  | 5.4B (`\tiny `{=latex}`\PEcore{}`{=latex} PT Data) |
+--------------------------------------------------------------------------------+----------------------------------------------------+
| samples seen                                                                   | 300M                                               |
+--------------------------------------------------------------------------------+----------------------------------------------------+
|                                                                                |                                                    |
+--------------------------------------------------------------------------------+----------------------------------------------------+
| resolution                                                                     | 448                                                |
+--------------------------------------------------------------------------------+----------------------------------------------------+
| mask ratio                                                                     | 0.75                                               |
+--------------------------------------------------------------------------------+----------------------------------------------------+
| mask size                                                                      | 2$\times$2 tokens                                  |
+--------------------------------------------------------------------------------+----------------------------------------------------+
|                                                                                |                                                    |
+--------------------------------------------------------------------------------+----------------------------------------------------+
| droppath                                                                       | 0.4                                                |
+--------------------------------------------------------------------------------+----------------------------------------------------+
| layerscale                                                                     | 0.1                                                |
+--------------------------------------------------------------------------------+----------------------------------------------------+
|                                                                                |                                                    |
+--------------------------------------------------------------------------------+----------------------------------------------------+
| data aug                                                                       |                                                    |
+--------------------------------------------------------------------------------+----------------------------------------------------+
| `\text{color jitter}`{=latex} `\text{\texttt{\tiny j(0.32,0,0.32,0)}}`{=latex} |                                                    |
+--------------------------------------------------------------------------------+----------------------------------------------------+
| `\text{hflip}`{=latex} `\text{\texttt{\tiny p(0.5)}}`{=latex}                  |                                                    |
+--------------------------------------------------------------------------------+----------------------------------------------------+

: **`\PEcore{G}`{=latex} Teacher.**
:::

\

```{=latex}
\captionsetup{justification=centering}
```
```{=latex}
\hfill
```
```{=latex}
\vspace{0pt}
```
```{=latex}
\centering
```
```{=latex}
\tablestyle{5pt}{1.1}
```
::: {#tbl_app:spatial_core}
+---------------------------+---------------------------------+
| config                    | values                          |
+:==========================+:===============================:+
| ```{=latex}               | SAM 2.1-L                       |
| \addpadding               |                                 |
| ```                       |                                 |
| model                     |                                 |
+---------------------------+---------------------------------+
| layer                     | mask logits                     |
+---------------------------+---------------------------------+
| resolution                | 1024 (`interp`$\rightarrow$448) |
+---------------------------+---------------------------------+
|                           |                                 |
+---------------------------+---------------------------------+
| loss                      | Eq. `\ref{eq:loss_sam}`{=latex} |
+---------------------------+---------------------------------+
| loss weight               | 1                               |
+---------------------------+---------------------------------+
| temperature               | 20                              |
+---------------------------+---------------------------------+
|                           |                                 |
+---------------------------+---------------------------------+
| sample points             | 32$\times$32 (1024)             |
+---------------------------+---------------------------------+
| pred iou threshold        | 0                               |
+---------------------------+---------------------------------+
| stability score threshold | 0                               |
+---------------------------+---------------------------------+
| mask threshold            | 0                               |
+---------------------------+---------------------------------+

: **`\PEcore{G}`{=latex} Teacher.**
:::

\

```{=latex}
\captionsetup{justification=centering}
```
```{=latex}
\hfill
```
```{=latex}
\vspace{0pt}
```
```{=latex}
\centering
```
```{=latex}
\tablestyle{5pt}{1.1}
```
::: {#tbl_app:spatial_core}
+-------------+----------------------------------+
| config      | values                           |
+:============+:================================:+
| ```{=latex} | `\PEcore{G}`{=latex}             |
| \addpadding |                                  |
| ```         |                                  |
| model       |                                  |
+-------------+----------------------------------+
| layer       | 41                               |
+-------------+----------------------------------+
| resolution  | 448                              |
+-------------+----------------------------------+
|             |                                  |
+-------------+----------------------------------+
| loss        | Eq. `\ref{eq:loss_core}`{=latex} |
+-------------+----------------------------------+
| loss weight | 1                                |
+-------------+----------------------------------+

: **`\PEcore{G}`{=latex} Teacher.**
:::

\

```{=latex}
\captionsetup{justification=centering}
```
### Visualization Method {#appx:feature_viz}

To visualize the features in Fig. `\ref{fig:feature_viz}`{=latex} and Fig. `\ref{fig:more_feature_viz}`{=latex}, our goal is to map a 1536-dimensional space down to 3 dimensions to view how the model encodes each token in relation to each other. One naive approach would be to apply PCA with 3 dimensions across all token in the image. However, we find this alone can be misleading.

Specifically, if the model has rich semantics, it should be the case that most of those 1536 features have some useful information in them. Some of that information could be spatially contiguous, some of it not. We want PCA to only select the *spatially contiguous* information, since we are trying to evaluate the spatial quality of the features. However, naively applying PCA will not necessarily do that, especially for models with information aggregated in \`\`global tokens" (§`\ref{sec:core_feature_analysis}`{=latex}). Despite these tokens carrying important information, they are not spatially contiguous. Thus, if PCA dedicates a large portion of its 3 dimensions to global tokens, the features will *look* like their spatial quality is bad, despite the features containing good spatial information.

```{=latex}
\centering
```
![**Feature Visualization Ablation.** With raw features (top row), PCA misses spatially contiguous parts of the feature space and instead focuses on global tokens (which carry information but are not spatially coherent). By applying a simple low pass filter (bottom row), we can reveal spatial information that PCA originally missed (see column 2: with raw features, the background looks like a mess, with the low pass filter the tiles become visible).](fig/feature_viz_comparison.png){#fig:feature_viz_comparison width="0.5\\linewidth"}

So, how do we select for only the *spatially contiguous* information to visualize? The answer is simple: by definition, the spatially contiguous information will be... spatially contiguous. To keep the spatially contiguous information while lowering the impact of the global tokens, we can simply apply a low pass filter to the features (specifically, a gaussian blur with kernel size 3 and a $\sigma$ of 1). To retain the detail of the original features, we can average the two together. Thus, to visualize features, we use the 3D PCA of the of the following. $x$ denotes the model's output features, and $g(x)$ denotes gaussian blur. $$0.5 x + 0.5g(x,k=3,\sigma=1)$$ We show the impact of this in Fig. `\ref{fig:feature_viz_comparison}`{=latex}. Blurring the features make them appear more detailed! In reality, that information was always there, just PCA did not show it. Thus, great care must be taken when visualizing high dimensional feature spaces. If they were easy to map to 3 dimensions---you wouldn't need 1536 of them!

Then, to map the PCA dimensions to RBG pixel values, we map each PCA component to a corresponding channel in LCh color space, then convert those LCh colors to RGB to get the final image. Note that we use LCh instead of RGB directly for aesthetic reasons, and also because LCh is a cylindrical color space---where smooth changes to the values look like smooth changes in colors to humans---and thus is easier to discern.

### Frozen Feature Dense Prediction  {#appx:dense_pred}

We discuss the detailed settings of the results for dense prediction with frozen features in Tab. `\ref{tbl:dense_prediction}`{=latex}. Each model is evaluated with its native resolution up to 448 or 448 (whichever is optimal).

#### Zero-Shot Tracking.

We evaluate our pretrained models on label propagation task using the protocols in  [@jabri2020space; @vgpt] on DAVIS dataset [@davis2017]. This evaluation does not require any finetuning or probing, therefore preserves the spatial features in the model. Following Toto [@vgpt], we use the features from the last n $=$ 7 frames to find the nearest neighbor patch in the current frame, and then propagate the masks from the previous frames to the current frame. Note that this evaluation method does not require any training.

#### Semantic Segmentation.

For semantic segmentation, we evaluate our pretrained models on ADE20K [@ade20k] semantic segmentation task. We use a linear layer and convolutional layer to map intermediate spatial features to segmentation masks following [@dinov2]. The models are evaluated and then features are resized to 518 $\times$ 518. We only use features from single layer. The probing layers are finetuned with AdamW [@adamw] with a learning rate of 0.001.

#### Depth Estimation.

For depth estimation on NYUv2 [@nyu_depth], we follow [@li2024binsformer; @dinov2]. We use a DPT-head [@dpt] on top of our frozen pretrained model and use only single layer features. We scale the size of the DPT-head for each models based on the hidden size for each architecture. Because NYU is a small dataset and the models we evaluate are large, we observe the results for most models are noisy and prone to overfitting. Thus, for fair comparison we train *all models* for 20 epochs and for *all models* take the lowest validation loss over all epochs.

#### Frozen Detection.

For the frozen feature detection results presented in §`\ref{sec:layerfinder}`{=latex}, we evaluated using Mask R-CNN [@maskrcnn] as a probe. We used a resolution of 1024 for Fig. `\ref{fig:layerfinder}`{=latex} and 768 for the remainining experiments in §`\ref{sec:layerfinder}`{=latex}. Because the backbones were frozen, we did not add any global attention and instead simply tiled the input image with a window size of 32 for the 1024px experiments and 24 for the 768px experiments. All models were interpolated to patch 16. Finally, the backbones were frozen and only the FPN and R-CNN heads trained for 15 epochs on COCO with a stepwise decay LR without drop path.

### End-to-End Finetuning Detection and Segmentation {#appx:det}

We provide a detailed discussion of settings of end-to-end finetuning on detection and segmentation presented in Tab. `\ref{tbl:det_SFT}`{=latex}. The hyperparameters can be found in Tab. `\ref{tbl_app:coco}`{=latex}. We find that the default 100-epoch protocol in ViTDet [@vitdet; @d2] causes overfitting problems in COCO experiments especially for billion-level parameter vision encoders, so we tune the training epochs, learning rate, drop path and learning rate decay accordingly.

The LVIS experiment setting is the same as COCO except all L-size models use learning rate of 2e-4 and all g-size and G-size models use 75 epochs.

```{=latex}
\centering
```
```{=latex}
\tablestyle{5pt}{1.1}
```
::: {#tbl_app:coco}
  config                         values
  ------------------------ ------------------
  optimizer                      AdamW
  optimizer momentum          (0.9, 0.999)
  weight decay                    0.1
  learning rate              $\rightarrow$
  learning rate schedule    Step-wise decay
  learning rate decay        $\rightarrow$
  batch size                       64
  image size                1024$\times$1024
  augmentation              LSJ \[0.1, 2.0\]
  epochs                     $\rightarrow$
  drop path                  $\rightarrow$
  postional embedding       abswin [@abswin]
  patch size                       16
  window size                $\rightarrow$
  global window index        $\rightarrow$

  : **Settings for End-to-End Finetuning Detection and Segmentation.**
:::

```{=latex}
\quad
```
```{=latex}
\quad
```
```{=latex}
\quad
```
```{=latex}
\quad
```
::: {#tbl_app:coco}
  model                        lr    epochs   drop path   lr decay   layers   global window index   window size
  -------------------------- ------ -------- ----------- ---------- -------- --------------------- -------------
  OpenAI CLIP-L               1e-4    100        0.4        0.8        24       (5, 11, 17, 23)         14
  MetaCLIP-L                  1e-4    100        0.4        0.8        24       (5, 11, 17, 23)         14
  MetaCLIP-G                  5e-5     75        0.5        0.9        48      (11, 23, 35, 47)         14
  SigLIP-so                   1e-4    100        0.4        0.8        27       (2, 10, 18, 26)         14
  EVA02-L                     1e-4    100        0.4        0.8        24       (5, 11, 17, 23)         14
  MAE-L                       1e-4    100        0.4        0.8        24       (5, 11, 17, 23)         14
  SigLIP2-so                  1e-4    100        0.4        0.8        27       (2, 10, 18, 26)         14
  SigLIP2-g                   5e-5     75        0.5        0.9        40       (9, 19, 29, 39)         14
  DINOv2-L                    1e-4    100        0.4        0.8        24       (5, 11, 17, 23)         32
  DINOv2-g                    5e-5     36        0.5        0.9        40       (9, 19, 29, 39)         32
  **`\PEcore{}`{=latex}G**    5e-5     75        0.5        0.9        50      (12, 24, 36, 49)         32
  **`\PEspat{}`{=latex}G**    5e-5     36        0.5        0.9        50      (12, 24, 36, 49)         32

  : **Settings for End-to-End Finetuning Detection and Segmentation.**
:::

```{=latex}
\captionsetup{justification=centering}
```
```{=latex}
\clearpage
```
### System-Level Comparison on Detection {#appx:det_sota_setting}

```{=latex}
\begin{wraptable}{r}{0.35\textwidth}
\vspace{-30pt}
\centering
\tablestyle{3pt}{1.1} 
\begin{tabular}{y{80}x{25}}
    \shline
    Test-Time Aug & \ct[c7]{AP$_\text{box}$}{}\\
    \hline
    \addpadding{}
    No TTA & 65.2 \\
    + More Queries & 65.3 \\
    + SoftNMS~\cite{softnms} & 65.8 \\
    + Flip Aug &  65.8 \\
    + Multiscale Aug & \textbf{66.0} \\
    \shline
\end{tabular}
\caption{{\bf Test-Time Aug} for system-level comparison on COCO in Tab.~\ref{tbl:det_coco}.}
\label{tab:det_tta}
\vspace{-10pt}
\end{wraptable}
```
We describe our implementation for system-level comparison to the state-of-the-arts on COCO object detection in Tab `\ref{tbl:det_coco}`{=latex}. Our implementation is based on the DETA repository[^8]. We replace the vision encoder with our `\PEspat{}`{=latex} and maintain the same hyperparameters as in the end-to-end finetuning settings, while keeping the detector unchanged. The training process consists of three stages:

1.  **Initial Training**: Train on Objects365 for 12 epochs with an image resolution of 1024 $\times$ 1024, a total batch size of 256, and a learning rate of 2e-4, which is divided by 10 at the 10th epoch.

2.  **Increasing Resolution**: Continue training on Objects365 for 6 epochs with a resolution of 1536 $\times$ 1536, a total batch size of 128, and a learning rate of 5e-5, which is divided by 10 at the 5th epoch.

3.  **Finetuning**: Finetune on COCO dataset for 12 epochs with an image resolution of 1728 $\times$ 1728, a total batch size of 64, and a learning rate of 5e-5, which is divided by 10 at the 8th epoch.

4.  **Further Increasing Resolution**: Further finetune on COCO dataset for 3 epochs with a resolution of 1824 $\times$ 1824, a total batch size of 64. To save GPU memory, we use SGD optimizer instead of Adam, with a learning rate of 5e-3, which is divided by 10 at the 2th epoch.

We apply a series of test-time augmentation techniques to further improve the performance, detailed in Tab. `\ref{tab:det_tta}`{=latex}.

Additional Results
==================

`\PEcore{}`{=latex}: Robust Image Pretraining {#appx:core_img_pt}
---------------------------------------------

In Tab. `\ref{tab:core_img_pretraing_raw}`{=latex}, we present the raw data for the robustness metrics in Fig. `\ref{fig:core_pt_ablations}`{=latex}. Across the board, each change improved almost all metrics (with the exception of progressive resolution slightly hurting the average and mask regularization slightly hurting ImageNet Adversarial). The fact that there were no tradeoffs to these changes, indicate that their improvements to the features are general. This could be why most of these changes improved performance for downstream tasks as well.

Note that in §`\ref{sec:core_image_pt}`{=latex}, we only discuss changes that we know to work. There are several changes that we have tried that do not work (i.e., do not improve performance or lower performance). For instance: average pooling instead of using a class token, increasing the text tower size, using hue or contrast jitter, and maintaining the same resolution throughout training but dropping tokens instead of progressive resolution (FLIP-style).

We also find increasing batch size and increasing training iterations for an L scale model to have equivalent effects. This is in contrast to the batch size scaling observed by [@siglip], but it is possible that this difference is down to a hyperparameter issue.

```{=latex}
\begin{table*}[!h]\centering
    \makebox[\linewidth][c]{
    \tablestyle{0pt}{1.15} 
    \begin{tabular}{wy{90} awwwwww}
        \shline
         & \multirow{2}{*}{\vspace{-2.2cm} Step} & \multicolumn{7}{c}{\ct[c1]{\it Zero-Shot Classification}} \\
            & 
            & \cb[c1]{\textit{\textbf{Avg Class.}}}{}
            & \cb[c1]{ImageNet}{val~\cite{imagenet}}
            & \cb[c1]{ImageNet}{v2~\cite{imagenetv2}}
            & \cb[c1]{ObjectNet}{IN Classes~\cite{objectnet}}
            & \cb[c1]{ImageNet}{Adversarial~\cite{imagenet-a}}
            & \cb[c1]{ImageNet}{Renditions~\cite{imagenet-r}}
            & \cb[c1]{ImageNet}{Sketch~\cite{imagenet-sketch}}
            \\
        \hline
        \addpadding
        1 & Baseline               & \ca{75.3} & 78.9 & 71.9 & 73.7 & 68.3 & 91.1 & 67.8 \\
        2 & Progressive Resolution & \ca{75.1} & 78.9 & 71.8 & 72.4 & 69.9 & 90.5 & 67.0 \\
        3 & High Batch Size        & \ca{76.2} & 79.5 & 72.8 & 74.1 & 71.8 & 91.0 & 68.1 \\
        4 & LAMB and High LR       & \ca{76.9} & 79.9 & 73.3 & 74.3 & 73.5 & 91.5 & 68.6 \\
        5 & High Resolution (336)  & \ca{78.3} & 80.4 & 73.8 & 75.6 & 79.2 & 92.0 & 68.8 \\
        6 & 2D RoPE                & \ca{79.2} & 80.7 & 74.1 & 77.4 & 80.9 & 92.7 & 69.4 \\
        7 & Attention Pooling      & \ca{80.1} & 81.0 & 74.8 & 78.4 & 82.9 & 93.4 & 69.9 \\
        8 & Data Augmentation      & \ca{80.8} & 81.1 & 75.2 & 80.8 & 83.1 & 93.5 & 71.2 \\
        9 & Mask Regularization    & \ca{80.9} & 81.3 & 75.3 & 80.9 & 82.8 & 93.8 & 71.2 \\
        \shline
    \end{tabular}
    }
    \caption{{\bf Robust Image Pretraining Full Results.} Raw results for the robustness metrics metrics in Fig.~\ref{fig:core_pt_ablations}. Almost every change improves every metric, but some metrics are improved more than others  (e.g., ObjectNet and ImageNet-A).
    }
    \label{tab:core_img_pretraing_raw}
\end{table*}
```
`\PEcore{}`{=latex}: Video Data Scaling
---------------------------------------

```{=latex}
\centering
```
```{=latex}
\tablestyle{0pt}{1.05}
```
```{=latex}
\begin{tabular}{x{20} awwwwww awwwwwww}
        \shline
          & \multicolumn{7}{c}{\ct[c1]{\it Image Zero-Shot}} & \multicolumn{8}{c}{\ct[c3]{\it Video Zero-Shot}} \\
              \cb{Video Data Size}{}
            & \cb[c1]{\textit{\textbf{Average Image}}}{}
            & \cb[c1]{ImageNet}{val~\cite{imagenet}}
            & \cb[c1]{ImageNet}{v2~\cite{imagenetv2}}
            & \cb[c1]{ObjectNet}{IN Classes~\cite{objectnet}}
            & \cb[c1]{ImageNet}{Adversarial~\cite{imagenet-a}}
            & \cb[c2]{MS-COCO}{txt$\rightarrow$img~\cite{coco}}
            & \cb[c2]{MS-COCO}{img$\rightarrow$txt~\cite{coco}}
            & \cb[c3]{\textit{\textbf{Average Video}}}{}
            & \cb[c3]{Kinetics}{400~\cite{kay2017kinetics}}
            & \cb[c3]{Kinetics}{600~\cite{kay2017kinetics}}
            & \cb[c3]{Kinetics}{700~\cite{kay2017kinetics}}
            & \cb[c3]{UCF 101}{\cite{soomro2012ucf101}}
            & \cb[c3]{HMDB 51}{\cite{kuehne2011hmdb}}
            & \cb[c3]{MSR-VTT}{txt$\rightarrow$vid~\cite{vtt}}
            & \cb[c3]{MSR-VTT}{vid$\rightarrow$txt~\cite{vtt}}
            \\
            
        \hline
        \addpadding
        0M              
        & \cat{77.0} & 83.9 & 78.6 & 86.6 & 90.3 & 52.1 & 70.3
        & \cat{57.0} & 70.3 & 69.4 & 61.6 & 78.5 & 47.4 & 40.5 & 31.4
        \\        
        3M              
        & \ca{77.7} & 84.1 & 78.8 & 86.6 & 90.9 & 53.3 & 74.2
        & \ca{61.6} & 72.4 & 72.2 & 64.2 & 88.5 & 53.8 & 42.8 & 37.6
        \\
        6M
        & \ca{78.0} & 84.2 & 79.0 & 86.7 & 91.1 & 54.0 & 72.7
        & \ca{63.6} & 73.5 & 73.4 & 66.0 & 88.9 & 54.6 & 44.9 & 43.6 
        \\
        8M           
        & \ca{78.4} & 84.2 & 79.2 & 87.0 & 91.6 & 54.9 & 73.6
        & \ca{64.8} & 74.5 & 74.5 & 67.7 & 89.5 & 55.3 & 46.9 & 45.5
        \\
        11M 
        & \ca{78.6} & 84.2 & 79.2 & 87.2 & 91.8 & 55.4 & 73.8
        & \ca{65.2} & 75.1 & 75.0 & 67.6 & 89.7 & 55.6 & 47.7 & 45.8
        \\
        14M
        & \ca{78.8} & 84.2 & 79.2 & 87.5 & 91.9 & 55.7 & 74.3
        & \ca{65.5} & 75.4 & 75.3 & 67.9 & 89.9 & 55.8 & 47.8 & 46.3
        \\
        17M
        & \ca{78.9} & 84.2 & 79.2 & 87.7 & 92.0  & 55.8 & 74.3 
        & \ca{65.8} & 75.7 & 75.5 & 68.2 & 90.2 & 56.0 & 48.3 & 46.7
        \\      
        \shline
    \end{tabular}
```
The detailed video data scaling results are presented in Tab. `\ref{tab:video-ft-ablation-details}`{=latex}. Our experiments demonstrate that increasing the number of synthetic video data generated by the proposed video data engine enhances the performance of classification and retrieval on both image and video benchmarks. On image benchmarks, while improvements on ImageNet val and v2 plateaued earlier compared to ObjectNet and ImageNet Adversarial, MS-COCO retrieval performance continued to show gains. On video benchmarks, scaling synthetic video data consistently yields better performance for both classification and retrieval tasks. We expect that further scaling up the video data with our video data engine will continue to drive performance improvements.

`\PEcore{}`{=latex}: Smaller Models {#appx:core_smaller_models}
-----------------------------------

```{=latex}
\begin{table*}[!h]\centering
    \makebox[\linewidth][c]{
    \tablestyle{0pt}{1.15} 
    \begin{tabular}{y{105} ww awwwwww}
        \shline
        \multirow{2}{*}{\vspace{-2.2cm} Model}  &&& \multicolumn{7}{c}{\ct[c1]{\it Zero-Shot Classification}} \\
            & \cb{Teacher's Temp}{}
            & \cb{Model Scale}{}
            & \cb[c1]{\textit{\textbf{Avg Class.}}}{}
            & \cb[c1]{ImageNet}{val~\cite{imagenet}}
            & \cb[c1]{ImageNet}{v2~\cite{imagenetv2}}
            & \cb[c1]{ObjectNet}{IN Classes~\cite{objectnet}}
            & \cb[c1]{ImageNet}{Adversarial~\cite{imagenet-a}}
            & \cb[c1]{ImageNet}{Renditions~\cite{imagenet-r}}
            & \cb[c1]{ImageNet}{Sketch~\cite{imagenet-sketch}}
            \\
        \hline
        \addpadding
        vanilla pretrained model & -  & B  & \ca{66.2} & 74.2 & 67.4 & 62.5 & 50.2 & 83.0 & 59.8 \\ 
        \hline
        \addpadding
        \multirow{4}{*}{distillation} & $\times$2 & B & \ca{65.2} & 71.8 & 65.5 & 61.4 & 50.2 & 83.6 & 58.6 \\
         & $\times$1 & B & \ca{68.0} & 74.9 & 68.1 & 64.7 & 54.1 & 85.3 & 61.1\\
         & $\times$0.7 & B & \ca{68.2} & 75.1 & 68.2 & 65.3 & 54.4 & 85.1 & 61.3 \\
         & $\times$0.5 & B & \ca{\textbf{68.3}} & 75.2 & 68.2 & 65.3 & 54.2 & 85.2 & 61.4 \\
        \hline
    \end{tabular}
    }
    \caption{{\bf Ablation Study on Teacher's Distribution Temperature.} We evaluate the effect of varying temperatures on the teacher's distribution, using a pretrained vanilla CLIP model (ViT-B/14, resolution 224) as a baseline (details in \S\ref{sec:core_image_pt}). The models are finetuned via distillation with a short schedule of 50K steps.
    }
    \label{tab:distillation-temperatrue-ablation}
\end{table*}
```
#### Ablation: Distillation Temperature.

To optimize the performance of smaller models (B and L-scales in Tab. `\ref{tab:pe2b}`{=latex}), we utilize a distillation finetuning approach with `\PEcore{G}`{=latex} as the teacher model. During this process, both student and teacher models encode image and text inputs to compute image-to-text and text-to-image similarity distributions, similar to CLIP training [@clip]. The student's distributions are then optimized to match those of the teacher by minimizing KL-divergence loss on both image-to-text and text-to-image similarity distributions.

We find that using a fixed and smaller temperature (i.e., higher logit scale), which controls the range of logits in the softmax, significantly enhances the effectiveness of distillation. This results in a sharper distribution for the teacher's distributions. In contrast, the student's temperature remains learnable, consistent with our pretraining procedure and CLIP training.

In Tab. `\ref{tab:distillation-temperatrue-ablation}`{=latex}, we present an ablation study examining the impact of temperature on the teacher's distribution. For this analysis, we utilize a pretrained *vanilla* CLIP model (ViT-B/14, resolution 224), which serves as a baseline for comparison (see §`\ref{sec:core_image_pt}`{=latex} for details). The models are finetuned using distillation with a concise schedule of 50K steps. Notably, our results show that employing a smaller temperature for the teacher's distributions yields improved performance on zero-shot ImageNet benchmarks.

```{=latex}
\centering
```
```{=latex}
\tablestyle{0pt}{1.05}
```
```{=latex}
\begin{tabular}{x{70} y{90} awwwwww awwwwwww}
        \shline
          && \multicolumn{7}{c}{\ct[c1]{\it Image Zero-Shot}} 
           & \multicolumn{8}{c}{\ct[c3]{\it Video Zero-Shot}} \\
              Model
           & Stage & \cb[c1]{\textit{\textbf{Average Image}}}{}
            & \cb[c1]{ImageNet}{val~\cite{imagenet}}
            & \cb[c1]{ImageNet}{v2~\cite{imagenetv2}}
            & \cb[c1]{ObjectNet}{IN Classes~\cite{objectnet}}
            & \cb[c1]{ImageNet}{Adversarial~\cite{imagenet-a}}
            & \cb[c2]{MS-COCO}{txt$\rightarrow$img~\cite{coco}}
            & \cb[c2]{MS-COCO}{img$\rightarrow$txt~\cite{coco}}
            & \cb[c3]{\textit{\textbf{Average Video}}}{}
            & \cb[c3]{Kinetics}{400~\cite{kay2017kinetics}}
            & \cb[c3]{Kinetics}{600~\cite{kay2017kinetics}}
            & \cb[c3]{Kinetics}{700~\cite{kay2017kinetics}}
            & \cb[c3]{UCF 101}{\cite{soomro2012ucf101}}
            & \cb[c3]{HMDB 51}{\cite{kuehne2011hmdb}}
            & \cb[c3]{MSR-VTT}{txt$\rightarrow$vid~\cite{vtt}}
            & \cb[c3]{MSR-VTT}{vid$\rightarrow$txt~\cite{vtt}}
            \\
            
        \hline
        SigLIP2-L/16~\cite{siglip2} & -
        & \cat{76.0} & 83.1 & 77.4 & 84.4 & 84.3 & 55.3 & 71.4
        & \cat{56.2} & 65.3 & 62.5 & 56.8 & 86.7 & 49.3 & 41.5 & 31.4 \\        
        \PEcore{L} & image pretraining
        & \cat{75.1} & 82.9 & 76.8 & 81.8 & 85.6 & 53.0 & 70.4
        & \cat{59.0} & 68.0 & 67.7 & 58.5 & 85.5 & 57.7 & 42.0 & 33.4 \\
        \PEcore{L} & +image distillation from \PEcore{G}
        & \cat{77.6} & \textbf{83.6} & \textbf{78.1} & 84.4 & 88.9 & 56.0 & 74.7
        & \cat{64.5} & 73.0 & 72.6 & 64.8 & 86.5 & 58.0 & 47.9 & 48.4 \\  
        \PEcore{L} & +video finetuning
        & \cat{\textbf{78.0}} & 83.5 & 77.9 & \textbf{84.7} & \textbf{89.0} & \textbf{57.1} & \textbf{75.9}
        & \cat{\textbf{65.3}} & \textbf{73.4} & \textbf{72.7} & \textbf{65.3} & \textbf{87.1} & \textbf{58.5} & \textbf{50.3} & \textbf{50.1} \\              
        \shline
    \end{tabular}
```
#### Building strong smaller models.

In Tab. `\ref{tab:smaller-models}`{=latex}, we demonstrate our step-by-step training strategy for building strong smaller models at the L scale, as discussed in §`\ref{sec:unified-encoder}`{=latex}. Specifically, we outline our approach to image pretraining, image distillation, and video finetuning, and distillation. Leveraging the robust foundation established by our pretraining techniques (§`\ref{sec:core_image_pt}`{=latex}), we show that distilling from `\PEcore{G}`{=latex}, our strongest unified perception encoder, yields improvements on both image and video benchmarks. Furthermore, a short-scheduled video finetuning provides an additional boost in performance on both benchmarks.

`\PElang{}`{=latex}: Additional Results {#appx:mmlm_benchmark_results}
---------------------------------------

Analogous to Tab. `\ref{tab:lang_mllm_bench}`{=latex}, in Tab. `\ref{tab:lang_mllm_bench_tiling}`{=latex}, we compare `\PEcore{}`{=latex} and `\PElang{}`{=latex} with *dynamic resolution* setting [@liu2024llavanext; @llama3]. More specifically, we use up to 4 tiles, following after a *thumbnail*, which is a whole image resized into $448\times 448$. With the maximum number of tiles of 4, the model can cover $\{1\times 1, 1\times 2, 1\times 3, 1\times 4, 2\times 1, 2\times 2, 3 \times 1, 4\times 1\}$ tile ratios. Similar to the Tab. `\ref{tab:lang_mllm_bench}`{=latex}, `\ref{tab:lang_mllm_bench_qwen}`{=latex}, `\ref{tab:lang_mllm_system_level}`{=latex} in the main paper, we show that `\PElang{}`{=latex} largely outperforms the baseline vision encoders by large margins across all categories of MLLM tasks. Note that `\PElang{}`{=latex} has been alignment-tuned with native resolution input, as opposed to *e.g.*, InternViT 2.5, which has been midtrained with dynamic tiling, which shows `\PElang{}`{=latex}'s strong generality for different input formats.

Next, in Tab. `\ref{tab:lang_refcoco}`{=latex}, `\ref{tab:lang_refcoco_qwen2}`{=latex}, `\ref{tab:lang_refcoco_tiling}`{=latex}, we show the breakdowns of RefCOCO/+/g [@kazemzadeh2014referitgame] with Llama 3.1-instruct 8B as language model, Qwen2.5 LM 7B as language model, and with Llama 3.1-instruct 8B and dynamic tiling ($4+1$), respectively. In our SFT data, we have VisualGenome [@krishna2017visual], DCI [@Urbanek_2024_CVPR], and Flickr30K [@plummer2015flickr30k] as grounding datasets, and RefCOCO/+/g are unseen. We therefore report zeroshot performance of the MLLMs to evaluate spatial understanding capability of the vision encoders. Overall, `\PElang{}`{=latex} L or G show the best performance across all RefCOCO splits, except with Qwen2.5 LM. This is because (1) InternViT 2.5 6B is midtrained with Qwen2 LM, and (2) during pre/mid-training the training data of RefCOCO/+/g are seen.

```{=latex}
\begin{table*}[ht]\centering
    \makebox[\linewidth][c]{
    \tablestyle{0pt}{1.05} 
        \begin{tabular}{y{50}wx{20} awwww awwww awww a awwwwww}
        \shline
        \multirow{2}{*}{\vspace{-2.2cm} Model}  &&& \multicolumn{5}{c}{\ct[c3]{\it OCR / Chart / Doc. Q\&A}} %
        & \multicolumn{5}{c}{\ct[c4]{\it Visual Q\&A}} & \multicolumn{4}{c}{\ct[c5]{\it Captioning}} & \multicolumn{1}{c}{\ct[c6]{}} & \multicolumn{7}{c}{\ct[c7]{\it Video}}\\
            & \cb{Encoder Params}{}
            & \cb{Resolution}{Patch Size}
            & \cb[c3]{\textit{\textbf{Avg. OCR QA}}}{}
            & \cb[c3]{ChartQA}{Acc.~\cite{zheng2024chartqa}}
            & \cb[c3]{DocVQA}{Acc.~\cite{mathew2021docvqa}}
            & \cb[c3]{Info. QA}{Acc.~\cite{mathew2022infographicvqa}}
            & \cb[c3]{AI2D}{Acc.~\cite{kembhavi2016ai2d}}
            %
            & \cb[c4]{\textit{\textbf{Avg. VQA}}}{}
            & \cb[c4]{TextVQA}{Acc.~\cite{singh2019textvqa}}
            & \cb[c4]{OK-VQA}{Acc.~\cite{schwenk2022okvqa}}
            & \cb[c4]{POPE}{Acc. ~\cite{li2023popebenchmark}}
            & \cb[c4]{VQAv2}{Acc.~\cite{goyal2017vqav2}}
            %
            & \cb[c5]{\textit{\textbf{Avg. Cap.}}}{}
            & \cb[c5]{Flicker}{CIDEr~\cite{flickr}}
            & \cb[c5]{COCO}{CIDEr ~\cite{coco}}
            & \cb[c5]{No Cap}{CIDEr~\cite{agrawal2019nocaps}}
            % <--
            % <--
            & \cb[c6]{\textit{\textbf{Avg. Ground.}}}{RefCOCO/g/+~\cite{kazemzadeh2014referitgame}} 
            & \cb[c7]{\textit{\textbf{Avg. Video}}}{}
            & \cb[c7]{VideoMME}{Acc.~\cite{fu2024videomme}}
            & \cb[c7]{STAR}{Acc.~\cite{wu2021star}}
            & \cb[c7]{TGIF-QA}{Acc.~\cite{jang2017tgif}}
            & \cb[c7]{EgoSchema}{Acc.~\cite{mangalam2024egoschema}}
            & \cb[c7]{MVBench}{Acc.~\cite{li2024mvbench}}
            & \cb[c7]{PerceptionTest}{Acc.~\cite{patraucean2024perceptiontest}}            \\
        \hline
\multicolumn{1}{l}{{\textit{256 Tokens per Tile}}}                & & & \ca{} &&&&& \ca{} &&&&& \ca{} &&&& \cat{} & \cat{} &&&&&& \\

MetaCLIP-L~\cite{metaclip}                  & 0.3B           & \rp{224}{14} & \ca{61.8} & 71.1 & 62.5 & 40.2 & 73.3 & \ca{74.6} & 65.3 & 64.9 & 88.5 & 79.8 & \ca{113.4} & 90.4 & 133.5 & 116.2 & \ca{67.1} & \ca{48.0} & 44.8 & 47.1 & 62.7 & 39.0 & 46.0 & 48.3 \\
MetaCLIP-G~\cite{metaclip}                  & 1.8B             & \rp{224}{14} & \ca{60.3} & 68.1 & 61.3 & 39.1 & 72.8 & \ca{74.9} & 65.4 & 65.9 & 88.2 & 80.1 & \ca{114.2} & 91.8 & 134.4 & 116.5 & \ca{66.0} & \ca{49.0} & 46.5 & 46.5 & 62.5 & 45.0 & 44.7 & 48.9 \\
\textbf{\PElang{} G}$^\dagger$                 & \,\,\,1.7B$^*$ & \rp{224}{14} & \ca{70.2} & 79.8 & 79.1 & 47.5 & 74.6 & \ca{76.0} & 70.6 & 64.3 & 88.3 & 80.6 & \ca{116.3} & 92.0 & 136.4 & 120.5 & \ca{69.5} & \ca{56.6} & 49.0 & 55.9 & 69.9 & 61.2 & 50.0 & 53.6 \\
 \hline
\multicolumn{1}{l}{{\textit{576 Tokens per Tile}}}                & & & \ca{} &&&&& \ca{} &&&&& \ca{} &&&& \cat{} & \cat{} &&&&&& \\
CLIP~\cite{clip}                          & 0.3B           & \rp{336}{14} & \ca{69.6} & 76.8 & 78.2 & 50.3 & 72.9 & \ca{76.3} & 71.8 & 64.9 & 88.0 & 80.4 & \ca{114.0} & 90.9 & 134.4 & 116.6 &  \ca{68.5} & \ca{50.8} & 46.6 & 52.2 & 65.0 & 44.6 & 46.3 & 49.9 \\
AIMv2-L~\cite{aimv2}                        & 0.3B           & \rp{336}{14} & \ca{66.7} & 74.1 & 74.9 & 45.2 & 72.4 & \ca{77.4} & 73.5 & 65.6 & 89.0 & 81.7 & \ca{116.4} & 92.5 & 137.1 & 119.5 & \ca{66.6} &  \ca{54.1} & 43.4 & 54.3 & 70.6 & 56.0 & 47.3 & 52.7 \\
SigLIP2-so~\cite{siglip2}                 & 0.4B           & \rp{384}{16} & \ca{55.5} & 61.4 & 54.9 & 33.3 & 72.3 & \ca{76.5} & 70.1 & 66.0 & 88.6 & 81.2 & \ca{118.0} & \textbf{95.8} & 138.3 & 119.8 & \ca{66.5} &  \ca{54.3} & 44.9 & 52.8 & 66.8 & 58.6 & 49.6 & 53.3 \\
SigLIP2-g-opt~\cite{siglip2}                 & 1.1B             & \rp{384}{16} & \ca{56.2} & 63.1 & 55.3 & 34.0 & 72.4 & \ca{77.0} & 70.3 & \textbf{66.7} & 89.6 & 81.6 & \ca{117.7} & 94.9 & 137.8 & 120.3 & \ca{66.5} &  \ca{53.9} & 46.2 & 53.9 & 66.6 & 53.8 & 48.5 & 54.7 \\
\textbf{\PElang{} G}$^\dagger$                 & \,\,\,1.7B$^*$ & \rp{336}{14} & \ca{77.5} & 82.1 & 88.5 & 61.8 & 77.4 & \ca{79.7} & 80.2 & 66.4 & 89.8 & 82.5 & \ca{120.3} & 97.4 & 140.2 & 123.2 & \ca{71.9} & \ca{59.8} & 49.4 & 62.7 & 74.1 & 64.0 & 53.1 & 55.6 \\
\hline
\multicolumn{1}{l}{{\textit{1024 Tokens per Tile}}}                & & & \ca{} &&&&& \ca{} &&&&& \ca{} &&&& \cat{} & \cat{} &&&&&& \\
SigLIP2-so~\cite{siglip2}                 & 0.4B           & \rp{512}{16} & \ca{56.9} & 66.0 & 56.5 & 34.3 & 70.9 & \ca{76.4} & 69.9 & 66.2 & 88.4 & 81.2 & \ca{117.8} & 94.7 & 137.8 & 120.9 & \ca{67.8} &   \ca{46.2} & 47.0 & 44.9 & 66.7 & 39.2 & 34.5 & 45.1 \\
\textbf{\PEcore{L}} & 0.3B & \rp{448}{14}                                & \ca{67.1} & 72.4 & 78.3 & 46.4 & 71.2 & \ca{76.4} & 74.0 & 63.7 & 88.8 & 79.0 & \ca{113.9} & 91.5 & 134.5 & 115.7 & \ca{62.9} &   \ca{51.4} & 47.0 & 51.2 & 62.7 & 49.6 & 47.8 & 50.1 \\
\textbf{\PElang{L}} & 0.3B & \rp{448}{14}                                & \ca{78.3} & 82.8 & 89.3 & 65.2 & 75.9 & \ca{78.5} & 78.8 & 64.4 & 89.6 & 81.3 & \ca{117.8} & 94.7 & 138.1 & 120.7 & \ca{71.6} & \ca{56.5} & 47.0 & 57.2 & 68.0 & 59.8 & 52.3 & 54.7 \\
\hline
AIMv2 3B~\cite{aimv2}                     & 2.7B             & \rp{448}{14} & \ca{67.5} & 73.0 & 78.2 & 46.5 & 72.2 & \ca{78.8} & 79.2 & 66.2 & 88.3 & 81.7 & \ca{119.0} & \textbf{95.8} & 139.7 & 121.5 &  \ca{65.1} & \ca{54.0} & \textbf{49.6} & 55.4 & 67.3 & 49.6 & 49.9 & 52.5 \\
InternViT2.5 6B~\cite{chen2024internvit2p5}  & 5.5B             & \rp{448}{14} & \ca{67.4} & 74.6 & 74.3 & 47.6 & 72.9 & \ca{75.9} & 71.3 & 64.8 & 87.7 & 79.7 & \ca{110.4} & 85.3 & 132.5 & 113.5 &  \ca{56.8} & \ca{52.0} & 46.0 & 49.6 & 65.0 & 50.6 & 49.6 & 51.3 \\
\textbf{\PEcore{G}}                       & 1.9B           & \rp{448}{14} & \ca{68.0} & 73.4 & 81.2 & 47.6 & 69.7 & \ca{76.4} & 74.3 & 62.5 & 89.1 & 79.6 & \ca{113.0} & 91.6 & 134.5 & 112.9 &  \ca{67.6} & \ca{53.2} & 46.0 & 54.3 & 67.0 & 51.2 & 48.7 & 52.0 \\
\textbf{\PElang{G}}                       & \,\,\,1.7B$^*$ & \rp{448}{14} & \ca{\textbf{78.6}} & \textbf{81.8} & \textbf{89.8} & \textbf{67.8} & \textbf{75.0} & \ca{\textbf{80.3}} & \textbf{82.3} & \textbf{66.7} & \textbf{89.6} & \textbf{82.8} & \ca{\textbf{119.6}} & 95.2 & \textbf{140.3} & \textbf{123.4} &  \ca{\textbf{71.8}} & \ca{\textbf{59.0}} & \textbf{49.6} & \textbf{61.8} & \textbf{73.9} & \textbf{60.0} & \textbf{52.6} & \textbf{56.3} \\
        \shline
    \end{tabular}
    }
    \caption{{\bf 4+1 Tile Llama 8B MLLM Results.} Llama 3.1-instruct 8B~\cite{llama3} is used as a language model.  $^*$\PElang{} has 1.7B parameters since we discard the last 3 layers during language alignment. 
    All MLLMs are trained with dynamic tiling for different image sizes and aspect ratio. 
    We use up to 4 image tiles of $448 \times 448$ (or the corresponding resolution for each encoder).
    The image tiles follow after a \textit{thumbnail} input, similar to prior work~\cite{liu2024llavanext}. $^\dagger$Evaluation on an model that was interpolated without additional training (i.e., \textit{zero-shot} resolution).}
    \label{tab:lang_mllm_bench_tiling}
\end{table*}
```
```{=latex}
\centering
```
```{=latex}
\makebox[\linewidth][c]{
    \tablestyle{0pt}{1.05} 
    \begin{tabular}{y{50}www awwwwww awwwwww awwwwwwww}
        \shline
        \multirow{2}{*}{\vspace{-2.2cm} Model}  &&& \multicolumn{9}{c}{\ct[c6]{\it Grounding}}\\
            & \cb{Encoder Params}{}
            & \cb{Resolution}{Patch Size}
            & \cb[c6]{\textit{\textbf{Avg. Ground.}}}{}
            & \cb[c6]{RefCOCO}{val~\cite{kazemzadeh2014referitgame}}
            & \cb[c6]{RefCOCO}{testA~\cite{kazemzadeh2014referitgame}}
            & \cb[c6]{RefCOCO}{testB~\cite{kazemzadeh2014referitgame}}
            & \cb[c6]{RefCOCO+}{val~\cite{kazemzadeh2014referitgame}}
            & \cb[c6]{RefCOCO+}{testA~\cite{kazemzadeh2014referitgame}}
            & \cb[c6]{RefCOCO+}{testB~\cite{kazemzadeh2014referitgame}}
            & \cb[c6]{RefCOCOg}{val~\cite{kazemzadeh2014referitgame}}
            & \cb[c6]{RefCOCOg}{test~\cite{kazemzadeh2014referitgame}}
            \\
        \hline
\multicolumn{1}{l}{{\textit{256 Tokens per Image}}}                & & & \cat{} &&& &&& &&  \\
MetaCLIP-L~\cite{metaclip}                  & 0.3B & \rp{224}{14} & \ca{60.6} & 63.6 & 56.7 & 67.5 & 54.1 & 58.9 & 48.8 & 67.2 & 67.8 \\
MetaCLIP-G~\cite{metaclip}                  & 1.8B & \rp{224}{14} & \ca{60.5} & 62.0 & 56.5 & 67.8 & 53.5 & 58.7 & 49.2 & 68.2 & 68.3 \\
\textbf{\PElang{} G}$^\dagger$                 & \,\,\,1.7B$^*$ & \rp{224}{14} & \ca{65.7} & 67.7 & 64.4 & 70.9 & 58.3 & 62.0 & 56.6 & 73.2 & 74.4 \\
\hline
\multicolumn{1}{l}{{\textit{576 Tokens per Image}}}                & & & \cat{} &&& &&& &&  \\
CLIP~\cite{clip}                            & 0.3B & \rp{336}{14} & \ca{65.0} & 66.7 & 61.4 & 71.6 & 57.6 & 62.5 & 54.5 & 73.2 & 72.8 \\
AIMv2-L~\cite{aimv2}                        & 0.3B & \rp{336}{14} & \ca{63.3} & 65.4 & 61.6 & 69.6 & 55.0 & 60.0 & 52.0 & 71.1 & 71.5 \\
AIMv2-L Dist. ~\cite{aimv2}                 & 0.3B & \rp{336}{14} & \ca{62.6} & 64.8 & 61.0 & 69.4 & 54.4 & 59.0 & 51.3 & 70.8 & 70.0 \\
SigLIP2-so~\cite{siglip2}                   & 0.4B & \rp{384}{16} & \ca{67.4} & 68.8 & 66.5 & 71.0 & 60.3 & 61.8 & 58.5 & 76.2 & 76.0 \\
SigLIP2-g-opt~\cite{siglip2}                & 1.1B & \rp{384}{16} & \ca{66.5} & 67.9 & 66.1 & 70.1 & 58.8 & 61.7 & 57.1 & 75.5 & 75.0 \\
\textbf{\PElang{} G}$^\dagger$                 & \,\,\,1.7B$^*$ & \rp{336}{14} & \ca{68.9} & 69.8 & 67.5 & 73.2 & 61.5 & 64.0 & 60.8 & 77.3 & 77.7 \\
\hline
\multicolumn{1}{l}{{\textit{1024 Tokens per Image}}}                & & & \cat{} &&& &&& &&  \\
InternViT2.5 L~\cite{chen2024internvit2p5}  & 0.3B & \rp{448}{14} & \ca{66.9} & 69.3 & 66.7 & 72.6 & 58.3 & 63.1 & 57.2 & 74.2 & 74.0 \\
SigLIP2-so~\cite{siglip2}                   & 0.4B & \rp{512}{16} & \ca{69.6} & 71.4 & 69.2 & 74.4 & 61.3 & 64.8 & 60.3 & 77.9 & 77.2 \\
\textbf{\PEcore{} L}                        & 0.3B & \rp{448}{14} & \ca{59.7} & 61.7 & 55.3 & 66.9 & 53.1 & 58.8 & 48.0 & 68.5 & 67.5 \\
\textbf{\PElang{} L}                        & 0.3B & \rp{448}{14} & \ca{70.5} & 71.8 & \textbf{70.2} & 73.0 & 63.7 & 66.1 & 62.7 & 78.8 & 78.9 \\
\hline
    \addpadding
DINOv2~\cite{dinov2}                        & 1.1B & \rp{448}{14} & \ca{64.9} & 67.2 & 62.5 & 70.5 & 57.0 & 61.0 & 54.5 & 73.1 & 73.1 \\
AIMv2 3B~\cite{aimv2}                       & 2.7B & \rp{448}{14} & \ca{36.1} & 37.6 & 34.1 & 40.7 & 32.7 & 36.2 & 32.0 & 36.9 & 38.6 \\
InternViT2.5 6B~\cite{chen2024internvit2p5} & 5.5B & \rp{448}{14} & \ca{68.0} & 70.2 & 67.6 & 72.2 & 60.6 & 64.0 & 58.7 & 75.3 & 75.2 \\
\textbf{\PEcore{} G}                        & 1.9B & \rp{448}{14} & \ca{66.6} & 68.3 & 64.4 & 72.3 & 58.7 & 62.7 & 56.0 & 75.1 & 75.0 \\
\textbf{\PElang{} G}                        & \,\,\,1.7B$^*$ & 448/14 & \ca{\textbf{71.3}} & \textbf{71.9} & 69.9 & \textbf{75.1} & \textbf{64.2} & \textbf{67.3} & \textbf{63.0} & \textbf{79.4} & \textbf{79.2} \\
        \shline
    \end{tabular}
    }
```
```{=latex}
\centering
```
```{=latex}
\makebox[\linewidth][c]{
        \tablestyle{0pt}{1.05} 
        \begin{tabular}{y{55}www awwwwww awwwwww awwwwwwww}
            \shline
            \multirow{2}{*}{\vspace{-2.2cm} Model}  &&& \multicolumn{9}{c}{\ct[c6]{\it Grounding}}\\
                & \cb{Encoder Params}{}
                & \cb{Resolution}{Patch Size}
                & \cb[c6]{\textit{\textbf{Avg. Ground.}}}{}
                & \cb[c6]{RefCOCO}{val~\cite{kazemzadeh2014referitgame}}
                & \cb[c6]{RefCOCO}{testA~\cite{kazemzadeh2014referitgame}}
                & \cb[c6]{RefCOCO}{testB~\cite{kazemzadeh2014referitgame}}
                & \cb[c6]{RefCOCO+}{val~\cite{kazemzadeh2014referitgame}}
                & \cb[c6]{RefCOCO+}{testA~\cite{kazemzadeh2014referitgame}}
                & \cb[c6]{RefCOCO+}{testB~\cite{kazemzadeh2014referitgame}}
                & \cb[c6]{RefCOCOg}{val~\cite{kazemzadeh2014referitgame}}
                & \cb[c6]{RefCOCOg}{test~\cite{kazemzadeh2014referitgame}}
                \\
            \hline
    \multicolumn{1}{l}{{\textit{576 Tokens per Image}}}                & & & \cat{} &&& &&& &&  \\
    SigLIP2-so~\cite{siglip2}                   & 0.4B & \rp{384}{16} & \ca{70.0} & 73.6 & 73.0 & 74.3 & 60.9 & 62.7 & 59.9 & 78.4 & 77.2 \\
    SigLIP2-g-opt~\cite{siglip2}                & 1.1B & \rp{384}{16} & \ca{69.9} & 73.3 & 72.4 & 73.6 & 60.5 & 62.3 & 60.7 & 78.4 & 78.2 \\
    \textbf{\PElang{} G}$^\dagger$                 & \,\,\,1.7B$^*$ & \rp{336}{14} & \ca{70.1} & 73.4 & 72.0 & 75.3 & 62.0 & 64.2 & 61.2 & 78.4 & 77.7 \\
    \hline
    \multicolumn{1}{l}{{\textit{1024 Tokens per Image}}}                & & & \cat{} &&& &&& &&  \\
    InternViT2.5 L~\cite{chen2024internvit2p5}  & 0.3B & \rp{448}{14} & \ca{68.1} & 72.4 & 69.1 & 74.1 & 59.3 & 62.4 & 56.6 & 75.2 & 75.5 \\
    SigLIP2-so~\cite{siglip2}                   & 0.4B & \rp{512}{16} & \ca{70.5} & 74.1 & 73.7 & 74.4 & 61.7 & 62.9 & 61.0 & 78.6 & 77.9 \\
    \textbf{\PEcore{L}}                         & 0.3B & \rp{448}{14} & \ca{66.5} & 70.4 & 67.8 & 71.5 & 57.7 & 61.1 & 56.2 & 75.8 & 75.3 \\
    \textbf{\PElang{L}}                         & 0.3B & \rp{448}{14} & \ca{70.4} & 74.4 & 72.6 & 74.6 & 62.2 & 64.0 & 62.0 & 79.0 & 78.7 \\
    \hline
    \addpadding 
    DINOv2~\cite{dinov2}                        & 1.1B & \rp{448}{14} & \ca{69.3} & 73.4 & 71.1 & 73.9 & 60.0 & 63.9 & 59.0 & 76.4 & 76.7 \\
    AIMv2 3B~\cite{aimv2}                       & 2.7B & \rp{448}{14} & \ca{67.6} & 71.4 & 67.7 & 72.3 & 59.2 & 61.2 & 56.3 & 76.4 & 76.4 \\
    InternViT2.5 6B$^\ddagger$~\cite{chen2024internvit2p5} & 5.5B   & \rp{448}{14} & \ca{\textbf{72.8}} & \textbf{77.7} & \textbf{76.5} & \textbf{77.1} & 63.6 & \textbf{66.0} & 62.2 & \textbf{80.0} & 79.5 \\
    \textbf{\PEcore{G}}                         & 1.9B & \rp{448}{14} & \ca{70.5} & 74.0 & 71.8 & 75.8 & 61.5 & 64.8 & 60.1 & 78.5 & 77.3 \\
    \textbf{\PElang{G}}                         & \,\,\,1.7B$^*$ & \rp{448}{14} & \ca{72.1} & 75.4 & 72.9 & 76.3 & \textbf{64.2} & 65.9 & \textbf{62.9} & 79.7 & \textbf{79.7} \\
            \shline
        \end{tabular}
        }
```
```{=latex}
\centering
```
```{=latex}
\makebox[\linewidth][c]{
        \tablestyle{0pt}{1.05} 
        \begin{tabular}{y{55}www awwwwww awwwwww awwwwwwww}
            \shline
            \multirow{2}{*}{\vspace{-2.2cm} Model}  &&& \multicolumn{9}{c}{\ct[c6]{\it Grounding}}\\
                & \cb{Encoder Params}{}
                & \cb{Resolution}{Patch Size}
                & \cb[c6]{\textit{\textbf{Avg. Ground.}}}{}
                & \cb[c6]{RefCOCO}{val~\cite{kazemzadeh2014referitgame}}
                & \cb[c6]{RefCOCO}{testA~\cite{kazemzadeh2014referitgame}}
                & \cb[c6]{RefCOCO}{testB~\cite{kazemzadeh2014referitgame}}
                & \cb[c6]{RefCOCO+}{val~\cite{kazemzadeh2014referitgame}}
                & \cb[c6]{RefCOCO+}{testA~\cite{kazemzadeh2014referitgame}}
                & \cb[c6]{RefCOCO+}{testB~\cite{kazemzadeh2014referitgame}}
                & \cb[c6]{RefCOCOg}{val~\cite{kazemzadeh2014referitgame}}
                & \cb[c6]{RefCOCOg}{test~\cite{kazemzadeh2014referitgame}}
                \\
            \hline
    \multicolumn{1}{l}{{\textit{256 Tokens per Tile}}}                & & & \cat{} &&& &&& &&  \\
    MetaCLIP-L~\cite{metaclip}                  & 0.3B & \rp{224}{14} & \ca{67.1} & 69.3 & 65.0 & 73.2 & 60.5 & 64.9 & 56.5 & 74.3 & 73.4 \\ 
    MetaCLIP-G~\cite{metaclip}                  & 1.8B & \rp{224}{14} & \ca{66.0} & 67.9 & 63.2 & 71.9 & 59.2 & 62.9 & 55.8 & 73.8 & 73.1 \\ 
    \textbf{\PElang{} G}$^\dagger$                 & \,\,\,1.7B$^*$ & \rp{224}{14} & \ca{70.3} & 71.6 & 69.6 & 73.7 & 63.3 & 66.2 & 62.6 & 78.6 & 78.2 \\

    \hline
    \multicolumn{1}{l}{{\textit{576 Tokens per Tile}}}                & & & \cat{} &&& &&& &&  \\
    CLIP~\cite{clip}                            & 0.3B & \rp{336}{14} & \ca{68.5} & 70.7 & 66.6 & 74.1 & 61.1 & 65.9 & 58.1 & 76.0 & 75.1 \\ 
    AIMv2-L~\cite{aimv2}                        & 0.3B & \rp{336}{14} & \ca{66.6} & 68.4 & 65.5 & 71.4 & 59.3 & 63.4 & 56.5 & 74.2 & 74.2 \\ 
    SigLIP2-so~\cite{siglip2}                   & 0.4B & \rp{384}{16} & \ca{66.5} & 67.9 & 66.1 & 70.1 & 58.8 & 61.7 & 57.1 & 75.5 & 75.0 \\ 
    SigLIP2-g-opt~\cite{siglip2}                & 1.1B & \rp{384}{16} & \ca{66.5} & 68.2 & 65.6 & 70.1 & 59.0 & 62.3 & 58.0 & 74.8 & 74.0 \\ 
    \textbf{\PElang{} G}$^\dagger$                 & \,\,\,1.7B$^*$ & \rp{336}{14} & \ca{71.9} & 73.6 & 71.5 & 74.9 & 64.8 & 67.3 & 63.9 & 80.4 & 80.6 \\
    \hline
    \multicolumn{1}{l}{{\textit{1024 Tokens per Tile}}}                & & & \cat{} &&& &&& &&  \\
    SigLIP2-so~\cite{siglip2}                   & 0.4B & \rp{512}{16} & \ca{67.8} & 69.2 & 67.8 & 71.2 & 59.9 & 62.5 & 59.0 & 76.9 & 76.0 \\ 
    \textbf{\PEcore{L}}                         & 0.3B & \rp{448}{14} & \ca{62.9} & 65.3 & 59.9 & 69.2 & 56.6 & 62.2 & 52.0 & 70.1 & 70.0 \\ 
    \textbf{\PElang{L}}                         & 0.3B & \rp{448}{14} & \ca{71.6}          & \textbf{73.0} & \textbf{70.8} & 74.3          & \textbf{65.2} & \textbf{67.2} & 62.9          & 79.7          & 79.7 \\ 
                                                                                                    
    \hline
    \addpadding
    AIMv2 3B~\cite{aimv2}                       & 2.7B & \rp{448}{14} & \ca{65.1} & 66.9 & 62.9 & 71.1 & 58.1 & 62.4 & 55.6 & 71.8 & 72.2 \\ 
    InternViT2.5 6B$^\ddagger$~\cite{chen2024internvit2p5} & 5.5B  & \rp{448}{14} & \ca{56.8} & 61.0 & 56.4 & 65.8 & 51.0 & 57.0 & 46.1 & 58.0 & 58.9 \\ 
    \textbf{\PEcore{G}}                        & 1.9B & \rp{448}{14} & \ca{67.6} & 69.2 & 65.8 & 72.4 & 59.9 & 64.1 & 58.3 & 75.1 & 75.6 \\ 
    \textbf{\PElang{G}}                        & \,\,\,1.7B$^*$ & \rp{448}{14} & \ca{\textbf{71.8}} & 72.6          & 70.7          & \textbf{74.6} & 64.8          & 66.6          & \textbf{64.6} & \textbf{80.4} & \textbf{80.3} \\ 
            \shline
        \end{tabular}
        }
```
```{=latex}
\clearpage
```
`\PEspat{}`{=latex}: Additional Qualitative Results {#appx:more_feature_viz}
---------------------------------------------------

```{=latex}
\centering
```
![**More Visualizations** of the feature space following Fig. `\ref{fig:feature_viz}`{=latex}. After the image itself, column 1 is `\PEcore{G}`{=latex} last layer features, column 2 is `\PEcore{G}`{=latex} aligned to its own layer 41, column 3 is `\PEcore{G}`{=latex} aligned to SAM 2.1-L [@sam2] mask logits, and column 4 is `\PEcore{G}`{=latex} aligned to both, denoted `\PEspat{G}`{=latex}. See §`\ref{appx:feature_viz}`{=latex} for visualization method.](fig/more_feature_viz.png){#fig:more_feature_viz width="1\\linewidth"}

```{=latex}
\clearpage
```
```{=latex}
\small
```
```{=latex}
\bibliographystyle{ieeenat_fullname}
```

[^1]: The annotators are instructed to remove, correct, and add information from the captions.

[^2]: PVD available at <https://ai.meta.com/datasets/pe-video/>

[^3]: We employ the setup described in §`\ref{sec:core_image_pt}`{=latex} except for the additional class token (only used for L and B). Interestingly, we find *using the same high learning rate* ($2$ $\times$ 10$^{-3}$) to perform well for G. We also did not find scaling the text encoder to be beneficial.

[^4]: We use the version provided by [@datacomp] and re-evaluate all models to ensure a fair comparison.

[^5]: <https://github.com/mlfoundations/open_clip>

[^6]: <https://github.com/LAION-AI/CLIP_benchmark>

[^7]: We excluded multi-images samples.

[^8]: <https://github.com/jozhang97/DETA>
