---
abstract: |
  ```{=latex}
  \vspace{-0.1cm}
  ```
  Forecasting is a critical task in decision-making across numerous domains. While historical numerical data provide a start, they fail to convey the complete context for reliable and accurate predictions. Human forecasters frequently rely on additional information, such as background knowledge and constraints, which can efficiently be communicated through natural language. However, in spite of recent progress with LLM-based forecasters, their ability to effectively integrate this textual information remains an open question. To address this, we introduce \`\`Context is Key" (CiK), a time series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context, requiring models to integrate both modalities; crucially, every task in CiK requires understanding textual context to be solved successfully. We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters, and propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark. Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings. This benchmark aims to advance multimodal forecasting by promoting models that are both accurate and accessible to decision-makers with varied technical expertise. The benchmark can be visualized at <https://servicenow.github.io/context-is-key-forecasting/v0/>.
bibliography:
- icml2025.bib
---

```{=latex}
\newcommand{\theHalgorithm}{\arabic{algorithm}}
```
```{=latex}
\newcommand{\figleft}{{\em (Left)}}
```
```{=latex}
\newcommand{\figcenter}{{\em (Center)}}
```
```{=latex}
\newcommand{\figright}{{\em (Right)}}
```
```{=latex}
\newcommand{\figtop}{{\em (Top)}}
```
```{=latex}
\newcommand{\figbottom}{{\em (Bottom)}}
```
```{=latex}
\newcommand{\captiona}{{\em (a)}}
```
```{=latex}
\newcommand{\captionb}{{\em (b)}}
```
```{=latex}
\newcommand{\captionc}{{\em (c)}}
```
```{=latex}
\newcommand{\captiond}{{\em (d)}}
```
```{=latex}
\newcommand{\newterm}[1]{{\bf #1}}
```
```{=latex}
\newcommand{\eqdef}{\mathrel{\mathop:}=}
```
```{=latex}
\def\figref#1{figure~\ref{#1}}
```
```{=latex}
\def\Figref#1{Figure~\ref{#1}}
```
```{=latex}
\def\twofigref#1#2{figures \ref{#1} and \ref{#2}}
```
```{=latex}
\def\quadfigref#1#2#3#4{figures \ref{#1}, \ref{#2}, \ref{#3} and \ref{#4}}
```
```{=latex}
\def\secref#1{section~\ref{#1}}
```
```{=latex}
\def\Secref#1{Section~\ref{#1}}
```
```{=latex}
\def\twosecrefs#1#2{sections \ref{#1} and \ref{#2}}
```
```{=latex}
\def\secrefs#1#2#3{sections \ref{#1}, \ref{#2} and \ref{#3}}
```
```{=latex}
\def\eqref#1{equation~\ref{#1}}
```
```{=latex}
\def\Eqref#1{Equation~\ref{#1}}
```
```{=latex}
\def\plaineqref#1{\ref{#1}}
```
```{=latex}
\def\chapref#1{chapter~\ref{#1}}
```
```{=latex}
\def\Chapref#1{Chapter~\ref{#1}}
```
```{=latex}
\def\rangechapref#1#2{chapters\ref{#1}--\ref{#2}}
```
```{=latex}
\def\algref#1{algorithm~\ref{#1}}
```
```{=latex}
\def\Algref#1{Algorithm~\ref{#1}}
```
```{=latex}
\def\twoalgref#1#2{algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\Twoalgref#1#2{Algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\partref#1{part~\ref{#1}}
```
```{=latex}
\def\Partref#1{Part~\ref{#1}}
```
```{=latex}
\def\twopartref#1#2{parts \ref{#1} and \ref{#2}}
```
```{=latex}
\def\ceil#1{\lceil #1 \rceil}
```
```{=latex}
\def\floor#1{\lfloor #1 \rfloor}
```
```{=latex}
\def\1{\bm{1}}
```
```{=latex}
\newcommand{\train}{\mathcal{D}}
```
```{=latex}
\newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}}
```
```{=latex}
\newcommand{\test}{\mathcal{D_{\mathrm{test}}}}
```
```{=latex}
\def\eps{{\epsilon}}
```
```{=latex}
\def\reta{{\textnormal{$\eta$}}}
```
```{=latex}
\def\ra{{\textnormal{a}}}
```
```{=latex}
\def\rb{{\textnormal{b}}}
```
```{=latex}
\def\rc{{\textnormal{c}}}
```
```{=latex}
\def\rd{{\textnormal{d}}}
```
```{=latex}
\def\re{{\textnormal{e}}}
```
```{=latex}
\def\rf{{\textnormal{f}}}
```
```{=latex}
\def\rg{{\textnormal{g}}}
```
```{=latex}
\def\rh{{\textnormal{h}}}
```
```{=latex}
\def\ri{{\textnormal{i}}}
```
```{=latex}
\def\rj{{\textnormal{j}}}
```
```{=latex}
\def\rk{{\textnormal{k}}}
```
```{=latex}
\def\rl{{\textnormal{l}}}
```
```{=latex}
\def\rn{{\textnormal{n}}}
```
```{=latex}
\def\ro{{\textnormal{o}}}
```
```{=latex}
\def\rp{{\textnormal{p}}}
```
```{=latex}
\def\rq{{\textnormal{q}}}
```
```{=latex}
\def\rr{{\textnormal{r}}}
```
```{=latex}
\def\rs{{\textnormal{s}}}
```
```{=latex}
\def\rt{{\textnormal{t}}}
```
```{=latex}
\def\ru{{\textnormal{u}}}
```
```{=latex}
\def\rv{{\textnormal{v}}}
```
```{=latex}
\def\rw{{\textnormal{w}}}
```
```{=latex}
\def\rx{{\textnormal{x}}}
```
```{=latex}
\def\ry{{\textnormal{y}}}
```
```{=latex}
\def\rz{{\textnormal{z}}}
```
```{=latex}
\def\rvepsilon{{\mathbf{\epsilon}}}
```
```{=latex}
\def\rvtheta{{\mathbf{\theta}}}
```
```{=latex}
\def\rva{{\mathbf{a}}}
```
```{=latex}
\def\rvb{{\mathbf{b}}}
```
```{=latex}
\def\rvc{{\mathbf{c}}}
```
```{=latex}
\def\rvd{{\mathbf{d}}}
```
```{=latex}
\def\rve{{\mathbf{e}}}
```
```{=latex}
\def\rvf{{\mathbf{f}}}
```
```{=latex}
\def\rvg{{\mathbf{g}}}
```
```{=latex}
\def\rvh{{\mathbf{h}}}
```
```{=latex}
\def\rvu{{\mathbf{i}}}
```
```{=latex}
\def\rvj{{\mathbf{j}}}
```
```{=latex}
\def\rvk{{\mathbf{k}}}
```
```{=latex}
\def\rvl{{\mathbf{l}}}
```
```{=latex}
\def\rvm{{\mathbf{m}}}
```
```{=latex}
\def\rvn{{\mathbf{n}}}
```
```{=latex}
\def\rvo{{\mathbf{o}}}
```
```{=latex}
\def\rvp{{\mathbf{p}}}
```
```{=latex}
\def\rvq{{\mathbf{q}}}
```
```{=latex}
\def\rvr{{\mathbf{r}}}
```
```{=latex}
\def\rvs{{\mathbf{s}}}
```
```{=latex}
\def\rvt{{\mathbf{t}}}
```
```{=latex}
\def\rvu{{\mathbf{u}}}
```
```{=latex}
\def\rvv{{\mathbf{v}}}
```
```{=latex}
\def\rvw{{\mathbf{w}}}
```
```{=latex}
\def\rvx{{\mathbf{x}}}
```
```{=latex}
\def\rvy{{\mathbf{y}}}
```
```{=latex}
\def\rvz{{\mathbf{z}}}
```
```{=latex}
\def\erva{{\textnormal{a}}}
```
```{=latex}
\def\ervb{{\textnormal{b}}}
```
```{=latex}
\def\ervc{{\textnormal{c}}}
```
```{=latex}
\def\ervd{{\textnormal{d}}}
```
```{=latex}
\def\erve{{\textnormal{e}}}
```
```{=latex}
\def\ervf{{\textnormal{f}}}
```
```{=latex}
\def\ervg{{\textnormal{g}}}
```
```{=latex}
\def\ervh{{\textnormal{h}}}
```
```{=latex}
\def\ervi{{\textnormal{i}}}
```
```{=latex}
\def\ervj{{\textnormal{j}}}
```
```{=latex}
\def\ervk{{\textnormal{k}}}
```
```{=latex}
\def\ervl{{\textnormal{l}}}
```
```{=latex}
\def\ervm{{\textnormal{m}}}
```
```{=latex}
\def\ervn{{\textnormal{n}}}
```
```{=latex}
\def\ervo{{\textnormal{o}}}
```
```{=latex}
\def\ervp{{\textnormal{p}}}
```
```{=latex}
\def\ervq{{\textnormal{q}}}
```
```{=latex}
\def\ervr{{\textnormal{r}}}
```
```{=latex}
\def\ervs{{\textnormal{s}}}
```
```{=latex}
\def\ervt{{\textnormal{t}}}
```
```{=latex}
\def\ervu{{\textnormal{u}}}
```
```{=latex}
\def\ervv{{\textnormal{v}}}
```
```{=latex}
\def\ervw{{\textnormal{w}}}
```
```{=latex}
\def\ervx{{\textnormal{x}}}
```
```{=latex}
\def\ervy{{\textnormal{y}}}
```
```{=latex}
\def\ervz{{\textnormal{z}}}
```
```{=latex}
\def\rmA{{\mathbf{A}}}
```
```{=latex}
\def\rmB{{\mathbf{B}}}
```
```{=latex}
\def\rmC{{\mathbf{C}}}
```
```{=latex}
\def\rmD{{\mathbf{D}}}
```
```{=latex}
\def\rmE{{\mathbf{E}}}
```
```{=latex}
\def\rmF{{\mathbf{F}}}
```
```{=latex}
\def\rmG{{\mathbf{G}}}
```
```{=latex}
\def\rmH{{\mathbf{H}}}
```
```{=latex}
\def\rmI{{\mathbf{I}}}
```
```{=latex}
\def\rmJ{{\mathbf{J}}}
```
```{=latex}
\def\rmK{{\mathbf{K}}}
```
```{=latex}
\def\rmL{{\mathbf{L}}}
```
```{=latex}
\def\rmM{{\mathbf{M}}}
```
```{=latex}
\def\rmN{{\mathbf{N}}}
```
```{=latex}
\def\rmO{{\mathbf{O}}}
```
```{=latex}
\def\rmP{{\mathbf{P}}}
```
```{=latex}
\def\rmQ{{\mathbf{Q}}}
```
```{=latex}
\def\rmR{{\mathbf{R}}}
```
```{=latex}
\def\rmS{{\mathbf{S}}}
```
```{=latex}
\def\rmT{{\mathbf{T}}}
```
```{=latex}
\def\rmU{{\mathbf{U}}}
```
```{=latex}
\def\rmV{{\mathbf{V}}}
```
```{=latex}
\def\rmW{{\mathbf{W}}}
```
```{=latex}
\def\rmX{{\mathbf{X}}}
```
```{=latex}
\def\rmY{{\mathbf{Y}}}
```
```{=latex}
\def\rmZ{{\mathbf{Z}}}
```
```{=latex}
\def\ermA{{\textnormal{A}}}
```
```{=latex}
\def\ermB{{\textnormal{B}}}
```
```{=latex}
\def\ermC{{\textnormal{C}}}
```
```{=latex}
\def\ermD{{\textnormal{D}}}
```
```{=latex}
\def\ermE{{\textnormal{E}}}
```
```{=latex}
\def\ermF{{\textnormal{F}}}
```
```{=latex}
\def\ermG{{\textnormal{G}}}
```
```{=latex}
\def\ermH{{\textnormal{H}}}
```
```{=latex}
\def\ermI{{\textnormal{I}}}
```
```{=latex}
\def\ermJ{{\textnormal{J}}}
```
```{=latex}
\def\ermK{{\textnormal{K}}}
```
```{=latex}
\def\ermL{{\textnormal{L}}}
```
```{=latex}
\def\ermM{{\textnormal{M}}}
```
```{=latex}
\def\ermN{{\textnormal{N}}}
```
```{=latex}
\def\ermO{{\textnormal{O}}}
```
```{=latex}
\def\ermP{{\textnormal{P}}}
```
```{=latex}
\def\ermQ{{\textnormal{Q}}}
```
```{=latex}
\def\ermR{{\textnormal{R}}}
```
```{=latex}
\def\ermS{{\textnormal{S}}}
```
```{=latex}
\def\ermT{{\textnormal{T}}}
```
```{=latex}
\def\ermU{{\textnormal{U}}}
```
```{=latex}
\def\ermV{{\textnormal{V}}}
```
```{=latex}
\def\ermW{{\textnormal{W}}}
```
```{=latex}
\def\ermX{{\textnormal{X}}}
```
```{=latex}
\def\ermY{{\textnormal{Y}}}
```
```{=latex}
\def\ermZ{{\textnormal{Z}}}
```
```{=latex}
\def\vzero{{\bm{0}}}
```
```{=latex}
\def\vone{{\bm{1}}}
```
```{=latex}
\def\vmu{{\bm{\mu}}}
```
```{=latex}
\def\vtheta{{\bm{\theta}}}
```
```{=latex}
\def\va{{\bm{a}}}
```
```{=latex}
\def\vb{{\bm{b}}}
```
```{=latex}
\def\vc{{\bm{c}}}
```
```{=latex}
\def\vd{{\bm{d}}}
```
```{=latex}
\def\ve{{\bm{e}}}
```
```{=latex}
\def\vf{{\bm{f}}}
```
```{=latex}
\def\vg{{\bm{g}}}
```
```{=latex}
\def\vh{{\bm{h}}}
```
```{=latex}
\def\vi{{\bm{i}}}
```
```{=latex}
\def\vj{{\bm{j}}}
```
```{=latex}
\def\vk{{\bm{k}}}
```
```{=latex}
\def\vl{{\bm{l}}}
```
```{=latex}
\def\vm{{\bm{m}}}
```
```{=latex}
\def\vn{{\bm{n}}}
```
```{=latex}
\def\vo{{\bm{o}}}
```
```{=latex}
\def\vp{{\bm{p}}}
```
```{=latex}
\def\vq{{\bm{q}}}
```
```{=latex}
\def\vr{{\bm{r}}}
```
```{=latex}
\def\vs{{\bm{s}}}
```
```{=latex}
\def\vt{{\bm{t}}}
```
```{=latex}
\def\vu{{\bm{u}}}
```
```{=latex}
\def\vv{{\bm{v}}}
```
```{=latex}
\def\vw{{\bm{w}}}
```
```{=latex}
\def\vx{{\bm{x}}}
```
```{=latex}
\def\vy{{\bm{y}}}
```
```{=latex}
\def\vz{{\bm{z}}}
```
```{=latex}
\def\evalpha{{\alpha}}
```
```{=latex}
\def\evbeta{{\beta}}
```
```{=latex}
\def\evepsilon{{\epsilon}}
```
```{=latex}
\def\evlambda{{\lambda}}
```
```{=latex}
\def\evomega{{\omega}}
```
```{=latex}
\def\evmu{{\mu}}
```
```{=latex}
\def\evpsi{{\psi}}
```
```{=latex}
\def\evsigma{{\sigma}}
```
```{=latex}
\def\evtheta{{\theta}}
```
```{=latex}
\def\eva{{a}}
```
```{=latex}
\def\evb{{b}}
```
```{=latex}
\def\evc{{c}}
```
```{=latex}
\def\evd{{d}}
```
```{=latex}
\def\eve{{e}}
```
```{=latex}
\def\evf{{f}}
```
```{=latex}
\def\evg{{g}}
```
```{=latex}
\def\evh{{h}}
```
```{=latex}
\def\evi{{i}}
```
```{=latex}
\def\evj{{j}}
```
```{=latex}
\def\evk{{k}}
```
```{=latex}
\def\evl{{l}}
```
```{=latex}
\def\evm{{m}}
```
```{=latex}
\def\evn{{n}}
```
```{=latex}
\def\evo{{o}}
```
```{=latex}
\def\evp{{p}}
```
```{=latex}
\def\evq{{q}}
```
```{=latex}
\def\evr{{r}}
```
```{=latex}
\def\evs{{s}}
```
```{=latex}
\def\evt{{t}}
```
```{=latex}
\def\evu{{u}}
```
```{=latex}
\def\evv{{v}}
```
```{=latex}
\def\evw{{w}}
```
```{=latex}
\def\evx{{x}}
```
```{=latex}
\def\evy{{y}}
```
```{=latex}
\def\evz{{z}}
```
```{=latex}
\def\mA{{\bm{A}}}
```
```{=latex}
\def\mB{{\bm{B}}}
```
```{=latex}
\def\mC{{\bm{C}}}
```
```{=latex}
\def\mD{{\bm{D}}}
```
```{=latex}
\def\mE{{\bm{E}}}
```
```{=latex}
\def\mF{{\bm{F}}}
```
```{=latex}
\def\mG{{\bm{G}}}
```
```{=latex}
\def\mH{{\bm{H}}}
```
```{=latex}
\def\mI{{\bm{I}}}
```
```{=latex}
\def\mJ{{\bm{J}}}
```
```{=latex}
\def\mK{{\bm{K}}}
```
```{=latex}
\def\mL{{\bm{L}}}
```
```{=latex}
\def\mM{{\bm{M}}}
```
```{=latex}
\def\mN{{\bm{N}}}
```
```{=latex}
\def\mO{{\bm{O}}}
```
```{=latex}
\def\mP{{\bm{P}}}
```
```{=latex}
\def\mQ{{\bm{Q}}}
```
```{=latex}
\def\mR{{\bm{R}}}
```
```{=latex}
\def\mS{{\bm{S}}}
```
```{=latex}
\def\mT{{\bm{T}}}
```
```{=latex}
\def\mU{{\bm{U}}}
```
```{=latex}
\def\mV{{\bm{V}}}
```
```{=latex}
\def\mW{{\bm{W}}}
```
```{=latex}
\def\mX{{\bm{X}}}
```
```{=latex}
\def\mY{{\bm{Y}}}
```
```{=latex}
\def\mZ{{\bm{Z}}}
```
```{=latex}
\def\mBeta{{\bm{\beta}}}
```
```{=latex}
\def\mPhi{{\bm{\Phi}}}
```
```{=latex}
\def\mLambda{{\bm{\Lambda}}}
```
```{=latex}
\def\mSigma{{\bm{\Sigma}}}
```
```{=latex}
\newcommand{\tens}[1]{\bm{\mathsfit{#1}}}
```
```{=latex}
\def\tA{{\tens{A}}}
```
```{=latex}
\def\tB{{\tens{B}}}
```
```{=latex}
\def\tC{{\tens{C}}}
```
```{=latex}
\def\tD{{\tens{D}}}
```
```{=latex}
\def\tE{{\tens{E}}}
```
```{=latex}
\def\tF{{\tens{F}}}
```
```{=latex}
\def\tG{{\tens{G}}}
```
```{=latex}
\def\tH{{\tens{H}}}
```
```{=latex}
\def\tI{{\tens{I}}}
```
```{=latex}
\def\tJ{{\tens{J}}}
```
```{=latex}
\def\tK{{\tens{K}}}
```
```{=latex}
\def\tL{{\tens{L}}}
```
```{=latex}
\def\tM{{\tens{M}}}
```
```{=latex}
\def\tN{{\tens{N}}}
```
```{=latex}
\def\tO{{\tens{O}}}
```
```{=latex}
\def\tP{{\tens{P}}}
```
```{=latex}
\def\tQ{{\tens{Q}}}
```
```{=latex}
\def\tR{{\tens{R}}}
```
```{=latex}
\def\tS{{\tens{S}}}
```
```{=latex}
\def\tT{{\tens{T}}}
```
```{=latex}
\def\tU{{\tens{U}}}
```
```{=latex}
\def\tV{{\tens{V}}}
```
```{=latex}
\def\tW{{\tens{W}}}
```
```{=latex}
\def\tX{{\tens{X}}}
```
```{=latex}
\def\tY{{\tens{Y}}}
```
```{=latex}
\def\tZ{{\tens{Z}}}
```
```{=latex}
\def\gA{{\mathcal{A}}}
```
```{=latex}
\def\gB{{\mathcal{B}}}
```
```{=latex}
\def\gC{{\mathcal{C}}}
```
```{=latex}
\def\gD{{\mathcal{D}}}
```
```{=latex}
\def\gE{{\mathcal{E}}}
```
```{=latex}
\def\gF{{\mathcal{F}}}
```
```{=latex}
\def\gG{{\mathcal{G}}}
```
```{=latex}
\def\gH{{\mathcal{H}}}
```
```{=latex}
\def\gI{{\mathcal{I}}}
```
```{=latex}
\def\gJ{{\mathcal{J}}}
```
```{=latex}
\def\gK{{\mathcal{K}}}
```
```{=latex}
\def\gL{{\mathcal{L}}}
```
```{=latex}
\def\gM{{\mathcal{M}}}
```
```{=latex}
\def\gN{{\mathcal{N}}}
```
```{=latex}
\def\gO{{\mathcal{O}}}
```
```{=latex}
\def\gP{{\mathcal{P}}}
```
```{=latex}
\def\gQ{{\mathcal{Q}}}
```
```{=latex}
\def\gR{{\mathcal{R}}}
```
```{=latex}
\def\gS{{\mathcal{S}}}
```
```{=latex}
\def\gT{{\mathcal{T}}}
```
```{=latex}
\def\gU{{\mathcal{U}}}
```
```{=latex}
\def\gV{{\mathcal{V}}}
```
```{=latex}
\def\gW{{\mathcal{W}}}
```
```{=latex}
\def\gX{{\mathcal{X}}}
```
```{=latex}
\def\gY{{\mathcal{Y}}}
```
```{=latex}
\def\gZ{{\mathcal{Z}}}
```
```{=latex}
\def\sA{{\mathbb{A}}}
```
```{=latex}
\def\sB{{\mathbb{B}}}
```
```{=latex}
\def\sC{{\mathbb{C}}}
```
```{=latex}
\def\sD{{\mathbb{D}}}
```
```{=latex}
\def\sF{{\mathbb{F}}}
```
```{=latex}
\def\sG{{\mathbb{G}}}
```
```{=latex}
\def\sH{{\mathbb{H}}}
```
```{=latex}
\def\sI{{\mathbb{I}}}
```
```{=latex}
\def\sJ{{\mathbb{J}}}
```
```{=latex}
\def\sK{{\mathbb{K}}}
```
```{=latex}
\def\sL{{\mathbb{L}}}
```
```{=latex}
\def\sM{{\mathbb{M}}}
```
```{=latex}
\def\sN{{\mathbb{N}}}
```
```{=latex}
\def\sO{{\mathbb{O}}}
```
```{=latex}
\def\sP{{\mathbb{P}}}
```
```{=latex}
\def\sQ{{\mathbb{Q}}}
```
```{=latex}
\def\sR{{\mathbb{R}}}
```
```{=latex}
\def\sS{{\mathbb{S}}}
```
```{=latex}
\def\sT{{\mathbb{T}}}
```
```{=latex}
\def\sU{{\mathbb{U}}}
```
```{=latex}
\def\sV{{\mathbb{V}}}
```
```{=latex}
\def\sW{{\mathbb{W}}}
```
```{=latex}
\def\sX{{\mathbb{X}}}
```
```{=latex}
\def\sY{{\mathbb{Y}}}
```
```{=latex}
\def\sZ{{\mathbb{Z}}}
```
```{=latex}
\def\emLambda{{\Lambda}}
```
```{=latex}
\def\emA{{A}}
```
```{=latex}
\def\emB{{B}}
```
```{=latex}
\def\emC{{C}}
```
```{=latex}
\def\emD{{D}}
```
```{=latex}
\def\emE{{E}}
```
```{=latex}
\def\emF{{F}}
```
```{=latex}
\def\emG{{G}}
```
```{=latex}
\def\emH{{H}}
```
```{=latex}
\def\emI{{I}}
```
```{=latex}
\def\emJ{{J}}
```
```{=latex}
\def\emK{{K}}
```
```{=latex}
\def\emL{{L}}
```
```{=latex}
\def\emM{{M}}
```
```{=latex}
\def\emN{{N}}
```
```{=latex}
\def\emO{{O}}
```
```{=latex}
\def\emP{{P}}
```
```{=latex}
\def\emQ{{Q}}
```
```{=latex}
\def\emR{{R}}
```
```{=latex}
\def\emS{{S}}
```
```{=latex}
\def\emT{{T}}
```
```{=latex}
\def\emU{{U}}
```
```{=latex}
\def\emV{{V}}
```
```{=latex}
\def\emW{{W}}
```
```{=latex}
\def\emX{{X}}
```
```{=latex}
\def\emY{{Y}}
```
```{=latex}
\def\emZ{{Z}}
```
```{=latex}
\def\emSigma{{\Sigma}}
```
```{=latex}
\newcommand{\etens}[1]{\mathsfit{#1}}
```
```{=latex}
\def\etLambda{{\etens{\Lambda}}}
```
```{=latex}
\def\etA{{\etens{A}}}
```
```{=latex}
\def\etB{{\etens{B}}}
```
```{=latex}
\def\etC{{\etens{C}}}
```
```{=latex}
\def\etD{{\etens{D}}}
```
```{=latex}
\def\etE{{\etens{E}}}
```
```{=latex}
\def\etF{{\etens{F}}}
```
```{=latex}
\def\etG{{\etens{G}}}
```
```{=latex}
\def\etH{{\etens{H}}}
```
```{=latex}
\def\etI{{\etens{I}}}
```
```{=latex}
\def\etJ{{\etens{J}}}
```
```{=latex}
\def\etK{{\etens{K}}}
```
```{=latex}
\def\etL{{\etens{L}}}
```
```{=latex}
\def\etM{{\etens{M}}}
```
```{=latex}
\def\etN{{\etens{N}}}
```
```{=latex}
\def\etO{{\etens{O}}}
```
```{=latex}
\def\etP{{\etens{P}}}
```
```{=latex}
\def\etQ{{\etens{Q}}}
```
```{=latex}
\def\etR{{\etens{R}}}
```
```{=latex}
\def\etS{{\etens{S}}}
```
```{=latex}
\def\etT{{\etens{T}}}
```
```{=latex}
\def\etU{{\etens{U}}}
```
```{=latex}
\def\etV{{\etens{V}}}
```
```{=latex}
\def\etW{{\etens{W}}}
```
```{=latex}
\def\etX{{\etens{X}}}
```
```{=latex}
\def\etY{{\etens{Y}}}
```
```{=latex}
\def\etZ{{\etens{Z}}}
```
```{=latex}
\newcommand{\pdata}{p_{\rm{data}}}
```
```{=latex}
\newcommand{\ptrain}{\hat{p}_{\rm{data}}}
```
```{=latex}
\newcommand{\Ptrain}{\hat{P}_{\rm{data}}}
```
```{=latex}
\newcommand{\pmodel}{p_{\rm{model}}}
```
```{=latex}
\newcommand{\Pmodel}{P_{\rm{model}}}
```
```{=latex}
\newcommand{\ptildemodel}{\tilde{p}_{\rm{model}}}
```
```{=latex}
\newcommand{\pencode}{p_{\rm{encoder}}}
```
```{=latex}
\newcommand{\pdecode}{p_{\rm{decoder}}}
```
```{=latex}
\newcommand{\precons}{p_{\rm{reconstruct}}}
```
```{=latex}
\newcommand{\laplace}{\mathrm{Laplace}}
```
```{=latex}
\newcommand{\E}{\mathbb{E}}
```
```{=latex}
\newcommand{\Ls}{\mathcal{L}}
```
```{=latex}
\newcommand{\R}{\mathbb{R}}
```
```{=latex}
\newcommand{\emp}{\tilde{p}}
```
```{=latex}
\newcommand{\lr}{\alpha}
```
```{=latex}
\newcommand{\reg}{\lambda}
```
```{=latex}
\newcommand{\rect}{\mathrm{rectifier}}
```
```{=latex}
\newcommand{\softmax}{\mathrm{softmax}}
```
```{=latex}
\newcommand{\sigmoid}{\sigma}
```
```{=latex}
\newcommand{\softplus}{\zeta}
```
```{=latex}
\newcommand{\KL}{D_{\mathrm{KL}}}
```
```{=latex}
\newcommand{\Var}{\mathrm{Var}}
```
```{=latex}
\newcommand{\standarderror}{\mathrm{SE}}
```
```{=latex}
\newcommand{\Cov}{\mathrm{Cov}}
```
```{=latex}
\newcommand{\normlzero}{L^0}
```
```{=latex}
\newcommand{\normlone}{L^1}
```
```{=latex}
\newcommand{\normltwo}{L^2}
```
```{=latex}
\newcommand{\normlp}{L^p}
```
```{=latex}
\newcommand{\normmax}{L^\infty}
```
```{=latex}
\newcommand{\parents}{Pa}
```
```{=latex}
\DeclareMathOperator*{\argmax}{arg\,max}
```
```{=latex}
\DeclareMathOperator*{\argmin}{arg\,min}
```
```{=latex}
\DeclareMathOperator{\sign}{sign}
```
```{=latex}
\DeclareMathOperator{\Tr}{Tr}
```
```{=latex}
\let\ab\allowbreak
```
```{=latex}
\renewcommand \thepart{}
```
```{=latex}
\renewcommand \partname{}
```
```{=latex}
\newcommand{\pointer}[1]{\textcolor[HTML]{0754B3}{\textbf{\textcircled{\raisebox{-0.9pt}{\textbf{#1}}}}}}
```
```{=latex}
\newcommand{\CiKnumtasks}{71}
```
```{=latex}
\newcommand{\alex}[1]{\textbf{\textcolor{magenta}{AD: #1}}}
```
```{=latex}
\newcommand{\arjun}[1]{\textbf{\textcolor{orange}{AA: #1}}}
```
```{=latex}
\newcommand{\andrew}[1]{\textcolor{teal}{AW: #1}}
```
```{=latex}
\newcommand{\vale}[1]{\textbf{\textcolor{orange}{VZ: #1}}}
```
```{=latex}
\newcommand{\etienne}[1]{\textbf{\textcolor{red}{EM: #1}}}
```
```{=latex}
\newcommand{\james}[1]{\textbf{\textcolor{violet}{JR: #1}}}
```
```{=latex}
\newcommand{\nicolas}[1]{\textbf{\textcolor{cyan}{NC: #1}}}
```
```{=latex}
\newcommand{\irina}[1]{\textbf{\textcolor{yellow}{IR: #1}}}
```
```{=latex}
\newcommand{\alex}[1]{}
```
```{=latex}
\newcommand{\arjun}[1]{}
```
```{=latex}
\newcommand{\andrew}[1]{}
```
```{=latex}
\newcommand{\vale}[1]{}
```
```{=latex}
\newcommand{\etienne}[1]{}
```
```{=latex}
\newcommand{\james}[1]{}
```
```{=latex}
\newcommand{\nicolas}[1]{}
```
```{=latex}
\newcommand{\irina}[1]{}
```
```{=latex}
\newcommand{\timellm}{Time-LLM (ETTh1)}
```
```{=latex}
\newcommand{\contextpointer}[1]{\raisebox{.5pt}{\textcolor{dodgerblue}{\textbf{\textcircled{\raisebox{-.9pt}{\scalebox{0.9}{\textsf{#1}}}}}}}}
```
```{=latex}
\newcommand{\orange}[1]{\textcolor{orange}{#1}}
```
```{=latex}
\newcommand{\paragraphtight}[1]{\par\textbf{#1}~}
```
```{=latex}
\twocolumn[
\icmltitle{Context is Key: A Benchmark for Forecasting with Essential Textual Information}


\icmlsetsymbol{equal}{*}

\begin{icmlauthorlist}
\icmlauthor{{$^{\bigstar}$}Andrew R. Williams}{snow,mila,udem}
\icmlauthor{{$^{\bigstar}$}Arjun Ashok}{snow,mila,udem}

\icmlauthor{{$^{\dag}$}Étienne Marcotte}{snow}
\icmlauthor{{$^{\dag}$}Valentina Zantedeschi}{snow,laval}
\icmlauthor{Jithendaraa Subramanian}{snow,mila,mcgill}
\icmlauthor{Roland Riachi}{mila}
\icmlauthor{James Requeima}{uoft}
\icmlauthor{Alexandre Lacoste}{snow}
\icmlauthor{Irina Rish}{mila,udem}
\icmlauthor{{$^{\dag}$}Nicolas Chapados}{snow,mila,poly}
\icmlauthor{{$^{\dag}$}Alexandre Drouin}{snow,mila,laval}
\end{icmlauthorlist}

\icmlaffiliation{snow}{ServiceNow Research}
\icmlaffiliation{mila}{Mila - Québec AI Institute}
\icmlaffiliation{udem}{Université de Montréal}
\icmlaffiliation{poly}{Polytechnique Montréal}
\icmlaffiliation{laval}{Université Laval}
\icmlaffiliation{uoft}{University of Toronto}
\icmlaffiliation{mcgill}{McGill University}

\icmlcorrespondingauthor{Andrew R. Williams}{andrew.williams1@servicenow.com}
\icmlcorrespondingauthor{Arjun Ashok}{arjun.ashok@servicenow.com}
\icmlkeywords{Machine Learning, ICML}

\vskip 0.3in
]
```
`\printAffiliationsAndNotice{\icmlEqualContribution}`{=latex}

```{=latex}
\newcommand{\model}[1]{\scalebox{0.9}[1]{\texttt{#1}}}
```
```{=latex}
\newcommand{\tablemodel}[1]{\scalebox{1}[1]{\texttt{#1}}}
```
```{=latex}
\newcommand{\directprompt}{\scalebox{0.9}{\textsc{direct prompt}}\xspace}
```
```{=latex}
\newcommand{\llmp}{\textsc{llmp}\xspace}
```
```{=latex}
\newcommand{\fix}{\marginpar{FIX}}
```
```{=latex}
\newcommand{\new}{\marginpar{NEW}}
```
`\doparttoc `{=latex} `\faketableofcontents `{=latex}

```{=latex}
\vspace{-0.8cm}
```
### Introduction

The prediction of future states of the world is a cornerstone of decision making [@hyndman2018forecasting] and intelligence [@wang2019defining]. Articulated as time series forecasting, this problem pervades much of science and commerce.

Accurate forecasting relies on several decisions up to the practitioner [@hyndman2018forecasting]: 1. *Model selection*: choosing an appropriate forecasting model for a given problem, and 2. *Incorporating prior information*: determining what relevant information to integrate into the model and how to do so effectively. This involves decisions about statistical priors, inductive biases in the model architecture, and other forms of domain knowledge integration, all of which traditionally rely on expert knowledge and manual intervention. However, recent advancements in machine learning have shown promise in automating both model selection and the incorporation of prior information, accelerating the democratization of time series forecasting.

In the wake of the foundation model paradigm shift [@bommasani2021opportunities], several works such as  @liang2024foundation [@chen2023long; @lim2021time], have addressed automatic model selection by learning flexible, adaptable models applicable across various problem scenarios. However, these approaches are much more costly than traditional statistical methods, and provide debatable improvements in performance [@NixtlaFTSA2024]. Typically, these models process purely numerical time series, excluding the context that human forecasters rely on to incorporate prior information.

```{=latex}
\begin{figure*}[t]\centering
    \includegraphics[width=0.85\textwidth]{figures/fig1/forecast_comparison_pretty.pdf}
    \vspace{-1em}
    \caption{An example task from the proposed Context is Key (CiK) benchmark with \model{GPT-4o} forecasts in blue and the ground truth in yellow. \textbf{Left:} Forecasts based on the numerical history alone are inaccurate, as nothing indicates a reversion to zero. \textbf{Right:} The context enables better forecasts because it reveals that the series represents photovoltaic power production. Hence, the model can deduce that no power will be produced at night. The context also enables better estimation of the peak hour of production by providing statistics from the history.
    }
    \label{fig:key-example}
    \vspace*{-0.5cm}
\end{figure*}
```
An alternative class of recent approaches [@jin2024timellm; @liu2024unitime; @requeima2024llm] adapt large language models (LLMs) for forecasting and leverage natural language (NL) as an intuitive interface to integrate side information. These methods overcome a significant limitation of traditional forecasting techniques by eliminating the need to manually encode priors or design specialized models. They further hold the promise of capturing a broader range of prior knowledge and context, potentially leading to more comprehensive and accurate forecasts.

Unfortunately, there are as of yet no systematic evaluations of these models' abilities to jointly leverage historical observations and natural language for forecasting. While several benchmarks for context-aided forecasting have been recently released [@zhang2023insight; @liu2024time; @xu2024beyond; @emami2024syscaps; @merrill2024language], their contexts are not guaranteed to be useful for improving performance. As such, it remains unknown whether existing models can enhance their forecasts by utilizing crucially relevant textual context.

To this end, we propose the Context is Key (CiK, pronounced *kick*) benchmark of forecasting tasks. CiK consists of tasks designed to assess a forecasting model's ability to use both 1. numerical input-output pairs and 2. essential textual context. As shown in `\cref{fig:key-example}`{=latex}, accurate forecasts in CiK are made possible only by effectively leveraging both numerical data and key information contained within the accompanying text.

Our contributions are:

-   *CiK Benchmark*: A collection of `\CiKnumtasks{}`{=latex} manually designed forecasting tasks spanning seven real-world domains, each requiring the integration of diverse contextual information that has a non-trivial impact on forecasts (`\cref{sec:benchmark}`{=latex}).

-   *Region of Interest CRPS (RCRPS)*: A scoring rule to evaluate context-aided forecasting performance, which prioritizes context-sensitive windows and accounts for constraint satisfaction (`\cref{subsec:evaluation}`{=latex}).

-   *Direct Prompt Forecasters*: A simple yet effective prompt-based approach to using LLMs as context-aided forecasters, which serves as a surprisingly strong baseline on CiK (`\cref{subsec:baselines}`{=latex}).

-   *Extensive evaluation* of diverse models on CiK, including statistical models, time series foundation models using only numerical data, and LLM-based forecasters capable of incorporating natural language context. Our analysis explores key factors such as the impact of context conditioning, tradeoffs in model size and performance, and discusses failure modes of models (`\cref{sec: results}`{=latex}).

```{=latex}
\vspace*{-0.3cm}
```
### Problem Setting {#sec:problem-setting}

```{=latex}
\vspace*{-0.1cm}
```
###### Context-Aided Forecasting

This work addresses the problem of *context-aided forecasting*, where the goal is to produce statistical forecasts by incorporating relevant side information (i.e., context). Let $\Xb_H = [X_1, \ldots, X_t]$ represent a series of random variables corresponding to historical observations in discrete time, where each $X_{\tau} \in \mathcal{X} \subseteq \mathbb{R}$, and let $\Xb_F = [X_{t+1}, \ldots, X_T]$ represent future observations. In classical statistical forecasting, the goal is to estimate the joint distribution of future observations given the historical ones, $P(\Xb_F \mid \Xb_H).$ We further assume access to *context*, denoted $\Cb$, which consists of additional data of arbitrary nature containing information relevant for predicting $\Xb_{F}$ and complementary to the history $\Xb_H$. The task then becomes estimating the distribution $P(\Xb_F \mid \Xb_H, \Cb).$ Crucially, we restrict our focus to *relevant context*, which we define as context that does not degrade forecasts. Formally, for $\xb_F \sim \Xb_F \mid \Xb_H, \Cb,$ given some loss function $\mathcal{L}$ assessing a predictive distribution over $\Xb_F$ against a realization $\xb_F$, `\mbox{$\mathcal{L}: P(\Xb_F) \times \xb_F \rightarrow \mathbb{R}$}`{=latex}, we are interested in models where, in expectation, forecasts that leverage context perform better: $$\begin{aligned}
    \expect_{\xb_F} \mathcal{L}(P(\Xb_F \mid \Xb_H, \Cb),\xb_F) \leq \expect_{\xb_F} \mathcal{L}(P(\Xb_F \mid \Xb_H),\xb_F).\end{aligned}$$ Furthermore, although the nature of the context can vary widely, we specifically concentrate on *context communicated through natural language*, which we refer to as \`\`context" or \`\`contextual information".

```{=latex}
\vspace*{-0.35cm}
```
### Context is Key: a Natural Language Context-Aided Forecasting Benchmark {#sec:benchmark}

```{=latex}
\vspace*{-0.1cm}
```
```{=latex}
\centering
```
![The tasks in the CiK benchmark rely on real-world numerical data from 7 domains.](figures/benchmark_overview/domain_distribution.png "fig:"){#fig:domains-summary width="0.7\\linewidth"} `\vspace*{-0.5cm}`{=latex}

```{=latex}
\vspace*{-0.5cm}
```
We introduce the *Context is Key* (CiK) benchmark, a collection of *probabilistic forecasting* tasks where accurate predictions require integrating both numerical data and natural language contextual information. CiK comprises `\CiKnumtasks{}`{=latex} distinct tasks spanning seven real-world application domains (see `\cref{subsec:data-sources}`{=latex}), each featuring various stochastic components that can be instantiated into thousands of task instances (e.g., time series, time windows, natural language formulation). These tasks encompass diverse types of contextual information that reveal various aspects of dynamical processes (see `\cref{subsec:types-of-context}`{=latex}). Moreover, they are designed such that *context is key* in that it non-trivially *unlocks* accurate forecasts, e.g., by conveying causal relationships that reveal the effect of a covariate on the time series of interest. An example task is shown in `\cref{fig:key-example}`{=latex} and others can be found in `\cref{app:task-examples}`{=latex}.

**Availability:** CiK is open source. The complete set of tasks can be explored at <https://servicenow.github.io/context-is-key-forecasting/v0/> and the source code, at <https://github.com/ServiceNow/context-is-key-forecasting>. All data sources used for CiK's tasks are openly available (see `\cref{app:data-sources}`{=latex}).

```{=latex}
\vspace*{-0.25cm}
```
#### Domains and Numerical Data Sources {#subsec:data-sources}

```{=latex}
\vspace*{-0.1cm}
```
As illustrated in `\cref{fig:domains-summary}`{=latex}, the majority (95%) of tasks in CiK are based on real-world application domains. We leverage 2,644 time series sourced from publicly available datasets across seven domains: Climatology (solar irradiance and cloud coverage [@sengupta2018national]); Economics (unemployment rates across states and counties [@fred_stlouisfed]); Energy (electricity consumption and production [@godahewa2021monash]); Mechanics (experimental properties of physical systems [@gamella2024chamber]); Public Safety (fire department intervention counts [@ville_de_montreal_2020]); Transportation (highway segment occupancy rates and average speeds [@chen2001freeway]); and Retail (cash withdrawals from various ATMs [@godahewa2021monash]). The remaining 5% of tasks use simulated data from dynamical systems crafted specifically for the tasks. Overall, the time series in CiK exhibit diverse sampling frequencies, with observations ranging from every 10 minutes to monthly intervals; additional details on data sources can be found in `\cref{app:data-sources}`{=latex}.

```{=latex}
\vspace{-0.1cm}
```
`\paragraphtight{Memorization mitigation:}`{=latex} `\label{par:memorization}`{=latex} Building tasks with publicly available data introduces contamination risk: pretrained LLMs and time series foundation models may have memorized portions of the data, potentially inflating evaluation performance. We employ several mitigation strategies. First, we prioritize live data sources that are continuously updated, such as @chen2001freeway and @ville_de_montreal_2020, ensuring the data is collected after the training cut-off dates of the models that we evaluate. Second, where applicable, we use derived series that are not directly available in the raw data, such as incident logs converted into time series [@ville_de_montreal_2020]. Finally, as a last resort, we apply minor transformations, such as adding noise or shifting timestamps, but use these sparingly to limit their potential impact on the tasks, and to avoid misalignment between common-sense knowledge (e.g., holiday dates) and the numerical data. The exact mitigation methods used, per data source, together with the number of tasks on which they are applied are given in `\cref{app:data-sources}`{=latex}.

```{=latex}
\vspace*{-0.3cm}
```
#### Natural Language Context {#subsec:types-of-context}

```{=latex}
\vspace*{-0.1cm}
```
```{=latex}
\centering
```
![Number of tasks per context type in CiK.](figures/benchmark_overview/context_sources.png "fig:"){#fig:context-sources-summary width="0.7\\linewidth"} `\vspace*{-0.55cm}`{=latex}

```{=latex}
\vspace*{-0.5cm}
```
For each task in the benchmark, we jointly sample numerical data from one of the series described in `\cref{subsec:data-sources}`{=latex} and then *manually* craft the natural language context necessary to unlock accurate forecasts. In some cases, this context is purely descriptive, providing information about the general nature of the target variable and its historical behavior, as seen in the task illustrated in `\cref{fig:key-example}`{=latex}. In other cases, the raw numerical data is adjusted to reflect the influence of the context. For example, in one task based on data from @godahewa2021monash, an ATM is expected to be inaccessible during a specific time period in the future, leading to zero withdrawals (see `\cref{app:atmexample}`{=latex}). In another task, electricity demand is projected to surge due to an incoming weather event (see `\cref{app:electricityincrease}`{=latex}). For such cases, we modify the series to incorporate patterns described by the context.

Overall, we include diverse forms of natural language context, each capturing a different aspect of the process underlying the time series and providing complementary knowledge that a human expert could leverage for more accurate forecasting. The types of context are described below and exemplified in the task illustrated in `\cref{fig:context-overview}`{=latex}. For additional clarity, further examples are provided in `\cref{app:task-examples}`{=latex} and the distribution of tasks per context type is shown in `\cref{fig:context-sources-summary}`{=latex}.

```{=latex}
\vspace*{-0.1cm}
```
`\paragraphtight{Intemporal information ($\cb_{I}$)}`{=latex} Information about the process that remains invariant in time. For example, a description of the process and the nature of the target variable, as in `\cref{fig:context-overview}`{=latex} (point `\contextpointer{2}`{=latex}). This includes patterns that cannot be inferred from the available numerical data, such as long-period seasonalities, or constraints on values, such as positivity.

```{=latex}
\vspace*{-0.1cm}
```
`\paragraphtight{Future information ($\cb_{F}$)}`{=latex} Information about the future behavior of the time series. For example, a scenario to be simulated as in `\cref{fig:context-overview}`{=latex} point `\contextpointer{3}`{=latex}, or expected events along with any entailed constraints, such as an inventory shortage restricting future sales amounts.

```{=latex}
\vspace*{-0.1cm}
```
`\noindent`{=latex}**Historical information ($\cb_{H}$)** Information about the past behavior of the series that the available numerical history does not reveal. For example, statistics on past values of the series, as in `\cref{fig:context-overview}`{=latex} (point `\contextpointer{4}`{=latex}), or an explanation for spurious patterns that should be be disregarded at inference, such as periodic anomalies caused by sensor maintenance.

```{=latex}
\vspace*{-0.1cm}
```
```{=latex}
\paragraphtight{Covariate information ($\cb_{\text{cov}}$)}{Information about additional variables that are statistically associated with the variable of interest. For example, a series correlated with the target values (as in \cref{fig:context-overview} point \contextpointer{5}).
}
```
```{=latex}
\vspace*{-0.1cm}
```
```{=latex}
\paragraphtight{Causal information ($\cb_{\text{causal}}$)}{Information about causal relationships between covariates and the target variable. For example, if the covariates are known to cause or are confounded with the target variable, as in \cref{fig:context-overview} point \contextpointer{6}.}
```
Finally, for completeness, `\cref{app:context_length}`{=latex} provides the distributions of lengths of the numerical historical data, prediction horizons and natural language context.

```{=latex}
\centering
```
![ Illustration of a CiK task annotated with types of natural language context: `\contextpointer{1}`{=latex} The short numerical history is misleading, suggesting an increasing trend. However, contextual information compensates and enables accurate forecasts: `\contextpointer{2}`{=latex} The intemporal information ($\cb_I$) reveals the nature of the series, implying a seasonal pattern with greater prevalence in the summer months due to weather. `\contextpointer{3}`{=latex} The future information ($\cb_{F}$) reveals that the series cannot continue its increasing trend. `\contextpointer{4}`{=latex} The historical information ($\cb_H$) complements the short history by providing high-level statistics on past values. `\contextpointer{5}`{=latex} The covariate information ($\cb_{\text{\text{cov}}}$) reveals an association with another quantity: field fires. Could its behavior impacts future values of the target series? `\contextpointer{6}`{=latex} No, the causal information ($\cb_{\text{causal}}$) provides the answer. ](figures/context_types_big_font_2.png "fig:"){#fig:context-overview width="\\linewidth"} `\captionsetup{format=plain, labelfont=bf, belowskip=0pt}`{=latex} `\setlength{\belowcaptionskip}{-10pt}`{=latex} `\vspace*{-0.7cm}`{=latex}

#### Validating the Relevance of The Context

Related efforts on context-aided forecasting, outlined in `\cref{sec:related-work}`{=latex}, rely on scraping and/or LLMs to obtain natural language context [@zhang2023insight; @merrill2024language; @liu2024time; @emami2024syscaps]. In contrast, to ensure both the quality and relevance of tasks in CiK, we manually craft all contextual information and associated data transformations according to the procedure described in `\cref{app:context_crafting}`{=latex}. To validate the importance of the context, we subject each task to review by a panel of human and LLM evaluators tasked with assessing if the context enables better forecasts. The results are overwhelmingly positive, with humans finding the context to be relevant for $95\%$ of evaluated instances (details in `\cref{subsec:llm-critique}`{=latex}).

```{=latex}
\begin{table*}[t]\caption{Results of selected models on the CiK benchmark. Starting from the left, the first column shows the RCRPS averaged over all tasks. The second column shows the rank of each method w.r.t. other models, averaged over all tasks. The remaining columns show the average RCRPS stratified by types of context (\cref{subsec:types-of-context}). All averages are weighted according to the scheme described in \cref{subsec:protocol} and accompanied by standard errors. Lower is better and the best averages are in bold.
An asterisk (*) denotes models that do not use natural language context.
For results on all models and with alternative aggregation strategies, see \cref{sec:addnl-res-sec}.
}
\label{table:main-results}
\centering
\resizebox{\textwidth}{!}{%
\begin{tabular}{lccccccc}
\toprule
 \textsc{Model} & \multirow{2}{*}{\shortstack{\textsc{Average} \\ \textsc{RCRPS}}} & \multirow{2}{*}{\shortstack{\textsc{Average} \\ \textsc{Rank}}} & \multirow{2}{*}{\shortstack{\textsc{Intemporal} \\ \textsc{Information}}} & \multirow{2}{*}{\shortstack{\textsc{Historical} \\ \textsc{Information}}} & \multirow{2}{*}{\shortstack{\textsc{Future} \\ \textsc{Information}}}  & \multirow{2}{*}{\shortstack{\textsc{Covariate} \\ \textsc{Information}}} & \multirow{2}{*}{\shortstack{\textsc{Causal} \\ \textsc{Information}}} \\
\\
\midrule
\multicolumn{7}{l}{\textsc{Direct Prompt} (ours)} \\
~~~\tablemodel{Llama-3.1-405B-Inst} & \textbf{0.159 $\pm$ 0.008} & \textbf{4.516 $\pm$ 0.233} & \textbf{0.174 $\pm$ 0.010} & 0.146 $\pm$ 0.001 & \textbf{0.075 $\pm$ 0.005} & \textbf{0.164 $\pm$ 0.010} & 0.398 $\pm$ 0.045 \\
~~~\tablemodel{Llama-3-70B-Inst} & 0.286 $\pm$ 0.004 & 7.803 $\pm$ 0.106 & 0.336 $\pm$ 0.006 & 0.180 $\pm$ 0.003 & 0.194 $\pm$ 0.006 & 0.228 $\pm$ 0.004 & 0.629 $\pm$ 0.019 \\
~~~\tablemodel{Mixtral-8x7B-Inst} & 0.523 $\pm$ 0.023 & 14.473 $\pm$ 0.147 & 0.723 $\pm$ 0.037 & 0.236 $\pm$ 0.002 & 0.241 $\pm$ 0.001 & 0.359 $\pm$ 0.028 & 0.875 $\pm$ 0.128 \\
~~~{\tablemodel{Qwen-2.5-7B-Inst}} & 0.290 $\pm$ 0.003 & 11.330 $\pm$ 0.253 & 0.290 $\pm$ 0.004 & 0.176 $\pm$ 0.003 & 0.287 $\pm$ 0.007 & 0.240 $\pm$ 0.002 & 0.525 $\pm$ 0.003 \\
~~~{\tablemodel{Qwen-2.5-0.5B-Inst}} & 0.463 $\pm$ 0.012 & 12.694 $\pm$ 0.173 & 0.609 $\pm$ 0.019 & 0.165 $\pm$ 0.004 & 0.218 $\pm$ 0.012 & 0.476 $\pm$ 0.015 & 0.429 $\pm$ 0.006 \\
~~~\tablemodel{GPT-4o} & 0.274 $\pm$ 0.010 & \textbf{4.381 $\pm$ 0.159} & 0.218 $\pm$ 0.007 & \textbf{0.118 $\pm$ 0.001} & 0.121 $\pm$ 0.001 & 0.250 $\pm$ 0.011 & 0.858 $\pm$ 0.053 \\
~~~\tablemodel{GPT-4o-mini} & 0.354 $\pm$ 0.022 & 9.056 $\pm$ 0.194 & 0.475 $\pm$ 0.035 & 0.139 $\pm$ 0.002 & 0.143 $\pm$ 0.002 & 0.341 $\pm$ 0.028 & 0.644 $\pm$ 0.128 \\
\midrule
\multicolumn{7}{l}{\textsc{LLMP}} \\
~~~\tablemodel{Llama-3-70B-Inst} & 0.539 $\pm$ 0.013 & 8.243 $\pm$ 0.231 & 0.438 $\pm$ 0.017 & 0.516 $\pm$ 0.028 & 0.847 $\pm$ 0.024 & 0.546 $\pm$ 0.016 & 0.392 $\pm$ 0.028 \\
~~~\tablemodel{Llama-3-70B} & 0.236 $\pm$ 0.006 & 6.522 $\pm$ 0.244 & 0.212 $\pm$ 0.005 & 0.121 $\pm$ 0.008 & 0.299 $\pm$ 0.017 & 0.193 $\pm$ 0.004 & \textbf{0.360 $\pm$ 0.011} \\
~~~\tablemodel{Mixtral-8x7B-Inst} & 0.264 $\pm$ 0.004 & 8.519 $\pm$ 0.264 & 0.242 $\pm$ 0.007 & 0.173 $\pm$ 0.004 & 0.324 $\pm$ 0.005 & 0.219 $\pm$ 0.005 & 0.437 $\pm$ 0.007 \\
~~~\tablemodel{Mixtral-8x7B} & 0.262 $\pm$ 0.008 & 8.540 $\pm$ 0.198 & 0.250 $\pm$ 0.008 & \textbf{0.119 $\pm$ 0.003} & 0.310 $\pm$ 0.019 & 0.229 $\pm$ 0.006 & 0.457 $\pm$ 0.011 \\
~~~\tablemodel{Qwen-2.5-7B-Inst} & 1.974 $\pm$ 0.027 & 18.443 $\pm$ 0.276 & 2.509 $\pm$ 0.044 & 2.857 $\pm$ 0.056 & 1.653 $\pm$ 0.008 & 1.702 $\pm$ 0.035 & 1.333 $\pm$ 0.144 \\
~~~\tablemodel{Qwen-2.5-7B} & 0.910 $\pm$ 0.037 & 16.051 $\pm$ 0.341 & 1.149 $\pm$ 0.047 & 1.002 $\pm$ 0.053 & 0.601 $\pm$ 0.071 & 0.639 $\pm$ 0.047 & 0.928 $\pm$ 0.129 \\
~~~\tablemodel{Qwen-2.5-0.5B-Inst} & 1.937 $\pm$ 0.024 & 20.136 $\pm$ 0.191 & 2.444 $\pm$ 0.038 & 1.960 $\pm$ 0.063 & 1.443 $\pm$ 0.010 & 1.805 $\pm$ 0.030 & 1.199 $\pm$ 0.129 \\
~~~\tablemodel{Qwen-2.5-0.5B} & 1.995 $\pm$ 0.024 & 19.686 $\pm$ 0.275 & 2.546 $\pm$ 0.039 & 2.083 $\pm$ 0.052 & 1.579 $\pm$ 0.015 & 1.821 $\pm$ 0.030 & 1.225 $\pm$ 0.128 \\
\midrule
\multicolumn{7}{l}{{\textsc{Multimodal Models}}} \\
~~~\tablemodel{UniTime} & 0.370 $\pm$ 0.001 & 14.675 $\pm$ 0.091 & 0.457 $\pm$ 0.002 & 0.155 $\pm$ 0.000 & 0.194 $\pm$ 0.003 & 0.395 $\pm$ 0.001 & 0.423 $\pm$ 0.001 \\
~~~\tablemodel{Time-LLM (ETTh1)} & 0.476 $\pm$ 0.001 & 17.932 $\pm$ 0.075 & 0.518 $\pm$ 0.002 & 0.183 $\pm$ 0.000 & 0.403 $\pm$ 0.002 & 0.441 $\pm$ 0.001 & 0.482 $\pm$ 0.001 \\
\midrule
\multicolumn{7}{l}{\textsc{TS Foundation Models}*} \\
~~~\tablemodel{Lag-Llama} & 0.327 $\pm$ 0.004 & 13.370 $\pm$ 0.233 & 0.330 $\pm$ 0.005 & 0.167 $\pm$ 0.005 & 0.292 $\pm$ 0.009 & 0.294 $\pm$ 0.004 & 0.495 $\pm$ 0.014 \\
~~~\tablemodel{Chronos-Large} & 0.326 $\pm$ 0.002 & 12.298 $\pm$ 0.148 & 0.314 $\pm$ 0.002 & 0.179 $\pm$ 0.003 & 0.379 $\pm$ 0.003 & 0.255 $\pm$ 0.002 & 0.460 $\pm$ 0.004 \\
~~~\tablemodel{TimeGEN} & 0.353 $\pm$ 0.000 & 15.047 $\pm$ 0.095 & 0.332 $\pm$ 0.000 & 0.177 $\pm$ 0.000 & 0.405 $\pm$ 0.000 & 0.292 $\pm$ 0.000 & 0.474 $\pm$ 0.000 \\
~~~\tablemodel{Moirai-Large}  & 0.520 $\pm$ 0.006 & 12.873 $\pm$ 0.263 & 0.596 $\pm$ 0.009 & 0.140 $\pm$ 0.001 & 0.431 $\pm$ 0.002 & 0.499 $\pm$ 0.007 & 0.438 $\pm$ 0.011 \\
\midrule
\multicolumn{7}{l}{\textsc{Statistical Models}*} \\
~~~\tablemodel{ARIMA} & 0.475 $\pm$ 0.006 & 12.721 $\pm$ 0.167 & 0.557 $\pm$ 0.009 & 0.200 $\pm$ 0.007 & 0.350 $\pm$ 0.003 & 0.375 $\pm$ 0.006 & 0.440 $\pm$ 0.011 \\
~~~\tablemodel{ETS} & 0.530 $\pm$ 0.009 & 15.001 $\pm$ 0.198 & 0.639 $\pm$ 0.014 & 0.362 $\pm$ 0.014 & 0.315 $\pm$ 0.006 & 0.402 $\pm$ 0.010 & 0.508 $\pm$ 0.017 \\
~~~\tablemodel{Exp-Smoothing} & 0.605 $\pm$ 0.013 & 15.689 $\pm$ 0.152 & 0.702 $\pm$ 0.020 & 0.493 $\pm$ 0.016 & 0.397 $\pm$ 0.006 & 0.480 $\pm$ 0.015 & 0.827 $\pm$ 0.060 \\
\bottomrule
\end{tabular}
}
\vspace*{-0.2cm}
\end{table*}
```
```{=latex}
\vspace{-0.4cm}
```
### Region of Interest CRPS

```{=latex}
\vspace{-0.1cm}
```
`\label{subsec:evaluation}`{=latex} Alongside the tasks, we introduce the Region of Interest Continuous Ranked Probability Score (RCRPS), a novel proper scoring rule designed specifically for context-aided probabilistic forecasting. This new scoring rule is an extension of the Continuous Ranked Probability Score (CRPS; @gneiting2007strictly), a proper scoring rule that provides a comprehensive assessment of forecast quality by evaluating the entire predictive distribution rather than point forecasts. Since it is based on the CRPS, the RCRPS can be calculated using only samples from the predictive distribution, and so can be used even in cases where closed-form distributions are unavailable. The RCRPS extends the CRPS via two key components: a *region of interest* and a measure of *constraint satisfaction*.

```{=latex}
\vspace*{-0.1cm}
```
`\paragraphtight{Region of interest (RoI):}`{=latex} The RCRPS assigns more weight to errors in a task's RoI, which is a subset of time steps $\Ical \subseteq [t\!+\!1, \dots, T]$ for which the context is particularly relevant. For example, in the ATM task from `\cref{subsec:types-of-context}`{=latex} (visualized in `\cref{app:atmexample}`{=latex}), the RoI denotes the time steps during which the ATM is expected to be unavailable. In other tasks, such as those in `\cref{fig:key-example,fig:context-overview}`{=latex}, where the context informs the value of all future time points, the RCRPS assigns equal weights to all time steps (for readability, we report the definition of RCRPS for this special case in `\Cref{app:metric}`{=latex}).

```{=latex}
\vspace*{-0.1cm}
```
`\paragraphtight{Constraint satisfaction:}`{=latex} The RCRPS penalizes constraint violations according to a task-specific function $v_\Cb$ whose value is positive for any trajectory that violates the constraints. Concrete examples are given in `\cref{app:constraint_functions}`{=latex}.

Given an inferred forecast distribution $\widetilde{\Xb}_F$ and a ground truth $\xb_F$, the scoring rule is defined as:

```{=latex}
\vspace*{-0.3cm}
```
```{=latex}
\scalebox{0.85}{
\parbox{\linewidth}{
\begin{multline*}
\text{RCRPS}(\widetilde{\Xb}_F, \xb_F) \;\eqdef\; \alpha \,\cdot
\Biggl[
    \frac{1}{2 |\mathcal{I}|} \sum_{i \in \mathcal{I}} \text{CRPS}\!\bigl(\widetilde{X}_i, x_i\bigr) +
\\
    \qquad\qquad
    \frac{1}{2 |\neg \mathcal{I}|} \sum_{i \in \neg \mathcal{I}} \text{CRPS}\!\bigl(\widetilde{X}_i, x_i\bigr)
    \;+\; \beta \,\cdot\, \text{CRPS}\!\bigl(v_\Cb(\widetilde{\Xb}_F), 0\bigr)
\Biggr],
\end{multline*}
}}
```
```{=latex}
\vspace*{-0.3cm}
```
where the terms respectively account for the CRPS inside the RoI, the CRPS outside of the RoI, and the constraint violation penalty. The last term, which is inspired by the threshold-weighted CRPS of @gneiting2011threshold, vanishes when all constraints are satisfied. The $\alpha$ term is a task-dependent normalization factor to make the RCRPS scale-independent, which enables fair RCRPS aggregation across tasks; its calculation is described in `\cref{app:metric-scaling}`{=latex}. Finally, $\beta$ is a scaling factor that controls the impact of constraint violation on the score; we use $\beta = 10$ in our experiments. We refer the reader to `\cref{app:metric}`{=latex} for additional details.

```{=latex}
\vspace{-0.35cm}
```
### Experiments and Results {#sec: results}

```{=latex}
\vspace{-0.05cm}
```
In this section, we define our evaluation protocol (`\cref{subsec:protocol}`{=latex}) and outline the models that we evaluate on CiK (`\cref{subsec:baselines}`{=latex}). We then present results on the benchmark (`\cref{subsec:mainresults}`{=latex}), along with an analysis of factors affecting model performance. Finally, we look at areas for improvement by analyzing forecasting errors (`\cref{subsec:areas-of-improvement}`{=latex}) and inference cost (`\cref{subsec:inference_cost}`{=latex}). `\vspace{-0.25cm}`{=latex}

#### Evaluation Protocol {#subsec:protocol}

```{=latex}
\vspace{-0.1cm}
```
Each task in CiK has many unique instances arising from the selection of time series and windows in the associated numerical data, as well as minor variations in natural language context. We deterministically sample five instances of each task in order to make the evaluation reproducible and affordable. For every instance, we generate 25 independent forecasts per model for evaluation. Many of the tasks in the benchmark share similarities due to sharing the same data sources or using variants of the same context. Therefore, we identify clusters of similar tasks and design a weighting scheme such that each cluster has equal total weight in our aggregate score (see `\cref{app:weighting-scheme}`{=latex} for more details).

```{=latex}
\vspace{-0.25cm}
```
#### Methods

```{=latex}
\vspace{-0.1cm}
```
`\label{subsec:baselines}`{=latex}

We evaluate a wide variety of models including methods based on LLMs, state-of-the-art numerical time series foundation models (TSFMs) and classical statistical forecasting methods. Since CiK is meant to be an evaluation benchmark, it does not have a corresponding training set. We therefore only evaluate models that can produce forecasts directly based on the history of a series. This includes LLMs and TSFMs that support zero-shot inference, and traditional statistical models that can be fit directly to the history of a given instance. We outline these methods below and refer the reader to `\cref{app:models}`{=latex} for additional details.

`\paragraphtight{LLM-based Forecasters:}`{=latex} We consider two prompt-based approaches. We propose \`\``\directprompt`{=latex}", a simple approach where we instruct the model to directly output a forecast as a structured output for all of the required timestamps (see `\cref{subsec:direct-prompt}`{=latex} for more details). We also evaluate LLM Processes (`\llmp`{=latex}; @requeima2024llm), a method which autoregressively prompts the LLM multiple times to output a forecast. For each of these, we evaluate a variety of LLMs with diverse architectures and sizes, such as `\model{GPT-4o}`{=latex}, `\model{GPT-4o-mini}`{=latex} [@achiam2023gpt], `\model{Qwen-2.5-\{0.5B, 1.5B, 7B\}}`{=latex}  [@yang2024qwen2], `\model{Mixtral-8x7B}`{=latex} [@jiang2024mixtralexperts], `\model{Llama-3-\{8B, 70B\}}`{=latex} [@dubey2024llama], `\model{Llama-3.1-405B}`{=latex} [@dubey2024llama]. [^1] We also evaluate multimodal forecasting models, `\model{ChatTime}`{=latex} [@wang2025chattime] (zero-shot), `\model{UniTime}`{=latex} [@liu2024unitime] and `\model{\timellm}`{=latex} [@jin2024timellm] (trained according to their respective authors' guidelines) (details in `\cref{app:unitimellm}`{=latex}). For all of these approaches, inference is performed zero-shot on the benchmark and we compare their performance with and without the natural language context.

`\paragraphtight{Quantitative Forecasting Models:}`{=latex} We also evaluate a number of models that process only numerical data, but cannot integrate natural language. We evaluate `\model{Exponential Smoothing}`{=latex} (@expsmoothing), `\model{ETS}`{=latex} [@hyndman2008forecasting], and `\model{ARIMA}`{=latex} [@box2015time], three simple, but time-tested statistical approaches. We also evaluate four state-of-the-art TSFMs: `\model{Lag-Llama}`{=latex} [@rasul2023lag], `\model{Chronos}`{=latex} [@ansari2024chronos], `\model{Moirai}`{=latex} [@woo2024unified] and `\model{TimeGEN}`{=latex} [@garza2023timegpt]. We fit `\model{Exponential Smoothing}`{=latex}, `\model{ETS}`{=latex} and `\model{ARIMA}`{=latex} to the individual numerical history of each task instance, which is the same input that the TSFMs process to produce forecasts zero-shot.

#### Results on CiK {#subsec:mainresults}

`\cref{table:main-results}`{=latex} shows our main results. At a high level, we observe that the best-performing methods combine pretrained LLMs with prompting strategies like `\directprompt `{=latex}and `\llmp`{=latex}, with a bias towards the largest models. In terms of RCRPS, `\model{Llama-3.1-405B-Inst}`{=latex} (`\directprompt`{=latex}) significantly outperforms all of its counterparts. `\model{GPT-4o}`{=latex} (`\directprompt`{=latex}) performs worse with respect to RCRPS, but compares favorably in terms of average rank. This discrepancy is due to strong failures on some of the tasks, which we discuss in `\cref{subsec:areas-of-improvement}`{=latex}. Other models like `\model{Llama-3-70B}`{=latex} (`\llmp`{=latex}), `\model{Mixtral-8x7B-Inst}`{=latex} (`\llmp`{=latex}) and `\model{Mixtral-8x7B}`{=latex} (`\llmp`{=latex}) are on par with `\model{GPT-4o}`{=latex} (`\directprompt`{=latex}), `\model{Llama-3-70B-Inst}`{=latex} (`\directprompt`{=latex}) and `\model{Qwen-2.5-7B-Inst}`{=latex} (`\directprompt`{=latex}) in terms of RCRPS. Interestingly, all of these methods outperform `\model{UniTime}`{=latex} and `\model{Time-LLM}`{=latex}, which also rely on LLMs (GPT-2 & LLaMA-7B). We discuss this gap in `\cref{app:why-unitime}`{=latex}. Finally, as emphasized in `\cref{fig:llm_wins}`{=latex}, we observe that the best-performing LLM methods significantly outperform purely quantitative models. In what follows, we examine various aspects of these results (and refer to `\cref{sec:addnl-res-sec}`{=latex} for additional results).

```{=latex}
\centering
```
![Proportion of tasks for which LLM-based methods outperform the 7 quantitative forecasting methods (see `\cref{subsec:baselines}`{=latex}). A method is considered to outperform another on a task if its average RCPRS is lower on said task. Results are shown for variants that use (left) and do not use (right) the natural language context. A full green bar would indicate that the method is better on all tasks, whereas a full red bar would indicate that it is worse everywhere. Tasks are weighted according to `\cref{subsec:protocol}`{=latex}. ](figures/llm_wins_against_pure_numerical_methods.png "fig:"){#fig:llm_wins width="\\linewidth"} `\vspace*{-0.6cm}`{=latex}

```{=latex}
\vspace{-0.65cm}
```
```{=latex}
\paragraphtight{Explaining the performance of LLM-based approaches:}
```
The strong performance of LLM-based methods could be due to two factors: (i) properly leveraging the natural language context and (ii) being more proficient at numerical forecasting. We thus aim to disentangle their contributions.

On the one hand, `\cref{fig:context-improvement}`{=latex} shows clear evidence that methods with access to the context improve their forecasts. For example, `\model{Llama-3.1-405B-Inst}`{=latex} (`\directprompt`{=latex}) improves by $67.1\%$ with context. We find these differences statistically significant across many of the models (see `\cref{app:stat-signif}`{=latex}). This is reflected in the quality of the example forecasts in `\cref{app:viz-success-forecasts}`{=latex}, where we observe clear improvements in regions of interest, as well as improved constraint satisfaction. Other models show lesser improvements and, in some cases, even a degradation in performance. Our analysis in `\cref{subsec:areas-of-improvement}`{=latex} shows that this can be explained either by the context being ignored, or by significant failures in using context, worsening overall performance.

On the other hand, `\cref{fig:llm_wins}`{=latex} (right) shows that LLM-based forecasters, when evaluated *without context*, no longer dominate the quantitative forecasting models. However, some LLM-based forecasters remain surprisingly competitive. For instance, multiple `\model{Llama-3}`{=latex} (`\llmp`{=latex}) models outperform at least $5$ of the quantitative models on the majority of tasks. The extended results in `\cref{app:extended-aggregate-results-on-all-models}`{=latex} further substantiate this. In contrast, models such as `\model{Llama-3.1-405B-Inst}`{=latex} (`\directprompt`{=latex}) and `\model{GPT-4o}`{=latex} (`\directprompt`{=latex}) show significantly weaker forecasting performance without context. This suggests that such models are especially preferable in cases where context is available. This is also reflected in their aggregate scores without context (in `\cref{table:full-table-results-appdx}`{=latex}).

```{=latex}
\paragraphtight{Comparing \textnormal{\llmp} and \textnormal{\directprompt}:}
```
`\cref{fig:llm_wins}`{=latex} (right) shows that without context, `\llmp `{=latex}models exhibit stronger numerical forecasting performance than `\directprompt `{=latex}models. This advantage likely stems from `\llmp`{=latex}'s closer alignment with the forecasting task: `\llmp `{=latex}simply prompts the LLM to autoregressively predict the next value in the time series, a task well suited to base models with no instruction tuning. In contrast, `\directprompt `{=latex}requires output forecasts to be structured, which relatively complicates the task.

This line of reasoning leads us to another observation on the impact of instruction tuning; as reflected in `\cref{table:main-results}`{=latex} and `\cref{fig:llm_wins}`{=latex}, instruction tuning appears to generally degrade `\llmp `{=latex}performance. `\model{Llama-3}`{=latex} models show a twofold decrease in performance after instruction tuning, a behavior previously observed by @gruver2024large. Interestingly, instruction tuning does not degrade the performance of `\model{Mixtral-8x7B}`{=latex}. Finally, whereas `\llmp `{=latex}mostly suffers from instruction tuning, `\directprompt `{=latex}requires forecasts to be produced in a specific structure, a skill that instruction-tuned models might be better at (see `\cref{app:instr-tune-dir-prompt}`{=latex} for details).

```{=latex}
\paragraphtight{No Method Excels Across All Context Types:}
```
`\cref{table:main-results}`{=latex} shows that some methods can effectively produce better forecasts with the provided contextual information. However, no single method is the best across context types. `\model{Llama-3.1-405B-Inst}`{=latex} (`\directprompt`{=latex}), the top-performing method on average, outperforms its counterparts on only 3 out of 5 context types. This finding indicates that the benchmark remains unsolved, leaving significant room for advancements from the research community.

```{=latex}
\centering
```
![RCRPS with and without context (log scale, lower is better). Full bars show performance with context; striped bars show performance without context. In general, larger models outperform smaller models and benefit much more from context. `\directprompt `{=latex}models all improve with context, other than `\model{Qwen-2.5-0.5B-Instruct}`{=latex}. For `\llmp`{=latex}, larger models benefit from context, but smaller models fail to do so and perform worse in general. ](figures/icml_fixed/model_performance_two_bar_icml_Performance_with_Context.png "fig:"){#fig:context-improvement width="0.95\\linewidth"} `\vspace*{-3mm}`{=latex}

```{=latex}
\vspace{-1.5em}
```
```{=latex}
\vspace*{-0.3cm}
```
#### Error Analysis {#subsec:areas-of-improvement}

```{=latex}
\vspace*{-0.1cm}
```
Foundation models are known to make mistakes or return factually inaccurate information [@bommasani2021opportunities]. We find that models occasionally return forecasts that miss the ground truth by a large margin. We use the term *significant failure* to denote a forecast that over or undershoots by at least $500\%$ of the range of the ground truth; we clip the RCRPS of such instances to 5 to avoid them disproportionately skewing the aggregate score. Despite this, such significant failures impact the results in `\cref{table:main-results}`{=latex}: `\model{GPT-4o}`{=latex} with `\directprompt`{=latex}, while emerging as a top-performer in most tasks (as reflected in its average rank), has a significantly higher aggregate RCRPS than models ranked worse, such as `\model{Mixtral-8x7B}`{=latex} with `\llmp`{=latex}. As an example, `\model{GPT-4o}`{=latex} with `\directprompt `{=latex}fails significantly in a task with a context involving scientific notation (see `\cref{fig:dp-gpt4-failure-1}`{=latex} and more examples in `\cref{app:viz-failure-forecasts}`{=latex}). Notably, while a model may generally achieve a high win rate, a few significant failures can dominate its aggregate performance, as observed for `\model{GPT-4o}`{=latex} with `\directprompt`{=latex}. In the case of `\model{Qwen-2.5-0.5B-Inst}`{=latex} with `\directprompt`{=latex}, this leads to an aggregate RCRPS that is worse with context than without. We analyze this in detail in `\cref{app:dp_mixtral_results}`{=latex}. These findings underscore the need for future work to develop more robust models that can handle context effectively while avoiding significant failures.

```{=latex}
\vspace{-0.25cm}
```
#### Inference Cost {#subsec:inference_cost}

```{=latex}
\vspace{-0.1cm}
```
`\label{paragraph:cost}`{=latex}

```{=latex}
\centering
```
```{=latex}
\raisebox{20pt}{\includegraphics[width=\linewidth]{figures/icml_fixed/pareto_front_new_shapes.pdf}}
```
```{=latex}
\vspace{-1.5cm}
```
```{=latex}
\vspace{-0.3cm}
```
`\label{fig:cost}`{=latex}

A key practical aspect for forecasting applications is the inference time of models and their associated cost. `\cref{fig:cost}`{=latex} shows that, while `\model{Llama-3.1-405B-Instruct}`{=latex} performs the best on average, it comes at the cost of a significantly higher parameter count than the quantitative forecasters. This emphasizes that while LLMs can be powerful context-aware forecasters, they come with steep computational costs, which highlights the need for efficient models that balance both accuracy and resource demands. Of note is also that many LLM-based forecasters are pareto-dominated by quantitative forecasters such as `\model{Lag-Llama}`{=latex} and `\model{Chronos}`{=latex}. This suggests that, beyond the ability to process text, a careful choice of prompting strategy and LLM is crucial for Pareto efficiency. Further, due to their high parameter count, the LLMs have inference times that are orders of magnitude longer than quantitative forecasters, which are far more efficient for sustained usage (see `\cref{fig:cost-inference-time}`{=latex} for a comparison). LLMs require significant computational power, making them unsuitable for real-world practical forecasting at scale where speed and cost matter. However, additional research could improve the cost-effectiveness of context-aided forecasting models to match that of traditional models.

```{=latex}
\vspace{-0.35cm}
```
### Related Work {#sec:related-work}

```{=latex}
\vspace{-0.1cm}
```
We review two streams of related work: (i) works that introduce related benchmarks and datasets, and (ii) works that repurpose LLMs for forecasting. `\paragraphtight{Benchmarks and Datasets}`{=latex} @merrill2024language present a benchmark designed to evaluate LLMs' ability to reason about time series, with context-aided forecasting as one assessed capability. They focus on purely synthetic time series, which may not accurately reflect real-world dynamics, whereas our benchmark is based primarily on real-world data. Further, their evaluation is limited to point forecasting metrics, which do not measure the quality of the full forecast distribution. In contrast, we adopt probabilistic forecasting metrics, e.g., the continuous ranked probability score (CRPS; *c.f.* @gneiting2007strictly, [-@gneiting2007strictly]), to assess the quality of entire forecast distributions. Other related datasets include Time-MMD [@liu2024time], which integrates text extracted from reports and web searches, TGTSF [@xu2024beyond], which incorporates information such as weather reports and news articles, SysCaps [@emami2024syscaps], which includes LLM-generated descriptions of building energy consumption systems, TS-Insights [@zhang2023insight], which includes LLM-generated descriptions of trends and seasonalities, and Dual-Forecaster [@wu2025dualforecaster] where time series datasets are captioned with trend and seasonality information. Several works [@sawhney2021fast; @liu2024echo; @wang2024from; @wang2024chattime; @kim2024multi] automatically construct datasets of paired textual and numerical information. The key distinction between these works and ours lies in the importance of the textual information: while in the above works, the text is not guaranteed to be essential for high-quality forecasts, in CiK, all tasks are handcrafted to ensure that accurate forecasts *cannot be achieved* without relying on the provided natural language context, thereby making it a high-quality evaluation benchmark for context-aided forecasting.

```{=latex}
\paragraphtight{Repurposing LLMs for Forecasting}
```
A natural approach to context-aided forecasting is to build methods based on LLMs. @xue2023promptcast showed that forecasting could be framed as a question-answering problem. Subsequently, @gruver2024large and @requeima2024llm showed that LLMs could generate accurate forecasts with sequence completion, and that textual side-information could be used to influence forecasts. However, their analysis is limited to illustrative examples rather than a comprehensive evaluation. Some works have explored the ability of LLMs to reason about time series [@chow2024towards; @kong2025position; @aksu2024xforecast; @potosnak2024implicit; @ye2024beyond]. Other approaches have incorporated time series into pretrained LLMs [@jin2024timellm; @liu2024unitime; @zhang2024dualtime] by introducing special tokens used to represent patched time series patterns, or modifying their encoders to account for time series data [@GPT4MTS_Jia_Wang_Zheng_Cao_Liu_2024]. While these methods show promising results, their evaluations primarily rely on datasets where the contextual information is not guaranteed to improve forecasts over numerical data alone. As a result, it remains unclear whether their performance is due by accurate numerical forecasting or by effectively incorporating context; this shortcoming motivates our investigation into this question.

```{=latex}
\vspace{-0.35cm}
```
### Discussion

```{=latex}
\vspace{-0.1cm}
```
In this work, we propose the Context is Key (CiK) benchmark: a collection of forecasting tasks that require processing historical data with essential natural language context. We evaluate a range of models on CiK, including our proposed LLM prompting method, `\directprompt`{=latex}, which achieves the best aggregate performance. We analyze and discuss the failure modes of these models, and our findings underscore the critical role of contextual information in improving forecasts, while also revealing both the unexpected strengths and notable limitations of the investigated LLM-based forecasters. `\paragraphtight{Limitations:}`{=latex} While our benchmark provides valuable insights into the integration of contextual information in time series forecasting, it is important to acknowledge its limitations. Our study limits itself to the natural language modality for context, and excludes multivariate time series scenarios. Although we deliberately designed the tasks to assess how well forecasting models can integrate contextual information, our benchmark does not evaluate whether models can leverage latent relationships that might elude human observation. While we have taken steps to mitigate memorization concerns, as discussed in `\cref{subsec:data-sources}`{=latex}, achieving absolute certainty in this regard is challenging without strictly held-out data. Finally, the performance of dataset-specific methods such as ChatTime [@wang2024chattime], UniTime [@liu2024unitime] and Time-LLM [@jin2024timellm] may improve in the presence of a dataset-specific training set.

`\paragraphtight{Future work:}`{=latex} There are several promising avenues for future work. Extensions to CiK could include multivariate forecasting tasks, or tasks that incorporate other modalities such as images, databases or spatiotemporal data. Tasks that deliberately challenge context length limitations, probe specific weaknesses of language models or include domain-specific expert knowledge would also be valuable additions. More research is also needed to better understand what drives the catastrophic failures that LLMs exhibit. The analysis of catastrophic failures could benefit from searching for systematic failure patterns, or failures associated with specific linguistic patterns. Training datasets for context-aided forecasting would enable a better evaluation of dataset-specific methods. In fact, methods to improve the automatic generation of large, high-quality datasets for context-aided forecasting could complement CiK. Furthermore, this motivates research into developing more accurate and efficient multimodal forecasting models, which our benchmark is well-positioned to support. Other avenues of research include exploring different input/output structures for forecasting with LLMs, finetuning specialized LLMs for context-aided forecasting, allowing models to scale test-time computation, and compressing the context to reduce the required amount of computation. Lastly, as models become more robust, they could be integrated into agentic systems with conversational interfaces, allowing forecasts to be augmented with human expertise and automatically retrieved information. Such advancements would represent a significant step toward automating and democratizing access to powerful forecasting tools.

### Impact Statement {#impact-statement .unnumbered}

This paper's goal is to advance the development of forecasting methods which can leverage contextual information, which can increase the likelihood that such methods will be adopted by various organizations. Therefore, this paper shares the positive and negative impacts that such methods will have. Such methods could increase forecasting accuracy and efficiency, increasing the ability to foresee future courses and plan in consequence. A secondary effect is an increased democratization of access to high-quality forecasts. On the flip side, this democratization could increase reliance on automated methods for decision-making purposes.

### Acknowledgements {#acknowledgements .unnumbered}

The authors are grateful to Andrei Dinin, Christian Hudon, Ethan Honey, Gabrielle Gauthier Melançon, Ghazwa Darwiche, Kiarash Mohammadi, Léo Boisvert, Loic Mandine, Megh Vipul Thakkar, Oluwanifemi Isaac Bamgbose, Orlando Marquez, Raymond Li, Thibault Le Sellier de Chezelles and Thomas Lai for participating in the human study on the relevance of context. The authors are grateful to Midan Kim, Torsten Scholak, Mohammad Reza Samsami, Oussama Boussif and Can Chen for their valuable feedback and suggestions. This research was supported by Mitacs Accelerate Grants and enabled by compute resources provided by ServiceNow Research and the Frontier supercomputer. The latter resources were awarded through the Frontier DD allocation and INCITE 2023 program for the project \`\`Scalable Foundation Models for Transferable Generalist AI" and were supplied by the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, with support from the Office of Science of the U.S. Department of Energy.

```{=latex}
\bibliographystyle{icml2025}
```
```{=latex}
\onecolumn
```
```{=latex}
\newpage
```
```{=latex}
\appendix
```
```{=latex}
\addcontentsline{toc}{section}{Appendix}
```
Appendix
========

`\parttoc `{=latex}

### Additional Details on the Benchmark

#### Data Sources {#app:data-sources}

We list here the domains and the respective sources of time series data we use in the various tasks in the CiK benchmark. We also show the number of tasks that use each source's data and list any memorization mitigation strategies used for each dataset.

-   **Traffic** (11 tasks):

    -   **Traffic occupancy rate**: We use traffic occupancy rate (%) data from the California Performance Measurement System (PeMS) [@chen2001freeway], with frequency hourly. This dataset contains a total of 446 series.

        -   As this is a live dataset (updated frequently), we use data from 2024 (i.e. data after the cutoff dates of LLMs used) and do not apply any memorization mitigation strategy.

-   **Climatology** (12 tasks):

    -   **Solar irradiance and cloud cover data** (9 tasks): We use solar irradiance and cloud cover data for the Americas in 2022 [@sengupta2018national], with frequency either 10 minutes or hourly. We extract a subset of 45 series from this dataset for the benchmark.

        -   To mitigate memorization, we shift the dates by one day ahead.

    -   **Solar photovoltaic power production** (3 tasks): Time series reflecting solar power production in Alabama during 2006 [@godahewa2021monash], with a frequency 10 minutes. This dataset contains a total of 137 series, but our tasks only use a single aggregated series generated from them.

        -   To mitigate memorization, we add gaussian noise to the data with a standard deviation of 3% of the standard deviation of the data in each respective sampled window.

-   **Public Safety** (26 tasks):

    -   **Fire Department Intervention Logs**: Logs of number of interventions carried out by the Montreal Fire Department due to the occurence of various kinds of incidents (such as trash fires, field fires, nautical accidents, bike accidents) [@ville_de_montreal_2020]. The data was processed from a raw log and aggregated to monthly frequency. This dataset contains a total of 48 series.

        -   Due to it being processed, we do not apply any special memorization mitigation strategy on top.

-   **Mechanics** (3 tasks):

    -   **Causal Chambers**: Experimental data collected from the wind tunnel physical system from @gamella2024chamber, released in April 2024. We make use of the `load_in`, `pressure_downwind`, `pressure_ambient` and `speed_in` series (downsampling them to 1s frequency) to build out-of-distribution forecasting tasks where the target values can be inferred from the driver variate provided as covariate and the description of the physical system given in the context. We select a subset of 17 series from this dataset for the benchmark.

        -   Since the data is released in 2024 and after the cutoff dates of the LLMs used, we do not apply any memorization mitigation technique to transform the data.

-   **Economics** (3 tasks):

    -   **FRED**: American unemployment data at the state and county levels, from the Federal Reserve Bank of St. Louis [@fred_stlouisfed], with frequency monthly. We extract a subset of 1769 series from this dataset for the benchmark.

        -   As this is a live dataset (updated frequently), we use data from 2024 (i.e. data after the cutoff dates of LLMs used) and do not apply any memorization mitigation strategy.

-   **Retail** (6 tasks):

    -   **NN5 ATM cash withdrawals**: The NN5 dataset of ATM cash withdrawals in the UK from the Monash Time Series Forecasting Repository [@godahewa2021monash], with frequency daily. This dataset contains a total of 111 series.

        -   To mitigate memorization, we add gaussian noise to the data with a standard deviation of 3% of the standard deviation of the data in each respective sampled window.

-   **Energy** (7 tasks):

    -   **Electricity consumption**: Electricity usage from 2012 to 2014 from the Monash Time Series Forecasting Repository [@godahewa2021monash], with frequency daily. This dataset contains a total of 321 series.

        -   To mitigate memorization, we add gaussian noise to the data with a standard deviation of 3% of the standard deviation of the data in each respective sampled window.

-   **Synthetic Data** (3 tasks): We employ a bivariate setup where the parent variable is drawn from a categorical distribution, and the child variable is generated using a continuous linear Structural Vector Autoregressive (SVAR) model with Gaussian noise, with a lag of $3$ and a noise scale of $0.1$.

    -   Since this data is synthetic, we do not apply any mitigation technique on top of data to mitigate memorization. Since our models assume a timestamp, we use dates from $2025$, and a frequency of daily when we input this data to our models.

```{=latex}
\centering
```
::: {#tab:transformations_domain}
  -------------------------------------- ----------------------------------------------- ----------------------------------------------- ----------------
  `\multicolumn{1}{c}{Domain}`{=latex}   `\multicolumn{1}{c}{Number of tasks}`{=latex}   `\multicolumn{2}{c}{Transformations}`{=latex}
                                                                                         Date shift                                      Gaussian noise
  Public Safety                          26                                              None                                            None
  Traffic                                11                                              None                                            None
  Mechanics                              3                                               None                                            None
  Economics                              3                                               None                                            None
  Synthetic                              3                                               None                                            None
  **Total**                              **46**                                          **None**                                        **None**
  Climatology                            12                                              9                                               3
  Energy                                 7                                               None                                            7
  Retail                                 6                                               None                                            6
  **Total**                              **25**                                          **9**                                           **16**
  -------------------------------------- ----------------------------------------------- ----------------------------------------------- ----------------

  : Summary of Transformations Applied to Tasks Per Domain
:::

Depending on the task and the context used in the task, appropriate history and prediction lengths are used in the task.

A summary of the number of tasks with either types of memorization strategy (shifting the dates by one day, or adding gaussian noise) is presented in `\cref{tab:transformations_domain}`{=latex}.

#### Task Creation Process {#app:context_crafting}

All tasks were manually designed, from scratch, by the authors of this work without resorting to external annotators, crowdsourcing, or LLMs. We used the following procedure to create the tasks in the benchmark.

`\paragraphtight{Task Diversity:}`{=latex} First, we identified high-quality sources of public time series data from various application domains (`\cref{subsec:data-sources}`{=latex}). Special care was taken to find data sources that are continuously updated to facilitate future benchmark updates. Second, we established the categorization for types of context (`\cref{subsec:types-of-context}`{=latex}). Third, we posited various reasoning capabilities that one could potentially use to infer non-trivial information about numerical forecasts from contextual information (e.g., using common sense, making analogies to covariates, etc.; `\cref{app:capabilities}`{=latex}). The task ideation process that followed aimed to ensure sufficient coverage of these three aspects.

`\paragraphtight{Task Ideation:}`{=latex} With this framework established, all authors contributed to the ideation of new tasks. In summary, the process consisted of:

1.  Selecting a data source

2.  Implementing a time series window selection strategy (e.g., short or long history)

3.  Brainstorming about the nature of contextual information that could help achieve better forecasts (e.g., information about the past) and the capabilities that might potentially serve to apply it to the forecasting problem.

4.  Writing code to verbalize the context (e.g., calculating statistics of the series beyond the observed numerical history, creating a template to render such statistics as text, etc.), and

5.  Finally, if required, writing code to modify the time series data to reflect the context (e.g., introducing some spikes in future values).

###### Peer Review:

Then, the tasks were peer-reviewed by a committee composed of all other authors (each with time series research experience). The creator of each task was not allowed to participate in the review. The review ensured that the contextual information was of high quality, that it undoubtedly enabled a better forecast, and that the context types used in the task were tagged correctly. If a task was deemed of not high enough quality, it was either returned for revisions or excluded from the benchmark.

`\paragraphtight{Code availability:}`{=latex} The code for all tasks is available here: <https://github.com/ServiceNow/context-is-key-forecasting>. An example task can be found here: <https://github.com/ServiceNow/context-is-key-forecasting/blob/main/cik_benchmark/tasks/montreal_fire/short_history.py>, where the time series window selection occurs from L94-112 and the context generation occurs from L114-158.

#### Model Capabilities {#app:capabilities}

As mentioned in `\cref{app:context_crafting}`{=latex}, we designed the tasks with consideration of the capabilities a model might potentially use to incorporate contextual information into its forecasts. All tasks in CiK are tagged with such capabilities. However, these tags are inherently subjective and not intended as formal attributions. Rather, they serve as broad categories to help readers identify examples of interest within the benchmark. These are as follows:

`\noindent `{=latex}**Common-Sense** (24 Tasks): Using direct instructions available in the context. Instructions could express constraints to be satisfied, or the exact expected effect of an event, for example.

```{=latex}
\vspace{-0.1cm}
```
`\noindent `{=latex}**Retrieval**: Retrieving facts from memory or context. `\vspace{-0.1cm}`{=latex}

-   **Retrieval from memory** (35 Tasks): Retrieving from memory facts that enable interpretation of the context, such as relevant physical constants or quantitative laws.

-   **Retrieval from context** (25 Tasks): Retrieving relevant information from context and distinguishing it from irrelevant information.

```{=latex}
\vspace{-0.1cm}
```
`\noindent `{=latex}**Reasoning**: Reasoning about information in context or memory. `\vspace{-0.1cm}`{=latex}

-   **Analogical Reasoning** (6 tasks): Making analogies between entities or events, for instance, applying knowledge from a past event that is similar to an upcoming one.

-   **Mathematical Reasoning** (32 tasks): Performing calculations over the context, e.g. solving an equation.

-   **Deductive Reasoning** (39 tasks): Inferring new facts not explicitly mentioned in the context, e.g. inferring from the context that certain values are logically impossible to occur.

-   **Causal Reasoning** (22 tasks): Deriving or using causal information from the context to reason about actions (such as interventions).

`\paragraphtight{Example:}`{=latex} To illustrate the rationale, we provide the following example. To solve the task in `\cref{fig:context-overview}`{=latex}, one could *retrieve from memory* that Montreal experiences snowfall and cold weather during the winter months. It could then infer that trash fires are less likely to occur during this period through *deductive reasoning*. This chain of thought reveals a seasonal pattern that is not apparent in the short numerical history. Additionally, through *causal reasoning*, it is apparent that, despite a strong association between field fires and trash fires, the intervention described in `\contextpointer{5}`{=latex} is unlikely to reduce the frequency of the latter. Failure to recognize this distinction would lead to inaccurate forecasts.

#### Validating the Relevance of the Context {#subsec:llm-critique}

To evaluate the relevance of the contextual information for tasks in CiK, we query both human and LLM annotators on the relevance of the context. Our findings confirm that the context is relevant for all tasks (see `\cref{fig:annotation-heatmap}`{=latex}).

```{=latex}
\centering
```
![Ratings of the relevance of the context for both human annotations (x-axis) and LLM annotations (y-axis). There are 5 ratings per task for the LLM, and between 4 and 10 ratings per task for the human annotators. Each cell represents the number of tasks that correspond to a given pair of (human, LLM) ratings for the relevance of the context. For example, 53 (top right) of the 71 tasks have over 95% of annotators tagging the context as useful, as well as more than 95% of the LLM annotations tagging the context as useful. All tasks are considered relevant by either the LLM or the human annotators. Furthermore, the vast majority of tasks are considered relevant across more than 95% of ratings . ](figures/annotations/cross_product_heatmap.png){#fig:annotation-heatmap width="0.75\\linewidth"}

```{=latex}
\centering
```
![The task-wise proportion of annotators (n=11) who tag the context as useful. The overall rate of tasks tagged as useful across all annotators is 94.7%.](figures/annotations/task_wise_proportion_of_annotators_who_find_context_useful.png){#fig:annotations width="1\\linewidth"}

```{=latex}
\centering
```
![A histogram of results from the LLM-based critique of the relevance of context. Given the historical data, the future data and the associated context of tasks, GPT-4o is asked to assess whether its predictions would be \`\`significantly better", \`\`slightly better", \`\`unchanged", or \`\`worse" (see `\cref{subsec:llm-critique}`{=latex} for the details). The context in all tasks is considered as enabling better forecasts, with the majority of tasks having context that enable \`\`significantly better" forecasts.](figures/llm_critique.png){#fig:llm_critique_fig width="0.5\\linewidth"}

###### Human Evaluation of the Relevance of Context

To ensure that the context used in the tasks is relevant to the tasks, we ask 11 human annotators to evaluate the relevance of the context across 5 seeds. After a brief presentation of two example tasks based on examples from [@hyndman2018forecasting], we ask the annotators whether the context enables better forecasts. Annotators are instructed that tasks are designed for the purpose of context-aided forecasting, and that we are seeking to identify tasks for which the context is not useful.

The results of this study can be found in `\cref{fig:annotations}`{=latex}: the vast majority of tasks are always annotated as useful, while annotators disagree on a small minority of tasks, such as `FullCausalContextImplicitEquationBivarLinSVAR` (visualized at <https://servicenow.github.io/context-is-key-forecasting/v0/FullCausalContextImplicitEquationBivarLinSVAR.html>) likely due to the highly statistical nature of such task(s). Overall, $94.7\%$ of annotations report that the context is useful.

###### An LLM-based Critique of the Relevance of Context

To further validate the quality of the tasks, we build an LLM-based critique by prompting `\model{GPT-4o}`{=latex} with the historical and future numerical data, as well as the context, and asking it to assess whether its estimation of future values would be \`\`significantly better", \`\`slightly better", \`\`unchanged", or \`\`worse" when the context is provided compared to when it is not provided. Note that this experiment was ran after the benchmark was created, as an analysis tool to further validate the quality of the tasks.

We run this critique on 5 instances of each of the `\CiKnumtasks{}`{=latex} tasks and report results in `\cref{fig:llm_critique_fig}`{=latex}. [^2] All tasks are assessed as enabling better forecasts when given context, with the majority of tasks assessed as having contexts that enable \`\`significantly better" forecasts. The prompt used in the critique is in `\cref{fig:critique-prompt}`{=latex}.

+:---------------------------------------------------------------------------------------------------------------------:+
| ```{=latex}                                                                                                           |
| \centering                                                                                                            |
| ```                                                                                                                   |
|                                                                                                                       |
|     You are a critic whose role is to evaluate the quality of tasks in the "context is key" time series forecasting   |
|     benchmark.                                                                                                        |
|                                                                                                                       |
|     "Context is Key" (CiK) is a time series forecasting benchmark that pairs numerical data with diverse types of     |
|     carefully crafted textual context, requiring models to integrate both modalities to arrive at accurate            |
|     predictions.                                                                                                      |
|                                                                                                                       |
|     Here is a task to evaluate.                                                                                       |
|                                                                                                                       |
|     <history>                                                                                                         |
|     ((history))                                                                                                       |
|     </history>                                                                                                        |
|                                                                                                                       |
|     <context>                                                                                                         |
|         <background>                                                                                                  |
|             ((background))                                                                                            |
|         </background>                                                                                                 |
|         <scenario>                                                                                                    |
|             ((scenario))                                                                                              |
|         </scenario>                                                                                                   |
|         <constraints>                                                                                                 |
|             ((constraints))                                                                                           |
|         </constraints>                                                                                                |
|     </context>                                                                                                        |
|     <future>                                                                                                          |
|     ((future))                                                                                                        |
|     </future>                                                                                                         |
|                                                                                                                       |
|     Assume the following two scenarios:                                                                               |
|     1) You are given only the numerical data in <history> and have no additional information about the nature of the  |
|     time series. You must ignore the <context> section completely.                                                    |
|                                                                                                                       |
|     2) You are given the <context> section in addition to the numerical data in <history>.                            |
|                                                                                                                       |
|     Now, assume you had to estimate the probability distribution of the <future> values given the information         |
|     available in each scenario. How would the quality of your estimation change in scenario 2 compared to             |
|     scenario 1?                                                                                                       |
|                                                                                                                       |
|     First show your reasoning in <reason></reason> tags, then answer in <answer></answer> tags with either            |
|     "significantly better", "slightly better", "unchanged", "worse" (no other responses are allowed).                 |
+-----------------------------------------------------------------------------------------------------------------------+

#### Weighting scheme for tasks {#app:weighting-scheme}

To take full advantage of the available data, we create multiple tasks using each data source, by varying the specific contextual information we provide to the models. Since we do not want our aggregate results to be dominated by the few datasets for which there are a larger number of tasks, we weight the contribution of each task to the various aggregated results.

To define the weight of each task, we first group the tasks in clusters. These clusters are primarily defined based on the original data source used to create the tasks. However, when tasks are fundamentally different, due to not testing the same capabilities, we put them in different clusters despite them using the same data source. For example, for tasks created using the Solar irradiance and cloud cover data, all of which ask models to forecast the irradiance, the tasks form three distinct clusters: one for tasks asking models to do forecast with very short history (less than a day), one for tasks giving the cloud cover as covariate, and the final one for tasks where the models are given a tight upper bound on the possible irradiance. Once we define these clusters, we simply give equal weight to each cluster, and equal weight to each task inside each cluster.

#### Standard errors and average ranks

To get the standard errors shown in `\cref{table:main-results}`{=latex}, we first compute the standard error for tasks using the method described in `\cref{app:stderr_formula}`{=latex}. We then aggregate them according to each task weight, by assuming that errors for each are independent and thus using the formula for the variance of a weighted sum of independent variables.

To take into consideration the uncertainty we have for the scores, we compute average ranks through a simple simulation. In this simulation, we replace the RCRPS for each task and model pair by an independent Gaussian variable of mean equals to the one we measured, and of standard deviation equals to the standard error. We then draw from this distribution and compute the weighted average ranks for each model. The results shown in `\cref{table:main-results}`{=latex} are the mean and standard deviation measured from 10,000 repetitions of this simulation.

#### Task lengths {#app:context_length}

```{=latex}
\centering
```
![Histograms depicting the distribution of lengths for the context, numerical history and target length of a set of five instances for each task in CiK. We measure the length of the natural language context in characters, and the numerical sequences in floats.](figures/appendix/context_lengths/lengths.png){#fig:context_lengths width="0.9\\linewidth"}

`\cref{fig:context_lengths}`{=latex} provides an overview of the distribution of the lengths of the natural language context, numerical history and target (prediction horizon) for a set of five instances for each task in the CiK benchmark.

```{=latex}
\FloatBarrier
```
### Examples of tasks from the benchmark {#app:task-examples}

In this section, we feature multiple examples from the benchmark to exemplify exactly what a task is, what context types represent (`\cref{subsec:types-of-context}`{=latex}), and how we tag these tasks with descriptive capabilities (`\cref{app:capabilities}`{=latex}). To visualize all tasks in the benchmark, we refer the reader to <https://servicenow.github.io/context-is-key-forecasting/v0/>.

#### Task: Constrained Predictions {#subsec:boundedPredExample}

```{=latex}
\centering
```
```{=latex}
\fbox{
        \parbox{0.90 \textwidth}{
                \textbf{Domain:} Traffic \\
                \textbf{Context types:} Future information

                \vspace{0.5em}
                \textbf{Context:} ``Suppose that in the forecast, the values are bounded above by 11.88, the values are bounded below by 7.06.''
            }
    }
```
![image](figures/example_tasks/bounded_pred.png){width="0.7\\linewidth"} `\label{fig:boundedPredExample}`{=latex}

This task, which we refer to as \`\`Bounded Prediction Constraint Based On Prediction Quantiles", is a forecasting task where we modify the forecast horizon (in green in the plot) by bounding one or both of its extremes according to its unmodified ground truth's quantile values. We verbalize these bounds in the context, and the model is expected to interpret and respect them.

Since we draw this series from the PeMS dataset [@chen2001freeway], we tag its domain as \`\`Traffic". The context directly refers to the future, hence the context source is tagged as \`\`Future information".

Since the context contains constraints, the Region of Interest CRPS metric that we introduce (`\cref{subsec:evaluation}`{=latex}) heavily penalizes forecasts that exceed these constraints: models that do not incorporate the information about bounds in the context, such as quantitative forecasting models, would not be able to predict the ground truth (orange line) because its lower bound is much higher than that of the history. In this case, the region of interest for the metric is the entire forecast horizon because the context applies everywhere. Although statistical forecasters may pick up on the seasonality present in the history (black line), they would obtain worse scores than models capable of processing the context and adjusting the lower bound of their predictions.

#### Task: Electrical Consumption Increase {#app:electricityincrease}

```{=latex}
\centering
```
```{=latex}
\fbox{
        \parbox{0.90 \textwidth}{
                \textbf{Domain:} Energy

                \textbf{Context types:} Future information, Covariate information


                \vspace{0.5em}
                \textbf{Context:} ``This is the electricity consumption recorded in Kilowatt (kW) in city A. A heatwave struck the city, which began on 2012-10-09 18:00:00 and lasted for approximately 3 hours, saw temperatures soar to unprecedented levels. According to the city's electricity provider, power consumption during the peak of the heatwave reached approximately 5 times the typical usage for this time of year.''
            }
    }
```
![image](figures/example_tasks/elec_short_news.png){width="0.7\\linewidth"} `\label{fig:elecIncreaseExample}`{=latex}

The \`\`Short News Electricity Increase" task introduces a large shock in the forecast horizon that is only referred to in the context. Hence, the model must interpret the context appropriately to forecast the spike.

Since this series represents electricity consumption (`\cref{subsec:data-sources}`{=latex}), we tag it a coming from the \`\`Energy" domain. The context types for this task are twofold: the first context source is \`\`Future information", which represents knowledge of the five-fold increase in typical usage during the shock. The second source of context, \`\`Covariate information", represents the occurrence of a heatwave, which coincides with the timing and duration of the shock. The model must therefore interpret both the information on the magnitude of the shock from the future information, as well as the timing and duration of the sock from the covariate information. Together, these pieces of information enable an accurate forecast despite the lack of information about the shock in the task's numerical history.

In this task, we also see a \`\`Region of Interest" (RoI), characterized by a darker region of the forecast horizon. This RoI represents the region of the forecast horizon for which the context is relevant, i.e. the period during which the increased power consumption occurred. As detailed in `\cref{subsec:evaluation}`{=latex}, this region of interest is taking into account in the RCRPS metric.

#### Task: ATM Maintenance {#app:atmexample}

```{=latex}
\centering
```
```{=latex}
\fbox{
        \parbox{0.90\textwidth}{
                \textbf{Domain:} Retail

                \textbf{Context types:} Intemporal information, Covariate information


                \vspace{0.5em}
                \textbf{Context:} ``This is the number of cash withdrawals from an automated teller machine (ATM) in an arbitrary location in England. The ATM was under maintenance for 7 days, periodically every 14 days, starting from 1996-11-30 00:00:00. Assume that the ATM will not be in maintenance in the future.''
            }
    }
```
![image](figures/example_tasks/atm.png){width="0.7\\linewidth"} `\label{fig:atmexample}`{=latex}

The \`\`Automated Teller Machine (ATM) Under Period Maintenance\`\` task represents the history of withdrawals from an ATM that undergoes regular maintenance. This maintenance introduces a periodic, easily forecastable signal into the history. However, the context explicitly states that the forecast should assume the ATM will not be in maintenance during the forecast. Therefore, forecasting models are expected to ignore this signal.

Since this series represents ATM withdrawals, we tag it as \`\`Retail". The context includes information such as the location of the ATM, and therefore provides \`\`Intemporal information". As the maintenance frequency and duration is also described, the context types include \`\`Covariate information".

The RoI represents when the maintenance periods would have occurred in the forecast horizon, which is likely where forecasting models that do not leverage the context will forecast 0. While a quantitative forecasting model would find such a signal irresistible, context-aware models should avoid repeating the pattern in the forecast.

We also note that the series is not quite 0 during the maintenance periods. This is a consequence of using one of our memorization mitigation schemes (`\cref{app:data-sources}`{=latex}, paragraph \`\`Memorization mitigation").

#### Task: Montreal Fire High Season

```{=latex}
\centering
```
```{=latex}
\fbox{
        \parbox{0.90\textwidth}{
                \textbf{Domain:}  Public Safety

                \textbf{Context types:} Intemporal information, Historical information


                \vspace{0.5em}
                \textbf{Context:} ``The Montreal Fire Department is in charge of responding to various kind of public safety incidents. This is the number of field fire incidents responded to by Montreal firefighters in the borough of Rivière-des-Prairies-Pointe-aux-Trembles. In other years, the yearly average number of incidents was 106 with the busiest month being June.''
            }
    }
```
![image](figures/example_tasks/montreal_fire.png){width="0.7\\linewidth"} `\label{fig:montreaLFireExample}`{=latex}

The \`\`Montreal Field Fire With Explicit Short History" task requires predicting the number of field fire incidents during the summer, so we tag it as being part of the \`\`Public Safety" domain.

The context contains information from two different sources: it contains \`\`Intemporal information", such as the location and nature of the incidents. However, it also contains \`\`Historical information", which verbalizes statistics about past values of the series, beyond the numerical data. That is, the yearly average number of incidents, along with the knowledge that June is the month with the most incidents.

#### Task: Solar Prediction

```{=latex}
\centering
```
```{=latex}
\fbox{
        \parbox{0.90\textwidth}{
                \textbf{Domain:} Climatology

                \textbf{Context types:} Intemporal information


                \vspace{0.5em}
                \textbf{Context:} ``This series estimates the power production for a given day of a new solar power plant located in the state of Georgia, which has a climate similar to Alabama's.''
            }
    }
```
![image](figures/example_tasks/solar.png){width="0.7\\linewidth"} `\label{fig:solarExample}`{=latex}

The \`\`Explicit Similar Location and Day Solar Forecast" task requires forecasting the power production of a solar power plant based on a very short history and information about the similarity between its climate and that of an adjacent location. We therefore tag the domain of this series as \`\`Climatology".

Without the \`\`Intemporal information" that the context provides, accurate forecasts of the parabola-like shape of the ground truth are unlikely: the history contains very few defining characteristics, which makes it interchangeable with that of many potential processes and therefore many possible forecasts.

#### Task: Speed From Load

```{=latex}
\centering
```
```{=latex}
\fbox{
        \parbox{0.90\textwidth}{
                \textbf{Domain:} Mechanics

                \textbf{Context types:} Causal information, Intemporal information, Covariate information


                \vspace{0.5em}
                \textbf{Context:} ``The wind tunnel is a chamber with one controllable fan that pushes air through it. We can control the load of the fan (corresponding to the duty cycle of the pulse-width-modulation signal) and measure its speed (in revolutions per minute). The fan is designed so its steady-state speed scales broadly linearly with the load. Unless completely powered off, the fan never operates below a certain speed, corresponding to a minimum effective load between 0.1 and 0.2. The task is to forecast the speed of the fan. The load is between 0 and 1. At full load (=1), the fan turns at a maximum speed of 3000 rpm. The load is set to: 0.0 until 05:47:09, 0.1 from 05:47:09 until 05:47:29, 0.0 from 05:47:29 until 05:48:01, 0.2 from 05:48:01 until 05:48:27, 0.1 from 05:48:27 until 05:48:49, 0.0 from 05:48:49 until 05:49:00.''
            }
    }
```
![image](figures/example_tasks/speed_from_load.png){width="0.7\\linewidth"} `\label{fig:speedFromLoadExample}`{=latex}

The \`\`Speed From Load" task combines many different context types and capabilities to produce a forecast of the revolutions per minute (RPM) of a fan in a wind tunnel based on its load. This task, based on the Causal Chambers dataset [@gamella2024chamber], is therefore tagged as part of the \`\`Mechanics" domain.

As the plot shows, producing an accurate forecast of the ground truth (orange line) from the numerical history alone (black line) is essentially impossible. However, the context of the task is quite rich: it provides \`\`Intemporal information" on the nature of the task, such as the limits of the load and of the fan, \`\`Covariate information" that describes the load during the history and future, as well as \`\`Causal information" on the control that the load exerts on the fan, as well as the proportionality of their relationship. With this information, it is possible to forecast the series nearly perfectly, excepting some noise.

### Additional Results {#sec:addnl-res-sec}

The full unaggregated results from our benchmark can be found at the following location: <https://github.com/ServiceNow/context-is-key-forecasting/blob/main/results/results_complete.csv>.

#### Extended results on all models {#app:extended-aggregate-results-on-all-models}

```{=latex}
\captionsetup{justification=centering}
```
```{=latex}
\centering
```
```{=latex}
\resizebox{0.32\textwidth}{!}{%
\begin{tabular}{lcc}
\toprule
Model & Average RCRPS & Average Rank \\
\midrule
\multicolumn{3}{l}{\textbf{With Context}} \\
\midrule
\multicolumn{3}{l}{\directprompt (ours)} \\
~~~\tablemodel{Llama-3.1-405B-Inst} & \textbf{0.159 $\pm$ 0.008} & \textbf{7.962 $\pm$ 0.591} \\
~~~\tablemodel{Llama-3-70B-Inst} & 0.286 $\pm$ 0.004 & 14.806 $\pm$ 0.200 \\
~~~\tablemodel{Llama-3-8B-Inst} & 0.461 $\pm$ 0.008 & 28.020 $\pm$ 0.579 \\
~~~\tablemodel{Mixtral-8x7B-Inst}& 0.523 $\pm$ 0.023 & 33.066 $\pm$ 0.419 \\
~~~\tablemodel{Qwen-2.5-7B-Inst} & 0.290 $\pm$ 0.003 & 22.840 $\pm$ 0.731 \\
~~~\tablemodel{Qwen-2.5-1.5B-Inst} & 0.575 $\pm$ 0.014 & 32.643 $\pm$ 0.860 \\
~~~\tablemodel{Qwen-2.5-0.5B-Inst} & 0.463 $\pm$ 0.012 & 27.216 $\pm$ 0.530 \\
~~~\tablemodel{GPT-4o} & 0.274 $\pm$ 0.010 & 8.632 $\pm$ 0.441 \\
~~~\tablemodel{GPT-4o-mini} & 0.354 $\pm$ 0.022 & 17.574 $\pm$ 0.498 \\
\midrule
\multicolumn{3}{l}{\llmp} \\
~~~\tablemodel{Llama-3-70B-Inst} & 0.539 $\pm$ 0.013 & 18.039 $\pm$ 0.569 \\
~~~\tablemodel{Llama-3-70B} & 0.236 $\pm$ 0.006 & 12.383 $\pm$ 0.727 \\
~~~\tablemodel{Llama-3-8B-Inst} & 0.483 $\pm$ 0.010 & 18.597 $\pm$ 0.471 \\
~~~\tablemodel{Llama-3-8B} & 0.311 $\pm$ 0.023 & 18.640 $\pm$ 1.042 \\
~~~\tablemodel{Mixtral-8x7B-Inst} & 0.264 $\pm$ 0.004 & 16.078 $\pm$ 0.668 \\
~~~\tablemodel{Mixtral-8x7B} & 0.262 $\pm$ 0.008 & 16.296 $\pm$ 0.516 \\
~~~\tablemodel{Qwen-2.5-7B-Inst} & 1.974 $\pm$ 0.027 & 45.227 $\pm$ 0.747 \\
~~~\tablemodel{Qwen-2.5-7B} & 0.910 $\pm$ 0.037 & 38.133 $\pm$ 1.049 \\
~~~\tablemodel{Qwen-2.5-1.5B-Inst} & 2.158 $\pm$ 0.027 & 50.647 $\pm$ 0.865 \\
~~~\tablemodel{Qwen-2.5-1.5B} & 1.731 $\pm$ 0.036 & 45.116 $\pm$ 0.525 \\
~~~\tablemodel{Qwen-2.5-0.5B-Inst} & 1.937 $\pm$ 0.024 & 50.480 $\pm$ 0.611 \\
~~~\tablemodel{Qwen-2.5-0.5B} & 1.995 $\pm$ 0.024 & 48.511 $\pm$ 0.823 \\
\midrule
\multicolumn{3}{l}{\tablemodel{Multimodal Models}} \\
~~~\tablemodel{UniTime} & 0.370 $\pm$ 0.001 & 35.459 $\pm$ 0.155 \\
~~~\tablemodel{Time-LLM (ETTh1)} & 0.476 $\pm$ 0.001 & 44.086 $\pm$ 0.149 \\
~~~\tablemodel{ChatTime-Base} & 0.735 $\pm$ 0.002 & 39.033 $\pm$ 0.312 \\
~~~\tablemodel{ChatTime-Chat} & 0.747 $\pm$ 0.005 & 34.182 $\pm$ 0.389 \\
\midrule
\multicolumn{3}{l}{\textbf{Without Context}} \\
\midrule
\multicolumn{3}{l}{\directprompt (ours)} \\
~~~\tablemodel{Llama-3.1-405B-Inst} & 0.473 $\pm$ 0.005 & 34.332 $\pm$ 0.292 \\
~~~\tablemodel{Llama-3-70B-Inst} & 0.470 $\pm$ 0.008 & 35.151 $\pm$ 0.361 \\
~~~\tablemodel{Llama-3-8B-Inst} & 0.602 $\pm$ 0.006 & 41.872 $\pm$ 0.406 \\
~~~\tablemodel{Mixtral-8x7B-Inst} & 0.712 $\pm$ 0.021 & 46.805 $\pm$ 0.384 \\
~~~\tablemodel{Qwen-2.5-7B-Inst} & 0.421 $\pm$ 0.022 & 37.162 $\pm$ 0.539 \\
~~~\tablemodel{Qwen-2.5-1.5B-Inst} & 0.450 $\pm$ 0.007 & 36.314 $\pm$ 0.621 \\
~~~\tablemodel{Qwen-2.5-0.5B-Inst} & 0.305 $\pm$ 0.007 & 26.817 $\pm$ 0.351 \\
~~~\tablemodel{GPT-4o} & 0.441 $\pm$ 0.008 & 31.511 $\pm$ 0.380 \\
~~~\tablemodel{GPT-4o-mini} & 0.423 $\pm$ 0.006 & 35.716 $\pm$ 0.278 \\
\midrule
\multicolumn{3}{l}{\llmp} \\
~~~\tablemodel{Llama-3-70B-Inst} & 0.378 $\pm$ 0.004 & 26.036 $\pm$ 0.477 \\
~~~\tablemodel{Llama-3-70B} & 0.311 $\pm$ 0.006 & 21.819 $\pm$ 0.472 \\
~~~\tablemodel{Llama-3-8B-Inst} & 0.503 $\pm$ 0.009 & 31.156 $\pm$ 0.435 \\
~~~\tablemodel{Llama-3-8B} & 0.345 $\pm$ 0.003 & 25.063 $\pm$ 0.379 \\
~~~\tablemodel{Mixtral-8x7B-Inst} & 0.383 $\pm$ 0.015 & 24.595 $\pm$ 0.471 \\
~~~\tablemodel{Mixtral-8x7B} & 0.306 $\pm$ 0.007 & 22.565 $\pm$ 0.514 \\
~~~\tablemodel{Qwen-2.5-7B-Inst} & 1.020 $\pm$ 0.026 & 39.936 $\pm$ 1.236 \\
~~~\tablemodel{Qwen-2.5-7B} & 0.732 $\pm$ 0.030 & 38.092 $\pm$ 1.194 \\
~~~\tablemodel{Qwen-2.5-1.5B-Inst} & 1.515 $\pm$ 0.033 & 47.068 $\pm$ 1.080 \\
~~~\tablemodel{Qwen-2.5-1.5B} & 1.070 $\pm$ 0.028 & 41.194 $\pm$ 1.057 \\
~~~\tablemodel{Qwen-2.5-0.5B-Inst} & 1.318 $\pm$ 0.037 & 44.391 $\pm$ 0.759 \\
~~~\tablemodel{Qwen-2.5-0.5B} & 1.821 $\pm$ 0.027 & 47.759 $\pm$ 0.706 \\
\midrule
\multicolumn{3}{l}{\tablemodel{Multimodal Models}} \\
~~~\tablemodel{UniTime} & 0.405 $\pm$ 0.002 & 37.247 $\pm$ 0.178 \\
~~~\tablemodel{Time-LLM (ETTh1)} & 0.458 $\pm$ 0.002 & 43.019 $\pm$ 0.162 \\
~~~\tablemodel{ChatTime-Base} & 0.725 $\pm$ 0.002 & 38.771 $\pm$ 0.351 \\
~~~\tablemodel{ChatTime-Chat} & 0.781 $\pm$ 0.015 & 35.260 $\pm$ 0.446 \\
\midrule
\multicolumn{3}{l}{\tablemodel{TS Foundation Models}} \\
~~~\tablemodel{Lag-Llama} & 0.327 $\pm$ 0.004 & 30.441 $\pm$ 0.810 \\
~~~\tablemodel{Chronos-Tiny} & 0.328 $\pm$ 0.001 & 27.487 $\pm$ 0.442 \\
~~~\tablemodel{Chronos-Mini} & 0.341 $\pm$ 0.001 & 28.895 $\pm$ 0.421 \\
~~~\tablemodel{Chronos-Small} & 0.328 $\pm$ 0.002 & 26.523 $\pm$ 0.368 \\
~~~\tablemodel{Chronos-Base} & 0.672 $\pm$ 0.003 & 30.601 $\pm$ 0.379 \\
~~~\tablemodel{Chronos-Large} & 0.326 $\pm$ 0.002 & 25.602 $\pm$ 0.399 \\
~~~\tablemodel{TimeGEN} & 0.353 $\pm$ 0.000 & 35.928 $\pm$ 0.167 \\
~~~\tablemodel{Moirai-Small} & 0.565 $\pm$ 0.031 & 36.037 $\pm$ 0.426 \\
~~~\tablemodel{Moirai-Base} & 0.624 $\pm$ 0.013 & 35.267 $\pm$ 0.408 \\
~~~\tablemodel{Moirai-Large} & 0.520 $\pm$ 0.006 & 28.641 $\pm$ 0.864 \\
\midrule
\multicolumn{3}{l}{\tablemodel{Statistical Models}} \\
~~~\tablemodel{ARIMA} & 0.475 $\pm$ 0.006 & 27.041 $\pm$ 0.482 \\
~~~\tablemodel{ETS} & 0.530 $\pm$ 0.009 & 33.781 $\pm$ 0.632 \\
~~~\tablemodel{Exp-Smoothing}  & 0.605 $\pm$ 0.013 & 36.424 $\pm$ 0.350 \\
\bottomrule
\end{tabular}
}
```
`\cref{table:full-table-results-appdx}`{=latex} provides the extended results with all models evaluated on the CiK benchmark, aggregated over all tasks.

#### Full results partitioned by types of context

```{=latex}
\centering
```
```{=latex}
\resizebox{0.68\textwidth}{!}{%
\begin{tabu}{lccccccc}
\toprule
Model & \multirow{2}{*}{\shortstack{Average \\ RCRPS}} & \multirow{2}{*}{\shortstack{Average \\ Rank}} & \multirow{2}{*}{\shortstack{Intemporal \\ Information}} & \multirow{2}{*}{\shortstack{Historical \\ Information}} & \multirow{2}{*}{\shortstack{Future \\ Information}}  & \multirow{2}{*}{\shortstack{Covariate \\ Information}} & \multirow{2}{*}{\shortstack{Causal \\ Information}} \\
\\
\midrule
\multicolumn{7}{l}{\textbf{With Context}} \\
\midrule
\multicolumn{7}{l}{\directprompt (ours)} \\
~~~\tablemodel{Llama-3.1-405B-Inst} & \textbf{0.159 $\pm$ 0.008} & \textbf{7.967 $\pm$ 0.587} & \textbf{0.174 $\pm$ 0.010} & 0.146 $\pm$ 0.001 & \textbf{0.075 $\pm$ 0.005} & \textbf{0.164 $\pm$ 0.010} & 0.398 $\pm$ 0.045 \\
~~~\tablemodel{Llama-3-70B-Inst} & 0.286 $\pm$ 0.004 & 14.806 $\pm$ 0.201 & 0.336 $\pm$ 0.006 & 0.180 $\pm$ 0.003 & 0.194 $\pm$ 0.006 & 0.228 $\pm$ 0.004 & 0.629 $\pm$ 0.019 \\
~~~\tablemodel{Llama-3-8B-Inst} & 0.461 $\pm$ 0.008 & 28.013 $\pm$ 0.588 & 0.572 $\pm$ 0.011 & 0.313 $\pm$ 0.008 & 0.253 $\pm$ 0.017 & 0.262 $\pm$ 0.003 & 0.531 $\pm$ 0.005 \\
~~~\tablemodel{Mixtral-8x7B-Inst} & 0.523 $\pm$ 0.023 & 33.068 $\pm$ 0.415 & 0.723 $\pm$ 0.037 & 0.236 $\pm$ 0.002 & 0.241 $\pm$ 0.001 & 0.359 $\pm$ 0.028 & 0.875 $\pm$ 0.128 \\
~~~\tablemodel{Qwen-2.5-7B-Inst}  & 0.290 $\pm$ 0.003 & 22.842 $\pm$ 0.733 & 0.290 $\pm$ 0.004 & 0.176 $\pm$ 0.003 & 0.287 $\pm$ 0.007 & 0.240 $\pm$ 0.002 & 0.525 $\pm$ 0.003 \\
~~~\tablemodel{Qwen-2.5-1.5B-Inst} & 0.575 $\pm$ 0.014 & 32.642 $\pm$ 0.864 & 0.684 $\pm$ 0.023 & 0.284 $\pm$ 0.006 & 0.370 $\pm$ 0.010 & 0.450 $\pm$ 0.004 & 1.270 $\pm$ 0.009 \\
~~~\tablemodel{Qwen-2.5-0.5B-Inst} & 0.463 $\pm$ 0.012 & 27.218 $\pm$ 0.530 & 0.609 $\pm$ 0.019 & 0.165 $\pm$ 0.004 & 0.218 $\pm$ 0.012 & 0.476 $\pm$ 0.015 & 0.429 $\pm$ 0.006 \\
~~~\tablemodel{GPT-4o} & 0.274 $\pm$ 0.010 & 8.631 $\pm$ 0.443 & 0.218 $\pm$ 0.007 & \textbf{0.118 $\pm$ 0.001} & 0.121 $\pm$ 0.001 & 0.250 $\pm$ 0.011 & 0.858 $\pm$ 0.053 \\
~~~\tablemodel{GPT-4o-mini} & 0.354 $\pm$ 0.022 & 17.565 $\pm$ 0.506 & 0.475 $\pm$ 0.035 & 0.139 $\pm$ 0.002 & 0.143 $\pm$ 0.002 & 0.341 $\pm$ 0.028 & 0.644 $\pm$ 0.128 \\
\midrule
\multicolumn{7}{l}{\llmp} \\
~~~\tablemodel{Llama-3-70B-Inst} & 0.539 $\pm$ 0.013 & 18.034 $\pm$ 0.571 & 0.438 $\pm$ 0.017 & 0.516 $\pm$ 0.028 & 0.847 $\pm$ 0.024 & 0.546 $\pm$ 0.016 & \textbf{0.392 $\pm$ 0.028} \\
~~~\tablemodel{Llama-3-70B} & 0.236 $\pm$ 0.006 & 12.382 $\pm$ 0.726 & 0.212 $\pm$ 0.005 & 0.121 $\pm$ 0.008 & 0.299 $\pm$ 0.017 & 0.193 $\pm$ 0.004 & 0.360 $\pm$ 0.011 \\
~~~\tablemodel{Llama-3-8B-Inst} & 0.483 $\pm$ 0.010 & 18.597 $\pm$ 0.482 & 0.476 $\pm$ 0.013 & 0.161 $\pm$ 0.006 & 0.326 $\pm$ 0.003 & 0.304 $\pm$ 0.008 & 0.878 $\pm$ 0.035 \\
~~~\tablemodel{Llama-3-8B} & 0.311 $\pm$ 0.023 & 18.647 $\pm$ 1.030 & 0.332 $\pm$ 0.035 & 0.123 $\pm$ 0.004 & 0.271 $\pm$ 0.010 & 0.288 $\pm$ 0.029 & 0.739 $\pm$ 0.134 \\
~~~\tablemodel{Mixtral-8x7B-Inst} & 0.264 $\pm$ 0.004 & 16.087 $\pm$ 0.667 & 0.242 $\pm$ 0.007 & 0.173 $\pm$ 0.004 & 0.324 $\pm$ 0.005 & 0.219 $\pm$ 0.005 & 0.437 $\pm$ 0.007 \\
~~~\tablemodel{Mixtral-8x7B} & 0.262 $\pm$ 0.008 & 16.282 $\pm$ 0.523 & 0.250 $\pm$ 0.008 & 0.119 $\pm$ 0.003 & 0.310 $\pm$ 0.019 & 0.229 $\pm$ 0.006 & 0.457 $\pm$ 0.011 \\
~~~\tablemodel{Qwen-2.5-7B-Inst} & 1.974 $\pm$ 0.027 & 45.235 $\pm$ 0.742 & 2.509 $\pm$ 0.044 & 2.857 $\pm$ 0.056 & 1.653 $\pm$ 0.008 & 1.702 $\pm$ 0.035 & 1.333 $\pm$ 0.144 \\
~~~\tablemodel{Qwen-2.5-7B} & 0.910 $\pm$ 0.037 & 38.144 $\pm$ 1.039 & 1.149 $\pm$ 0.047 & 1.002 $\pm$ 0.053 & 0.601 $\pm$ 0.071 & 0.639 $\pm$ 0.047 & 0.928 $\pm$ 0.129 \\
~~~\tablemodel{Qwen-2.5-1.5B-Inst} & 2.158 $\pm$ 0.027 & 50.652 $\pm$ 0.866 & 2.614 $\pm$ 0.041 & 1.672 $\pm$ 0.055 & 1.413 $\pm$ 0.029 & 2.057 $\pm$ 0.033 & 2.448 $\pm$ 0.128 \\
~~~\tablemodel{Qwen-2.5-1.5B} & 1.731 $\pm$ 0.036 & 45.108 $\pm$ 0.519 & 2.337 $\pm$ 0.049 & 2.982 $\pm$ 0.052 & 0.942 $\pm$ 0.065 & 1.435 $\pm$ 0.046 & 1.304 $\pm$ 0.129 \\
~~~\tablemodel{Qwen-2.5-0.5B-Inst} & 1.937 $\pm$ 0.024 & 50.493 $\pm$ 0.602 & 2.444 $\pm$ 0.038 & 1.960 $\pm$ 0.063 & 1.443 $\pm$ 0.010 & 1.805 $\pm$ 0.030 & 1.199 $\pm$ 0.129 \\
~~~\tablemodel{Qwen-2.5-0.5B} & 1.995 $\pm$ 0.024 & 48.499 $\pm$ 0.834 & 2.546 $\pm$ 0.039 & 2.083 $\pm$ 0.052 & 1.579 $\pm$ 0.015 & 1.821 $\pm$ 0.030 & 1.225 $\pm$ 0.128 \\
\midrule
\multicolumn{7}{l}{\tablemodel{Multimodal Models}} \\
~~~\tablemodel{UniTime} & 0.370 $\pm$ 0.001 & 35.456 $\pm$ 0.152 & 0.457 $\pm$ 0.002 & 0.155 $\pm$ 0.000 & 0.194 $\pm$ 0.003 & 0.395 $\pm$ 0.001 & 0.423 $\pm$ 0.001 \\
~~~\tablemodel{Time-LLM (ETTh1)} & 0.476 $\pm$ 0.001 & 44.087 $\pm$ 0.148 & 0.518 $\pm$ 0.002 & 0.183 $\pm$ 0.000 & 0.403 $\pm$ 0.002 & 0.441 $\pm$ 0.001 & 0.482 $\pm$ 0.001 \\
~~~\tablemodel{ChatTime-Base} & 0.735 $\pm$ 0.002 & 39.037 $\pm$ 0.311 & 0.663 $\pm$ 0.002 & 0.181 $\pm$ 0.001 & 0.374 $\pm$ 0.003 & 0.794 $\pm$ 0.002 & 2.727 $\pm$ 0.003 \\
~~~\tablemodel{ChatTime-Chat} & 0.747 $\pm$ 0.005 & 34.186 $\pm$ 0.391 & 0.693 $\pm$ 0.007 & 0.405 $\pm$ 0.038 & 0.347 $\pm$ 0.007 & 0.832 $\pm$ 0.006 & 2.971 $\pm$ 0.018 \\
\midrule
\multicolumn{7}{l}{\textbf{Without Context}} \\
\midrule
\multicolumn{7}{l}{\directprompt (ours)} \\
~~~\tablemodel{Llama-3.1-405B-Inst} & 0.473 $\pm$ 0.005 & 34.334 $\pm$ 0.296 & 0.527 $\pm$ 0.007 & 0.713 $\pm$ 0.014 & 0.392 $\pm$ 0.003 & 0.320 $\pm$ 0.002 & 0.587 $\pm$ 0.005 \\
~~~\tablemodel{Llama-3-70B-Inst} & 0.470 $\pm$ 0.008 & 35.150 $\pm$ 0.356 & 0.532 $\pm$ 0.013 & 0.676 $\pm$ 0.018 & 0.389 $\pm$ 0.003 & 0.317 $\pm$ 0.002 & 0.615 $\pm$ 0.005 \\
~~~\tablemodel{Llama-3-8B-Inst} & 0.602 $\pm$ 0.006 & 41.874 $\pm$ 0.404 & 0.748 $\pm$ 0.009 & 0.679 $\pm$ 0.015 & 0.345 $\pm$ 0.008 & 0.335 $\pm$ 0.003 & 0.604 $\pm$ 0.004 \\
~~~\tablemodel{Mixtral-8x7B-Inst} & 0.712 $\pm$ 0.021 & 46.807 $\pm$ 0.389 & 0.906 $\pm$ 0.035 & 0.758 $\pm$ 0.015 & 0.400 $\pm$ 0.001 & 0.485 $\pm$ 0.028 & 0.893 $\pm$ 0.128 \\
~~~\tablemodel{Qwen-2.5-7B-Inst} & 0.421 $\pm$ 0.022 & 37.158 $\pm$ 0.547 & 0.479 $\pm$ 0.035 & 0.515 $\pm$ 0.017 & 0.322 $\pm$ 0.008 & 0.357 $\pm$ 0.028 & 0.830 $\pm$ 0.128 \\
~~~\tablemodel{Qwen-2.5-1.5B-Inst} & 0.450 $\pm$ 0.007 & 36.312 $\pm$ 0.615 & 0.494 $\pm$ 0.011 & 0.324 $\pm$ 0.007 & 0.368 $\pm$ 0.008 & 0.315 $\pm$ 0.006 & 0.498 $\pm$ 0.009 \\
~~~\tablemodel{Qwen-2.5-0.5B-Inst} & 0.305 $\pm$ 0.007 & 26.817 $\pm$ 0.353 & 0.341 $\pm$ 0.007 & 0.185 $\pm$ 0.004 & 0.236 $\pm$ 0.016 & 0.255 $\pm$ 0.005 & 0.396 $\pm$ 0.002 \\
~~~\tablemodel{GPT-4o} & 0.441 $\pm$ 0.008 & 31.505 $\pm$ 0.387 & 0.492 $\pm$ 0.013 & 0.280 $\pm$ 0.007 & 0.376 $\pm$ 0.002 & 0.276 $\pm$ 0.001 & 0.504 $\pm$ 0.002 \\
~~~\tablemodel{GPT-4o-mini} & 0.423 $\pm$ 0.006 & 35.711 $\pm$ 0.273 & 0.480 $\pm$ 0.009 & 0.391 $\pm$ 0.007 & 0.335 $\pm$ 0.004 & 0.280 $\pm$ 0.001 & 0.531 $\pm$ 0.003 \\
\midrule
\multicolumn{7}{l}{\llmp} \\
~~~\tablemodel{Llama-3-70B-Inst}  & 0.378 $\pm$ 0.004 & 26.030 $\pm$ 0.469 & 0.405 $\pm$ 0.006 & 0.186 $\pm$ 0.004 & 0.353 $\pm$ 0.004 & 0.253 $\pm$ 0.002 & 0.481 $\pm$ 0.004 \\
~~~\tablemodel{Llama-3-70B} & 0.311 $\pm$ 0.006 & 21.812 $\pm$ 0.470 & 0.311 $\pm$ 0.004 & 0.142 $\pm$ 0.004 & 0.321 $\pm$ 0.018 & 0.245 $\pm$ 0.002 & 0.479 $\pm$ 0.006 \\
~~~\tablemodel{Llama-3-8B-Inst}  & 0.503 $\pm$ 0.009 & 31.147 $\pm$ 0.437 & 0.598 $\pm$ 0.014 & 0.262 $\pm$ 0.009 & 0.365 $\pm$ 0.004 & 0.266 $\pm$ 0.002 & 0.510 $\pm$ 0.001 \\
~~~\tablemodel{Llama-3-8B} & 0.345 $\pm$ 0.003 & 25.063 $\pm$ 0.375 & 0.387 $\pm$ 0.004 & 0.162 $\pm$ 0.006 & 0.271 $\pm$ 0.007 & 0.250 $\pm$ 0.001 & 0.491 $\pm$ 0.002 \\
~~~\tablemodel{Mixtral-8x7B-Inst} & 0.383 $\pm$ 0.015 & 24.587 $\pm$ 0.471 & 0.420 $\pm$ 0.024 & 0.162 $\pm$ 0.008 & 0.340 $\pm$ 0.004 & 0.349 $\pm$ 0.019 & 0.470 $\pm$ 0.005 \\
~~~\tablemodel{Mixtral-8x7B} & 0.306 $\pm$ 0.007 & 22.567 $\pm$ 0.516 & 0.295 $\pm$ 0.004 & 0.150 $\pm$ 0.004 & 0.336 $\pm$ 0.021 & 0.242 $\pm$ 0.001 & 0.489 $\pm$ 0.003 \\
~~~\tablemodel{Qwen-2.5-7B-Inst} & 1.020 $\pm$ 0.026 & 39.951 $\pm$ 1.232 & 1.435 $\pm$ 0.041 & 0.889 $\pm$ 0.032 & 0.376 $\pm$ 0.018 & 0.812 $\pm$ 0.032 & 0.810 $\pm$ 0.128 \\
~~~\tablemodel{Qwen-2.5-7B} & 0.732 $\pm$ 0.030 & 38.091 $\pm$ 1.180 & 0.923 $\pm$ 0.045 & 0.403 $\pm$ 0.034 & 0.441 $\pm$ 0.029 & 0.545 $\pm$ 0.034 & 0.792 $\pm$ 0.128 \\
~~~\tablemodel{Qwen-2.5-1.5B-Inst} & 1.515 $\pm$ 0.033 & 47.085 $\pm$ 1.063 & 2.108 $\pm$ 0.047 & 0.607 $\pm$ 0.038 & 0.971 $\pm$ 0.050 & 1.300 $\pm$ 0.041 & 0.926 $\pm$ 0.128 \\
~~~\tablemodel{Qwen-2.5-1.5B} & 1.070 $\pm$ 0.028 & 41.199 $\pm$ 1.063 & 1.296 $\pm$ 0.044 & 0.272 $\pm$ 0.019 & 0.650 $\pm$ 0.011 & 0.855 $\pm$ 0.036 & 0.785 $\pm$ 0.128 \\
~~~\tablemodel{Qwen-2.5-0.5B-Inst} & 1.515 $\pm$ 0.033 & 47.085 $\pm$ 1.063 & 2.108 $\pm$ 0.047 & 0.607 $\pm$ 0.038 & 0.971 $\pm$ 0.050 & 1.300 $\pm$ 0.041 & 0.926 $\pm$ 0.128 \\
~~~\tablemodel{Qwen-2.5-0.5B} & 1.821 $\pm$ 0.027 & 47.768 $\pm$ 0.703 & 2.252 $\pm$ 0.042 & 1.480 $\pm$ 0.054 & 1.484 $\pm$ 0.024 & 1.642 $\pm$ 0.034 & 1.004 $\pm$ 0.129 \\
\midrule
\multicolumn{7}{l}{\tablemodel{Multimodal Models}} \\
~~~\tablemodel{UniTime} & 0.405 $\pm$ 0.002 & 37.250 $\pm$ 0.178 & 0.460 $\pm$ 0.003 & 0.178 $\pm$ 0.001 & 0.330 $\pm$ 0.003 & 0.384 $\pm$ 0.002 & 0.443 $\pm$ 0.003 \\
~~~\tablemodel{Time-LLM (ETTh1)} & 0.458 $\pm$ 0.002 & 43.016 $\pm$ 0.164 & 0.487 $\pm$ 0.002 & 0.174 $\pm$ 0.000 & 0.406 $\pm$ 0.004 & 0.419 $\pm$ 0.001 & 0.465 $\pm$ 0.001 \\
~~~\tablemodel{ChatTime-Base} & 0.725 $\pm$ 0.002 & 38.762 $\pm$ 0.353 & 0.658 $\pm$ 0.003 & 0.171 $\pm$ 0.001 & 0.367 $\pm$ 0.003 & 0.783 $\pm$ 0.003 & 2.719 $\pm$ 0.005 \\
~~~\tablemodel{ChatTime-Chat} & 0.781 $\pm$ 0.015 & 35.267 $\pm$ 0.436 & 0.741 $\pm$ 0.024 & 0.160 $\pm$ 0.001 & 0.425 $\pm$ 0.035 & 0.791 $\pm$ 0.001 & 2.882 $\pm$ 0.000 \\
\midrule
\multicolumn{7}{l}{\tablemodel{TS Foundation Models}} \\
~~~\tablemodel{Lag-Llama} & 0.327 $\pm$ 0.004 & 30.446 $\pm$ 0.818 & 0.330 $\pm$ 0.005 & 0.167 $\pm$ 0.005 & 0.292 $\pm$ 0.009 & 0.294 $\pm$ 0.004 & 0.495 $\pm$ 0.014 \\
~~~\tablemodel{Chronos-Tiny} & 0.328 $\pm$ 0.001 & 27.495 $\pm$ 0.440 & 0.302 $\pm$ 0.002 & 0.163 $\pm$ 0.002 & 0.393 $\pm$ 0.002 & 0.264 $\pm$ 0.002 & 0.486 $\pm$ 0.003 \\
~~~\tablemodel{Chronos-Mini} & 0.341 $\pm$ 0.001 & 28.892 $\pm$ 0.425 & 0.318 $\pm$ 0.002 & 0.171 $\pm$ 0.003 & 0.407 $\pm$ 0.002 & 0.272 $\pm$ 0.002 & 0.481 $\pm$ 0.004 \\
~~~\tablemodel{Chronos-Small} & 0.328 $\pm$ 0.002 & 26.528 $\pm$ 0.371 & 0.308 $\pm$ 0.002 & 0.179 $\pm$ 0.002 & 0.393 $\pm$ 0.003 & 0.257 $\pm$ 0.002 & 0.453 $\pm$ 0.007 \\
~~~\tablemodel{Chronos-Base} & 0.672 $\pm$ 0.003 & 30.592 $\pm$ 0.377 & 0.570 $\pm$ 0.002 & 0.211 $\pm$ 0.005 & 0.392 $\pm$ 0.002 & 0.697 $\pm$ 0.003 & 2.481 $\pm$ 0.013 \\
~~~\tablemodel{Chronos-Large} & 0.326 $\pm$ 0.002 & 25.600 $\pm$ 0.401 & 0.314 $\pm$ 0.002 & 0.179 $\pm$ 0.003 & 0.379 $\pm$ 0.003 & 0.255 $\pm$ 0.002 & 0.460 $\pm$ 0.004 \\
~~~\tablemodel{TimeGEN} & 0.353 $\pm$ 0.000 & 35.925 $\pm$ 0.168 & 0.332 $\pm$ 0.000 & 0.177 $\pm$ 0.000 & 0.405 $\pm$ 0.000 & 0.292 $\pm$ 0.000 & 0.474 $\pm$ 0.000 \\
~~~\tablemodel{Moirai-small} & 0.565 $\pm$ 0.031 & 36.048 $\pm$ 0.438 & 0.662 $\pm$ 0.050 & 0.195 $\pm$ 0.010 & 0.434 $\pm$ 0.002 & 0.558 $\pm$ 0.040 & 0.464 $\pm$ 0.013 \\
~~~\tablemodel{Moirai-base} & 0.624 $\pm$ 0.013 & 35.261 $\pm$ 0.410 & 0.629 $\pm$ 0.021 & 0.172 $\pm$ 0.002 & 0.399 $\pm$ 0.004 & 0.630 $\pm$ 0.017 & 0.486 $\pm$ 0.015 \\
~~~\tablemodel{Moirai-large}& 0.520 $\pm$ 0.006 & 28.636 $\pm$ 0.870 & 0.596 $\pm$ 0.009 & 0.140 $\pm$ 0.001 & 0.431 $\pm$ 0.002 & 0.499 $\pm$ 0.007 & 0.438 $\pm$ 0.011 \\
\midrule
\multicolumn{7}{l}{\tablemodel{Statistical Models}} \\
~~~\tablemodel{ARIMA} & 0.475 $\pm$ 0.006 & 27.039 $\pm$ 0.483 & 0.557 $\pm$ 0.009 & 0.200 $\pm$ 0.007 & 0.350 $\pm$ 0.003 & 0.375 $\pm$ 0.006 & 0.440 $\pm$ 0.011 \\
~~~\tablemodel{ETS} & 0.530 $\pm$ 0.009 & 33.786 $\pm$ 0.635 & 0.639 $\pm$ 0.014 & 0.362 $\pm$ 0.014 & 0.315 $\pm$ 0.006 & 0.402 $\pm$ 0.010 & 0.508 $\pm$ 0.017 \\
~~~\tablemodel{Exp-Smoothing} & 0.605 $\pm$ 0.013 & 36.426 $\pm$ 0.349 & 0.702 $\pm$ 0.020 & 0.493 $\pm$ 0.016 & 0.397 $\pm$ 0.006 & 0.480 $\pm$ 0.015 & 0.827 $\pm$ 0.060 \\
\bottomrule
\end{tabu}
}
```
```{=latex}
\vspace*{-0.5cm}
```
`\cref{table:main-results-ctx-types-full}`{=latex} provides the results of all tested models, partitioned by the types of context.

#### Full results partitioned by model capabilities

We provide an additional view of the results of all models in `\cref{table:main-results-caps-full}`{=latex}, partitioned by model capabilities.

```{=latex}
\centering
```
```{=latex}
\resizebox{0.87\textwidth}{!}{%
\begin{tabu}{lccccccccc}
\toprule
 & \multirow{2}{*}{\shortstack{Average \\ RCRPS}} & \multirow{2}{*}{\shortstack{Average \\ Rank}}& \multirow{2}{*}{\shortstack{Common-Sense}} & \multicolumn{2}{c}{Retrieval} & \multicolumn{4}{c}{Reasoning}  \\
\cmidrule(lr){5-6} \cmidrule(lr){7-10}
 Model & & & & From Context& From Memory& Deductive& Analogical& Mathematical& Causal \\
\midrule
\multicolumn{9}{l}{\textbf{With Context}} \\
\midrule
\multicolumn{9}{l}{\directprompt (ours)} \\
~~~\tablemodel{Llama-3.1-405B-Inst}& \textbf{0.159 $\pm$ 0.008} & \textbf{7.971 $\pm$ 0.585} & \textbf{0.140 $\pm$ 0.013} & 0.109 $\pm$ 0.002 & \textbf{0.191 $\pm$ 0.006} & 0.132 $\pm$ 0.001 & \textbf{0.167 $\pm$ 0.008} & 0.316 $\pm$ 0.028 & 0.376 $\pm$ 0.039 \\
~~~\tablemodel{Llama-3-70B-Inst} & 0.286 $\pm$ 0.004 & 14.802 $\pm$ 0.203 & 0.323 $\pm$ 0.008 & 0.122 $\pm$ 0.003 & 0.408 $\pm$ 0.012 & 0.168 $\pm$ 0.002 & 0.492 $\pm$ 0.019 & 0.473 $\pm$ 0.012 & 0.577 $\pm$ 0.017 \\
~~~\tablemodel{Llama-3-8B-Inst} & 0.461 $\pm$ 0.008 & 28.016 $\pm$ 0.584 & 0.323 $\pm$ 0.010 & 0.174 $\pm$ 0.003 & 0.849 $\pm$ 0.021 & 0.407 $\pm$ 0.014 & 1.245 $\pm$ 0.039 & 0.437 $\pm$ 0.004 & 0.494 $\pm$ 0.004 \\
~~~\tablemodel{Mixtral-8x7B-Inst} & 0.523 $\pm$ 0.023 & 33.069 $\pm$ 0.413 & 0.433 $\pm$ 0.043 & 0.204 $\pm$ 0.000 & 0.864 $\pm$ 0.029 & 0.426 $\pm$ 0.024 & 1.245 $\pm$ 0.006 & 0.644 $\pm$ 0.080 & 0.789 $\pm$ 0.112 \\
~~~\tablemodel{Qwen-2.5-7B-Inst} & 0.290 $\pm$ 0.003 & 22.852 $\pm$ 0.734 & 0.343 $\pm$ 0.005 & 0.127 $\pm$ 0.002 & 0.324 $\pm$ 0.008 & 0.205 $\pm$ 0.005 & 0.281 $\pm$ 0.014 & 0.409 $\pm$ 0.002 & 0.480 $\pm$ 0.002 \\
~~~\tablemodel{Qwen-2.5-1.5B-Inst} & 0.575 $\pm$ 0.014 & 32.631 $\pm$ 0.861 & 0.610 $\pm$ 0.007 & 0.214 $\pm$ 0.004 & 0.988 $\pm$ 0.049 & 0.344 $\pm$ 0.009 & 1.077 $\pm$ 0.122 & 0.896 $\pm$ 0.006 & 1.151 $\pm$ 0.008 \\
~~~\tablemodel{Qwen-2.5-0.5B-Inst} & 0.463 $\pm$ 0.012 & 27.221 $\pm$ 0.533 & 0.267 $\pm$ 0.008 & 1.029 $\pm$ 0.055 & 0.744 $\pm$ 0.039 & 0.244 $\pm$ 0.007 & 2.043 $\pm$ 0.104 & 0.330 $\pm$ 0.004 & 0.392 $\pm$ 0.005 \\
~~~\tablemodel{GPT-4o} & 0.274 $\pm$ 0.010 & 8.640 $\pm$ 0.436 & 0.179 $\pm$ 0.004 & \textbf{0.087 $\pm$ 0.003} & 0.519 $\pm$ 0.029 & \textbf{0.110 $\pm$ 0.006} & 0.447 $\pm$ 0.029 & 0.590 $\pm$ 0.033 & 0.769 $\pm$ 0.046 \\
~~~\tablemodel{GPT-4o-mini} & 0.354 $\pm$ 0.022 & 17.573 $\pm$ 0.505 & 0.296 $\pm$ 0.043 & 0.419 $\pm$ 0.014 & 0.471 $\pm$ 0.012 & 0.219 $\pm$ 0.005 & 1.024 $\pm$ 0.033 & 0.475 $\pm$ 0.080 & 0.578 $\pm$ 0.112 \\
\midrule
\multicolumn{9}{l}{\llmp} \\
~~~\tablemodel{Llama-3-70B-Inst} & 0.539 $\pm$ 0.013 & 18.042 $\pm$ 0.572 & 0.641 $\pm$ 0.018 & 0.284 $\pm$ 0.015 & 0.392 $\pm$ 0.014 & 0.495 $\pm$ 0.025 & 0.312 $\pm$ 0.019 & 0.453 $\pm$ 0.020 & 0.495 $\pm$ 0.028 \\
~~~\tablemodel{Llama-3-70B} & 0.236 $\pm$ 0.006 & 12.377 $\pm$ 0.723 & 0.309 $\pm$ 0.011 & 0.126 $\pm$ 0.009 & 0.217 $\pm$ 0.007 & 0.132 $\pm$ 0.003 & 0.241 $\pm$ 0.019 & 0.294 $\pm$ 0.008 & \textbf{0.329 $\pm$ 0.010} \\
~~~\tablemodel{Llama-3-8B-Inst} & 0.483 $\pm$ 0.010 & 18.585 $\pm$ 0.477 & 0.345 $\pm$ 0.002 & 0.138 $\pm$ 0.004 & 0.910 $\pm$ 0.030 & 0.242 $\pm$ 0.008 & 1.278 $\pm$ 0.069 & 0.617 $\pm$ 0.022 & 0.787 $\pm$ 0.030 \\
~~~\tablemodel{Llama-3-8B} & 0.311 $\pm$ 0.023 & 18.634 $\pm$ 1.039 & 0.403 $\pm$ 0.043 & 0.124 $\pm$ 0.003 & 0.280 $\pm$ 0.026 & 0.177 $\pm$ 0.014 & 0.267 $\pm$ 0.015 & 0.530 $\pm$ 0.084 & 0.661 $\pm$ 0.117 \\
~~~\tablemodel{Mixtral-8x7B-Inst} & 0.264 $\pm$ 0.004 & 16.078 $\pm$ 0.666 & 0.344 $\pm$ 0.004 & 0.127 $\pm$ 0.003 & 0.224 $\pm$ 0.005 & 0.179 $\pm$ 0.010 & 0.173 $\pm$ 0.009 & 0.348 $\pm$ 0.005 & 0.405 $\pm$ 0.007 \\
~~~\tablemodel{Mixtral-8x7B} & 0.262 $\pm$ 0.008 & 16.302 $\pm$ 0.523 & 0.348 $\pm$ 0.012 & 0.146 $\pm$ 0.022 & 0.230 $\pm$ 0.016 & 0.153 $\pm$ 0.002 & 0.230 $\pm$ 0.041 & 0.354 $\pm$ 0.007 & 0.414 $\pm$ 0.009 \\
~~~\tablemodel{Qwen-2.5-7B-Inst}  & 1.974 $\pm$ 0.027 & 45.233 $\pm$ 0.739 & 1.816 $\pm$ 0.048 & 1.022 $\pm$ 0.054 & 2.215 $\pm$ 0.046 & 2.758 $\pm$ 0.024 & 1.723 $\pm$ 0.092 & 2.025 $\pm$ 0.093 & 1.607 $\pm$ 0.127 \\
~~~\tablemodel{Qwen-2.5-7B} & 0.910 $\pm$ 0.037 & 38.157 $\pm$ 1.041 & 0.691 $\pm$ 0.063 & 0.794 $\pm$ 0.083 & 1.558 $\pm$ 0.062 & 0.893 $\pm$ 0.028 & 2.328 $\pm$ 0.153 & 0.878 $\pm$ 0.084 & 0.881 $\pm$ 0.113 \\
~~~\tablemodel{Qwen-2.5-1.5B-Inst} & 2.158 $\pm$ 0.027 & 50.654 $\pm$ 0.863 & 2.056 $\pm$ 0.046 & 1.566 $\pm$ 0.033 & 2.671 $\pm$ 0.038 & 2.165 $\pm$ 0.035 & 3.635 $\pm$ 0.053 & 2.480 $\pm$ 0.085 & 2.323 $\pm$ 0.113 \\
~~~\tablemodel{Qwen-2.5-1.5B} & 1.731 $\pm$ 0.036 & 45.118 $\pm$ 0.528 & 1.343 $\pm$ 0.061 & 1.737 $\pm$ 0.074 & 2.594 $\pm$ 0.042 & 2.256 $\pm$ 0.042 & 3.275 $\pm$ 0.132 & 2.036 $\pm$ 0.083 & 1.526 $\pm$ 0.114 \\
~~~\tablemodel{Qwen-2.5-0.5B-Inst}  & 1.937 $\pm$ 0.024 & 50.482 $\pm$ 0.612 & 1.740 $\pm$ 0.043 & 1.800 $\pm$ 0.021 & 2.193 $\pm$ 0.025 & 2.305 $\pm$ 0.028 & 3.439 $\pm$ 0.004 & 1.685 $\pm$ 0.084 & 1.398 $\pm$ 0.114 \\
~~~\tablemodel{Qwen-2.5-0.5B} & 1.995 $\pm$ 0.024 & 48.507 $\pm$ 0.840 & 1.829 $\pm$ 0.045 & 0.950 $\pm$ 0.025 & 1.967 $\pm$ 0.020 & 2.809 $\pm$ 0.023 & 1.804 $\pm$ 0.036 & 1.695 $\pm$ 0.085 & 1.443 $\pm$ 0.113 \\
\midrule
\multicolumn{9}{l}{\tablemodel{Multimodal Models}} \\
~~~\tablemodel{UniTime} & 0.370 $\pm$ 0.001 & 35.453 $\pm$ 0.152 & 0.267 $\pm$ 0.002 & 0.179 $\pm$ 0.001 & 0.321 $\pm$ 0.001 & 0.511 $\pm$ 0.003 & 0.337 $\pm$ 0.001 & 0.333 $\pm$ 0.001 & 0.385 $\pm$ 0.001 \\
~~~\tablemodel{Time-LLM (ETTh1)} & 0.476 $\pm$ 0.001 & 44.084 $\pm$ 0.150 & 0.448 $\pm$ 0.002 & 0.192 $\pm$ 0.000 & 0.373 $\pm$ 0.000 & 0.538 $\pm$ 0.003 & 0.397 $\pm$ 0.001 & 0.382 $\pm$ 0.001 & 0.440 $\pm$ 0.001 \\
~~~\tablemodel{ChatTime-Base} & 0.735 $\pm$ 0.002 & 39.033 $\pm$ 0.312 & 0.843 $\pm$ 0.002 & 0.216 $\pm$ 0.002 & 1.099 $\pm$ 0.002 & 0.263 $\pm$ 0.004 & 0.374 $\pm$ 0.004 & 1.788 $\pm$ 0.002 & 2.407 $\pm$ 0.002 \\
~~~\tablemodel{ChatTime-Chat} & 0.747 $\pm$ 0.005 & 34.182 $\pm$ 0.389 & 0.825 $\pm$ 0.004 & 0.299 $\pm$ 0.020 & 1.198 $\pm$ 0.015 & 0.305 $\pm$ 0.012 & 0.277 $\pm$ 0.003 & 2.015 $\pm$ 0.016 & 2.691 $\pm$ 0.022 \\
\midrule
\multicolumn{9}{l}{\textbf{Without Context}} \\
\midrule
\multicolumn{9}{l}{\directprompt (ours)} \\
~~~\tablemodel{Llama-3.1-405B-Inst} & 0.473 $\pm$ 0.005 & 34.336 $\pm$ 0.294 & 0.393 $\pm$ 0.002 & 0.325 $\pm$ 0.006 & 0.752 $\pm$ 0.015 & 0.494 $\pm$ 0.009 & 0.720 $\pm$ 0.027 & 0.594 $\pm$ 0.006 & 0.617 $\pm$ 0.006 \\
~~~\tablemodel{Llama-3-70B-Inst} & 0.470 $\pm$ 0.008 & 35.143 $\pm$ 0.357 & 0.404 $\pm$ 0.002 & 0.304 $\pm$ 0.007 & 0.717 $\pm$ 0.015 & 0.488 $\pm$ 0.022 & 0.694 $\pm$ 0.024 & 0.606 $\pm$ 0.007 & 0.631 $\pm$ 0.007 \\
~~~\tablemodel{Llama-3-8B-Inst} & 0.602 $\pm$ 0.006 & 41.873 $\pm$ 0.398 & 0.390 $\pm$ 0.005 & 0.322 $\pm$ 0.004 & 1.123 $\pm$ 0.018 & 0.643 $\pm$ 0.012 & 1.446 $\pm$ 0.035 & 0.581 $\pm$ 0.005 & 0.617 $\pm$ 0.005 \\
~~~\tablemodel{Mixtral-8x7B-Inst} & 0.712 $\pm$ 0.021 & 46.809 $\pm$ 0.376 & 0.624 $\pm$ 0.043 & 0.324 $\pm$ 0.006 & 1.053 $\pm$ 0.007 & 0.783 $\pm$ 0.005 & 1.237 $\pm$ 0.007 & 0.858 $\pm$ 0.080 & 0.872 $\pm$ 0.112 \\
~~~\tablemodel{Qwen-2.5-7B-Inst} & 0.421 $\pm$ 0.022 & 37.154 $\pm$ 0.546 & 0.447 $\pm$ 0.043 & 0.259 $\pm$ 0.008 & 0.505 $\pm$ 0.010 & 0.375 $\pm$ 0.007 & 0.411 $\pm$ 0.016 & 0.692 $\pm$ 0.080 & 0.792 $\pm$ 0.112 \\
~~~\tablemodel{Qwen-2.5-1.5B-Inst}  & 0.450 $\pm$ 0.007 & 36.308 $\pm$ 0.619 & 0.377 $\pm$ 0.006 & 0.232 $\pm$ 0.004 & 0.661 $\pm$ 0.018 & 0.387 $\pm$ 0.013 & 0.939 $\pm$ 0.039 & 0.423 $\pm$ 0.007 & 0.476 $\pm$ 0.008 \\
~~~\tablemodel{Qwen-2.5-0.5B-Inst} & 0.305 $\pm$ 0.007 & 26.819 $\pm$ 0.351 & 0.267 $\pm$ 0.010 & 0.162 $\pm$ 0.001 & 0.384 $\pm$ 0.008 & 0.300 $\pm$ 0.011 & 0.440 $\pm$ 0.016 & \textbf{0.315 $\pm$ 0.002} & 0.367 $\pm$ 0.002 \\
~~~\tablemodel{GPT-4o} & 0.441 $\pm$ 0.008 & 31.507 $\pm$ 0.385 & 0.381 $\pm$ 0.002 & 0.179 $\pm$ 0.002 & 0.692 $\pm$ 0.028 & 0.357 $\pm$ 0.007 & 0.953 $\pm$ 0.067 & 0.422 $\pm$ 0.003 & 0.471 $\pm$ 0.002 \\
~~~\tablemodel{GPT-4o-mini} & 0.423 $\pm$ 0.006 & 35.715 $\pm$ 0.275 & 0.359 $\pm$ 0.003 & 0.214 $\pm$ 0.002 & 0.649 $\pm$ 0.019 & 0.391 $\pm$ 0.008 & 0.771 $\pm$ 0.041 & 0.461 $\pm$ 0.003 & 0.511 $\pm$ 0.003 \\
\midrule
\multicolumn{9}{l}{\llmp} \\
~~~\tablemodel{Llama-3-70B-Inst} & 0.378 $\pm$ 0.004 & 26.031 $\pm$ 0.474 & 0.368 $\pm$ 0.003 & 0.150 $\pm$ 0.003 & 0.513 $\pm$ 0.012 & 0.292 $\pm$ 0.006 & 0.668 $\pm$ 0.025 & 0.384 $\pm$ 0.003 & 0.440 $\pm$ 0.003 \\
~~~\tablemodel{Llama-3-70B} & 0.311 $\pm$ 0.006 & 21.810 $\pm$ 0.464 & 0.349 $\pm$ 0.011 & 0.141 $\pm$ 0.002 & 0.351 $\pm$ 0.008 & 0.215 $\pm$ 0.004 & 0.395 $\pm$ 0.017 & 0.372 $\pm$ 0.004 & 0.434 $\pm$ 0.005 \\
~~~\tablemodel{Llama-3-8B-Inst} & 0.503 $\pm$ 0.009 & 31.156 $\pm$ 0.438 & 0.385 $\pm$ 0.003 & 0.159 $\pm$ 0.002 & 0.914 $\pm$ 0.030 & 0.431 $\pm$ 0.010 & 1.271 $\pm$ 0.070 & 0.424 $\pm$ 0.003 & 0.467 $\pm$ 0.001 \\
~~~\tablemodel{Llama-3-8B} & 0.345 $\pm$ 0.003 & 25.067 $\pm$ 0.381 & 0.326 $\pm$ 0.004 & 0.150 $\pm$ 0.002 & 0.497 $\pm$ 0.009 & 0.266 $\pm$ 0.005 & 0.640 $\pm$ 0.017 & 0.381 $\pm$ 0.002 & 0.444 $\pm$ 0.002 \\
~~~\tablemodel{Mixtral-8x7B-Inst} & 0.383 $\pm$ 0.015 & 24.582 $\pm$ 0.471 & 0.357 $\pm$ 0.003 & 0.550 $\pm$ 0.072 & 0.459 $\pm$ 0.047 & 0.211 $\pm$ 0.008 & 1.027 $\pm$ 0.133 & 0.371 $\pm$ 0.004 & 0.428 $\pm$ 0.004 \\
~~~\tablemodel{Mixtral-8x7B}  & 0.306 $\pm$ 0.007 & 22.560 $\pm$ 0.512 & 0.360 $\pm$ 0.013 & 0.146 $\pm$ 0.002 & 0.327 $\pm$ 0.008 & 0.202 $\pm$ 0.005 & 0.340 $\pm$ 0.016 & 0.382 $\pm$ 0.002 & 0.445 $\pm$ 0.003 \\
~~~\tablemodel{Qwen-2.5-7B-Inst} & 1.020 $\pm$ 0.026 & 39.942 $\pm$ 1.233 & 0.521 $\pm$ 0.045 & 1.157 $\pm$ 0.022 & 1.634 $\pm$ 0.017 & 1.061 $\pm$ 0.034 & 3.319 $\pm$ 0.037 & 0.853 $\pm$ 0.081 & 0.769 $\pm$ 0.112 \\
~~~\tablemodel{Qwen-2.5-7B} & 0.732 $\pm$ 0.030 & 38.109 $\pm$ 1.185 & 0.649 $\pm$ 0.048 & 0.359 $\pm$ 0.036 & 0.974 $\pm$ 0.037 & 0.751 $\pm$ 0.039 & 1.433 $\pm$ 0.096 & 0.728 $\pm$ 0.084 & 0.730 $\pm$ 0.112 \\
~~~\tablemodel{Qwen-2.5-1.5B-Inst} & 1.515 $\pm$ 0.033 & 47.071 $\pm$ 1.066 & 1.316 $\pm$ 0.055 & 1.159 $\pm$ 0.057 & 1.802 $\pm$ 0.031 & 1.652 $\pm$ 0.042 & 3.383 $\pm$ 0.106 & 1.108 $\pm$ 0.082 & 0.848 $\pm$ 0.112 \\
~~~\tablemodel{Qwen-2.5-1.5B} & 1.070 $\pm$ 0.028 & 41.187 $\pm$ 1.053 & 1.005 $\pm$ 0.048 & 0.287 $\pm$ 0.026 & 1.339 $\pm$ 0.023 & 1.264 $\pm$ 0.040 & 1.798 $\pm$ 0.047 & 0.771 $\pm$ 0.086 & 0.720 $\pm$ 0.112 \\
~~~\tablemodel{Qwen-2.5-0.5B-Inst} & 1.318 $\pm$ 0.037 & 44.393 $\pm$ 0.750 & 1.464 $\pm$ 0.064 & 0.239 $\pm$ 0.031 & 1.192 $\pm$ 0.019 & 1.433 $\pm$ 0.047 & 1.675 $\pm$ 0.072 & 0.930 $\pm$ 0.082 & 0.743 $\pm$ 0.112 \\
~~~\tablemodel{Qwen-2.5-0.5B} & 1.821 $\pm$ 0.027 & 47.763 $\pm$ 0.715 & 1.705 $\pm$ 0.045 & 0.572 $\pm$ 0.040 & 1.722 $\pm$ 0.033 & 2.498 $\pm$ 0.036 & 1.671 $\pm$ 0.064 & 1.492 $\pm$ 0.083 & 1.113 $\pm$ 0.114 \\
\midrule
\multicolumn{9}{l}{\tablemodel{Multimodal Models}} \\
~~~\tablemodel{UniTime} & 0.405 $\pm$ 0.002 & 37.248 $\pm$ 0.177 & 0.361 $\pm$ 0.002 & 0.166 $\pm$ 0.001 & 0.319 $\pm$ 0.001 & 0.496 $\pm$ 0.005 & 0.314 $\pm$ 0.002 & 0.352 $\pm$ 0.002 & 0.409 $\pm$ 0.003 \\
~~~\tablemodel{Time-LLM (ETTh1)} & 0.458 $\pm$ 0.002 & 43.014 $\pm$ 0.164 & 0.440 $\pm$ 0.003 & 0.191 $\pm$ 0.000 & 0.371 $\pm$ 0.000 & 0.499 $\pm$ 0.002 & 0.399 $\pm$ 0.001 & 0.368 $\pm$ 0.001 & 0.424 $\pm$ 0.001 \\
~~~\tablemodel{ChatTime-Base} & 0.725 $\pm$ 0.002 & 38.771 $\pm$ 0.351 & 0.837 $\pm$ 0.002 & 0.205 $\pm$ 0.003 & 1.090 $\pm$ 0.004 & 0.250 $\pm$ 0.004 & 0.365 $\pm$ 0.006 & 1.779 $\pm$ 0.003 & 2.398 $\pm$ 0.004 \\
~~~\tablemodel{ChatTime-Chat} & 0.781 $\pm$ 0.015 & 35.260 $\pm$ 0.446 & 0.865 $\pm$ 0.022 & 0.165 $\pm$ 0.002 & 1.217 $\pm$ 0.036 & 0.282 $\pm$ 0.020 & 0.445 $\pm$ 0.064 & 1.896 $\pm$ 0.001 & 2.536 $\pm$ 0.000 \\
\midrule
\multicolumn{9}{l}{\tablemodel{TS Foundation Models}} \\
~~~\tablemodel{Lag-Llama} & 0.327 $\pm$ 0.004 & 30.451 $\pm$ 0.819 & 0.353 $\pm$ 0.007 & 0.181 $\pm$ 0.003 & 0.324 $\pm$ 0.003 & 0.269 $\pm$ 0.006 & 0.342 $\pm$ 0.006 & 0.386 $\pm$ 0.009 & 0.449 $\pm$ 0.012 \\
~~~\tablemodel{Chronos-Tiny} & 0.328 $\pm$ 0.001 & 27.487 $\pm$ 0.441 & 0.400 $\pm$ 0.002 & 0.144 $\pm$ 0.003 & 0.297 $\pm$ 0.002 & 0.229 $\pm$ 0.002 & 0.290 $\pm$ 0.005 & 0.382 $\pm$ 0.002 & 0.440 $\pm$ 0.003 \\
~~~\tablemodel{Chronos-Mini} & 0.341 $\pm$ 0.001 & 28.893 $\pm$ 0.428 & 0.412 $\pm$ 0.002 & 0.147 $\pm$ 0.002 & 0.302 $\pm$ 0.002 & 0.248 $\pm$ 0.002 & 0.305 $\pm$ 0.004 & 0.378 $\pm$ 0.003 & 0.436 $\pm$ 0.004 \\
~~~\tablemodel{Chronos-Small} & 0.328 $\pm$ 0.002 & 26.524 $\pm$ 0.372 & 0.388 $\pm$ 0.003 & 0.144 $\pm$ 0.002 & 0.287 $\pm$ 0.002 & 0.248 $\pm$ 0.003 & 0.290 $\pm$ 0.003 & 0.358 $\pm$ 0.005 & 0.412 $\pm$ 0.006 \\
~~~\tablemodel{Chronos-Base} & 0.672 $\pm$ 0.003 & 30.601 $\pm$ 0.375 & 0.702 $\pm$ 0.002 & 0.143 $\pm$ 0.002 & 1.023 $\pm$ 0.006 & 0.261 $\pm$ 0.003 & 0.299 $\pm$ 0.004 & 1.643 $\pm$ 0.009 & 2.187 $\pm$ 0.012 \\
~~~\tablemodel{Chronos-Large} & 0.326 $\pm$ 0.002 & 25.602 $\pm$ 0.399 & 0.385 $\pm$ 0.002 & 0.138 $\pm$ 0.002 & 0.288 $\pm$ 0.002 & 0.249 $\pm$ 0.002 & 0.295 $\pm$ 0.003 & 0.362 $\pm$ 0.003 & 0.417 $\pm$ 0.004 \\
~~~\tablemodel{TimeGEN} & 0.353 $\pm$ 0.000 & 35.924 $\pm$ 0.167 & 0.401 $\pm$ 0.000 & 0.176 $\pm$ 0.000 & 0.308 $\pm$ 0.000 & 0.278 $\pm$ 0.000 & 0.324 $\pm$ 0.000 & 0.377 $\pm$ 0.000 & 0.431 $\pm$ 0.000 \\
~~~\tablemodel{Moirai-small}  & 0.565 $\pm$ 0.031 & 36.038 $\pm$ 0.438 & 0.429 $\pm$ 0.005 & 0.671 $\pm$ 0.146 & 0.468 $\pm$ 0.076 & 0.566 $\pm$ 0.017 & 1.204 $\pm$ 0.271 & 0.376 $\pm$ 0.009 & 0.426 $\pm$ 0.012 \\
~~~\tablemodel{Moirai-base} & 0.624 $\pm$ 0.013 & 35.263 $\pm$ 0.407 & 0.410 $\pm$ 0.006 & 0.600 $\pm$ 0.053 & 0.680 $\pm$ 0.028 & 0.690 $\pm$ 0.019 & 1.147 $\pm$ 0.099 & 0.375 $\pm$ 0.010 & 0.441 $\pm$ 0.013 \\
~~~\tablemodel{Moirai-large} & 0.520 $\pm$ 0.006 & 28.635 $\pm$ 0.862 & 0.414 $\pm$ 0.004 & 0.155 $\pm$ 0.004 & 0.260 $\pm$ 0.003 & 0.751 $\pm$ 0.015 & 0.276 $\pm$ 0.008 & 0.337 $\pm$ 0.007 & 0.397 $\pm$ 0.010 \\
\midrule
\multicolumn{9}{l}{\tablemodel{Statistical Models}} \\
~~~\tablemodel{ARIMA} & 0.475 $\pm$ 0.006 & 27.047 $\pm$ 0.485 & 0.395 $\pm$ 0.005 & 0.160 $\pm$ 0.002 & 0.517 $\pm$ 0.012 & 0.513 $\pm$ 0.012 & 0.706 $\pm$ 0.026 & 0.354 $\pm$ 0.007 & 0.403 $\pm$ 0.010 \\
~~~\tablemodel{ETS} & 0.530 $\pm$ 0.009 & 33.786 $\pm$ 0.635 & 0.418 $\pm$ 0.009 & 0.228 $\pm$ 0.010 & 0.682 $\pm$ 0.018 & 0.577 $\pm$ 0.019 & 0.855 $\pm$ 0.035 & 0.453 $\pm$ 0.012 & 0.479 $\pm$ 0.015 \\
~~~\tablemodel{Exp-Smoothing} & 0.605 $\pm$ 0.013 & 36.425 $\pm$ 0.346 & 0.569 $\pm$ 0.021 & 0.334 $\pm$ 0.013 & 0.743 $\pm$ 0.018 & 0.563 $\pm$ 0.016 & 0.899 $\pm$ 0.035 & 0.673 $\pm$ 0.038 & 0.782 $\pm$ 0.053 \\
\bottomrule
\end{tabu}
}
```
```{=latex}
\vspace*{-0.5cm}
```
#### Inference Time {#app:inference_time}

```{=latex}
\vspace*{-2mm}
```
```{=latex}
\centering
```
![Inference time in seconds, for all baselines, averaged over all tasks. Several quantitative methods are much faster on average than LLM-based methods. However, there are significant differences in inference time between the LLM-based forecasters: for the `\model{Llama}`{=latex} models, `\llmp `{=latex}takes about an order of magnitude more time to run on average than `\directprompt`{=latex}. ](figures/icml_fixed/model_runtime_striped_icml.png){#fig:cost-inference-time width="\\linewidth"}

`\cref{fig:cost-inference-time}`{=latex} provides the inference time of all tested models on the benchmark. Note that these values have not been normalized based on the computing resources made available to each model during inference; please refer to `\cref{app:models}`{=latex} for information about how much compute resources were allocated to each of them.

#### Significant failures per model {#subsec:significant-failures}

```{=latex}
\captionsetup{justification=centering}
```
```{=latex}
\centering
```
```{=latex}
\resizebox{0.39\textwidth}{!}{%
\begin{tabular}{lc}
\toprule
 Model & Number of instances with significant failures \\
\midrule
\multicolumn{2}{l}{\textbf{With Context}} \\
\midrule
\multicolumn{2}{l}{\directprompt (ours)} \\
~~~\tablemodel{Llama-3.1-405B-Inst} & 0 \\
~~~\tablemodel{Llama-3-70B-Inst} & 1 \\
~~~\tablemodel{Llama-3-8B-Inst} & 2 \\
~~~\tablemodel{Mixtral-8x7B-Inst} & 7 \\
~~~\tablemodel{Qwen-2.5-7B-Inst} & 1 \\
~~~\tablemodel{Qwen-2.5-1.5B-Inst} & 10 \\
~~~\tablemodel{Qwen-2.5-0.5B-Inst} & 7 \\
~~~\tablemodel{GPT-4o} & 5 \\
~~~\tablemodel{GPT-4o-mini} & 0 \\
\midrule
\multicolumn{2}{l}{\llmp} \\
~~~\tablemodel{Llama-3-70B-Inst} & 18 \\
~~~\tablemodel{Llama-3-70B} & 0 \\
~~~\tablemodel{Llama-3-8B-Inst} & 12 \\
~~~\tablemodel{Llama-3-8B} & 0 \\
~~~\tablemodel{Mixtral-8x7B-Inst} & 1 \\
~~~\tablemodel{Mixtral-8x7B} & 0 \\
~~~\tablemodel{Qwen-2.5-7B-Inst} & 107 \\
~~~\tablemodel{Qwen-2.5-7B} & 27 \\
~~~\tablemodel{Qwen-2.5-1.5B-Inst} & 93 \\
~~~\tablemodel{Qwen-2.5-1.5B} & 99 \\
~~~\tablemodel{Qwen-2.5-0.5B-Inst} & 100 \\
~~~\tablemodel{Qwen-2.5-0.5B} & 109 \\
\midrule
\multicolumn{2}{l}{Multimodal Models} \\
~~~\tablemodel{UniTime} & 0 \\
~~~\tablemodel{Time-LLM (ETTh1)} & 2 \\
~~~\tablemodel{ChatTime-Base} & 21 \\
~~~\tablemodel{ChatTime-Chat} & 23 \\
\midrule
\multicolumn{2}{l}{\textbf{Without Context}} \\
\midrule
\multicolumn{2}{l}{\directprompt (ours)} \\
~~~\tablemodel{Llama-3.1-405B-Inst} & 9 \\
~~~\tablemodel{Llama-3-70B-Inst} & 4 \\
~~~\tablemodel{Llama-3-8B-Inst} & 8 \\
~~~\tablemodel{Mixtral-8x7B-Inst} & 14 \\
~~~\tablemodel{Qwen-2.5-7B-Inst} & 3 \\
~~~\tablemodel{Qwen-2.5-1.5B-Inst} & 5 \\
~~~\tablemodel{Qwen-2.5-0.5B-Inst} & 0 \\
~~~\tablemodel{GPT-4o} & 2 \\
~~~\tablemodel{GPT-4o-mini} & 1 \\
\midrule
\multicolumn{2}{l}{\llmp} \\
~~~\tablemodel{Llama-3-70B-Inst} & 1 \\
~~~\tablemodel{Llama-3-70B} & 0 \\
~~~\tablemodel{Llama-3-8B-Inst} & 9 \\
~~~\tablemodel{Llama-3-8B} & 3 \\
~~~\tablemodel{Mixtral-8x7B-Inst} & 3 \\
~~~\tablemodel{Mixtral-8x7B} & 0 \\
~~~\tablemodel{Qwen-2.5-7B-Inst} & 35 \\
~~~\tablemodel{Qwen-2.5-7B} & 15 \\
~~~\tablemodel{Qwen-2.5-1.5B-Inst} & 52 \\
~~~\tablemodel{Qwen-2.5-1.5B} & 36 \\
~~~\tablemodel{Qwen-2.5-0.5B-Inst} & 42 \\
~~~\tablemodel{Qwen-2.5-0.5B} & 74 \\
\midrule
\multicolumn{2}{l}{\tablemodel{Multimodal Models}} \\
~~~\tablemodel{UniTime} & 1 \\
~~~\tablemodel{Time-LLM (ETTh1)} & 1 \\
~~~\tablemodel{ChatTime-Base} & 21 \\
~~~\tablemodel{ChatTime-Chat} & 23 \\
\midrule
\multicolumn{2}{l}{\tablemodel{TS Foundation Models}} \\
~~~\tablemodel{Lag-Llama} & 1 \\
~~~\tablemodel{Chronos-Tiny} & 2 \\
~~~\tablemodel{Chronos-Mini} & 2 \\
~~~\tablemodel{Chronos-Small} & 1 \\
~~~\tablemodel{Chronos-Base} & 18 \\
~~~\tablemodel{Chronos-Large} & 1 \\
~~~\tablemodel{TimeGEN} & 2 \\
~~~\tablemodel{Moirai-Small} & 3 \\
~~~\tablemodel{Moirai-Base} & 8 \\
~~~\tablemodel{Moirai-Large} & 7 \\
\midrule
\multicolumn{2}{l}{\tablemodel{Statistical Models}} \\
~~~\tablemodel{ARIMA} & 2 \\
~~~\tablemodel{ETS} & 1 \\
~~~\tablemodel{Exp-Smoothing} & 5 \\
\bottomrule
\end{tabular}
}
```
We observe that in a few instances in the benchmark, some models tend to obtain significantly worse performance when evaluated with context. In our evaluation, we term all instances where the RCRPS value of a model is greater than $5$, as significant failures of the model on those instances. We found $5$ as a suitable value for analyzing such failures, as it intuitively represents the value a forecast would get if the distance between the forecast and the ground-truth was 5 times bigger than the range of the ground-truth for the task. When we aggregate the RCRPS of instances in the benchmark (such as in `\cref{table:main-results}`{=latex}), we cap the RCRPS of such significant failures to $5$, to avoid outliers with a much higher RCRPS affecting the aggregate score. In `\cref{table:cat-failures-per-model}`{=latex}, we show the number of such instances in our evaluation of the benchmark where we found models to have significant failures (out of a total of $355$ evaluated instances). Interestingly, some models such as `\directprompt `{=latex}with `\model{Llama-3.1-405B-Instruct}`{=latex} and `\llmp `{=latex}with `\model{Llama-3-70B}`{=latex} and `\model{Llama-3-8B}`{=latex} are more robust to such significant failures, and do not incur such failures. On the other hand, models such as `\model{Qwen}`{=latex} family of models (that are notably significantly smaller than the rest) with `\llmp `{=latex}achieve the most significant failures, followed by `\model{Llama-3-70B-Instruct}`{=latex} and `\model{Llama-3-8B-Instruct}`{=latex} with `\llmp`{=latex}. We postulate that this is because of models misinterpreting context. It is still an open question as to how to increase the robustness of models to prevent or reduce such significant failures. We visualize such significant failures in `\cref{app:viz-failure-forecasts}`{=latex}.

#### Testing the Statistical Significance of the Relevance of Context {#app:stat-signif}

```{=latex}
\centering
```
::: {#tab:paired-t-test}
  ---------------------------------------------------------------------- --------------- ----------------- ----------------------------------
                                                                         Average RCRPS   Average RCRPS
  Model                                                                  With context    Without context   $p$-value
  `\directprompt`{=latex} - `\tablemodel{Llama-3.1-405B-Inst}`{=latex}   0.165 ± 0.005   0.544 ± 0.007     6.92 $\times$ 10$^\text{-13}$
  `\llmp`{=latex} - `\tablemodel{Llama-3-70B}`{=latex}                   0.191 ± 0.004   0.249 ± 0.004     1.85 $\times$ 10$^\text{-9}$
  `\llmp`{=latex} - `\tablemodel{Mixtral-8x7B}`{=latex}                  0.202 ± 0.005   0.245 ± 0.004     9.17 $\times$ 10$^\text{-8}$
  `\llmp`{=latex} - `\tablemodel{Llama-3-8B}`{=latex}                    0.214 ± 0.007   0.283 ± 0.003     4.66 $\times$ 10$^\text{-4}$
  `\llmp`{=latex} - `\tablemodel{Mixtral-8x7B-Inst}`{=latex}             0.223 ± 0.002   0.290 ± 0.009     0.002
  `\directprompt`{=latex} - `\tablemodel{Qwen-2.5-7B-Inst}`{=latex}      0.244 ± 0.003   0.403 ± 0.009     7.99 $\times$ 10$^\text{-8}$
  `\directprompt`{=latex} - `\tablemodel{Llama-3-70B-Inst}`{=latex}      0.246 ± 0.003   0.529 ± 0.010     1.07 $\times$ 10$^\text{-10}$
  `\directprompt`{=latex} - `\tablemodel{GPT-4o-mini}`{=latex}           0.250 ± 0.003   0.403 ± 0.005     2.85 $\times$ 10$^\text{-8}$
  `\directprompt`{=latex} - `\tablemodel{GPT-4o}`{=latex}                0.252 ± 0.010   0.387 ± 0.007     6.21 $\times$ 10$^\text{-4}$
  `\tablemodel{UniTime}`{=latex}                                         0.290 ± 0.001   0.321 ± 0.001     0.016
  `\directprompt`{=latex} - `\tablemodel{Qwen-2.5-0.5B-Inst}`{=latex}    0.343 ± 0.011   0.258 ± 0.004     0.987
  `\tablemodel{Time-LLM (ETTh1)}`{=latex}                                0.378 ± 0.001   0.364 ± 0.001     1 - 8.08 $\times$ 10$^\text{-7}$
  `\directprompt`{=latex} - `\tablemodel{Mixtral-8x7B-Inst}`{=latex}     0.413 ± 0.007   0.699 ± 0.006     1.88 $\times$ 10$^\text{-6}$
  `\llmp`{=latex} - `\tablemodel{Llama-3-8B-Inst}`{=latex}               0.413 ± 0.009   0.432 ± 0.010     0.287
  `\directprompt`{=latex} - `\tablemodel{Llama-3-8B-Inst}`{=latex}       0.416 ± 0.007   0.631 ± 0.007     3.31 $\times$ 10$^\text{-10}$
  `\directprompt`{=latex} - `\tablemodel{Qwen-2.5-1.5B-Inst}`{=latex}    0.481 ± 0.016   0.406 ± 0.006     0.975
  `\tablemodel{ChatTime-Chat}`{=latex}                                   0.557 ± 0.001   0.554 ± 0.001     0.086
  `\tablemodel{ChatTime-Base}`{=latex}                                   0.568 ± 0.001   0.556 ± 0.002     1 - 1.35 $\times$ 10$^\text{-4}$
  `\llmp`{=latex} - `\tablemodel{Llama-3-70B-Inst}`{=latex}              0.579 ± 0.019   0.313 ± 0.003     1 - 1.93 $\times$ 10$^\text{-5}$
  `\llmp`{=latex} - `\tablemodel{Qwen-2.5-7B}`{=latex}                   0.909 ± 0.025   0.618 ± 0.018     1 - 8.48 $\times$ 10$^\text{-6}$
  `\llmp`{=latex} - `\tablemodel{Qwen-2.5-1.5B-Inst}`{=latex}            2.038 ± 0.025   1.181 ± 0.022     1
  `\llmp`{=latex} - `\tablemodel{Qwen-2.5-0.5B-Inst}`{=latex}            2.067 ± 0.025   1.047 ± 0.017     1
  `\llmp`{=latex} - `\tablemodel{Qwen-2.5-0.5B}`{=latex}                 2.144 ± 0.021   1.766 ± 0.025     1 - 4.83 $\times$ 10$^\text{-8}$
  `\llmp`{=latex} - `\tablemodel{Qwen-2.5-1.5B}`{=latex}                 2.162 ± 0.028   0.861 ± 0.016     1
  `\llmp`{=latex} - `\tablemodel{Qwen-2.5-7B-Inst}`{=latex}              2.275 ± 0.025   0.895 ± 0.014     1
  ---------------------------------------------------------------------- --------------- ----------------- ----------------------------------

  : $p$-value of the one-sided paired $t$-test between the RCPRS values with and without context for models who can use it. Since this test is done on the unweighted RCRPS values, the average RCPRS presented in this table are also unweighted.
:::

To assess whether the lower average RCPRS using context than without using context we observe for the best performing model in our benchmark is statistically significant, we ran an analysis using the paired $t$-test. We used the paired $t$-test implemented in the `scipy` Python package as the `scipy.stats.ttest_rel` method, with `alternative="less"` to make the test one-sided. As can be seen in `\cref{tab:paired-t-test}`{=latex}, for many models the improved RCPRS when using the context is statistically significant, with $p$-values lower than $\times$ 10$^\text{-6}$ for 7 out of the 23 models under consideration. Furthermore, the best performing models are those for which the improvement is statistically significant, with the 9 best models all having $p$-values below 0.01.

#### Cost of API-based models {#app:cost-of-api-based-models}

`\cref{tab:openai-costs}`{=latex} provides the cost incurred in evaluating `\model{GPT-4o}`{=latex} (version gpt-4o-2024-05-13) and `\model{GPT-4o-mini}`{=latex} (version gpt-4o-mini-2024-07-18) with the Direct Prompt method on CiK (as per the evaluation protocol used, described in `\cref{subsec:protocol}`{=latex}).

```{=latex}
\centering
```
```{=latex}
\resizebox{0.5\textwidth}{!}{%
\begin{tabular}{@{}lccc@{}}
\toprule
\textbf{Model}           & \textbf{Total} & \textbf{Per-instance average} & \textbf{Per-instance median} \\ \midrule
\tablemodel{GPT-4o}                   & \$143.83             & \$0.288                           & \$0.170                             \\
\tablemodel{GPT-4o} (no context)      & \$139.50             & \$0.279                           & \$0.160                             \\
\tablemodel{GPT-4o-mini}              & \$13.79              & \$0.040                           & \$0.040                             \\
\tablemodel{GPT-4o-mini} (no context) & \$13.32              & \$0.038                           & \$0.040                             \\ \bottomrule
\end{tabular}
}
```
#### Impact of Relevant and Irrelevant Information in Context

```{=latex}
\centering
```
![A comparison of RCRPS (lower is better) for two tasks on predicting the Unemployment Rate of a county. Both contain the context needed to solve the task. However, the `UnemploymentCountyUsingSingleStateData` task (dark green) is filtered to only contain the relevant context. Other the other hand, the `UnemploymentCountyUsingExpliciteMultipleStateData` task (light green) also contains other unrelated context. We visualize three models here, all of which perform better when the context only includes the most relevant information.](figures/county_filtered.png){#fig:county_filtered width="0.4\\linewidth"}

We study here if models perform better on context that has already been filtered to only contain relevant information. To assess this, we compare two tasks on predicting the Unemployment Rate of a county.

1.  For the `UnemploymentCountyUsingSingleStateData` task, the context contains the unemployment rate of the state which the county belongs to, tagged with the name of the state. See <https://servicenow.github.io/context-is-key-forecasting/v0/UnemploymentCountyUsingSingleStateData.html> for a visualization.

2.  In the `` task, in addition to the county's state unemployment rate, the context includes unemployment rates of 2 other randomly selected states, also tagged with state names. See <https://servicenow.github.io/context-is-key-forecasting/v0/UnemploymentCountyUsingExplicitMultipleStateData.html> for a visualization.

Results of three randomly picked models from the benchmark is visualized in `\cref{fig:county_filtered}`{=latex}. We find that models perform much better when only the relevant state's data is provided, as opposed to the context also containing data from other states.

#### Impact of Solely Irrelevant Information in Context {#app:impact-irrelevant-info}

Many of our tasks include covariates in its context which are highly useful for the models to accurately predict the target time series. One question is: Do the LLM-based models perform well for such tasks due to correctly understanding that said covariates are helpful or because they blindly use the provided data without asking themselves if the data is actually relevant?

As a way to get some insight on this question, we took a task where the models have to forecast the unemployment data of an American county, given the unemployment data of the state the county is in (Task ``). We then modify this task by first trying to mislead the model by wrongly saying that the state-level data was from another state (without changing the data itself), then by giving the data from the other state (while explicitly telling the model that data is from said other state), before finally removing the state-level data altogether. The result for this experiment with 5 instances per task for `\directprompt`{=latex} - `\model{GPT-4o}`{=latex} is shown in `\cref{tab:unemployment_experiment}`{=latex}, while the forecasts for a single instance are shown in `\cref{fig:unemployment_experiment}`{=latex}. From these, we see that the model aggressively used data which is marked as being from an other state, even though if the data was actually from said other state, the performance would be closer to not having any state-level data. This shows that the model is liable to take any information provided as being useful, even though its usefulness is marginal.

```{=latex}
\centering
```
  Available data                                        RCPRS
  -------------------------------------------------- --------
  Data from the correct state, accurately tagged       0.0583
  Data from the correct state, inaccurately tagged     0.0557
  Data from an incorrect state, accurately tagged      0.1966
  No state-level data                                  0.2630

  : Ability of the `\directprompt`{=latex} - `\model{GPT-4o}`{=latex} model to accurately predict the unemployment level of an American county, given various covariates. These results are averaged over 5 instances.

`\label{tab:unemployment_experiment}`{=latex}

```{=latex}
\centering
```
```{=latex}
\centering
```
![The task in our benchmark: the context contains the unemployment rate of the state the county is in, correctly tagged with the state name.](figures/appendix/Effect_of_context_for_Unemployment/UnemploymentCountyUsingSingleStateData.png){width="\\columnwidth"}

```{=latex}
\hspace{0.05\textwidth}
```
```{=latex}
\centering
```
![The context only mentions that this time series is an unemployment rate, and of which county it is. No state-level unemployement data is provided.](figures/appendix/Effect_of_context_for_Unemployment/UnemploymentCountyNoInfo.png){width="\\columnwidth"}

```{=latex}
\vspace{0.2cm}
```
```{=latex}
\centering
```
![The state-level unemployment rate is incorrectly tagged as being from another state.](figures/appendix/Effect_of_context_for_Unemployment/UnemploymentCountyWrongState.png){width="\\columnwidth"}

```{=latex}
\hspace{0.05\textwidth}
```
```{=latex}
\centering
```
![The context contains the unemployment rate of another state than the one the county is in, which is correctly tagged.](figures/appendix/Effect_of_context_for_Unemployment/UnemploymentCountyWrongDataAndState.png){width="\\columnwidth"}

#### The effect of significant failures on the aggregate performance of models {#app:dp_mixtral_results}

```{=latex}
\centering
```
![Histogram of the RCPRS (lower is better) of the `\directprompt`{=latex} - `\model{Qwen-2.5-0.5B-Instruct}`{=latex} model on each task, with and without context. With context, the RCRPS is close to zero for a large number of tasks, but there is also a long tail of tasks with high RCRPS values, dominating and worsening the model's aggregate RCRPS.](figures/icml_fixed/rcrps-histogram-qwen2.5-0.5b.png){#fig:rcrps_hist_mixtral_dp width="0.5\\linewidth"}

As discussed in `\cref{subsec:areas-of-improvement}`{=latex}, in a few instances from the benchmark, some models return forecasts that miss the ground truth by a large margin, which we term significant failures (detailed in `\cref{subsec:significant-failures}`{=latex}). We analyse the effect of such significant failures on the results here. We use the `\directprompt`{=latex} - `\model{Mixtral 8x7B}`{=latex} model as an example here, while the same phenomenon may apply to other models. In `\cref{fig:context-improvement}`{=latex}, we can find that the aggregate RCRPS of `\directprompt`{=latex} - `\model{Mixtral 8x7B}`{=latex} *worsens* when it uses context. However, in `\cref{fig:llm_wins}`{=latex} (left), the win rate of the model vs quantitative baselines *improves* when it uses context. These two figures show results that seem contradictory, but are in fact compatible: adding context improves the model's RCRPS for most tasks, but greatly worsens it for a minority of tasks where the model achieves significant failures.

To further illustrate this effect, we visualize the task-wise RCRPS of the `\directprompt`{=latex} - `\model{Mixtral-8x7B-Instruct}`{=latex} model, both with and without context, in `\cref{fig:rcrps_hist_mixtral_dp}`{=latex}. With context, the model gets an RCRPS close to zero in a large number of tasks. However, there is also a long tail of tasks with high RCRPS values with context, dominating and worsening the model's aggregate RCRPS.

#### Visualizations of successful context-aware forecasts {#app:viz-success-forecasts}

```{=latex}
\centering
```
```{=latex}
\fbox{
        \parbox{0.9\textwidth}{

                \textbf{Context:} ``
This series represents the occupancy rate (\%) captured by a highway sensor.

Consider that the meter will be offline for maintenance between 2024-04-11 13:00:00 and 2024-04-11 15:00:00, which results in zero readings.
''
            }
    }
```
```{=latex}
\centering
```
![Without Context](figures/new-appendix-examples/Sucesses/llama405b/noctx/SensorMaintenanceInPredictionTask/5/forecast.png "fig:"){width="0.7\\linewidth"} `\captionsetup{justification=centering}`{=latex}

```{=latex}
\centering
```
![With Context](figures/new-appendix-examples/Sucesses/llama405b/ctx/SensorMaintenanceInPredictionTask/5/forecast.png "fig:"){width="0.7\\linewidth"} `\captionsetup{justification=centering}`{=latex}

```{=latex}
\centering
```
```{=latex}
\fbox{
        \parbox{0.9\textwidth}{
                \textbf{Context:} ``
This series contains Diffuse Horizontal Irradiance for a location in Sinaloa, Mexico.
The Diffuse Horizontal Irradiance is the total amount of sun energy (in Watts per squared meter) arriving indirectly on a horizontal surface, ignoring the direct sunlight. Even when there are no clouds to scatter the sun light, there will still be some Diffuse Horizontal Irradiance, since clouds are not the only cause of light scattering. When there are no clouds, the Diffuse Horizontal Irradiance is mostly a function of the position of the sun in the sky, with only small variations from factors such as water vapour and dust particles levels. If the cloud cover is light, the Diffuse Horizontal Irradiance will increase due to the increase scattering of sun light, but heavy cloud cover will decrease it due to some sun light no longer being able to reach the ground.


At the beginning of the series, the weather was cloudy.\\
At 2022-07-12 11:00:00, the weather became clear.\\
At 2022-07-12 19:00:00, the weather became cloudy.\\
At 2022-07-13 12:00:00, the weather became clear.\\
At 2022-07-13 13:00:00, the weather became cloudy.\\
At 2022-07-14 06:00:00, we expect that the weather will become clear.\\
At 2022-07-14 07:00:00, we expect that the weather will become cloudy.\\
At 2022-07-14 10:00:00, we expect that the weather will become clear.\\
At 2022-07-14 18:00:00, we expect that the weather will become cloudy.
''
            }
    }
```
```{=latex}
\centering
```
![Without Context](figures/new-appendix-examples/Sucesses/llama405b/noctx/ExplicitDiffuseHorizontalIrradianceFromCloudStatus/1/forecast.png "fig:"){width="0.7\\linewidth"} `\captionsetup{justification=centering}`{=latex}

```{=latex}
\centering
```
![With Context](figures/new-appendix-examples/Sucesses/llama405b/ctx/ExplicitDiffuseHorizontalIrradianceFromCloudStatus/1/forecast.png "fig:"){width="0.7\\linewidth"} `\captionsetup{justification=centering}`{=latex}

```{=latex}
\centering
```
```{=latex}
\fbox{
        \parbox{0.9\textwidth}{

                \textbf{Context:} ``
This is the number of cash withdrawals from an automated teller machine (ATM) in an arbitrary location in England.

Consider that the building which contains the ATM is closed from 1997-09-05 00:00:00, for 8 days.
''
            }
    }
```
```{=latex}
\centering
```
![Without Context](figures/new-appendix-examples/Sucesses/GPT4/noctx/ATMBuildingClosedTask/4/forecast.png "fig:"){width="0.7\\linewidth"} `\captionsetup{justification=centering}`{=latex}

```{=latex}
\centering
```
![With Context](figures/new-appendix-examples/Sucesses/GPT4/ctx/ATMBuildingClosedTask/4/forecast.png "fig:"){width="0.7\\linewidth"} `\captionsetup{justification=centering}`{=latex}

```{=latex}
\centering
```
```{=latex}
\fbox{
        \parbox{0.9\textwidth}{

                \textbf{Context:} ``
The Montreal Fire Department is in charge of responding to various kind of public safety incidents. This is the number of field fire incidents responded to by Montreal firefighters in the Rivière-des-Prairies-Pointe-aux-Trembles borough. In other years, the yearly average number of incidents was 106 with the busiest month being June.

The Mayor is determined to completely eradicate this kind of incident.
Fortunately, the city's public safety research group identified that field fires and trash fires tend to co-occur.
When the amount of field fires increases, the amount of trash fires also tends to increase. The same holds when they decrease.


The Mayor has a plan: they will implement daily spraying of all piles of trash with water starting on 2022-06.
''
            }
    }
```
```{=latex}
\centering
```
![Without Context](figures/new-appendix-examples/Sucesses/GPT4/noctx/MontrealFireFieldAndTrashNeutralToneImplicitCausalConfoundingTask/5/forecast.png "fig:"){width="0.7\\linewidth"} `\captionsetup{justification=centering}`{=latex}

```{=latex}
\centering
```
![With Context](figures/new-appendix-examples/Sucesses/GPT4/ctx/MontrealFireFieldAndTrashNeutralToneImplicitCausalConfoundingTask/5/forecast.png "fig:"){width="0.7\\linewidth"} `\captionsetup{justification=centering}`{=latex}

```{=latex}
\begin{figure*}[]\centering
    \fbox{
        \parbox{0.9\textwidth}{ { \scriptsize
                \textbf{Context:} ``
This is the Unemployment Rate for Okaloosa County, in Florida.

For reference, here is the Unemployment Rate for a few American states during the same period:

Pennsylvania\\
--------------------\\
(2023-08-01 00:00:00, 4.2)\\
(2023-09-01 00:00:00, 3.0)\\
(2023-10-01 00:00:00, 3.1)\\
(2023-11-01 00:00:00, 2.9)\\
(2023-12-01 00:00:00, 2.9)\\
(2024-01-01 00:00:00, 3.5)\\
(2024-02-01 00:00:00, 3.7)\\
(2024-03-01 00:00:00, 3.4)\\
(2024-04-01 00:00:00, 2.9)\\
(2024-05-01 00:00:00, 3.2)\\
(2024-06-01 00:00:00, 3.7)\\
(2024-07-01 00:00:00, 4.0)\\

Florida\\
--------------------\\
(2023-08-01 00:00:00, 3.3)\\
(2023-09-01 00:00:00, 3.1)\\
(2023-10-01 00:00:00, 3.1)\\
(2023-11-01 00:00:00, 3.0)\\
(2023-12-01 00:00:00, 2.9)\\
(2024-01-01 00:00:00, 3.1)\\
(2024-02-01 00:00:00, 3.1)\\
(2024-03-01 00:00:00, 3.3)\\
(2024-04-01 00:00:00, 3.1)\\
(2024-05-01 00:00:00, 2.9)\\
(2024-06-01 00:00:00, 3.5)\\
(2024-07-01 00:00:00, 3.8)\\

Wisconsin\\
--------------------\\
(2023-08-01 00:00:00, 3.4)\\
(2023-09-01 00:00:00, 2.9)\\
(2023-10-01 00:00:00, 2.8)\\
(2023-11-01 00:00:00, 2.7)\\
(2023-12-01 00:00:00, 2.9)\\
(2024-01-01 00:00:00, 2.8)\\
(2024-02-01 00:00:00, 3.3)\\
(2024-03-01 00:00:00, 3.5)\\
(2024-04-01 00:00:00, 3.0)\\
(2024-05-01 00:00:00, 3.0)\\
(2024-06-01 00:00:00, 3.3)\\
(2024-07-01 00:00:00, 3.3)
''
}
            }
    }

    \begin{subfigure}[t]{0.45\linewidth}
        \centering
        \includegraphics[trim={0 0 0 0.3cm},clip,width=\linewidth]{figures/new-appendix-examples/Sucesses/llmp-mixtral-8x7b-instruct/noctx/UnemploymentCountyUsingExplicitMultipleStateData/2/forecast.pdf}
        \captionsetup{justification=centering}
        \caption{Without Context}
    \end{subfigure}%
    \begin{subfigure}[t]{0.45\linewidth}
        \centering
        \includegraphics[trim={0 0 0 0.3cm},clip,width=\linewidth]{figures/new-appendix-examples/Sucesses/llmp-mixtral-8x7b-instruct/ctx/UnemploymentCountyUsingExplicitMultipleStateData/2/forecast.pdf}
        \captionsetup{justification=centering}
        \caption{With Context}
    \end{subfigure}

    \caption{Example of successful context-aware forecasts by \llmp with \model{Mixtral-8x7B-Instruct}}
    \label{fig:llmp-mixtral-success-1}
\end{figure*}
```
```{=latex}
\begin{figure*}[]\centering
    \fbox{
        \parbox{0.9\textwidth}{
                \textbf{Context:} ``
Suppose that in the forecast, the values are bounded below by 0.80.
''
            }
    }

    \begin{subfigure}[t]{1\linewidth}
        \centering
        \includegraphics[trim={0 0 0 0.3cm},clip,width=0.7\linewidth]{figures/new-appendix-examples/Sucesses/llmp-mixtral-8x7b-instruct/noctx/OraclePredUnivariateConstraintsTask/3/forecast.pdf}
        \captionsetup{justification=centering}
        \caption{Without Context}
    \end{subfigure}%

    \begin{subfigure}[t]{1\linewidth}
        \centering
        \includegraphics[trim={0 0 0 0.3cm},clip,width=0.7\linewidth]{figures/new-appendix-examples/Sucesses/llmp-mixtral-8x7b-instruct/ctx/OraclePredUnivariateConstraintsTask/3/forecast.pdf}
        \captionsetup{justification=centering}
        \caption{With Context}
    \end{subfigure}

    \caption{Example of successful context-aware forecasts by \llmp with \model{Mixtral-8x7B-Instruct}}
    \label{fig:llmp-mixtral-success-2}
\end{figure*}
```
```{=latex}
\begin{figure*}[]\centering
    \fbox{
        \parbox{0.9\textwidth}{
                \textbf{Context:} ``
This series contains the amount of sunlight (in Watts per squared meter) arriving on a horizontal surface, for a location in Alaska, United States.
''
            }
    }

    \begin{subfigure}[t]{1\linewidth}
        \centering
        \includegraphics[trim={0 0 0 0.3cm},clip,width=0.7\linewidth]{figures/new-appendix-examples/Sucesses/llmp-llama-70-b/noctx/LocaleInfoHalfDaySolarForecastTask/4/forecast.pdf}
        \captionsetup{justification=centering}
        \caption{Without Context}
    \end{subfigure}%

    \begin{subfigure}[t]{1\linewidth}
        \centering
        \includegraphics[trim={0 0 0 0.3cm},clip,width=0.7\linewidth]{figures/new-appendix-examples/Sucesses/llmp-llama-70-b/ctx/LocaleInfoHalfDaySolarForecastTask/4/forecast.pdf}
        \captionsetup{justification=centering}
        \caption{With Context}
    \end{subfigure}

    \caption{Example of successful context-aware forecasts by \llmp with \model{Llama-3-70B}}
    \label{fig:llmp-llama70b-success-1}
\end{figure*}
```
```{=latex}
\begin{figure*}[]\centering
    \fbox{
        \parbox{0.9\textwidth}{
                \textbf{Context:} ``
The Montreal Fire Department is in charge of responding to various kind of public safety incidents. This series contains the number of field fire incidents responded to by the Montreal Fire Department in the Rosemont-La Petite-Patrie borough. On average, they respond to 58 incidents per year and the month with the most incidents was June.

The Mayor is determined to completely eradicate this kind of incident.
Fortunately, the city's public safety research group, a team of highly qualified experts, identified that field fires and gas leaks tend to co-occur.
When the amount of field fires increases, the amount of gas leaks also tends to increase. The same holds when they decrease.


The Mayor has a plan: they will implement a strict prohibition of using any form of combustible gas in the city starting on 2023-06.
In a recent interview, they claimed, "This is a bulletproof plan, and I am certain it will immediately put an end to field fires."
''
            }
    }

    \begin{subfigure}[t]{0.90\linewidth}
        \centering
        \includegraphics[trim={0 0 0 0.3cm},clip,width=0.7\linewidth]{figures/new-appendix-examples/Sucesses/llmp-llama-70-b/noctx/MontrealFireFieldAndGasConvincingToneImplicitCausalConfoundingTask/2/forecast.pdf}
        \captionsetup{justification=centering}
        \caption{Without Context}
    \end{subfigure}%

    \begin{subfigure}[t]{0.90\linewidth}
        \centering
        \includegraphics[trim={0 0 0 0.3cm},clip,width=0.7\linewidth]{figures/new-appendix-examples/Sucesses/llmp-llama-70-b/ctx/MontrealFireFieldAndGasConvincingToneImplicitCausalConfoundingTask/2/forecast.pdf}
        \captionsetup{justification=centering}
        \caption{With Context}
    \end{subfigure}

    \caption{Example of successful context-aware forecasts by \llmp with \model{Llama-3-70B}}
    \label{fig:llmp-llama70b-success-2}
\end{figure*}
```
#### Visualizations of significant failures {#app:viz-failure-forecasts}

```{=latex}
\centering
```
```{=latex}
\fbox{
        \parbox{0.8\textwidth}{
                \textbf{Context:} ``
Given are variables X\_0 and X\_1, where X\_0 is a covariate and X\_1 is the variable to forecast. Variables are generated from a linear Structural Vector Autoregressive (SVAR) model with additive gauss noise and a noise scale of 1.487e-03, with lag = 3.

The task is to forecast the value of the variable X\_1 at time t, given the values of the covariate X\_0 and the variable X\_1 itself at times t-1, ... t-3.
For the first 128 days, the covariate X\_0 takes a value of 8 from 2024-02-21 to 2024-03-11, 12 from 2024-03-12 to 2024-05-06, 12 from 2024-05-07 to 2024-06-27.
For the next 32 days, the covariate X\_0 takes a value of 30 from 2024-06-28 to 2024-07-13, 60 from 2024-07-14 to 2024-07-14, 60 from 2024-07-15 to 2024-07-29. Each day can be treated as a timestep for the forecasting task. The causal parents affect the child variables at different lags.

The causal parents for each variable is given below:\\
No parents for X\_0 at any lag.\\
Parents for X\_1 at lag 1: ['X\_0', 'X\_1'] affect the forecast variable as 0.527 * X\_0 + -0.895 * X\_1.\\
Parents for X\_1 at lag 2: ['X\_0', 'X\_1'] affect the forecast variable as 1.380 * X\_0 + -0.758 * X\_1.\\
Parents for X\_1 at lag 3: ['X\_0', 'X\_1'] affect the forecast variable as -0.661 * X\_0 + -0.793 * X\_1.
''
            }
    }
```
```{=latex}
\centering
```
![Without Context](figures/new-appendix-examples/Failures/gPT4/noctx/Untitled/forecast.png "fig:"){width="0.7\\linewidth"} `\captionsetup{justification=centering}`{=latex}

```{=latex}
\centering
```
![With Context](figures/new-appendix-examples/Failures/gPT4/ctx/4/forecast.png "fig:"){width="0.7\\linewidth"} `\captionsetup{justification=centering}`{=latex}

```{=latex}
\centering
```
```{=latex}
\fbox{
        \parbox{0.9\textwidth}{
                \textbf{Context:} ``
This series contains the road occupancy rates on a freeway in the San Francisco Bay area. The days for which the forecast is required are Thursday 2024-07-04, Friday 2024-07-05, Saturday 2024-07-06. Note that 2024-07-04 is a holiday due to Independence Day. Note that traffic on this freeway typically reduces on holidays.
''
            }
    }
```
```{=latex}
\centering
```
![Without Context](figures/new-appendix-examples/Failures/llmp_llama/noctx/Untitled/forecast.png "fig:"){width="0.7\\linewidth"} `\captionsetup{justification=centering}`{=latex}

```{=latex}
\centering
```
![With Context](figures/new-appendix-examples/Failures/llmp_llama/ctx/1/forecast.png "fig:"){width="0.7\\linewidth"} `\captionsetup{justification=centering}`{=latex}

```{=latex}
\centering
```
```{=latex}
\fbox{
        \parbox{0.9\textwidth}{
                \textbf{Context:} ``
This series represents the occupancy rate (\%) captured by a highway sensor. The sensor had a calibration problem starting from 2024-04-20 13:00:00 which resulted in an additive trend in the series that increases by 0.0072 at every hour. At timestep 2024-04-24 13:00:00, the sensor was repaired and this additive trend will disappear.
''
            }
    }
```
```{=latex}
\centering
```
![Without Context](figures/new-appendix-examples/Failures/llmp_llama/noctx/figures/forecast.png "fig:"){width="0.7\\linewidth"} `\captionsetup{justification=centering}`{=latex}

```{=latex}
\centering
```
![With Context](figures/new-appendix-examples/Failures/llmp_llama/ctx/2/forecast.png "fig:"){width="0.7\\linewidth"} `\captionsetup{justification=centering}`{=latex}

```{=latex}
\begin{figure*}[]\centering
    \fbox{
        \parbox{0.9\textwidth}{
                \textbf{Context:} ``
The Montreal Fire Department is in charge of responding to various kind of public safety incidents. This series contains the number of field fire incidents responded to by the Montreal Fire Department in the L'Île-Bizard-Sainte-Geneviève borough. On average, they respond to 19 incidents per year with the busiest month being June.

The Mayor is determined to completely eradicate this kind of incident.
Fortunately, the city's public safety research group, a team of highly qualified experts, identified that field fires and trash fires tend to co-occur.
When the amount of field fires increases, the amount of trash fires also tends to increase. The same holds when they decrease.


The Mayor has a plan: they will implement daily spraying of all piles of trash with fire retardant foam starting on 2023-06.
In a recent interview, they claimed, "This is a bulletproof plan, and I am certain it will immediately put an end to field fires."
''
            }
    }

    \begin{subfigure}[t]{0.90\linewidth}
        \centering
        \includegraphics[trim={0 0 0 0.3cm},clip,width=0.7\linewidth]{figures/new-appendix-examples/Failures/dp-llama-8b-instruct/noctx/MontrealFireFieldAndTrashConvincingToneImplicitCausalConfoundingTask/4/forecast.pdf}
        \captionsetup{justification=centering}
        \caption{Without Context}
    \end{subfigure}%

    \begin{subfigure}[t]{0.90\linewidth}
        \centering
        \includegraphics[trim={0 0 0 0.26cm},clip,width=0.7\linewidth]{figures/new-appendix-examples/Failures/dp-llama-8b-instruct/ctx/Untitled/4/forecast.pdf}
        \captionsetup{justification=centering}
        \caption{With Context}
    \end{subfigure}

    \caption{Example to show a significant failure case of \directprompt with \model{Llama-3-8B-Instruct} where it misinterprets the context}
    \label{fig:dp-llama-3-8b-instruct-failure-1}
\end{figure*}
```
### Implementation Details of Models {#app:models}

#### `\directprompt`{=latex} {#subsec:direct-prompt}

##### Method

`\noindent`{=latex}

For `\directprompt`{=latex}, we propose to use a simple prompt template that we describe below, where **$(($context$))$** is replaced with the context of the respective task, **$(($history$))$** is replaced with the historical values in the given format, and **$(($pred\_time$))$** is replaced with the prediction timesteps. The prompted model is expected to output predictions in the given template style (i.e. within the given forecast tags, in the given format) for all prediction timesteps in the prompt. Notably, unlike `\llmp `{=latex}which consists of predicting the single next digit in a loop, Direct Prompt expects models to forecast in a single pass in a highly structured format, which requires models to understand and adhere to the template. `\newpage`{=latex}

+:------------------------------------------------------------------------------------------------:+
| ```{=latex}                                                                                      |
| \centering                                                                                       |
| ```                                                                                              |
|     I have a time series forecasting task for you.                                               |
|                                                                                                  |
|     Here is some context about the task. Make sure to factor in any background knowledge,        |
|     satisfy any constraints, and respect any scenarios.                                          |
|     <context>                                                                                    |
|     ((context))                                                                                  |
|     </context>                                                                                   |
|                                                                                                  |
|     Here is a historical time series in (timestamp, value) format:                               |
|     <history>                                                                                    |
|     ((history))                                                                                  |
|     </history>                                                                                   |
|                                                                                                  |
|     Now please predict the value at the following timestamps: ((pred_time)).                     |
|                                                                                                  |
|     Return the forecast in (timestamp, value) format in between <forecast> and </forecast> tags. |
|     Do not include any other information (e.g., comments) in the forecast.                       |
|                                                                                                  |
|     Example:                                                                                     |
|     <history>                                                                                    |
|     (t1, v1)                                                                                     |
|     (t2, v2)                                                                                     |
|     (t3, v3)                                                                                     |
|     </history>                                                                                   |
|     <forecast>                                                                                   |
|     (t4, v4)                                                                                     |
|     (t5, v5)                                                                                     |
|     </forecast>                                                                                  |
|     "                                                                                            |
+--------------------------------------------------------------------------------------------------+

To constrain the output of the model to follow the specific structure, we use constrained decoding using the lm-format-enforcer tool (<https://github.com/noamgat/lm-format-enforcer>) and a regular expression that only allows models to output the values corresponding to the prediction timestamps. Without constrained decoding, we observe that models often produce samples that fail to adhere to the structure and are therefore rejected. Specifically, larger models (`\model{Llama-3.1-405B-Instruct}`{=latex}, `\model{GPT-4o}`{=latex} and `\model{GPT-4o-mini}`{=latex}) can produce $25$ valid forecasts with $1$ to $3$ trials. However with the smaller models (such as `\model{Llama-3-70B-Instruct}`{=latex}, `\model{Llama-3-8B-Instruct}`{=latex} and `\model{Mixtral-8x7B-Instruct}`{=latex}), up to $10$ trials can be required to obtain $25$ valid forecasts.

Further, we found that without an explicit \`\`Do not include any other information (e.g., comments) in the forecast.", models often included unwanted information along with the forecasts.

```{=latex}
\vspace{-0.3cm}
```
###### Instruction-tuned models are more amenable for `\directprompt`{=latex} {#app:instr-tune-dir-prompt}

`\directprompt `{=latex}requires forecasts to be produced in a specific structure. To generate structured outputs, models need to be steerable [@dubey2024llama], a capability that is typically elicited from base models with post-training methods such as instruction tuning [@wei2021finetuned]. We observe this in our evaluations as we find that several base models, including `\model{Llama-3-8B}`{=latex}, `\model{Llama-3-70B}`{=latex}, `\model{Mixtral-8x7B}`{=latex}, and even the biggest base model we tried, `\model{Llama-3.1-405B}`{=latex}, are incapable of generating outputs adhering to the structure required for `\directprompt`{=latex}, despite increasing the number of retries to as high as $50$ retries. With `\directprompt`{=latex}, these models often output irrelevant information, sometimes completing solely the context as a text completion task, and in other cases regurgitating forecasting datasets that they have memorized.

###### Extensions of `\directprompt`{=latex}

While very simple, such prompt templates can be powerful tools to understand how LLMs perform context-aided forecasting: as the prompt gives control over the structure and content of the output (particularly for instruction-tuned models), one may construct other, more involved template structures in the prompt. For instance, a prompt template could ask LLMs to explain the reasoning behind their (context-aided) forecasts, and more. We leave it to future work to understand how such prompt-based techniques can lead to more detailed evaluations and give us better insights into what the models are capable of.

##### Implementation Details

We used a single H100 GPU to run the `\directprompt `{=latex}approach for `\model{Llama-3-8B-Instruct}`{=latex}, and 2 H100 GPUs for `\model{Qwen-2.5-\{0.5B-Instruct, 1.5-Instruct, 7B-Instruct\}}`{=latex}, `\model{Llama-3-70B-Instruct}`{=latex} and `\model{Mixtral-8x7B-Instruct}`{=latex}. We queried `\model{Llama-3.1-405b-Instruct}`{=latex} from an externally-hosted server running on 8 H100s. We use the OpenAI API to perform inference on the proprietary `\model{GPT-4o}`{=latex} and `\model{GPT-4o-mini}`{=latex} models. We provide the cost incurred in the inference of these models with the `\directprompt `{=latex}method in `\cref{app:cost-of-api-based-models}`{=latex}.

##### Example Prompt

A prompt used in an example task from the benchmark is given below.

+:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:+
| ```{=latex}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| \centering                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| ```                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|     I have a time series forecasting task for you.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|     Here is some context about the task. Make sure to factor in any background knowledge,satisfy any constraints, and respect any scenarios.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|     <context>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|     Background: This is hourly traffic data.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|     Scenario: Suppose that there is an accident on the road and there is 40.0%                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|     </context>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|     Here is a historical time series in (timestamp, value) format:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
|     <history>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|     (2024-04-23 00:00:00, 0.1)(2024-04-23 01:00:00, 0)(2024-04-23 02:00:00, 0)(2024-04-23 03:00:00, 0)(2024-04-23 04:00:00, 0.1)(2024-04-23 05:00:00, 0.2)(2024-04-23 06:00:00, 0.3)(2024-04-23 07:00:00, 0.5)(2024-04-23 08:00:00, 0.5)(2024-04-23 09:00:00, 0.4)(2024-04-23 10:00:00, 0.5)(2024-04-23 11:00:00, 0.5)(2024-04-23 12:00:00, 0.4)(2024-04-23 13:00:00, 0.6)(2024-04-23 14:00:00, 0.8)(2024-04-23 15:00:00, 1.2)(2024-04-23 16:00:00, 1.2)(2024-04-23 17:00:00, 1.3)(2024-04-23 18:00:00, 0.6)(2024-04-23 19:00:00, 0.3)(2024-04-23 20:00:00, 0.3)(2024-04-23 21:00:00, 0.3)(2024-04-23 22:00:00, 0.1)(2024-04-23 23:00:00, 0.1)(2024-04-24 00:00:00, 0.1)(2024-04-24 01:00:00, 0)(2024-04-24 02:00:00, 0)(2024-04-24 03:00:00, 0.1)(2024-04-24 04:00:00, 0.1)(2024-04-24 05:00:00, 0.2)(2024-04-24 06:00:00, 0.3)(2024-04-24 07:00:00, 0.5)(2024-04-24 08:00:00, 0.6)(2024-04-24 09:00:00, 0.5)(2024-04-24 10:00:00, 0.4)(2024-04-24 11:00:00, 0.5)(2024-04-24 12:00:00, 0.6) |
|     </history>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|     Now please predict the value at the following timestamps: ['2024-04-24 13:00:00' '2024-04-24 14:00:00' '2024-04-24 15:00:00' '2024-04-24 16:00:00' '2024-04-24 17:00:00' '2024-04-24 18:00:00' '2024-04-24 19:00:00' '2024-04-24 20:00:00' '2024-04-24 21:00:00' '2024-04-24 22:00:00' '2024-04-24 23:00:00' '2024-04-25 00:00:00' '2024-04-25 01:00:00' '2024-04-25 02:00:00' '2024-04-25 03:00:00' '2024-04-25 04:00:00' '2024-04-25 05:00:00' '2024-04-25 06:00:00' '2024-04-25 07:00:00' '2024-04-25 08:00:00' '2024-04-25 09:00:00' '2024-04-25 10:00:00' '2024-04-25 11:00:00' '2024-04-25 12:00:00'].                                                                                                                                                                                                                                                                                                                                                                             |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|     Return the forecast in (timestamp, value) format in between <forecast> and </forecast> tags.Do not include any other information (e.g., comments) in the forecast.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|     Example:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|     <history>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|     (t1, v1)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|     (t2, v2)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|     (t3, v3)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|     </history>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|     <forecast>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|     (t4, v4)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|     (t5, v5)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|     </forecast>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|     "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

```{=latex}
\vspace{1cm}
```
#### `\llmp`{=latex} {#subsec:llmp-additional-details}

##### Method

In this section we outline LLM-processes (`\llmp`{=latex}; @requeima2024llm), one of the prompt-based baselines evaluated in `\cref{subsec:mainresults}`{=latex}. Prompts are constructed by first providing textual information followed by the numerical history. The context may include background knowledge, a scenario description and task constraints, replaced by **$(($background$))$**, **$(($scenario$))$** and **$(($constraints$))$**, respectively, in the prompt template below. The numerical history (**$(($history$))$**) is provided by converting the numerical data to text where values are separated by commas (,) and tuples by newline characters (\\n). The LLM then outputs the continuation of the string prompt, forecasing the the value for the next time index (**$(($next index$))$**). This forecast and the next time index is appended to the prompt allowing the LLM to autoregressively complete the entire forecast. Numerical samples are rejected if they do not adhere to a decimal representation format. See @requeima2024llm) for full details.

The following is the prompt template used to construct prompts for the `\llmp `{=latex}baseline:

```{=latex}
\begin{tabular}{c}
\centering
\begin{lstlisting}[literate={\\}{\textbackslash}1]
"
Forecast the future values of this time series, while considering the following background knowledge, scenario, and constraints.

Background knowledge:
((background))

Scenario:
((scenario))

Constraints:
((constraints))

((history))
((next index))
"
\end{lstlisting}
\end{tabular}
```
A prompt used in an example task from the benchmark is given below:

```{=latex}
\begin{tabular}{c}
\centering
\begin{lstlisting}[literate={\\}{\textbackslash}1]
"
Forecast the future values of this time series, while considering the following background knowledge, scenario, and constraints.

Background knowledge:
This is hourly traffic data.

Scenario:
Suppose that there is an accident on the road and there is 40.0%

Constraints:

2024-04-23 00:00:00,0.1\n2024-04-23 01:00:00,0\n2024-04-23 02:00:00,0\n2024-04-23 03:00:00,0\n2024-04-23 04:00:00,0.1\n2024-04-23 05:00:00,0.2\n2024-04-23 06:00:00,0.3\n2024-04-23 07:00:00,0.5\n2024-04-23 08:00:00,0.5\n2024-04-23 09:00:00,0.4\n2024-04-23 10:00:00,0.5\n2024-04-23 11:00:00,0.5\n2024-04-23 12:00:00,0.4\n2024-04-23 13:00:00,0.6\n2024-04-23 14:00:00,0.8\n2024-04-23 15:00:00,1.2\n2024-04-23 16:00:00,1.2\n2024-04-23 17:00:00,1.3\n2024-04-23 18:00:00,0.6\n2024-04-23 19:00:00,0.3\n2024-04-23 20:00:00,0.3\n2024-04-23 21:00:00,0.3\n2024-04-23 22:00:00,0.1\n2024-04-23 23:00:00,0.1\n2024-04-24 00:00:00,0.1\n2024-04-24 01:00:00,0\n2024-04-24 02:00:00,0\n2024-04-24 03:00:00,0.1\n2024-04-24 04:00:00,0.1\n2024-04-24 05:00:00,0.2\n2024-04-24 06:00:00,0.3\n2024-04-24 07:00:00,0.5\n2024-04-24 08:00:00,0.6\n2024-04-24 09:00:00,0.5\n2024-04-24 10:00:00,0.4\n2024-04-24 11:00:00,0.5\n2024-04-24 12:00:00,0.6\n2024-04-24 13:00:00,
"
\end{lstlisting}
\end{tabular}
```
##### Implementation Details {#subsubsec:llmp-implementation-details}

We used a single H100 GPU to run the `\llmp `{=latex}approach for the following models: `\model{Llama-3-8B}`{=latex}, and `\model{Llama-3-8B-Instruct}`{=latex}. We used 2 H100 GPUs for the `\model{Qwen-2.5}`{=latex} family of models, `\model{Mixtral-8x7B}`{=latex}, and `\model{Mixtral-8x7B-Instruct}`{=latex}, and used used 8 H100 GPUs for the following models: `\model{Llama-3-70B}`{=latex}, and `\model{Llama-3-70B-Instruct}`{=latex}.

Since the code of `\llmp `{=latex}(<https://github.com/requeima/llm_processes/>) only supports using open-source models (such as those available in HuggingFace) and requires loading the weights into memory, it does not support experimenting with the `\model{GPT-4o}`{=latex} and `\model{GPT-4o-mini}`{=latex} models. Further, due to the memory requirements of `\llmp`{=latex}, we were unable to experiment with the `\model{Llama-3.1-405B}`{=latex} and `\model{Llama-3.1-405B-Instruct}`{=latex} models that required more than 24 H100 GPUs in parallel to process a single instance from the benchmark, which exceeded our available resources.

#### `\model{ChatTime}`{=latex} {#app:chattime_details}

We evaluate the released ChatTime-Base (<https://huggingface.co/ChengsenWang/ChatTime-1-7B-Base>) and ChatTime-Chat (<https://huggingface.co/ChengsenWang/ChatTime-1-7B-Chat>) models zero-shot, as per the instructions in the authors' GitHub repository (<https://github.com/ForestsKing/ChatTime>).

#### `\model{UniTime}`{=latex} and `\model{Time-LLM}`{=latex} {#app:unitimellm}

For multimodal models, we jointly train `\model{UniTime}`{=latex} [@liu2024unitime] on its ensemble of datasets: ETTm1, ETTm2, ETTh1, ETTh2, Electricity, Weather, Exchange and Illness.

We also evaluate `\model{Time-LLM}`{=latex} [@jin2024timellm], another multimodal model built on top of the Llama architecture. We train `\model{Time-LLM}`{=latex} on ETTh1 according to the authors' suggested specifications, and we compare the performance of both models with and without context.

`\paragraphtight{\model{UniTime}:}`{=latex} We train `\model{UniTime}`{=latex} [@liu2024unitime] with their codebase (<https://github.com/liuxu77/UniTime>) using a single seed on one AMD Instinct MI200 GPU for approximately 14 hours. It features a lightweight transformer with maximum context length of 210 and a pre-trained GPT2 language model as backbone, of which only the first half of the transformer layers are used. The time series baseline employs non-overlapping patch embeddings generated with a kernel size and stride of 16, and a maximum input sequence length of 96. When the total tokenized length exceeds the architecture's capacity, we truncate the context.

Unlike `\model{Time-LLM}`{=latex}, `\model{UniTime}`{=latex} is jointly trained on all datasets simultaneously. Batches were generated by first choosing a dataset uniformly at random then returning a batch from the associated data loader. To account for domain convergence speed imbalance, a mask rate of 0.5 is used and the training batch size is varied according to the dataset (details in the data config directory of the `\model{UniTime}`{=latex} GitHub repository). Training was conducted for 10 epochs of the mixed dataset, with cosine decay from an initial learning rate of 1e-4 to a minimum of 1e-6 over a maximum period of 20 epochs. The results of our training on the original datasets are given in `\cref{tab:unitime_results_original}`{=latex}.

Finally, in order to accelerate training, we added BF16 automatic mixed precision training and gradient accumulation to the original training procedure.

```{=latex}
\centering
```
  **Dataset**          `\multicolumn{4}{c}{\textbf{Mean Squared Error (MSE)}}`{=latex}
  ------------------- ----------------------------------------------------------------- ------- ------- -------
  Prediction Length                                  96                                   192     336     720
  ETTh1                                             0.395                                0.435   0.469   0.468
  ETTh2                                             0.291                                0.368   0.413   0.422
  ETTm1                                             0.336                                0.377   0.409   0.465
  ETTm2                                             0.181                                0.248   0.315   0.417
  Exchange                                          0.090                                0.180   0.322   0.862
  Weather                                           0.179                                0.224   0.278   0.354
  Electricity                                       0.198                                0.202   0.217   0.257
                                                     24                                   36      48      60
  Illness                                           2.284                                2.515   2.572   2.455

  : Evaluation results for `\model{UniTime}`{=latex} on their test splits. Results are comparable to the original paper, although MSE on Illness is approximately 20% higher for prediction lengths 36,48,60.

`\label{tab:unitime_results_original}`{=latex}

`\paragraphtight{\model{Time-LLM}:}`{=latex} We train `\model{Time-LLM}`{=latex} [@jin2024timellm] with their codebase (<https://github.com/KimMeen/Time-LLM>) on the ETTh1 dataset [@haoyietal-informer-2021] with a prediction length of 96. We train using a single seed on four AMD Instinct MI200 GPUs, with an average training time per run of approximately 13 hours. Training was conducted using a batch size of 8 per device and 4 gradient accumulation steps, along with a 1Cycle learning rate schedule with a maximum learning rate of 1e-3. In addition, runs were accelerated using DeepSpeed Stage 2 and BF16 automatic mixed precision.

Training was conducted over a maximum of 50 epochs with early stopping, and a time-based split of 70% for training, 10% for validation, and 20% for testing, where the most recent windows were reserved for the test set. All runs were trained with an input sequence length of 512, with overlapping patch embeddings generated with a kernel size of 16 and a stride of 8. The results on the ETTh1 test set are given in `\cref{tab:timellm_results_original}`{=latex}.

When evaluating on CiK tasks which do not conform to `\model{Time-LLM}`{=latex}'s requirements, we make the following modifications to the method:

-   For short history tasks where the history length $|\mathbf{X_H}|$ is less than 5, we change the `topk` operator's $k$ value from 5 to $|\mathbf{X_H}|$ in the `calculate_lags()` function.

-   For tasks where the length of the prediction window $|\mathbf{X_F}|$exceeds the trained projection head's output dimension (in our case, 96), we repeat the last predicted value $|\mathbf{X_F}| - 96$ times. This occurs for very few tasks (3 tasks) with prediction windows of 97 or 98 steps depending on the sampled instance, which we assume leads to a negligible impact on evaluated results.

```{=latex}
\centering
```
```{=latex}
\captionsetup{justification=centering}
```
  **Time-LLM**   **MSE**                                   **MAE**
  -------------- ----------------------------------------- -----------------------------------------
  ETTh1-pl96     `\multicolumn{1}{r}{0.3846123}`{=latex}   `\multicolumn{1}{r}{0.4149854}`{=latex}

  : ETTh1 test set results for Time-LLM trained on ETTh1.

`\label{tab:timellm_results_original}`{=latex}

`\paragraphtight{Why Do \model{Time-LLM} and \model{UniTime} Not Benefit (More) From Context?}`{=latex} `\label{app:why-unitime}`{=latex} Looking at table `\cref{app:extended-aggregate-results-on-all-models}`{=latex}, we see that context actually harms the performance of `\model{Time-LLM}`{=latex}'s forecasts. Two possible reasons for this are: 1) `\model{Time-LLM}`{=latex}'s adaptation procedure is unlikely to retain the backbone LLM's language-processing capabilities, and 2) `\model{Time-LLM}`{=latex}'s single-dataset training procedure is unlikely to generalize to unseen time series patterns. Part of `\model{Time-LLM}`{=latex}'s model adaptation involves training linear layers at the input and output of the language model. Although the backbone LLM remains frozen, these linear layers must be trained, and `\model{Time-LLM}`{=latex} opts for a highly structured prompting format which involves domain knowledge, task instructions and input statistics. Since the training data for the linear layers consists of output representations based on these highly structured prompts, it is not evident that the resulting architecture will generalize to more diverse contextual descriptions such as those found in CiK. Furthermore, although we have not conducted a formal analysis of the diversity of the ETTh1 dataset, it is not a priori obvious that such a dataset would have a sufficient diversity of patterns to train a time series foundation model.

Interestingly, `\model{UniTime}`{=latex}'s performance does benefit from context for some tasks (see `\cref{fig:unitime_example}`{=latex}). However, the aggregate RCRPS and rank of `\model{UniTime}`{=latex} with respect to other models indicate that it still struggles to produce forecasts competitive with even quantitative forecasting methods.

```{=latex}
\fbox{
        \parbox{0.97\textwidth}{
                \textbf{Context:} ``Suppose that in the forecast, the values are bounded above by 6.29.''
            }
    }
```
![A comparison of forecasts from `\model{UniTime}`{=latex} without context (left) and with context (right). On average across 5 instances, `\model{UniTime}`{=latex}'s RCRPS is 64% better with context than without on the \`\`Bounded Prediction Constraint Based On Prediction Quantiles" task.](figures/appendix/unitime/unitime_constr_example.png){#fig:unitime_example width="\\linewidth"}

#### `\model{Lag-Llama}`{=latex} {#section}

We use the publicly available implementation of `\model{Lag-Llama}`{=latex} [@rasul2023lag] located at <https://github.com/time-series-foundation-models/>, and its associated pre-trained weights. The model inference was done on a single H100 GPU.

#### `\model{Chronos}`{=latex} {#section-1}

We use the publicly available implementation of `\model{Chronos}`{=latex} [@ansari2024chronos] located at <https://github.com/amazon-science/chronos-forecasting>. We evaluated (see `\cref{app:extended-aggregate-results-on-all-models}`{=latex}) our tasks on all $5$ available models: chronos-tiny, chronos-mini, chronos-small, chronos-base and chronos-large, and reported the results of the best performing model, chronos-large in `\cref{table:main-results}`{=latex}. The model inference was done on a single H100 GPU.

#### `\model{Moirai}`{=latex} {#section-2}

We use the publicly available implementation of `\model{Moirai}`{=latex} [@woo2024unified] located at <https://github.com/SalesforceAIResearch/uni2ts>. We evaluated (see `\cref{app:extended-aggregate-results-on-all-models}`{=latex}) our tasks on the $3$ following models: moirai-1.0-R-small (located at <https://huggingface.co/Salesforce/moirai-1.0-R-small>), moirai-1.0-R-base (located at <https://huggingface.co/Salesforce/moirai-1.0-R-base>) and moirai-1.0-R-large (located at <https://huggingface.co/Salesforce/moirai-1.0-R-large>) and reported the results of the best performing model, moirai-1.0-R-large in `\cref{table:main-results}`{=latex}. The model inference was done on a single H100 GPU.

#### `\model{TimeGEN}`{=latex} {#section-3}

We access `\model{TimeGEN-1}`{=latex}, an optimization of the `\model{TimeGPT}`{=latex} model [@garza2023timegpt], using the API made available through the `nixtla` Python package. Unlike all other baselines, we only generate point forecasts with `\model{TimeGEN}`{=latex} due to its probabilistic mode requiring much longer historical data than is available in instances evaluated in the benchmark. This is the reason the RCPRS values for `\model{TimeGEN}`{=latex} have zero standard error.

#### `\model{Exponential Smoothing}`{=latex} {#section-4}

We used the `\model{Exponential Smoothing}`{=latex} implementation from the `statsmodels` Python package, namely the `statsmodels.tsa.holtwinters.ExponentialSmoothing` class. Both trend and seasonal components of the models are set to be additive. The seasonal period length is either set manually for tasks where the simple guess using the time series frequency is incorrect. If there is not at least two full seasonal periods in the history window of the time series, we disable the seasonal component of the model. Since some of the benchmark tasks can have as few as 3 time steps in the history window, we also disable the trend component if we have less than 5 time steps in said window.

#### `\model{ETS}`{=latex} and `\model{ARIMA}`{=latex} {#and}

We used the implementations of `\model{ETS}`{=latex} and `\model{ARIMA}`{=latex} from the `forecast` R package, using `rpy2` for compatibility with Python. For `\model{ETS}`{=latex}, we use the `ets` method, which we call with automatic error, trend, and seasonality components. In the rare cases where the `\model{ETS}`{=latex} forecast contains NaN values, we manually switch off the trend component and rerun the forecast. The`\model{ARIMA}`{=latex} results are computed using the `auto.arima` method. If the `\model{ARIMA}`{=latex} fits fail, we rerun it with restricted parameter and disabled seasonality.

### Details of the proposed metric {#app:metric}

The CiK benchmark is designed to determine whether models can improve their probabilistic forecasts by leveraging associated textual information (see `\cref{sec:problem-setting}`{=latex}). To support this goal, the evaluation metric:

1.  should be a **proper scoring rule**, such that a model who perfectly knows what the correct forecast is should have no reason to favor another prediction;

2.  must be **easy to compute** using a finite sample from the forecast distribution, since many models do not provide a functional form for their forecasts.

To account for the importance of leveraging relevant context, the metric should also:

1.  **penalize obviously impossible forecasts**, i.e. that can be inferred as implausible from the contextual information;

2.  **take a similar range of values across different tasks**, to prevent some tasks to dominate the score as we average the results across tasks;

3.  **prioritize forecast quality for timesteps with relevant context**, even if these timesteps are a small portion of the forecast horizon.

To satisfy the first two properties, we start with the Continuous Ranked Probability Score (CRPS) [@gneiting2007strictly], a reliable strictly proper scoring rule for univariate probability distribution, and take its mean over all time steps. To compute the CRPS from a finite number of samples, we use the estimator based on its probability weighted moment form [@taillardat2016], since it is unbiased [@zamo2018]. See `\cref{app:crps_estimator}`{=latex} for more details about this estimator.

Many of our tasks are built to include information about a hard constraint on $\Xb_F$ in their $\Ccal$, which can be written as $v_\Ccal(\xb_F)=0$. If we were only interested to measure by how much a forecast breaks the constraint, we could take inspiration from the threshold-weighted CRPS [@gneiting2011threshold] by using $v_\Ccal$ as its chaining function [@allen2023]: $$\text{twCRPS}_{v_\Ccal}(\widetilde{\Xb}_F, \xb_F) \equiv \text{CRPS}\left(v_\Ccal(\widetilde{\Xb}_F), v_\Ccal(\xb_F)\right),$$ where $\widetilde{\Xb}_F$ is the forecast of $\Xb_F$ to be evaluated. Since, by construction, the ground-truth $\xb_F$ always satisfy the constraints, we have $v_\Ccal(\xb_F) = 0$. But since we do not care only about whether forecasts break constraints, we sum both the original CRPS and this twCRPS, but we weight the later by a factor of $\beta=10$, to denote the additional interest we show to these errors. See `\cref{app:constraint_functions}`{=latex} for the various $v_\Ccal$ used in the benchmark.

One common approach to normalize the CRPS to get similar ranges for multiple problems is to divide it by the mean absolute value of the target ground-truth of the forecasted series [@gluonts_jmlr]. This has two issues: the metric is no longer proper, and it leads to much larger values for series close to zero than those far from it. To solve the first issue, we take advantage that we can generate many more instances from each of our tasks, by computing a normalization factor $\alpha$ from 25 instances not included in the benchmark. The details of this calculations are in `\cref{app:metric-scaling}`{=latex}.

Many tasks in our benchmark contains contextual information which is highly relevant for a small fraction of the time steps in the forecasting window, while being only marginally relevant for the majority of the time steps. If we were to weight these two categories equally, then the score for a model which ignores the context would be hard to distinguish from the score of one who does not. We correct this issue by identifying the subset of time steps with relevant information, which we call the Region of Interest (RoI). We then weight the CRPS to give half weight to the RoI time steps and half weight to the non-RoI time steps. Therefore, we obtain our metric, which we call the Region-of-Interest CRPS (RCRPS):

```{=latex}
\resizebox{\linewidth}{!}{
$
\text{RCRPS}(\widetilde{\Xb}_F, \xb_F) \eqdef
    \begin{cases}
        \mathcal{\alpha} \, \cdot \left[\frac{1}{2 |\mathcal{I}|} \cdot \sum\limits_{i \in \mathcal{I}} \text{CRPS}\!\!\left(\widetilde{X}_i, x_i\right)
        + \frac{1}{2 | \neg \mathcal{I}|} \cdot \sum\limits_{i \in \neg \mathcal{I}} \text{CRPS}\!\!\left(\widetilde{X}_i, x_i\right)
        + \beta \cdot\, \text{CRPS}\!\!\left(v_\Cb(\widetilde{\Xb}_F), 0\right)\right] & \text{if}\; |\mathcal{I}| > 0\\
        \mathcal{\alpha} \, \cdot \left[\frac{1}{| \neg \mathcal{I}|} \cdot \sum\limits_{i \in \neg \mathcal{I}} \text{CRPS}\!\!\left(\widetilde{X}_i, x_i\right)
        + \beta \cdot\, \text{CRPS}\!\!\left(v_\Cb(\widetilde{\Xb}_F), 0\right)\right], & \text{if}\; |\mathcal{I}| = 0
     \end{cases}
$
}
```
where $\mathcal{I}$ is the set of time steps in the RoI, $\neg \mathcal{I}$ is the set of time steps in the forecast but not in the RoI, $\alpha$ is the aforementioned scaling factor, and we drop the factor of two and the first sum for tasks where there is no meaningful RoI.

#### Scaling for cross-task aggregation {#app:metric-scaling}

The rationale behind scaling the RCPRS is to allow us to average its value from diverse tasks without the average being dominated by the forecast quality for tasks with time series with large values. An alternative argument is: all other conditions being equal, a forecaster that is wrong by 10 in its forecast for a time series which goes from 25 to 30 is worse than one that is wrong by 100 in its forecast for a time series which goes from 2500 to 3000. Furthermore, we have multiple tasks for which some instances have constant $\xb_F$ or nearly so, often with values close to zero. Due to these tasks, we cannot simply use a scaling which only depends on said instances $\xb_F$. Instead, we take advantage of our benchmark ability to generate a very large number of instances for each tasks by using $M=25$ instances not included in our benchmark. Given the ground-truth future values $\xb_F^m$ for these instance, the scaling factor $\beta$ for an individual task is as follow: $$\alpha = \left[ \frac{\sum_m \left( \max_i x_i^m - \min_i x_i^m \right)}{M} \right]^{-1}. \label{eq:beta_scaling}$$

###### Properness {#sec:proper}

In an ideal scenario, all instances of a tasks would be fully independent. In that case then `\cref{eq:beta_scaling}`{=latex} would not contain any information about the target time series in the benchmark instances, making the RCPRS a proper scoring rule. However, due to possible overlaps in the time windows used when creating the instances and to auto-correlations, we cannot guarantee independence between instances, and thus we cannot guarantee that the RCPRS is actually a proper scoring rule. Note that this deviation from a proper scoring rule is minor, and has a much smaller effect than the one due to the common approach of normalizing the CRPS using the Mean Absolute Value of the ground-truth.

#### CRPS and twCRPS

Given a univariate forecast $\widetilde{X}$ and a ground-truth realization $x$, the Continuous Ranked Probability Score (CRPS) can be defined in its integral as follow: $$\text{CRPS}(\widetilde{X}, x) = \int_{-\infty}^{\infty} dy \left[ \Phi_{\widetilde{X}}(y) - \mathds{1}(y \ge x) \right]^2,
    \label{eq:crps_int}$$ where $\Phi_{\widetilde{X}}(y)$ is the Cumulative Distribution Function of $\widetilde{X}$, and $\mathds{1}$ is the indicator function.

There are multiple ways to compute the CRPS, but a particularly interesting one which showcases its link to the Mean Absolute Error is the energy form of the CRPS: $$\text{CRPS}(\widetilde{X}, x) = \mathbb{E}_{X \sim \widetilde{X}} \left| X - x \right| - \frac{1}{2} \mathbb{E}_{X,X' \sim \widetilde{X}} \left| X - X' \right|.
    \label{eq:crps_energy}$$

We get the threshold-weighted CRPS (twCRPS) from `\cref{eq:crps_energy}`{=latex} by adding a weighting function $w(x)$ to it: $$\text{twCRPS}(\widetilde{X}, x) = \int_{-\infty}^{\infty} dy w(y) \left[ \Phi_{\widetilde{X}}(y) - \mathds{1}(y \ge x) \right]^2.
    \label{eq:twcrps_int}$$ To get the energy form of the twCRPS, we must compute the chaining function $v(x)$ from $w(x)$: $$v(x) - v(x') = \int_{[x,x')} dy w(y).$$ Using $v(x)$, we can write the twCRPS as: $$\text{twCRPS}(\widetilde{X}, x) =
    \mathbb{E}_{X \sim \widetilde{X}} \left| v(X) - v(x) \right|
    - \frac{1}{2} \mathbb{E}_{X,X' \sim \widetilde{X}} \left| v(X) - v(X') \right|.
    \label{eq:twcrps_energy}$$ `\cref{eq:twcrps_energy}`{=latex} can readily be generalized to a multivariate forecast, by using any $\mathbb{R}^d \rightarrow \mathbb{R}$ chaining function.

#### Estimating the CRPS using samples {#app:crps_estimator}

Computing the CRPS using `\cref{eq:crps_int}`{=latex} or `\cref{eq:crps_energy}`{=latex} directly would be extremely hard for most of the baselines included in our experiments. Instead, it is more computationally convenient to use an estimator of the CRPS which uses a finite number of samples $x_1$, \..., $x_M$ from the forecasting distribution. An unbiased estimator of the CRPS created from `\cref{eq:crps_energy}`{=latex} is: $$\text{CRPS}(\widetilde{X}, x) \approx
    \frac{1}{M} \sum_{n=1}^M \left| x_n - x \right| %
    - \frac{1}{2 M (M-1)} \sum_{n=1}^M \sum_{n'=1}^M \left| x_n - x_{n'} \right|.
    \label{eq:crps_estimator_energy}$$ However, this estimator is relatively costly, having a $O(M^2)$ time complexity.

A faster estimator which gives the same result as `\cref{eq:crps_estimator_energy}`{=latex} (up to numerical accuracy) is the one based on the probability weighted moment form of the CRPS [@taillardat2016; @zamo2018]: $$\text{CRPS}(\widetilde{X}, x) \approx
    \frac{1}{M} \sum_{n=1}^M \left| x_n - x \right| +
    \frac{1}{M} \sum_{n=1}^M x_n %
    - \frac{2}{M (M-1)} \sum_{n=1}^M (n-1) x_n,
    \label{eq:crps_estimator_pwm}$$ where the $x_n$ have been sorted in ascending order. We used `\cref{eq:crps_estimator_pwm}`{=latex} in our metric, since it is as accurate as `\cref{eq:crps_estimator_energy}`{=latex}, while only having a $O(M \log M)$ time complexity.

#### Constraint-violation functions {#app:constraint_functions}

In selecting constraint-violation functions $v_\Ccal$ for our various tasks, we have the following requirements: it should be invariant to the number of timesteps in the forecasting window and it should be multiplied by $\alpha$ if all numerical data in a task is transformed using $x \rightarrow \alpha x + \beta$. Here are the $v_\Ccal$ we use in some of our benchmark tasks:

-   *Constant upper-bound constraint* $x_i \le \tau^+$: $$v_\Ccal(\xb_F) = \frac{1}{T - t}\sum_{i=t+1}^T \max(0, x_i - \tau^+),$$

-   *Constant lower-bound constraint* $x_i \ge \tau^-$: $$v_\Ccal(\xb_F) = \frac{1}{T - t}\sum_{i=t+1}^T \max(0, \tau^- - x_i),$$

-   *Constant lower-bound and upper-bound constraints* $\tau^- \le x_i \le \tau^+$: $$v_\Ccal(\xb_F) = \frac{1}{T - t}\sum_{i=t+1}^T \max(0, \tau^- - x_i) + \max(0, x_i - \tau^+),$$

-   and *variable upper-bound constraints, on a subset of time steps* $x_i \le \tau^+_i \ \forall i \in C$: $$v_\Ccal(\xb_F) = \frac{1}{|C|}\sum_{i \in C} \max(0, x_i - \tau^+_i).$$

#### Covariance of two CRPS estimators {#app:stderr_formula}

One approach to compute standard error on the RCRPS is to compute the empirical standard deviation based on the 5 instances we use for each task. However, such a method would overestimate the standard error, since it would consider both the variance coming from the selection of instances of a given task, and the variance coming from the models sampling processes. Since all models are tested using the exact same instances, the variance coming from their selection is not relevant, and thus we need a way to ignore it.

To do so, we take advantage that the RCRPS is a weighted sum of multiple CRPS estimates. Since those estimates are not independent from one another, we can compute an estimate of the variance of the RCPRS under the sampling process by computing an estimate of the covariance matrix between the various CRPS estimates, followed by the appropriate weighted sum.

Let says we want to compute the covariance between the CRPS for variable $i$ and the CRPS for variable $j$, using $M$ independent and identically distributed samples from the joint distribution of $\widetilde{X}_i$ and $\widetilde{X}_j$. $$\begin{gathered}
    \mathrm{Cov}\left(
        \text{CRPS}\!\!\left(\widetilde{X}_i, x_i\right), \text{CRPS}\!\!\left(\widetilde{X}_j, x_j\right)
    \right) = \\
    \mathrm{Cov}\Biggl(
        \frac{1}{M} \sum_n | \widetilde{X}_{i,n} - x_i | - \frac{1}{2 M (M-1)} \sum_{n \neq n'} | \widetilde{X}_{i,n} - \widetilde{X}_{i,n'} |, \\
        \frac{1}{M} \sum_n | \widetilde{X}_{j,n} - x_j | - \frac{1}{2 M (M-1)} \sum_{n \neq n'} | \widetilde{X}_{j,n} - \widetilde{X}_{j,n'} |
    \Biggr),\end{gathered}$$ where the sums are over the various samples $n$ and $x_i$ and $x_j$ are the ground-truth values.

After some tedious algebraic manipulations, we obtain the final formula for the covariance of two CRPS estimates: $$\begin{aligned}
    \mathrm{Cov}\left(
        \text{CRPS}\!\!\left(\widetilde{X}_i, x_i\right), \text{CRPS}\!\!\left(\widetilde{X}_j, x_j\right)
    \right) =
    & - \frac{1}{M} \Esp_{\widetilde{X}_i} | \widetilde{X}_i - x_i | \Esp_{\widetilde{X}'_j} | \widetilde{X}'_j - x_j | \\
    & + \frac{1}{M} \Esp_{\widetilde{X}_i} | \widetilde{X}_i - x_i | \Esp_{\widetilde{X}'_j}\Esp_{\widetilde{X}''_j} | \widetilde{X}'_j - \widetilde{X}''_j | \\
    & + \frac{1}{M} \Esp_{\widetilde{X}_i}\Esp_{\widetilde{X}'_i} | \widetilde{X}_i - \widetilde{X}'_i | \Esp_{\widetilde{X}''_j} | \widetilde{X}''_j - x_j | \\
    & - \frac{2 M - 3}{2 M (M - 1)} \Esp_{\widetilde{X}_i}\Esp_{\widetilde{X}'_i} | \widetilde{X}_i - \widetilde{X}'_i | \Esp_{\widetilde{X}''_j}\Esp_{\widetilde{X}'''_j} | \widetilde{X}''_j - \widetilde{X}'''_j | \\
    & + \frac{1}{M} \Esp_{(\widetilde{X}_i, \widetilde{X}_j)} | \widetilde{X}_i - x_i | \cdot | \widetilde{X}_j - x_j | \\
    & - \frac{1}{M} \Esp_{(\widetilde{X}_i, \widetilde{X}_j)}\Esp_{\widetilde{X}'_i} | \widetilde{X}_i - \widetilde{X}'_i | \cdot | \widetilde{X}_j - x_j | \\
    & - \frac{1}{M} \Esp_{(\widetilde{X}_i, \widetilde{X}_j)}\Esp_{\widetilde{X}'_j} | \widetilde{X}_i - x_i | \cdot | \widetilde{X}_j - \widetilde{X}'_j | \\
    & + \frac{1}{2 M (M - 1)} \Esp_{(\widetilde{X}_i, \widetilde{X}_j)}\Esp_{(\widetilde{X}'_i, \widetilde{X}'_j)} | \widetilde{X}_i - \widetilde{X}'_i | \cdot | \widetilde{X}_j - \widetilde{X}'_j | \\
    & + \frac{M - 1}{M (M - 1)} \Esp_{(\widetilde{X}_i, \widetilde{X}_j)}\Esp_{\widetilde{X}'_i}\Esp_{\widetilde{X}''_j} | \widetilde{X}_i - \widetilde{X}'_i | \cdot | \widetilde{X}_j - \widetilde{X}''_j |, \\\end{aligned}$$ where variables with the same number of apostrophes ($'$) are drawn together and those with different number of apostrophes are independent variables.

To get an estimate of the covariance using our $M$ samples, we can estimate each of these terms using their respective unbiased estimators. Once we have computed an estimate of the variance for a single task instance, the overall variance for a full task is computed using the formula for the variance of the average of multiple independent variables. One slight disadvantage of using this method is that it offers no guarantee that the RCPRS variance estimate will be non-negative, so in the rare cases where the estimate for the variance of a full task is negative, we clip it to 0.

#### Comparison of Statistical Properties of Various Scoring Rules

```{=latex}
\centering
```
::: {#tab:crps-stat-properties}
  ---------------------------------------------------------- ------------------------------------------------ ----------------------------------- ------------------------------------------ ----------------
  `\multicolumn{1}{c}{\multirow{2}{*}{Metric}}`{=latex}       `\multirow{2}{*}{Proper Scoring Rule}`{=latex}   `\multirow{2}{*}{Domain}`{=latex}   `\multicolumn{2}{c}{Invariance}`{=latex}
  `\cmidrule`{=latex}(l)4-5 `\multicolumn{1}{c}{}`{=latex}                                                                                                         Additive                   Multiplicative
  Brier Score                                                                      Yes                                     Discrete                                  Yes                           Yes
  CRPS                                                                             Yes                                    Continuous                                 Yes                            No
  TwCRPS                                                                           Yes                                    Continuous                                 Yes                            No
  CRPS skill score                                                                  No                                    Continuous                                 Yes                           Yes
  MAV-Scaled CRPS                                                                   No                                    Continuous                                  No                           Yes
  RCRPS                                                                            No\*                                   Continuous                                 Yes                           Yes
  ---------------------------------------------------------- ------------------------------------------------ ----------------------------------- ------------------------------------------ ----------------

  : Comparison of Statistical Properties of Various Scoring Rules. The \* indicates that, to be proper, it would need that different seeds are independent, which cannot be guaranteed by CiK, but could happen in other applications.
:::

`\cref{tab:crps-stat-properties}`{=latex} describes a few statistical properties for both commonly used scoring rules and our RCPRS. For the invariance (additive and multiplicative) properties, we indicate whether the scoring rule remains unchanged if all relevant quantities (forecast, ground truth, threshold, and constraint parameters) are modified by adding a constant to them or by multiplying them by a constant. By MAV-Scaled CRPS, we denote the common approach in the forecasting literature to normalize the CRPS by dividing it by the Mean Absolute Values of the ground-truth, instead of reporting the original CRPS values.

```{=latex}
\vspace{-2mm}
```

[^1]: For `\llmp`{=latex}, we do not consider `\model{Llama-3.1-405B}`{=latex} and `\model{GPT}`{=latex} models as `\llmp `{=latex}requires loading model weights into memory, which is infeasible due to resource limitations and confidentiality of the respective models (see `\cref{subsubsec:llmp-implementation-details}`{=latex} for more details).

[^2]: For `\cref{fig:annotation-heatmap}`{=latex}, we consider both \`\`significantly better" and \`\`slightly better" as meaning the context is useful.