---
abstract: |
  Recent time series modeling faces a sharp divide between numerical generation and semantic understanding, with research showing that generation models often rely on superficial pattern matching, while understanding-oriented models struggle with high-fidelity numerical output. Although unified multimodal models (UMMs) have bridged this gap in vision, their potential for time series remains untapped. We propose , the first vision-centric framework that unifies time series understanding and generation through two key innovations: (1) Fidelity-preserving bidirectional mapping between time series and images (Bi-TSI), which advances Time Series-to-Image (TS2I) and Image-to-Time Series (I2TS) conversions to ensure near-lossless transformations. (2) Understanding-guided generation. We introduce , a novel dataset consists of six understanding tasks rooted in time series analytics that are coupled with two generation tasks. With a calibrated Chain-of-Thought (CoT), is the first to leverage time series understanding as an explicit control signal for high-fidelity generation. Experiments confirm that this unified approach significantly improves semantic understanding and numerical precision, establishing a new frontier for multimodal time series modeling.
bibliography:
- reference.bib
---

```{=latex}
\newcommand{\theHalgorithm}{\arabic{algorithm}}
```
```{=latex}
\newcommand{\ICML@preprint}{%
  \textit{Preprint. \today.}%
}
```
```{=latex}
\newcommand{\ICML@appearing}{\textit{Proceedings of the
$\mathit{43}^{rd}$ International Conference on Machine Learning},
Seoul, South Korea. PMLR 306, 2026.
Copyright 2026 by the author(s).}
```
```{=latex}
\newcommand{\Notice@String}{Preliminary work.  Under review by the
International Conference on Machine Learning (ICML)\@.  Do not distribute.}
```
```{=latex}
\newcommand{\yrcite}[1]{\citeyearpar{#1}}
```
```{=latex}
\renewcommand{\cite}[1]{\citep{#1}}
```
```{=latex}
\def\ftype@copyrightbox{8}
```
```{=latex}
\def\@copyrightspace{
\@float{copyrightbox}[b]
\begin{center}
\setlength{\unitlength}{1pc}
\begin{picture}(20,1.5)
\put(0,2.5){\line(1,0){4.818}}
\put(0,0){\parbox[b]{19.75pc}{\small \Notice@String}}
\end{picture}
\end{center}
\end@float}
```
```{=latex}
\def\addcontentsline#1#2#3{}
```
```{=latex}
\renewcommand{\headrulewidth}{1pt}
```
```{=latex}
\def\icmltitlerunning#1{\gdef\@icmltitlerunning{#1}}
```
```{=latex}
\newcommand\addstringtofullauthorlist{\g@addto@macro\icmlfullauthorlist}
```
```{=latex}
\newcommand\addtofullauthorlist[1]{%
  \ifdefined\icmlanyauthors%
    \addstringtofullauthorlist{, #1}%
  \else%
    \addstringtofullauthorlist{#1}%
    \gdef\icmlanyauthors{1}%
  \fi%
  \ifdefined\hypersetup%
    \hypersetup{pdfauthor=\icmlfullauthorlist}%
  \fi
}
```
```{=latex}
\def\toptitlebar{\hrule height1pt \vskip .25in}
```
```{=latex}
\def\bottomtitlebar{\vskip .22in \hrule height1pt \vskip .3in}
```
```{=latex}
\newenvironment{icmlauthorlist}{%
  \setlength\topsep{0pt}
  \setlength\parskip{0pt}
  \begin{center}
    }{%
  \end{center}
}
```
```{=latex}
\newcommand{\@pa}[1]{%
  \ifcsname the@affil#1\endcsname
    % do nothing
  \else
    \ifcsname @icmlsymbol#1\endcsname
      % nothing
    \else
      \stepcounter{@affiliationcounter}%
      \newcounter{@affil#1}%
      \setcounter{@affil#1}{\value{@affiliationcounter}}%
    \fi
  \fi%
  \ifcsname @icmlsymbol#1\endcsname
    \textsuperscript{\csname @icmlsymbol#1\endcsname\,}%
  \else
    \textsuperscript{\arabic{@affil#1}\,}%
  \fi
}
```
```{=latex}
\newcommand{\icmlauthor}[2]{%
  \ificmlshowauthors
    \mbox{\bf #1}\,\@for\theaffil:=#2\do{\@pa{\theaffil}} \addtofullauthorlist{#1}%
  \else
    \ifdefined\@icmlfirsttime\else
      \gdef\@icmlfirsttime{1}
      \mbox{\bf Anonymous Authors}\@pa{@anon} \addtofullauthorlist{Anonymous Authors}
    \fi
  \fi
}
```
```{=latex}
\newcommand{\icmlsetsymbol}[2]{%
  \expandafter\gdef\csname @icmlsymbol#1\endcsname{#2}
}
```
```{=latex}
\newcommand{\icmlaffiliation}[2]{%
  \ificmlshowauthors
    \ifcsname the@affil#1\endcsname
      \expandafter\gdef\csname @affilname\csname the@affil#1\endcsname\endcsname{#2}%
    \else
      {\bf AUTHORERR: Error in use of \textbackslash{}icmlaffiliation command. Label ``#1'' not mentioned in some \textbackslash{}icmlauthor\{author name\}\{labels here\} command beforehand. }
      \typeout{}%
      \typeout{}%
      \typeout{*******************************************************}%
      \typeout{Affiliation label undefined. }%
      \typeout{Make sure \string\icmlaffiliation\space follows }%
      \typeout{all of \string\icmlauthor\space commands}%
      \typeout{*******************************************************}%
      \typeout{}%
      \typeout{}%
    \fi
  \else
    \expandafter\gdef\csname @affilname1\endcsname{Anonymous Institution, Anonymous City, Anonymous Region, Anonymous Country}
  \fi
}
```
```{=latex}
\newcommand{\icmlcorrespondingauthor}[2]{%
  \ificmlshowauthors
    \ifdefined\icmlcorrespondingauthor@text
      \g@addto@macro\icmlcorrespondingauthor@text{, #1 \textless{}#2\textgreater{}}
    \else
      \gdef\icmlcorrespondingauthor@text{#1 \textless{}#2\textgreater{}}
    \fi
  \else
    \gdef\icmlcorrespondingauthor@text{Anonymous Author \textless{}anon.email@domain.com\textgreater{}}
  \fi
}
```
```{=latex}
\newcommand{\icmlEqualContribution}{\textsuperscript{*}Equal contribution}
```
```{=latex}
\newcommand{\printAffiliationsAndNotice}[1]{\global\icml@noticeprintedtrue%
  \stepcounter{@affiliationcounter}%
  {\let\thefootnote\relax\footnotetext{\hspace*{-\footnotesep}\ificmlshowauthors #1\fi%
      \forloop{@affilnum}{1}{\value{@affilnum} < \value{@affiliationcounter}}{
        \textsuperscript{\arabic{@affilnum}}\ifcsname @affilname\the@affilnum\endcsname%
          \csname @affilname\the@affilnum\endcsname%
        \else
          {\bf AUTHORERR: Missing \textbackslash{}icmlaffiliation.}
        \fi
      }.%
      \ifdefined\icmlcorrespondingauthor@text
         { }Correspondence to: \icmlcorrespondingauthor@text.
      \else
        {\bf AUTHORERR: Missing \textbackslash{}icmlcorrespondingauthor.}
      \fi

      \ \\
      \Notice@String
    }
  }
}
```
```{=latex}
\def\icmlkeywords#1{%
  \ifdefined\nohyperref\else\ifdefined\hypersetup
      \hypersetup{pdfkeywords={#1}}
    \fi\fi
}
```
```{=latex}
\renewenvironment{abstract}
{%
  \centerline{\large\bf Abstract}
  \vspace{-0.12in}\begin{quote}}
    {\par\end{quote}\vskip 0.12in}
```
```{=latex}
\def\@startsection#1#2#3#4#5#6{\if@noskipsec \leavevmode \fi
  \par \@tempskipa #4\relax
  \@afterindenttrue
  \ifdim \@tempskipa <\z@ \@tempskipa -\@tempskipa \fi
  \if@nobreak \everypar{}\else
    \addpenalty{\@secpenalty}\addvspace{\@tempskipa}\fi \@ifstar
  {\@ssect{#3}{#4}{#5}{#6}}{\@dblarg{\@sict{#1}{#2}{#3}{#4}{#5}{#6}}}}
```
```{=latex}
\def\@sict#1#2#3#4#5#6[#7]#8{\ifnum #2>\c@secnumdepth
    \def\@svsec{}\else
    \refstepcounter{#1}\edef\@svsec{\csname the#1\endcsname}\fi
  \@tempskipa #5\relax
  \ifdim \@tempskipa>\z@
    \begingroup #6\relax
    \@hangfrom{\hskip #3\relax\@svsec.~}{\interlinepenalty \@M #8\par}
    \endgroup
    \csname #1mark\endcsname{#7}\addcontentsline
    {toc}{#1}{\ifnum #2>\c@secnumdepth \else
        \protect\numberline{\csname the#1\endcsname}\fi
      #7}\else
    \def\@svsechd{#6\hskip #3\@svsec #8\csname #1mark\endcsname
      {#7}\addcontentsline
      {toc}{#1}{\ifnum #2>\c@secnumdepth \else
          \protect\numberline{\csname the#1\endcsname}\fi
        #7}}\fi
  \@xsect{#5}}
```
```{=latex}
\def\@sect#1#2#3#4#5#6[#7]#8{\ifnum #2>\c@secnumdepth
    \def\@svsec{}\else
    \refstepcounter{#1}\edef\@svsec{\csname the#1\endcsname\hskip 0.4em }\fi
  \@tempskipa #5\relax
  \ifdim \@tempskipa>\z@
    \begingroup #6\relax
    \@hangfrom{\hskip #3\relax\@svsec}{\interlinepenalty \@M #8\par}
    \endgroup
    \csname #1mark\endcsname{#7}\addcontentsline
    {toc}{#1}{\ifnum #2>\c@secnumdepth \else
        \protect\numberline{\csname the#1\endcsname}\fi
      #7}\else
    \def\@svsechd{#6\hskip #3\@svsec #8\csname #1mark\endcsname
      {#7}\addcontentsline
      {toc}{#1}{\ifnum #2>\c@secnumdepth \else
          \protect\numberline{\csname the#1\endcsname}\fi
        #7}}\fi
  \@xsect{#5}}
```
```{=latex}
\def\thesection {\arabic{section}}
```
```{=latex}
\def\thesubsection {\thesection.\arabic{subsection}}
```
```{=latex}
\def\section{\@startsection{section}{1}{\z@}{-0.12in}{0.02in}
  {\large\bf\raggedright}}
```
```{=latex}
\def\subsection{\@startsection{subsection}{2}{\z@}{-0.10in}{0.01in}
  {\normalsize\bf\raggedright}}
```
```{=latex}
\def\subsubsection{\@startsection{subsubsection}{3}{\z@}{-0.08in}{0.01in}
  {\normalsize\sc\raggedright}}
```
```{=latex}
\def\paragraph{\@startsection{paragraph}{4}{\z@}{1.5ex plus
    0.5ex minus .2ex}{-1em}{\normalsize\bf}}
```
```{=latex}
\def\subparagraph{\@startsection{subparagraph}{5}{\z@}{1.5ex plus
    0.5ex minus .2ex}{-1em}{\normalsize\bf}}
```
```{=latex}
\def\footnoterule{\kern-3pt \hrule width 0.8in \kern 2.6pt }
```
```{=latex}
\def\@listi{\leftmargin\leftmargini}
```
```{=latex}
\def\@listii{\leftmargin\leftmarginii
  \labelwidth\leftmarginii\advance\labelwidth-\labelsep
  \topsep 2pt plus 1pt minus 0.5pt
  \parsep 1pt plus 0.5pt minus 0.5pt
  \itemsep \parsep}
```
```{=latex}
\def\@listiii{\leftmargin\leftmarginiii
  \labelwidth\leftmarginiii\advance\labelwidth-\labelsep
  \topsep 1pt plus 0.5pt minus 0.5pt
  \parsep \z@ \partopsep 0.5pt plus 0pt minus 0.5pt
  \itemsep \topsep}
```
```{=latex}
\def\@listiv{\leftmargin\leftmarginiv
  \labelwidth\leftmarginiv\advance\labelwidth-\labelsep}
```
```{=latex}
\def\@listv{\leftmargin\leftmarginv
  \labelwidth\leftmarginv\advance\labelwidth-\labelsep}
```
```{=latex}
\def\@listvi{\leftmargin\leftmarginvi
  \labelwidth\leftmarginvi\advance\labelwidth-\labelsep}
```
```{=latex}
\def\@normalsize{\@setsize\normalsize{11pt}\xpt\@xpt}
```
```{=latex}
\def\small{\@setsize\small{10pt}\ixpt\@ixpt}
```
```{=latex}
\def\footnotesize{\@setsize\footnotesize{10pt}\ixpt\@ixpt}
```
```{=latex}
\def\scriptsize{\@setsize\scriptsize{8pt}\viipt\@viipt}
```
```{=latex}
\def\tiny{\@setsize\tiny{7pt}\vipt\@vipt}
```
```{=latex}
\def\large{\@setsize\large{14pt}\xiipt\@xiipt}
```
```{=latex}
\def\Large{\@setsize\Large{16pt}\xivpt\@xivpt}
```
```{=latex}
\def\LARGE{\@setsize\LARGE{20pt}\xviipt\@xviipt}
```
```{=latex}
\def\huge{\@setsize\huge{23pt}\xxpt\@xxpt}
```
```{=latex}
\def\Huge{\@setsize\Huge{28pt}\xxvpt\@xxvpt}
```
```{=latex}
\def\fnum@figure{Figure \thefigure}
```
```{=latex}
\def\fnum@table{Table \thetable}
```
```{=latex}
\def\abovestrut#1{\rule[0in]{0in}{#1}\ignorespaces}
```
```{=latex}
\def\belowstrut#1{\rule[-#1]{0in}{#1}\ignorespaces}
```
```{=latex}
\def\abovespace{\abovestrut{0.20in}}
```
```{=latex}
\def\aroundspace{\abovestrut{0.20in}\belowstrut{0.10in}}
```
```{=latex}
\def\belowspace{\belowstrut{0.10in}}
```
```{=latex}
\def\texitem#1{\par\noindent\hangindent 12pt
  \hbox to 12pt {\hss #1 ~}\ignorespaces}
```
```{=latex}
\def\icmlitem{\texitem{$\bullet$}}
```
```{=latex}
\def\fillzeros[#1]#2{\cv@tmpc@=#2\relax\ifnum\cv@tmpc@<0\cv@tmpc@=-\cv@tmpc@\fi
  \cv@tmpc=1 %
  \loop\ifnum\cv@tmpc@<10 \else \divide\cv@tmpc@ by 10 \advance\cv@tmpc by 1 \fi
  \ifnum\cv@tmpc@=10\relax\cv@tmpc@=11\relax\fi \ifnum\cv@tmpc@>10 \repeat
  \ifnum#2<0\advance\cv@tmpc1\relax-\fi
  \loop\ifnum\cv@tmpc<#1\relax0\advance\cv@tmpc1\relax\fi \ifnum\cv@tmpc<#1 \repeat
  \cv@tmpc@=#2\relax\ifnum\cv@tmpc@<0\cv@tmpc@=-\cv@tmpc@\fi \relax\the\cv@tmpc@}
```
```{=latex}
\def\makevruler[#1][#2][#3][#4][#5]{
  \begingroup\offinterlineskip
  \textheight=#5\vbadness=10000\vfuzz=120ex\overfullrule=0pt%
  \global\setbox\icmlrulerbox=\vbox to \textheight{%
    {
        \parskip=0pt\hfuzz=150em\cv@boxheight=\textheight
        \cv@lineheight=#1\global\icmlrulercount=#2%
        \cv@tot\cv@boxheight\divide\cv@tot\cv@lineheight\advance\cv@tot2%
        \cv@refno1\vskip-\cv@lineheight\vskip1ex%
        \loop\setbox\cv@tmpbox=\hbox to0cm{\hfil {\hfil\fillzeros[#4]\icmlrulercount}}%
        \ht\cv@tmpbox\cv@lineheight\dp\cv@tmpbox0pt\box\cv@tmpbox\break
        \advance\cv@refno1\global\advance\icmlrulercount#3\relax
        \ifnum\cv@refno<\cv@tot\repeat
      }
  }
  \endgroup
}
```
```{=latex}
\def\icmlruler#1{\makevruler[12pt][#1][1][3][\textheight]\usebox{\icmlrulerbox}}
```
```{=latex}
\newcommand{\tg}[1]{\textcolor{MidnightBlue}{#1}}
```
```{=latex}
\newcommand{\bestres}[1]{{\textbf{\textcolor{red}{#1}}}}
```
```{=latex}
\newcommand{\secondres}[1]{\textcolor{blue}{\uline{#1}}}
```
```{=latex}
\newcommand{\dataset}{\textsc{TSUMM-Suite}\xspace}
```
```{=latex}
\newcommand{\method}{\textsc{TimeOmni-VL}\xspace}
```
```{=latex}
\newcommand{\projectlead}{$^\dagger$Project lead}
```
Introduction
============

Time series are pervasive in modern systems and everyday life, underpinning decision-making across healthcare, transportation, industrial monitoring, and finance [@ShapeX; @TrafficR1; @ITFormer; @BeyondForecasting]. With the advances of time series modeling at scale, recent progress has largely followed two parallel threads: (1) *Generation models*. Led by time series foundation models (TSFMs), this thread prioritizes high-fidelity numerical sequence generation, excelling in tasks such as forecasting [@TimeMoE] and data imputation [@moment] (Figure [1](#fig:intro){reference-type="ref" reference="fig:intro"}b). (2) *Understanding Models*. Influenced by the rise of large language models (LLMs), this thread focuses on temporal reasoning [@TimeOmni-1] by providing explicit, human-readable interpretations of complex dynamics [@ChatTS] (Figure [1](#fig:intro){reference-type="ref" reference="fig:intro"}a). However, a significant divide remains: Generation models often lack explicit structural understanding despite offering representation analysis on signal components [@TimeMixer++], while understanding-oriented models frequently struggle with high-fidelity numerical generation as text-native tokenizers can disrupt numerical continuity (e.g., "123" → "1", "2", "3"). Bridging this gap with a unified model capable of both understanding and generation represents an urgent need for time series processing (Figure [1](#fig:intro){reference-type="ref" reference="fig:intro"}c).

![Comparison of architectures for **(a)** time series understanding model that produce textual answer only, **(b)** time series generation model that output time series only, and **(c)** unified time series understanding and generation model that support both answering queries and generating time series.](figs/fig1.png){#fig:intro width="1\\linewidth"}

Likewise, the vision domain has undergone a similar trajectory, with models specialized for visual generation [@GLIDE; @VQ-VAE-2] and those focusing on visual understanding [@clip; @Qwen2VL]. However, recently, the vision community has witnessed advancements in unified multimodal models (UMMs) that excel in both image understanding and generation. A key emerging insight is that robust understanding serves as a foundation for superior generation, since structured semantic guidance improves controllability and fidelity [@UMMs-survey]. Meanwhile, an emerging line of work suggests a similarity between time series and vision modality, as pixel-level variations in natural images can be viewed as sequential signals and exhibit intrinsic commonalities with time series [@VisionTS]. By reframing time series as a visual inpainting problem, visual generative models [@MAE] can achieve impressive time series forecasting [@VisionTS++] and imputation [@FM2I; @Hinge-FM2I] performance even in a training-free manner. Despite their effectiveness, these vision-based approaches largely rely on superficial texture imitation rather than genuine temporal understanding. They lack a mechanism to interpret the underlying signal dynamics from a time series perspective, which includes identifying trend shifts or seasonal dependencies within the visual space. Motivated by these observations, we ask a natural question: *Is it possible to represent time series in the vision modality and thereby internalize time series understanding and generation as native capabilities of UMMs, so that time series performance improves naturally as UMMs continue to advance?*

However, achieving this integration is non-trivial as two fundamental challenges remain: **(1) Fidelity-preserving bidirectional mappings between time series and images are still lacking.** Although VisionTS-style [@VisionTS; @VisionTS++] converters offer a practical interface for vision models, we find that the front-end conversion can already discard numerical information, so the model may not observe the complete series content. Once information is lost at input stage, it cannot be recovered downstream, making high-fidelity generation fundamentally unattainable. **(2) Understanding-guided generation remains underexplored for time series.** While UMMs possess strong semantic capabilities, they are not yet grounded in time series properties such as inherent periodicity and structural changepoints. As a result, they cannot leverage semantics to guide time series generation, preventing the system from achieving the precise and controllable results commonly observed in other multimodal tasks.

To address these challenges, we build around two core design objectives: (1) Fidelity-preserving bidirectional mappings between time series and images and (2) understanding-guided generation (as our primary goal is precise generation, where understanding serves as the necessary control signal, not vice versa). We advance existing converters [@VisionTS] into a fidelity-oriented **Bi**directional **T**ime **S**eries $\Leftrightarrow$ **I**mage mappings (**Bi-TSI**) that avoid information loss at the input stage. Concretely, we introduce robust fidelity normalization (RFN) to stabilize dynamic-range projection and preserve peak geometry under realistic signals, alongside encoding capacity control to prevent implicit downsampling when rendering time series onto a fixed time series image (TS-image) canvas. Building on Bi-TSI, we construct a new dataset by specifying forecasting and imputation as generation tasks and deriving six understanding tasks from the same generation instances, organized into layout-level and signal-level analysis. These tasks encourage UMMs to interpret TS-images from a temporal perspective rather than relying on superficial textures. Finally, we present , the first vision-centric framework that internalizes time series understanding and generation as native capabilities of UMMs. To enable understanding-guided generation, we form a generation Chain-of-Thought (CoT) by organizing the understanding QAs of each generation instance into a calibrated reasoning chain, making temporal understanding an explicit control signal for precise and controllable time series generation. Our contributions lie in three aspects:

**1. New Models.** We present , the first vision-centric framework that unifies time series understanding and generation. integrates: (1) Fidelity-preserving bidirectional Time Series $\Leftrightarrow$ Image mappings to prevent implicit information loss. (2) Generation CoT that organizes instance-level understanding into a calibrated reasoning chain and serves as an explicit control signal for numerical generation tasks like forecasting and imputation.

**2. New Datasets and Testbed.** We introduce , a benchmark comprising two generation tasks and six understanding tasks. The understanding tasks are tailored to the TS-image representation produced by , and are organized into **layout-level** and **signal-level** analyses to encourage temporal interpretation rather than superficial texture.

**3. Comprehensive Evaluation.** Results demonstrate that the understanding tasks effectively teach the base model to interpret TS-images: boosts the base model from near-zero accuracy to near-perfect scores on four understanding tasks (approaching 1.0). On generation, achieves top-tier results on forecasting and reaches state-of-the-art performance on imputation. Moreover, the proposed generation CoT consistently improves generation quality, yielding an average 8.2% gain.

![image](figs/fig2.png){width="1\\linewidth"}

Related Work
============

**Time Series Generation Models.** In this context, time series generation specifically refers to forecasting and imputation tasks rather than synthetic data generation. Existing models are primarily categorized into two paradigms. **(1) Time series-based models.** Early efforts focused on developing domain-specific architectures which often lacked cross-dataset generalization [@MTGNN; @GraphSTAGE; @TimeMixer++]. With the increasing availability of large-scale datasets, training TSFMs from scratch has become the mainstream approach to achieve superior zero-shot generalization [@moirai; @Chronos; @TimeMoE]. **(2) Image-based models.** Early researchers explored convolutional [@TS2I_1] and patch-based [@FM2I; @Hinge-FM2I] methods to reconstruct time series as images, revealing shared properties between the two modalities. Following the success of general visual generative models, the TS2I paradigm has resurged through models like VisionTS [@VisionTS; @VisionTS++], which demonstrate impressive zero-shot capabilities. However, their reliance on pixel-level pattern matching lacks genuine temporal understanding.

**Time Series Understanding Models.** Time-LLM [@Time-LLM] leverages the generalization capabilities of LLMs for time series, yet its understanding of temporal patterns remains largely implicit. To achieve explicit understanding, existing research has branched into two primary directions. The first involves **time series language models (TSLMs)**, which utilize synthetic datasets to align temporal signals with textual descriptions to ground temporal semantics [@ChatTS; @TimeMQA; @ITFormer; @OpenTSLM]. The second encompasses **time series reasoning models (TSRMs)**, which leverage the R1-paradigm [@DeepSeek-R1] to enhance temporal reasoning [@TimeOmni-1; @STReasoner]. Despite these advancements, both categories are constrained by the text-centric nature of LLMs. Standard vocabularies typically fragment multi-digit numbers into discrete tokens, thereby disrupting numerical continuity and undermining the precision required for high-fidelity generation.

**Unified Multimodal Models.** UMMs have recently emerged in the vision community to integrate understanding and generation within a single framework. These models generally follow either a unified auto-regressive architecture [@Chameleon; @MetaMorph; @Janus; @BLIP3-o; @Emu3.5] or a hybrid paradigm combining auto-regression with diffusion [@JanusFlow; @Bagel; @Qwen-Image]. Currently, the hybrid approach often yields superior results because image understanding prioritizes high-level semantics while generation requires fine-grained pixel details [@UMMs-survey; @Bagel]. Since the time series community lacks universal pre-trained encoders equivalent to ViT [@NaViT] or VAE [@VAE] in vision, recent studies [@TsLLM; @SciTS] attempting unified modeling with auto-regressive LLMs typically rely on shallow MLP layers. However, the effectiveness of such simple layers in projecting time series into the latent space remains unverified. This gap motivates combining TS2I methods with UMMs: by utilizing images as a modality-specific enhancement, we leverage UMMs to achieve a unified framework for temporal understanding and generation.

![image](figs/fig3.png){width="1\\linewidth"}

Methodology
===========

In this section, we first establish a unified problem formulation for both tasks. We then present , the first vision-centric framework that unifies time series understanding and generation. Finally, we introduce and its construction pipeline, which formalizes both generation and understanding tasks and bridges them by deriving generation CoT directly from understanding QAs.

**Problem Definition.** We formulate unified time series understanding and generation as a conditional *think-then-output* process within UMMs. Unlike in TSRMs [@TimeOmni-1], where CoT mainly serves as a textual explanation, here we treat CoT as a control signal that conditions generation. Given (1) the observed time series input $\mathbf{X}\in\mathbb{R}^{T\times N}$, and (2) an auxiliary context $C$ (e.g., task instructions), the model first generates a CoT $R = (r_1, \dots, r_K)$, and then produces the task target $o$ using $R$ as additional context. Formally, $$p_\theta(R, o \mid \mathbf{X}, C) = p_\theta(R \mid \mathbf{X}, C)
p_\theta(o \mid R, \mathbf{X}, C).$$ To standardize the inference process, we explicitly instruct the model to enclose the CoT $R$ within `<think></think>` tags across all tasks.

In this paper, we transform time series into the TS-image $I=\mathcal{V}(\mathbf{X})$. For understanding tasks on the TS-image $I$, the output produces a textual answer. For generation tasks (e.g., forecasting or imputation), we formulate the problem as editing the input TS-image: given a source image $I_{\mathrm{src}}$ and a generation instruction $C_{\mathrm{gen}}$, the model outputs an edited image $I_{\mathrm{tgt}}$, which is then decoded back into numerical values.

#### Overall Framework.

As illustrated in Figure [\[fig:1\]](#fig:1){reference-type="ref" reference="fig:1"}, we design to handle both time series understanding and generation tasks. We use Bagel [@Bagel] as the backbone UMM. While our framework is backbone-agnostic, we choose Bagel as it is a widely recognized and lightweight base model that has superior performance among other options. To adapt UMMs to temporal data, we introduce a fidelity-preserving **Bi**directional **T**ime **S**eries $\Leftrightarrow$ **I**mage mappings (**Bi-TSI**), consisting of a TS2I converter and an I2TS converter (Section [3.1](#sec:Bi-TSI){reference-type="ref" reference="sec:Bi-TSI"}). Specifically, the TS2I converter transforms raw time series into a high-fidelity visual representation (TS-image $I$), which is then fed into the backbone model. Within the backbone, the data flow differs by task (the data construction pipeline is described in Section [3.2](#sec:tasks){reference-type="ref" reference="sec:tasks"}): (1) **Understanding tasks**: Given an understanding instruction $C_{\mathrm{und}}$ and the TS-image $I$, the *Understanding Model* first generates an understanding CoT $R$, followed by the final understanding answer $o$. (2) **Generation tasks**: The process follows a "understand-then-generate" paradigm. The model first inputs a generation instruction alongside the TS-image $I_{\mathrm{src}}$ into the *Understanding Model* to derive an generation-oriented CoT $R_{\mathrm{gen}}$. This CoT then serves as a conditional guide, and the TS-image $I_{\mathrm{src}}$ is fed again into the *Generation Module*, which synthesizes the target TS-image $I_{\mathrm{tgt}}$. The output TS-image is converted back to numerical time series $o$ via the I2TS converter.

#### Training Objectives.

We jointly train the *Understanding Model* and the *Generation Module*. For generation tasks, the generation CoT is produced by the understanding model and is therefore supervised by the understanding loss.

**Understanding Loss (Text).** Given a TS-image $I$ and an instruction $C$, we optimize next-token prediction over a text sequence $y$ (understanding: $y=[R;o]$; generation: $y=R_{\mathrm{gen}}$): $$\mathcal{L}_{\mathrm{und}}
= - \sum_{i=1}^{|y|} \log P_{\theta}\!\left(y_i \mid y_{<i}, I, C\right).$$

**Generation Loss (Image).** We train the generation module as a diffusion denoiser. Given $I_{\mathrm{tgt}}$, we sample $s$ and add Gaussian noise $\epsilon$ to obtain $I_s$. Here $F_{\mathrm{gen}}(\cdot)$ predicts the injected noise conditioned on $(I_{\mathrm{src}},R_{\mathrm{gen}})$: $$\mathcal{L}_{\mathrm{gen}}
=\mathbb{E}_{s,\epsilon}\!\left[\left\|F_{\mathrm{gen}}(I_s; I_{\mathrm{src}}, R_{\mathrm{gen}}, s)-\epsilon\right\|_2^2\right].$$

Ultimately, we minimize a weighted sum of the above losses during training: $$\mathcal{L}=\lambda_{\mathrm{und}}\,\mathcal{L}_{\mathrm{und}}+\lambda_{\mathrm{gen}}\,\mathcal{L}_{\mathrm{gen}}.$$

![image](figs/fig4.png){width="1\\linewidth"}

Fidelity-Preserving \`\`Time Series $\Leftrightarrow$ Image" {#sec:Bi-TSI}
------------------------------------------------------------

To unlock UMMs for time series, we require a fidelity-preserving bidirectional mappings that enable near-lossless transformations between time series and TS-image. Therefore, we introduce **Bi-TSI**, which consists of two components: a Time Series-to-Image (TS2I) converter that encodes numerical sequences into a TS-image (Figure [\[fig:1\]](#fig:1){reference-type="ref" reference="fig:1"}a) and an Image-to-Time Series (I2TS) converter that decodes a TS-image back to numerical values (Figure [\[fig:1\]](#fig:1){reference-type="ref" reference="fig:1"}b).

#### Quick Overview of TS2I and I2TS.

Given a multivariate time series $\mathbf{X}\in\mathbb{R}^{T\times N}$ with periodicity $f$, we set the TS-image ${I}$ to have resolution $H\times W$. (1) TS2I first normalizes $\mathbf{X}$ and folds each variable $\tilde{\mathbf{x}}^{(n)}\in\mathbb{R}^{T}$ into a periodic grid $\mathbf{S}^{(n)}\in\mathbb{R}^{f\times C}$ with $C=T/f$. Each grid is then rendered into a band of size $h\times W$, where $h=\lfloor H/N \rfloor$, and all bands are stacked vertically to form a TS-image of resolution $H\times W$; a task-specific masking scheme is applied so that the unmasked region provides the observed context while the masked region is completed by the backbone model. (2) I2TS reverses this process by taking the backbone output TS-image, extracting each variable band according to its vertical location, resizing the decoded region back to the $f\times C$ grid, unrolling it to the temporal axis, and applying denormalization to recover numerical values. Our conversion pipeline follows the VisionTS++ [@VisionTS++], with a step-by-step description provided in Appendix [8](#app:ts2i_i2ts){reference-type="ref" reference="app:ts2i_i2ts"}. In this section, we present two key improvements that make the TS2I/I2TS round-trip mapping reliable for UMMs.

#### Robust Fidelity Normalization (RFN). {#RFN}

A key step in TS2I is normalization when projecting values into the image space, but common choices can distort the TS-image. Standard Deviation (Std)-based scaling [@VisionTS] is sensitive to extreme spikes; a single outlier can compress normal samples into a narrow range, pushing the spike to the boundary. Consequently, the spike geometry may appear saturated in the TS-image (Figure [\[fig:rfn\]](#fig:rfn){reference-type="ref" reference="fig:rfn"}a). Meanwhile, Median Absolute Deviation (MAD)-based scaling [@Chronos2] fails when many samples share the same value; a near-zero MAD leads to overly aggressive normalization, amplifying minor fluctuations. To address this, RFN combines robust scaling with bounded compression. Given $\mathbf{X}\in\mathbb{R}^{T\times N}$, we compute a per-variable median location $\boldsymbol{\mu}\in\mathbb{R}^{N}$. For robust scaling $\boldsymbol{\sigma}$, we combine a MAD-based estimate with the standard deviation: $$\label{eq:RFN_sigma}
\boldsymbol{\sigma} = \alpha \frac{\mathrm{Median}\!\left(\left|\mathbf{X}-\boldsymbol{\mu}\right|\right)}{c_{\mathrm{MAD}}} + (1-\alpha)\,\mathrm{Std}\!\left(\mathbf{X}\right).$$ We then apply a smooth bounded mapping via $\tanh$: $$\label{eq:RFN_tanh}
\mathbf{X}_{\mathrm{norm}} = \tanh\!\left(\frac{\mathbf{X}-\boldsymbol{\mu}}{\kappa\,\boldsymbol{\sigma}}\right),$$ where $\alpha\in[0,1]$, $c_{\mathrm{MAD}}$ is the consistency constant, and $\kappa$ controls saturation. See Appendix [9](#app:norm_comparison){reference-type="ref" reference="app:norm_comparison"} for further comparisons of Std-based and MAD-based normalization under two challenging regimes (extreme outliers and step-like signals), and how RFN avoids signal washout and noise amplification.

#### Avoiding downsampling via Encoding Capacity Control.

Without explicit constraints on variables or length, VisionTS++ [@VisionTS++] maps oversized periodic grids to the target TS-image resolution, triggering downsampling and loss of temporal details. As shown in Figure [\[fig:rfn\]](#fig:rfn){reference-type="ref" reference="fig:rfn"}b, once information is lost at the input stage, even perfect completion fails to recover it, as the backbone cannot restore details removed by the initial mapping. To avoid this failure mode, we make two changes: **(1) capacity constraints to eliminate downsampling** by requiring $H/N\ge f$ and $W\ge L/f$, where $H$ is the available vertical height, $W$ is the horizontal width allocated to the encoded segment, $f$ is the periodicity, $N$ is the number of variables, and $L$ is the total encoded length (including masked portions). These constraints ensure at least one pixel per timestep during rendering, preserving high-fidelity inputs for the backbone model. **(2) higher-resolution TS-images to retain practical capacity**. We use $896\times896$ images, providing $16\times$ more area than $224\times224$ in VisionTS++, which allows Bi-TSI to encode more variables and longer sequences.

Formulating Generation and Understanding Tasks {#sec:tasks}
----------------------------------------------

We introduce . To leverage understanding for superior generation, we adopt a generation-first pipeline: we first specify generation tasks and then construct understanding samples grounded in them. Detailed case studies can be found in Appendix [11](#app:case_study){reference-type="ref" reference="app:case_study"}.

#### Generation Tasks.

We focus on two key time series generation tasks: forecasting and imputation (Figure [\[fig: task\]](#fig: task){reference-type="ref" reference="fig: task"}). The pretraining dataset is derived from GIFT-Eval [@GIFT-Eval]. For forecasting, we follow the GIFT-Eval evaluation protocol to adjust the prediction length $P$ based on the series frequency. To ensure the visual loss focuses sufficiently on the completion region, we constrain the context length $H$ to between $P$ and $2P$ for forecasting, and set the masking ratio between $10\%$ and $50\%$ of the total sequence $\mathbf{X}$ for imputation. We constructed $40k$ samples for forecasting and $40k$ for imputation. Within each task category, the ratio of univariate, multi-attribute, and multi-node samples is $2:1:1$. For multivariate samples, the maximum number of input variables in our training set is $21$, consistent with the maximum target variates required in the GIFT-Eval testbed.

#### Understanding Tasks.

We design six types of understanding tasks tailored to the generation samples (Figure [\[fig: task\]](#fig: task){reference-type="ref" reference="fig: task"}). They span two levels: (1) **Layout-level tasks** for locating specific variables and periods, and (2) **Signal-level tasks** for detailed intra-period and inter-period pattern analysis. This hierarchical design compels the model to interpret the TS-image as structured temporal signals rather than superficial textures. Based on the generation samples, we constructed $9,409$ QA pairs accompanied by detailed understanding CoTs generated via rules and LLMs [@Gemini2.5]. To further enhance temporal reasoning, we also incorporate the TSR-Suite dataset [@TimeOmni-1], providing $2,339$ CoT-guided temporal reasoning samples to inject essential temporal priors into the understanding model.

#### Bridging Generation and Understanding tasks.

To implement understanding-guided generation, we derive the generation CoT $R_{\mathrm{gen}}$ by composing the analytical logic from the understanding tasks (Figure [\[fig: task\]](#fig: task){reference-type="ref" reference="fig: task"}). This is feasible because our understanding QAs are constructed on the same generation instances: while layout-level QAs identify the temporal coordinates of variables and periods, signal-level QAs analyze the patterns within these regions. Consequently, the derived $R_{\mathrm{gen}}$ integrates these analyses to provide a structured context for the input TS-image $I_{\mathrm{src}}$. We structure the training samples as an interleaved sequence: $$\mathbf{seq} = P_{\mathrm{sys}} \oplus I_{\mathrm{src}} \oplus C_{\mathrm{gen}} \oplus R_{\mathrm{gen}} \oplus I_{\mathrm{tgt}},$$ where $P_{\mathrm{sys}}$ denotes the system prompt, $C_{\mathrm{gen}}$ is the generation instruction, and $I_{\mathrm{tgt}}$ is the ground-truth target TS-image. Through this construction, $R_{\mathrm{gen}}$ serves as a conditioning context, tightly linking the two task families.

Experiments {#sec:exp}
===========

**Implementation.** In our experiments, the understanding model and the generation module are initialized from the pretrained Bagel-7B [@Bagel]. All training data come from the proposed . Although we constructed $40k$ interleaved sequences for each generation task, we only use $5k$ for training in each task and leave the remaining data for further community exploration. For understanding tasks, we use the full $9,409$ QA pairs with detailed understanding CoT. The model is trained on a node with 8$\times$ NVIDIA A100 GPUs. We use a base learning rate of $3\times10^{-5}$ with a warm-up phase covering 5% of the total training iterations. All input TS-images have a resolution of $896\times896$, resulting in approximately $3,000$ visual tokens per image. In the main comparisons, we use the same checkpoint to evaluate all tasks.

**Evaluation Metrics.** We evaluate using standard metrics spanning numerical and textual outputs. For forecasting, we report the normalized Mean Absolute Scaled Error (nMASE) in accordance with common practice on the GIFT-Eval testbed [@GIFT-Eval]; for imputation, we also report nMASE under various masking ratios. For TS-image understanding, scores are normalized to $[0,1]$ (higher is better) based on task-specific criteria in Appendix [10.1](#app: understanding_metrics){reference-type="ref" reference="app: understanding_metrics"}. For reasoning tasks, we follow the TSR-Suite benchmark [@TimeOmni-1], reporting Accuracy (ACC) for text-output tasks and Mean Absolute Error (MAE) for sequence-output tasks. All reported results are obtained under zero-shot, out-of-distribution evaluation. Due to the limitations of LLMs in counting (especially for generation tasks) and their tendency to produce repetitive or garbled outputs (especially for understanding tasks), we compute all subsequent evaluation metrics only on model outputs that yield a valid and extractable answer. This protocol reduces confounding effects from differences in instruction-following abilities across models. "--" indicates the Success Rate (SR) below 10%, where the results are omitted due to insufficient statistical reliability, and we therefore do not report them.

Main Results
------------

**Time Series Understanding**\
**[Setup.]{.underline}** We evaluate six TS-image understanding tasks and find that general-purpose VLMs are not directly applicable without dedicated adaptation. For example, Gemini2.5-Flash achieves zero accuracy on the signal-level QA5 task; a detailed comparison with two Gemini variants is reported in Table [7](#tab:understanding_result){reference-type="ref" reference="tab:understanding_result"} (Appendix [10.2](#app:gemini Performance on Understanding Tasks){reference-type="ref" reference="app:gemini Performance on Understanding Tasks"}). This is expected because our understanding tasks are tailored to the TS-images in . We therefore conduct a controlled comparison between and Bagel-7B across all six tasks to test whether post-training enables the base model to understand our TS-images. **[Results.]{.underline}** Figure [2](#fig:understanding){reference-type="ref" reference="fig:understanding"} shows that, while the base model attains zero accuracy on three tasks, consistently improves answer accuracy on both layout-level tasks, which evaluate localization of variables and periods, and signal-level tasks, which require value comparison and temporal pattern interpretation. In particular, accuracy on QA1 through QA4 approaches 1.0. These results indicate that post-training substantially strengthens temporal understanding of our TS-images, providing a solid foundation for the subsequent understanding-guided generation.

![Performance on TS-image understanding tasks.](figs/fig5.png){#fig:understanding width="1\\linewidth"}

width=,center `\renewcommand{\arraystretch}{1}`{=latex}

::: {#tab:forecasting_result}
  ------------------------------ ---------------- -------------- ---------------
                                                                 
  (lr)2-4                         **Short-term**   **Med-term**   **Long-term**
  **LLMs**                                                       
  Gemini-2.5-flash                    1.295           1.201           1.279
  Qwen2.5-Instruct-7B                 1.445             \-             \-
  **Time Series-based Models**                                   
  ChatTime                            0.983           1.439           4.164
  Time-R1                             1.162             \-             \-
  TimeOmni-1                          1.298             \-             \-
  **Image-based Models**                                         
  VisionTS++                                                     
  VisionTS                            1.263                           0.794
  Bagel                               16.303          17.840         16.530
                                                      0.816      
  ------------------------------ ---------------- -------------- ---------------

  : Forecasting performance (nMASE) across different prediction lengths. : the best, : the 2nd best. \`\`--" denotes SR below 10%; not statistically significant.
:::

[\[tab:forecasting\_result\]]{#tab:forecasting_result label="tab:forecasting_result"}

**Time Series Forecasting**\
**[Setup.]{.underline}** Evaluating the full GIFT-Eval involves over $140k$ sequences, which is impractical for assessing LLMs and UMMs. We adopt a representative subset of 685 instances (419 short-, 137 medium-, and 129 long-term), which is substantially larger than prior TSLMs testbeds [@TimeMQA]. **[Results.]{.underline}** Table [1](#tab:forecasting_result){reference-type="ref" reference="tab:forecasting_result"} reports the forecasting results. Among text-output models, Gemini-2.5-Flash [@Gemini2.5] is the only one maintaining reasonable performance on long-horizon prediction. Other models (Qwen2.5-7B [@Qwen2.5], Time-R1 [@timer1], and TimeOmni-1) fail to reliably forecast at horizons of 480 to 900 steps. This highlights a common bottleneck: deficient counting abilities prevent these models from generating the required sequence length, which precludes quantitative evaluation due to length mismatch. ChatTime [@ChatTime] is an exception; by mapping each numeric value to a single token, it preserves numerical continuity and improves counting reliability. Even so, these text-based models typically yield nMASE above 1, indicating worse performance than the [Naive]{.smallcaps} baseline. In contrast, and VisionTS Series achieve top-tier accuracy. Our base model Bagel-7B fails to forecast without specialized tuning (see Table [\[tab:Bagel\_forecasting\]](#tab:Bagel_forecasting){reference-type="ref" reference="tab:Bagel_forecasting"} of Appendix [11](#app:case_study){reference-type="ref" reference="app:case_study"} for failure case). The results show that with dedicated post-training, time series forecasting can be effectively internalized as a capability of UMMs.

width=,center `\renewcommand{\arraystretch}{1}`{=latex}

::: {#tab:imputation_result}
  ------------------------------ ----------------- ----------------- ----------------- ------------------
                                                                                       
  (lr)2-5                         **\[0.1, 0.2)**   **\[0.2, 0.3)**   **\[0.3, 0.4)**   **\[0.4, 0.5\]**
  **LLMs**                                                                             
  Gemini-2.5-flash                                       2.028             2.434             1.160
  Qwen2.5-Instruct-7B                  4.878             1.854              \-                 \-
  **Statistics Baselines**                                                             
  Nearest                              0.975             0.958             1.003       
  Linear                               0.943                                                 0.968
  **Time Series-based Models**                                                         
  Moment-large                         1.220             1.400             1.630             2.100
  Moment-base                          1.510             1.600             1.700             2.130
  **Image-based Models**                                                               
  Bagel                               17.411            12.239            11.849             11.032
                                                                                       
  ------------------------------ ----------------- ----------------- ----------------- ------------------

  : Imputation Performance (nMASE) under different masking ratios. : the best, : the 2nd best. \`\`--" denotes SR below 10%; not statistically significant.
:::

[\[tab:imputation\_result\]]{#tab:imputation_result label="tab:imputation_result"}

**Time Series Imputation**\
**[Setup.]{.underline}** To ensure zero-shot evaluation, we also use GIFT-Eval and construct a subset of 855 test instances with varying missing ratios: 87 samples with 10%--20% missing, 163 with 20%--30%, 306 with 30%--40%, and 279 with 40%--50%. **[Results.]{.underline}** Table [2](#tab:imputation_result){reference-type="ref" reference="tab:imputation_result"} reports the imputation results. achieves state-of-the-art performance, likely because imputation can leverage both past and future contexts to guide reconstruction, unlike pure forecasting. The untuned Bagel backbone still fails to perform time series-specific task instructions, with representative failure cases provided in Table [\[tab:Bagel\_imputation\]](#tab:Bagel_imputation){reference-type="ref" reference="tab:Bagel_imputation"} of Appendix [11](#app:case_study){reference-type="ref" reference="app:case_study"}. Interestingly, simple statistical baselines outperform both the time series-finetuned Moment models [@moment] and text-only LLM baselines in the imputation task.

**Time Series Reasoning**\
**[Setup.]{.underline}** To examine whether time series domain knowledge can be effectively injected into UMMs, we follow the out-of-distribution evaluation protocol of TimeOmni-1 [@TimeOmni-1] on text-only reasoning tasks. **[Results.]{.underline}** Table [8](#tab:reasoning_result){reference-type="ref" reference="tab:reasoning_result"} in Appendix [10.3](#app:Performance on Reasoning Tasks){reference-type="ref" reference="app:Performance on Reasoning Tasks"} reports the reasoning results. Although we do not use reinforcement learning to explicitly enhance the model's reasoning ability, achieves top-2 performance on Task 1, Task 2, and Task 4. These results indicate that our post-training successfully incorporates essential time series domain knowledge into UMMs.

![Ablation on TS2I strategies. Comparison between our TS2I and the heatmap representation for forecasting (left) and imputation (right). Red arrows indicate the performance gap.](figs/fig7.png){#fig:ours V.s. heatmap width="1\\linewidth"}

![Visual comparison of TS-image construction. Original time series (left). Our TS2I strategy (middle), which aligns periodic cycles explicitly.Standard heatmap representation (right).](figs/fig6.png){#fig:illustrition of heatmap width="1\\linewidth"}

More Analysis
-------------

**Ablation on TS2I Strategies**\
**[Setup.]{.underline}** We compare our TS2I strategy in Bi-TSI with the widely adopted \`\`time series to heatmap\" representation [@Vision_Models_ts_Survey] (Figure [4](#fig:illustrition of heatmap){reference-type="ref" reference="fig:illustrition of heatmap"}). Except for the imaging procedure, all experimental settings are kept identical. We report performance on the generation tasks. **[Results.]{.underline}** Figure [3](#fig:ours V.s. heatmap){reference-type="ref" reference="fig:ours V.s. heatmap"} summarizes the ablation results. Replacing our TS2I with the heatmap representation consistently degrades performance across all tasks; in fact, the heatmap variant yields nMASE worse than the [Naive]{.smallcaps} baseline on nearly all tasks. This highlights that generation performance is highly sensitive to the choice of TS-image construction strategy. We attribute the degradation to two main factors. (1) **Information loss under limited image resolution.** When the total length (context + prediction) exceeds the TS-image width (896), the heatmap must downsample along the temporal axis, which discards fine-grained information. (2) **Higher modeling difficulty.** Heatmaps require the model to implicitly align periodic patterns across the 2D layout, whereas our TS2I rearranges the series by cycles, making the periodic alignment explicit. We also include a discussion on why we do not use line plots in Appendix [10.4](#app: discussion of line plot){reference-type="ref" reference="app: discussion of line plot"}.

**Ablation on Understanding Model**\
**[Setup.]{.underline}** To verify whether understanding can facilitate generation, we freeze the understanding model during training and disable CoT generation during inference. **[Results.]{.underline}** Figure [5](#fig:ablation_on_understanding){reference-type="ref" reference="fig:ablation_on_understanding"} summarizes the ablation results. Without CoT as context, generation performance drops consistently across all cases, yielding an average 8.2% increase in nMASE. This suggests that the shared self-attention in our backbone model enables effective interaction between the understanding model and the generation module, allowing the generation module to leverage the semantics provided by the understanding model and consequently produce more controllable time series generations.

![Ablation on the understanding model. Comparison between generation-only and understanding-guided generation for forecasting (left) and imputation (right).](figs/fig8.png){#fig:ablation_on_understanding width="1\\linewidth"}

**Case Studies**\
Detailed case studies across all tasks (six understanding, two generation) are provided in Appendix [11](#app:case_study){reference-type="ref" reference="app:case_study"}. Additionally, we present two representative failure cases of the base model in Table [\[tab:Bagel\_forecasting\]](#tab:Bagel_forecasting){reference-type="ref" reference="tab:Bagel_forecasting"} and Table [\[tab:Bagel\_imputation\]](#tab:Bagel_imputation){reference-type="ref" reference="tab:Bagel_imputation"}. These comparisons further demonstrate that our post-training internalizes time series understanding and generation as inherent capabilities of UMMs.

Conclusion
==========

We introduced , a vision-centric framework that unifies temporal understanding and generation. We first develop Bi-TSI, a fidelity-oriented mapping that ensures near-lossless time series-to-image conversion. Building on this, we introduce , a benchmark comprising comprehensive understanding tasks that advance the model from basic periodic localization to complex pattern analytics, alongside downstream generation tasks. Through an understanding-guided generation mechanism formulated as a CoT-conditioned process, links semantic understanding to high-fidelity generation. Experimental results demonstrate that performs strongly on both understanding and generation, providing a new perspective on vision-centric unified time series modeling.

Impact Statement {#impact-statement .unnumbered}
================

This paper presents work whose goal is to advance the field of machine learning and time series analytics. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Acknowledgment {#acknowledgment .unnumbered}
--------------

This work is partially supported by NVIDIA Academic Grant in Higher Education and Developer program.

Dataset Details
===============

Data Statistics {#app.Data Statistics}
---------------

This section reports the quantitative statistics of the proposed . As summarized in Table [3](#tab:stats){reference-type="ref" reference="tab:stats"}, is constructed for post-training to equip our model with unified time series understanding and generation capabilities. It comprises two generation tasks (forecasting and imputation), one TS-image understanding task suite, and one reasoning task set. For generation, we provide $40{,}000$ training instances for each of forecasting and imputation, together with testbeds of $685$ and $855$ instances, respectively. For understanding, we include $9{,}409$ training QA pairs tailored to our TS-image representation and a $685$-instance test set for evaluation. For reasoning, we incorporate the TSR-Suite [@TimeOmni-1] split, with $2{,}339$ training and $2{,}448$ test samples, which serves as high-quality instruction tuning data to improve generalizable temporal reasoning.

::: {#tab:stats}
                      **Forecasting**   **Imputation**   **Understanding**   **Reasoning**     
  ------------------ ----------------- ---------------- ------------------- --------------- -- --
  **Training Set**        40,000            40,000             9,409             2,339         
  **Testbed**               685              855                685              2,448         

  : Detailed quantitative statistics for the four time series tasks in across training sets and testbeds.
:::

[\[tab:stats\]]{#tab:stats label="tab:stats"}

Statistics on Sequence Length and Token Budget
----------------------------------------------

In this section, we report the actual sequence lengths used in in Table [4](#tab:seq-len){reference-type="ref" reference="tab:seq-len"} and the corresponding token budgets computed with the tokenizer of our base model Bagel in Table [5](#tab:token-budget){reference-type="ref" reference="tab:token-budget"}.

As shown in Table [4](#tab:seq-len){reference-type="ref" reference="tab:seq-len"}, covers a wide range of temporal scales. Forecasting, imputation, and understanding involve long-range dependencies, with a maximum length of $2{,}592$ and an average of about $950$ time points.

```{=latex}
\renewcommand{\arraystretch}{1.15}
```
::: {#tab:seq-len}
                    **Forecasting**   **Imputation**   **Understanding**   **Reasoning**
  ---------------- ----------------- ---------------- ------------------- ---------------
  **MAX length**         2,592            2,592              2,592              792
  **MIN length**           8                8                  8                10
  **AVG length**          957              957                947               109

  : Maximum, minimum, and average time series lengths across four tasks.
:::

[\[tab:seq-len\]]{#tab:seq-len label="tab:seq-len"}

Table [5](#tab:token-budget){reference-type="ref" reference="tab:token-budget"} reports the average token usage for the time series input ($\mathbf{X}$) and textual context ($C$) across tasks. For forecasting and imputation, inputs are predominantly visual, with an average of $7{,}236$ image tokens and about $130$ context tokens. For understanding, the textual component increases to $479$ context tokens on average, alongside $4{,}096$ visual tokens. Finally, the reasoning task uses $860$ time series tokens and $246$ context tokens on average.

```{=latex}
\renewcommand{\arraystretch}{1.15}
```
::: {#tab:token-budget}
                                                **Forecasting**   **Imputation**   **Understanding**   **Reasoning**  
  -------------------------------------------- ----------------- ---------------- ------------------- --------------- --
  **AVG tokens of time series $\mathbf{X}$**         7,236            7,236              4,096              860       
  **AVG tokens of context $C$**                       116              130                479               246       
  **AVG total tokens**                               7,352            7,366              4,575             1,106      

  : Average token budgets computed using the tokenizer of our base model Bagel [@Bagel].
:::

[\[tab:token-budget\]]{#tab:token-budget label="tab:token-budget"}

Prompt Used in this Paper {#app.prompt}
=========================

Prompt to Gemini for Generating Time Series Pattern Analyses.
-------------------------------------------------------------

You will be given a single-cycle time series for variable $\texttt{\{var\_idx\}}$ (cycle $\texttt{\{cycle\_idx\}}$, $\texttt{\{color\}}$ channel).\
Time series (length=$\texttt{\{T\}}$):\
$\texttt{[21.6, 21.7, 31.7,...]}$\
Helpful stats (do not ignore the raw series):\
- min=$\texttt{\{arr.min():.3f\}}$ at $t=\texttt{\{valley\_i\}}$, max=$\texttt{\{arr.max():.3f\}}$ at $t=\texttt{\{peak\_i\}}$\
- start=$\texttt{\{arr[0]:.3f\}}$, end=$\texttt{\{arr[-1]:.3f\}}$\
Write a concise 2--3 sentence description of the trend/shape using only what is evident from the series.\
Guidelines:\
- Describe the overall trend direction (increasing/decreasing/flat). If the behavior changes over time, you may describe it in 2--3 phases (e.g., early/mid/late), but do not force segmentation if the series is stable.\
- Mention whether the series fluctuates and whether fluctuations are small/moderate/large (relative to its range).\
- Mention any notable peak, valley, or plateau, and roughly when it occurs (early/mid/late or with an index $t=\ldots$).\
- If there is no clear peak/valley/plateau, explicitly say so.\
Return a single paragraph. Do NOT invent events not supported by the numbers.

System Prompt for Training and Evaluation {#app:SystemPrompt}
-----------------------------------------

This section presents the system prompts used for training and evaluation (Section [4](#sec:exp){reference-type="ref" reference="sec:exp"}). We categorize them into two types: understanding task system prompts and generation task system prompts.

You should first think about the reasoning process in the mind and then provide the user with the answer.\
The reasoning process is enclosed within `<think> </think>` tags, i.e. `<think>` reasoning process here `</think>` answer here

You should first think about the planning process in the mind and then generate the image.\
The planning process is enclosed within `<think> </think>` tags, i.e. `<think>` planning process here `</think>` image here

Details of the TS2I and I2TS Process {#app:ts2i_i2ts}
====================================

In this section, we provide a detailed description of the bidirectional mappings between time series and images utilized throughout . Our goal is a fidelity-preserving Time Series $\Leftrightarrow$ Image transformation that is as close to lossless as possible. This requirement is crucial because the TS-image is fed into the UMMs backbone as the model input. If the TS2I conversion discards numerical information, the backbone cannot recover it, and the entire vision-centric pipeline would fail to produce high-fidelity time series outputs. Likewise, the image generated by the backbone must be decoded back to a numerical sequence without losing the information contained in the output image. Therefore, we design TS2I and I2TS as a deterministic round-trip mapping and treat it as near-lossless in practice, with residual errors primarily arising from spatial interpolation and finite numerical precision.

Time Series to Image (TS2I) Converter {#app:ts2i}
-------------------------------------

#### Periodicity-based segmentation.

Given a multivariate time series $\mathbf{X}\in\mathbb{R}^{T\times N}$ with periodicity $f\in\mathbb{Z}^+$, we adopt a periodicity-consistent setting in our experiments, where both the context length and the prediction horizon are integer multiples of $f$. If the available length is not an exact multiple of $f$, we truncate it to the nearest valid length. Consequently, $T$ is divisible by $f$ and the series can be decomposed into $C=T/f$ periodic blocks without padding. Prior to periodic segmentation, $\mathbf{X}$ is normalized using robust fidelity normalization (RFN) in Section [3.1.0.2](#RFN){reference-type="ref" reference="RFN"} to ensure numerically stable and geometry-consistent rendering.

#### Rearrangement into a periodic grid.

For each variable $n$, let $\tilde{\mathbf{x}}^{(n)}\in\mathbb{R}^{T}$ denote the normalized sequence after applying RFN, where $\tilde{(\cdot)}$ indicates values in the normalized space used for image rendering. We fold the normalized sequence $\tilde{\mathbf{x}}^{(n)}$ into a $f\times C$ matrix $\mathbf{S}^{(n)}\in\mathbb{R}^{f\times C}$, where $C=T/f$: $$\mathbf{S}^{(n)}_{i,j}=\tilde{\mathbf{x}}^{(n)}_{jf+i},\qquad i=0,\ldots,f-1,\; j=0,\ldots,C-1.$$ Here, the row index $i$ corresponds to the intra-period position, while the column index $j$ indexes successive periods. This construction maps intra-period structure to vertical locality and inter-period evolution to horizontal progression.

#### Rendering.

Given $\mathbf{S}^{(n)}\in\mathbb{R}^{f\times C}$, the rendering step upsamples the periodic grid into the image coordinate space. Specifically, we allocate each variable a vertical band of height $h=\lfloor H/N \rfloor$ and resize $\mathbf{S}^{(n)}$ to $h\times W_{\mathrm{in}}$, where $W_{\mathrm{in}}$ denotes the width of the unmasked region and the remaining width $W_{\mathrm{out}}=W-W_{\mathrm{in}}$ is masked. For forecasting, the mask occupies the right side so the model completes future periods from left to right; for imputation, masked regions can be placed at arbitrary locations within the TS-image.

#### Supporting multivariate inputs via band stacking and color assignment.

For the multivariate time series input $\mathbf{X}$, TS2I renders each variable into one band and stacks the $N$ bands along the vertical axis to construct the complete TS-image, whose overall resolution is $H\times W$ with the visible context occupying the left width $W_{\mathrm{in}}$. To distinguish different variables within a single image, we follow the setting of VisionTS++ [@VisionTS++] and assign each band a RGB color, while enforcing that adjacent bands do not share the same color. This simple color assignment preserves the band geometry and helps the backbone model separate variable-specific patterns in the visual space.

Image to Time Series (I2TS) Converter {#app:i2ts}
-------------------------------------

#### Recovering the completed region and inverse rearrangement.

Given the output TS-image $\hat{{I}}\in\mathbb{R}^{H\times W}$, I2TS decodes numerical values from the completed region. We first recover each variable band according to its vertical location using the same band height $h=\lfloor H/N \rfloor$ as in TS2I. For variable $n$, we crop its band from $\hat{I}$ and resize the decoded region back to the periodic grid resolution $f\times C$, yielding $\hat{\mathbf{S}}^{(n)}\in\mathbb{R}^{f\times C}$. Finally, we invert the TS2I folding step to obtain the normalized sequence $\hat{\mathbf{x}}^{(n)}\in\mathbb{R}^{T}$: $$\hat{\mathbf{x}}^{(n)}_{j f + i} = \hat{\mathbf{S}}^{(n)}_{i,j},\qquad i=0,\ldots,f-1,\; j=0,\ldots,C-1.$$ Concatenating all variables gives the normalized multivariate sequence $\hat{\mathbf{U}}\in\mathbb{R}^{T\times N}$ for the decoded region.

#### Inverse normalization and value restoration.

I2TS applies the exact inverse of the RFN mapping defined in Equation [\[eq:RFN\_tanh\]](#eq:RFN_tanh){reference-type="ref" reference="eq:RFN_tanh"}. Let $\hat{\mathbf{U}}$ denote the decoded values in the normalized space. We first apply inverse hyperbolic tangent: $$\hat{\mathbf{Z}} = \kappa\,\mathrm{arctanh}\!\left(\hat{\mathbf{U}}\right),$$ where values in $\hat{\mathbf{U}}$ are implicitly clamped within the valid domain $(-1, 1)$ for numerical stability. Finally, we restore the original numerical scale using the per-variable statistics $(\boldsymbol{\mu}, \boldsymbol{\sigma})$ recorded during the encoding stage: $$\hat{\mathbf{X}} = \hat{\mathbf{Z}} \odot \boldsymbol{\sigma} + \boldsymbol{\mu}.$$

In summary, TS2I and I2TS form a deterministic round-trip mapping that is near-lossless in practice. Any residual reconstruction error mainly comes from spatial interpolation introduced in rendering and resizing, rather than from stochasticity in the transformation itself.

Comparison of Different Normalization Strategies {#app:norm_comparison}
================================================

In this section, we provide an intuitive explanation of why existing normalization methods (for example, standard deviation (Std)-based [@VisionTS] and median absolute deviation (MAD)-based normalization [@Chronos2]) fall short and how our robust fidelity normalization (RFN) addresses these issues. We focus on two extreme yet common regimes: signals with extreme outliers and signals with step-like patterns.

Case I: Extreme Outliers {#sub:case_outlier}
------------------------

**Scenario.** Assume a clean informative signal (e.g., a sine wave) contaminated by a single, massive outlier with amplitude $\Delta$. This creates a single abrupt spike in the signal. The standard deviation $\sigma$ is highly sensitive to extreme values. A single massive spike causes $\sigma$ to grow with the outlier size ($\sigma \approx \Delta/\sqrt{T}$). Consequently, for the normal part of the signal $x_t$, the normalized value $\hat{x}_t$ collapses: $$\hat{x}_t \approx \frac{x_t}{\Delta / \sqrt{T}} \xrightarrow{\Delta \to \infty} 0.$$

When applied to TS2I conversion, the informative signal is compressed toward zero. As a result, the outlier is mapped to a single bright pixel in the TS-image, while the underlying temporal patterns collapse into a nearly uniform dark background. The vision backbone consequently focuses almost exclusively on the outlier.

**RFN Solution.** RFN uses the MAD, which ignores the single outlier, keeping the denominator stable. The outlier is smoothly saturated by the bounded $\tanh$ function, preserving the visibility of the main signal.

Case II: Signals with Step-like Patterns {#sub:case_step}
----------------------------------------

**Scenario.** Consider a \`\`step function" or a signal $\mathbf{x}$ that stays constant for a long period. In these flat regions, the value at time $t$ can be expressed as $x_t = c + \eta_t$, where $c$ is a constant and $\eta_t$ represents microscopic noise. For a signal that is constant for more than half of its length, the MAD is mathematically zero. This leads to division by zero, causing the microscopic noise to be amplified to massive magnitudes: $$\hat{x}_t \approx \frac{\eta_t}{0} \to \infty.$$ When applied to TS2I conversion, the normalization artificially amplifies negligible sensor noise into high amplitude pixel level fluctuations. As a result, the TS2I becomes dominated by high contrast artifacts, falsely suggesting violent temporal variability in the input signal.

**RFN Solution.** RFN prevents this collapse by incorporating the standard deviation as a regularizing term. Even if MAD is zero, the standard deviation of a step function remains non-zero, providing a \`\`safety floor\": $$\sigma_{\mathrm{RFN}} = \alpha \cdot \underbrace{\text{MAD}(\mathbf{x})}_{\approx 0} + (1-\alpha)\underbrace{\text{Std}(\mathbf{x})}_{> 0},$$ where $\sigma_{\mathrm{RFN}}$ denotes the robust scaling factor used by RFN. This ensures that the resulting image correctly depicts flat regions with clear transitions.

Table [6](#tab:norm_comparison){reference-type="ref" reference="tab:norm_comparison"} summarizes the behavior of each normalization strategy across representative regimes. RFN is the only method that consistently performs ideal TS2I conversion, remaining effective in both outlier-dominated signals and step-like signals with extended flat regions.

::: {#tab:norm_comparison}
  **Regime**         **Std-based**       **MAD-based**      **RFN (Ours)**
  ----------------- ---------------- --------------------- ----------------
  Gaussian signal        Ideal               Ideal              Ideal
  Heavy outliers     Signal washout          Ideal              Ideal
  Step / flat            Ideal        Noise amplification       Ideal

  : Qualitative behavior of different normalization methods under representative challenging regimes. [Ideal]{.underline} indicates faithful visual preservation of the underlying signal structure.
:::

Additional Experimental Results
===============================

The Scoring Criteria for Understanding Tasks {#app: understanding_metrics}
--------------------------------------------

To ensure a rigorous evaluation of the model's ability to interpret TS-images, we design specific scoring metrics for each understanding task. All scores are normalized to the range $[0, 1]$. The detailed criteria are defined as follows:

-   **Understanding QA1: Variable Counting.** We utilize exact match (EM). The score is $1$ if the predicted integer representing the number of variables exactly matches the groundtruth, and $0$ otherwise.

-   **Understanding QA2: Variable Y-Range.** We evaluate the model's ability to localize variables vertically using the intersection over union (IoU) metric. For each variable, its vertical span is represented as a rectangular region covering the full width of the segment. Let $B_{pred}$ and $B_{gt}$ denote the predicted and groundtruth bounding boxes, respectively. The score is calculated as: $$\text{Score} = \text{IoU}(B_{pred}, B_{gt}) = \frac{\text{Area}(B_{pred} \cap B_{gt})}{\text{Area}(B_{pred} \cup B_{gt})}.$$

-   **Understanding QA3: Cycle Bounding Box.** Similarly, we utilize bounding box IoU. The model outputs the specific coordinates $[(x_1, y_1), (x_2, y_2)]$ for a cycle. The score is the IoU between the predicted bounding box $B_{pred}$ and the groundtruth box $B_{gt}$, calculated using the same formula as QA2.

-   **Understanding QA4: Mean Comparison.** We utilize EM. The task requires identifying which of two specific cycles has a higher mean value. The score is $1$ if the predicted cycle index exactly matches the groundtruth index (e.g., correctly selecting \`\`Cycle 7\" over \`\`Cycle 9\"), and $0$ otherwise.

-   **Understanding QA5: Anomaly Detection.** We utilize weighted accuracy. We parse the output to extract three key count statistics: the total count of anomalous cycles, the count of bright anomalies, and the count of dark anomalies. The final score is the average of the match results for these three components (each contributing $1/3$). For example, if the groundtruth is \`\`2 anomalous cycles (1 bright, 1 dark)" and the model correctly predicts all three counts, the score is $1$; if it correctly predicts the total and bright counts but misses the dark count, the score is $2/3$.

-   **Understanding QA6: Trend Analysis.** We utilize a composite score consisting of three equally weighted sub-components ($1/3$ each):

    1.  **Color Consistency:** We use EM. The score is $1$ if the predicted color channel (e.g., \`\`Blue") exactly matches the groundtruth, and $0$ otherwise.

    2.  **Localization Accuracy:** We use bounding box IoU between the predicted bounding box and the groundtruth box (between 0 and 1).

    3.  **Trend Description Quality:** We use BERTScore [@BERTScore] to measure the semantic similarity between the generated textual description and the groundtruth analysis.

    The final score is the arithmetic mean of these three sub-scores: $\text{Score} = \frac{1}{3} (\text{EM}_{\text{color}} + \text{IoU}_{\text{bbox}} + \text{BERTScore}_{\text{text}})$.

Results of Understanding Tasks {#app:gemini Performance on Understanding Tasks}
------------------------------

```{=latex}
\renewcommand{\arraystretch}{1}
```
::: {#tab:understanding_result}
  ---------------------- --------- --------- --------- --------- --------- ---------
                                                                           
  (lr)2-4 (lr)5-7         **QA1**   **QA2**   **QA3**   **QA4**   **QA5**   **QA6**
  **Proprietary VLMs**                                                     
  Gemini2.5-flash          0.540     0.640     0.004     0.535       0       0.342
  Gemini2.0-flash          0.230     0.290     0.261     0.279       0       0.220
  **Base Model**                                                           
  Bagel                      0       0.502     0.012     0.182       0       0.254
                             1         1       0.931       1       0.667     0.841
  ---------------------- --------- --------- --------- --------- --------- ---------

  : Performance on Understanding Tasks. The table reports scores for layout-level tasks (QA1--3) and signal-level tasks (QA4--6).
:::

[\[tab:understanding\_result\]]{#tab:understanding_result label="tab:understanding_result"}

Results of Reasoning Tasks {#app:Performance on Reasoning Tasks}
--------------------------

```{=latex}
\renewcommand{\arraystretch}{1}
```
::: {#tab:reasoning_result}
  -------------------------- ----------- ----------- ----------------------- -----------
                                                                             
  (lr)2-3 (lr)4-4 (lr)5-5     **Task1**   **Task2**   **Task3$\downarrow$**   **Task4**
  **LLMs**                                                                   
  Gemini2.5-flash               77.5        25.9             170.78             36.6
  Qwen2.5-Instruct-7B           42.8        26.3                                24.9
  **TSLMs**                                                                  
  Time-MQA-8B [@TimeMQA]        25.1        31.2               \-               11.6
  ChatTS [@ChatTS]              39.2        18.6               \-               11.1
  ITFormer [@ITFormer]          47.5        14.6             230.04             41.7
  Time-R1 [@timer1]             34.0        31.4             160.47             32.2
  TimeOmni-1 [@TimeOmni-1]                                                   
                                                             163.79          
  -------------------------- ----------- ----------- ----------------------- -----------

  : Performance on Reasoning Tasks. The default metric is ACC, except for Task 3 where MAE is used. : the best, : the 2nd best. \`\`--" denotes SR below 10%; not statistically significant.
:::

[\[tab:reasoning\_result\]]{#tab:reasoning_result label="tab:reasoning_result"}

Discussion on Line Plot Representations {#app: discussion of line plot}
---------------------------------------

We exclude line plots due to four practical limitations. (1) Information sparsity. Most pixels correspond to background, while the signal is confined to thin strokes, which limits representational capacity. (2) Variable overlap. In multivariate settings, intersecting curves create ambiguity, making it difficult to uniquely identify and disentangle variables. (3) Misaligned attention. General-purpose vision-language models (VLMs) and UMMs tend to focus on textual labels and legends rather than the fine geometry of thin lines [@CaTSBench]. (4) Decoding complexity. Recovering precise values from rendered curves is an ill-posed inverse problem that is sensitive to stroke width, aliasing, and line overlap, leading to unstable decoding.

Case Study {#app:case_study}
==========

Comprehensive Task Demonstrations of 
------------------------------------

In this section, we provide detailed case studies across the six understanding tasks (Tables [\[tab:variable\_counting\_example\]](#tab:variable_counting_example){reference-type="ref" reference="tab:variable_counting_example"} to [\[tab:q6\_trend\_analysis\]](#tab:q6_trend_analysis){reference-type="ref" reference="tab:q6_trend_analysis"}) and two generation tasks (Tables [\[tab:task2\_forecasting\]](#tab:task2_forecasting){reference-type="ref" reference="tab:task2_forecasting"} and [\[tab:task3\_imputation\]](#tab:task3_imputation){reference-type="ref" reference="tab:task3_imputation"}) within the benchmark.

```{=latex}
\newcommand{\usericon}{\raisebox{-0.25ex}{\includegraphics[height=1.3em]{figs/user.png}}}
```
```{=latex}
\newcommand{\modelicon}{\raisebox{-0.25ex}{\includegraphics[height=1.3em]{figs/logo.png}}}
```
```{=latex}
\newcommand{\bagelmodel}{\raisebox{-0.25ex}{\includegraphics[height=1.3em]{figs/bagel.png}}}
```
Comparative Analysis and Failure Cases of the Base Model: Bagel {#app:base_case_bagel}
---------------------------------------------------------------

To further validate the necessity of our time series-specific post-training, we present representative failure cases from our base model, Bagel [@Bagel], on the same generation tasks. Specifically, Table [\[tab:Bagel\_forecasting\]](#tab:Bagel_forecasting){reference-type="ref" reference="tab:Bagel_forecasting"} illustrates a failure in the forecasting task, while Table [\[tab:Bagel\_imputation\]](#tab:Bagel_imputation){reference-type="ref" reference="tab:Bagel_imputation"} demonstrates an unsuccessful case for the imputation task.