---
abstract: |
  Recent advances in multimodal time series learning underscore a paradigm shift from analytics centered on basic patterns toward advanced time series understanding and reasoning. However, existing multimodal time series datasets mostly remain at the level of surface alignment and question answering, without reaching the depth of genuine reasoning. The absence of well-defined tasks that genuinely require time series reasoning, along with the scarcity of high-quality data, has limited progress in building practical time series reasoning models (TSRMs). To this end, we introduce **T**ime **S**eries **R**easoning **Suite** (), which formalizes four atomic tasks that span three fundamental capabilities for reasoning with time series: (1) *perception*, acquired through scenario understanding and causality discovery; (2) *extrapolation*, realized via event-aware forecasting; and (3) *decision-making*, developed through deliberation over perception and extrapolation. is the first comprehensive time series reasoning suite that supports not only thorough evaluation but also the data pipeline and training of TSRMs. It contains more than 23K samples, of which 2.3K are carefully curated through a human-guided hierarchical annotation process. Building on this foundation, we introduce , the first unified reasoning model designed to address diverse real-world problems demanding time series reasoning. The model is trained in multiple stages, integrating a mixture of task scenarios, novel reward functions, and tailored optimizations. Experiments show that delivers strong out-of-distribution generalization across all tasks and achieves a high rate of valid responses. It significantly improves causality discovery accuracy (64.0% vs. 35.9% with GPT-4.1) and raises the valid response rate by over 6% compared to GPT-4.1 on the event-aware forecasting task. Code[^1] and checkpoints[^2] are publicly available.
author:
- |
  Tong Guan$^{1,2}$Zijie Meng$^{2}$Dianqi Li Shiyu WangChao-Han Huck Yang$^{3}$\
  **Qingsong Wen$^{4}$Zuozhu Liu$^{2}$Sabato Marco Siniscalchi$^{5,6}$Ming Jin$^{1}$[^3]  Shirui Pan$^{1}$**\
  \
  $^{1}$Griffith University $^{2}$Zhejiang University $^{3}$NVIDIA $^{4}$Squirrel Ai Learning\
  $^{5}$University of Palermo $^{6}$Norwegian University of Science and Technology
bibliography:
- reference.bib
title: '![image](figs/logo.png){height="1em"} : Incentivizing Complex Reasoning with Time Series in Large Language Models'
---

```{=latex}
\renewcommand{\topfraction}{0.95}
```
```{=latex}
\renewcommand{\textfraction}{0.05}
```
```{=latex}
\def\iclrfinalcopy{\iclrfinaltrue}
```
```{=latex}
\def\addcontentsline#1#2#3{}
```
```{=latex}
\def\maketitle{\par
\begingroup
   \def\thefootnote{\fnsymbol{footnote}}
   \def\@makefnmark{\hbox to 0pt{$^{\@thefnmark}$\hss}} % for perfect author
                                                        % name centering
%   The footnote-mark was overlapping the footnote-text,
%   added the following to fix this problem               (MK)
   \long\def\@makefntext##1{\parindent 1em\noindent
                            \hbox to1.8em{\hss $\m@th ^{\@thefnmark}$}##1}
   \@maketitle \@thanks
\endgroup
\setcounter{footnote}{0}
\let\maketitle\relax \let\@maketitle\relax
\gdef\@thanks{}\gdef\@author{}\gdef\@title{}\let\thanks\relax}
```
```{=latex}
\def\ifancy@mpty#1{\def\temp@a{#1}\ifx\temp@a\@empty}
```
```{=latex}
\def\fancy@def#1#2{\ifancy@mpty{#2}\fancy@gbl\def#1{\leavevmode}\else
                                   \fancy@gbl\def#1{#2\strut}\fi}
```
```{=latex}
\let\fancy@gbl\global
```
```{=latex}
\def\@fancyerrmsg#1{%
        \ifx\PackageError\undefined
        \errmessage{#1}\else
        \PackageError{Fancyhdr}{#1}{}\fi}
```
```{=latex}
\def\@fancywarning#1{%
        \ifx\PackageWarning\undefined
        \errmessage{#1}\else
        \PackageWarning{Fancyhdr}{#1}{}\fi}
```
```{=latex}
\def\@forc#1#2#3{\expandafter\f@rc\expandafter#1\expandafter{#2}{#3}}
```
```{=latex}
\def\f@rc#1#2#3{\def\temp@ty{#2}\ifx\@empty\temp@ty\else
                                    \f@@rc#1#2\f@@rc{#3}\fi}
```
```{=latex}
\def\f@@rc#1#2#3\f@@rc
```
```{=latex}
\newcommand{\f@nfor}[3]{\edef\@fortmp{#2}%
    \expandafter\@forloop#2,\@nil,\@nil\@@#1{#3}}
```
```{=latex}
\newcommand\def@ult[3]{%
    \edef\temp@a{\lowercase{\edef\noexpand\temp@a{#3}}}\temp@a
    \def#1{}%
    \@forc\tmpf@ra{#2}%
        {\expandafter\if@in\tmpf@ra\temp@a{\edef#1{#1\tmpf@ra}}{}}%
    \ifx\@empty#1\def#1{#2}\fi}
```
```{=latex}
\newcommand{\if@in}[4]{%
    \edef\temp@a{#2}\def\temp@b##1#1##2\temp@b{\def\temp@b{##1}}%
    \expandafter\temp@b#2#1\temp@b\ifx\temp@a\temp@b #4\else #3\fi}
```
```{=latex}
\newcommand{\fancyhead}{\@ifnextchar[{\f@ncyhf\fancyhead h}%
                                     {\f@ncyhf\fancyhead h[]}}
```
```{=latex}
\newcommand{\fancyfoot}{\@ifnextchar[{\f@ncyhf\fancyfoot f}%
                                     {\f@ncyhf\fancyfoot f[]}}
```
```{=latex}
\newcommand{\fancyhf}{\@ifnextchar[{\f@ncyhf\fancyhf{}}%
                                   {\f@ncyhf\fancyhf{}[]}}
```
```{=latex}
\newcommand{\fancyheadoffset}{\@ifnextchar[{\f@ncyhfoffs\fancyheadoffset h}%
                                           {\f@ncyhfoffs\fancyheadoffset h[]}}
```
```{=latex}
\newcommand{\fancyfootoffset}{\@ifnextchar[{\f@ncyhfoffs\fancyfootoffset f}%
                                           {\f@ncyhfoffs\fancyfootoffset f[]}}
```
```{=latex}
\newcommand{\fancyhfoffset}{\@ifnextchar[{\f@ncyhfoffs\fancyhfoffset{}}%
                                         {\f@ncyhfoffs\fancyhfoffset{}[]}}
```
```{=latex}
\def\f@ncyhf#1#2[#3]#4{%
    \def\temp@c{}%
    \@forc\tmpf@ra{#3}%
        {\expandafter\if@in\tmpf@ra{eolcrhf,EOLCRHF}%
            {}{\edef\temp@c{\temp@c\tmpf@ra}}}%
    \ifx\@empty\temp@c\else
        \@fancyerrmsg{Illegal char `\temp@c' in \string#1 argument:
          [#3]}%
    \fi
    \f@nfor\temp@c{#3}%
        {\def@ult\f@@@eo{eo}\temp@c
         \if@twoside\else
           \if\f@@@eo e\@fancywarning
             {\string#1's `E' option without twoside option is useless}\fi\fi
         \def@ult\f@@@lcr{lcr}\temp@c
         \def@ult\f@@@hf{hf}{#2\temp@c}%
         \@forc\f@@eo\f@@@eo
             {\@forc\f@@lcr\f@@@lcr
                 {\@forc\f@@hf\f@@@hf
                     {\expandafter\fancy@def\csname
                      f@ncy\f@@eo\f@@lcr\f@@hf\endcsname
                      {#4}}}}}}
```
```{=latex}
\def\f@ncyhfoffs#1#2[#3]#4{%
    \def\temp@c{}%
    \@forc\tmpf@ra{#3}%
        {\expandafter\if@in\tmpf@ra{eolrhf,EOLRHF}%
            {}{\edef\temp@c{\temp@c\tmpf@ra}}}%
    \ifx\@empty\temp@c\else
        \@fancyerrmsg{Illegal char `\temp@c' in \string#1 argument:
          [#3]}%
    \fi
    \f@nfor\temp@c{#3}%
        {\def@ult\f@@@eo{eo}\temp@c
         \if@twoside\else
           \if\f@@@eo e\@fancywarning
             {\string#1's `E' option without twoside option is useless}\fi\fi
         \def@ult\f@@@lcr{lr}\temp@c
         \def@ult\f@@@hf{hf}{#2\temp@c}%
         \@forc\f@@eo\f@@@eo
             {\@forc\f@@lcr\f@@@lcr
                 {\@forc\f@@hf\f@@@hf
                     {\expandafter\setlength\csname
                      f@ncyO@\f@@eo\f@@lcr\f@@hf\endcsname
                      {#4}}}}}%
     \fancy@setoffs}
```
```{=latex}
\newcommand{\lhead}{\@ifnextchar[{\@xlhead}{\@ylhead}}
```
```{=latex}
\def\@xlhead[#1]#2{\fancy@def\f@ncyelh{#1}\fancy@def\f@ncyolh{#2}}
```
```{=latex}
\def\@ylhead#1{\fancy@def\f@ncyelh{#1}\fancy@def\f@ncyolh{#1}}
```
```{=latex}
\newcommand{\chead}{\@ifnextchar[{\@xchead}{\@ychead}}
```
```{=latex}
\def\@xchead[#1]#2{\fancy@def\f@ncyech{#1}\fancy@def\f@ncyoch{#2}}
```
```{=latex}
\def\@ychead#1{\fancy@def\f@ncyech{#1}\fancy@def\f@ncyoch{#1}}
```
```{=latex}
\newcommand{\rhead}{\@ifnextchar[{\@xrhead}{\@yrhead}}
```
```{=latex}
\def\@xrhead[#1]#2{\fancy@def\f@ncyerh{#1}\fancy@def\f@ncyorh{#2}}
```
```{=latex}
\def\@yrhead#1{\fancy@def\f@ncyerh{#1}\fancy@def\f@ncyorh{#1}}
```
```{=latex}
\newcommand{\lfoot}{\@ifnextchar[{\@xlfoot}{\@ylfoot}}
```
```{=latex}
\def\@xlfoot[#1]#2{\fancy@def\f@ncyelf{#1}\fancy@def\f@ncyolf{#2}}
```
```{=latex}
\def\@ylfoot#1{\fancy@def\f@ncyelf{#1}\fancy@def\f@ncyolf{#1}}
```
```{=latex}
\newcommand{\cfoot}{\@ifnextchar[{\@xcfoot}{\@ycfoot}}
```
```{=latex}
\def\@xcfoot[#1]#2{\fancy@def\f@ncyecf{#1}\fancy@def\f@ncyocf{#2}}
```
```{=latex}
\def\@ycfoot#1{\fancy@def\f@ncyecf{#1}\fancy@def\f@ncyocf{#1}}
```
```{=latex}
\newcommand{\rfoot}{\@ifnextchar[{\@xrfoot}{\@yrfoot}}
```
```{=latex}
\def\@xrfoot[#1]#2{\fancy@def\f@ncyerf{#1}\fancy@def\f@ncyorf{#2}}
```
```{=latex}
\def\@yrfoot#1{\fancy@def\f@ncyerf{#1}\fancy@def\f@ncyorf{#1}}
```
```{=latex}
\let\headwidth\fancy@headwidth
```
```{=latex}
\newcommand{\headrulewidth}{0.4pt}
```
```{=latex}
\newcommand{\footrulewidth}{0pt}
```
```{=latex}
\newcommand{\footruleskip}{.3\normalbaselineskip}
```
```{=latex}
\newcommand{\plainheadrulewidth}{0pt}
```
```{=latex}
\newcommand{\plainfootrulewidth}{0pt}
```
```{=latex}
\def\fancyplain#1#2{\if@fancyplain#1\else#2\fi}
```
```{=latex}
\let\fnch@everypar\everypar
```
```{=latex}
\def\fancy@reset{\fnch@everypar{}\restorecr\endlinechar=13
 \def\baselinestretch{1}%
 \def\nouppercase##1{{\let\uppercase\relax\let\MakeUppercase\relax
     \expandafter\let\csname MakeUppercase \endcsname\relax##1}}%
 \ifx\undefined\@newbaseline% NFSS not present; 2.09 or 2e
   \ifx\@normalsize\undefined \normalsize % for ucthesis.cls
   \else \@normalsize \fi
 \else% NFSS (2.09) present
  \@newbaseline%
 \fi}
```
```{=latex}
\def\@fancyvbox#1#2{\setbox0\vbox{#2}\ifdim\ht0>#1\@fancywarning
  {\string#1 is too small (\the#1): ^^J Make it at least \the\ht0.^^J
    We now make it that large for the rest of the document.^^J
    This may cause the page layout to be inconsistent, however\@gobble}%
  \dimen0=#1\global\setlength{#1}{\ht0}\ht0=\dimen0\fi
  \box0}
```
```{=latex}
\def\@fancyhead#1#2#3#4#5{#1\hbox to\headwidth{\fancy@reset
  \@fancyvbox\headheight{\hbox
    {\rlap{\parbox[b]{\headwidth}{\raggedright#2}}\hfill
      \parbox[b]{\headwidth}{\centering#3}\hfill
      \llap{\parbox[b]{\headwidth}{\raggedleft#4}}}\headrule}}#5}
```
```{=latex}
\def\@fancyfoot#1#2#3#4#5{#1\hbox to\headwidth{\fancy@reset
    \@fancyvbox\footskip{\footrule
      \hbox{\rlap{\parbox[t]{\headwidth}{\raggedright#2}}\hfill
        \parbox[t]{\headwidth}{\centering#3}\hfill
        \llap{\parbox[t]{\headwidth}{\raggedleft#4}}}}}#5}
```
```{=latex}
\def\headrule{{\if@fancyplain\let\headrulewidth\plainheadrulewidth\fi
    \hrule\@height\headrulewidth\@width\headwidth \vskip-\headrulewidth}}
```
```{=latex}
\def\footrule{{\if@fancyplain\let\footrulewidth\plainfootrulewidth\fi
    \vskip-\footruleskip\vskip-\footrulewidth
    \hrule\@width\headwidth\@height\footrulewidth\vskip\footruleskip}}
```
```{=latex}
\def\ps@fancy{%
\@ifundefined{@chapapp}{\let\@chapapp\chaptername}{}%for amsbook
%
% Define \MakeUppercase for old LaTeXen.
% Note: we used \def rather than \let, so that \let\uppercase\relax (from
% the version 1 documentation) will still work.
%
\@ifundefined{MakeUppercase}{\def\MakeUppercase{\uppercase}}{}%
\@ifundefined{chapter}{\def\sectionmark##1{\markboth
{\MakeUppercase{\ifnum \c@secnumdepth>\z@
 \thesection\hskip 1em\relax \fi ##1}}{}}%
\def\subsectionmark##1{\markright {\ifnum \c@secnumdepth >\@ne
 \thesubsection\hskip 1em\relax \fi ##1}}}%
{\def\chaptermark##1{\markboth {\MakeUppercase{\ifnum \c@secnumdepth>\m@ne
 \@chapapp\ \thechapter. \ \fi ##1}}{}}%
\def\sectionmark##1{\markright{\MakeUppercase{\ifnum \c@secnumdepth >\z@
 \thesection. \ \fi ##1}}}}%
%\csname ps@headings\endcsname % use \ps@headings defaults if they exist
\ps@@fancy
\gdef\ps@fancy{\@fancyplainfalse\ps@@fancy}%
% Initialize \headwidth if the user didn't
%
\ifdim\headwidth<0sp
%
% This catches the case that \headwidth hasn't been initialized and the
% case that the user added something to \headwidth in the expectation that
% it was initialized to \textwidth. We compensate this now. This loses if
% the user intended to multiply it by a factor. But that case is more
% likely done by saying something like \headwidth=1.2\textwidth. 
% The doc says you have to change \headwidth after the first call to
% \pagestyle{fancy}. This code is just to catch the most common cases were
% that requirement is violated.
%
    \global\advance\headwidth123456789sp\global\advance\headwidth\textwidth
\fi}
```
```{=latex}
\def\ps@fancyplain{\ps@fancy \let\ps@plain\ps@plain@fancy}
```
```{=latex}
\def\ps@plain@fancy{\@fancyplaintrue\ps@@fancy}
```
```{=latex}
\let\ps@@empty\ps@empty
```
```{=latex}
\def\ps@@fancy{%
\ps@@empty % This is for amsbook/amsart, which do strange things with \topskip
\def\@mkboth{\protect\markboth}%
\def\@oddhead{\@fancyhead\fancy@Oolh\f@ncyolh\f@ncyoch\f@ncyorh\fancy@Oorh}%
\def\@oddfoot{\@fancyfoot\fancy@Oolf\f@ncyolf\f@ncyocf\f@ncyorf\fancy@Oorf}%
\def\@evenhead{\@fancyhead\fancy@Oelh\f@ncyelh\f@ncyech\f@ncyerh\fancy@Oerh}%
\def\@evenfoot{\@fancyfoot\fancy@Oelf\f@ncyelf\f@ncyecf\f@ncyerf\fancy@Oerf}%
}
```
```{=latex}
\def\fancy@Oolh{\if@reversemargin\hss\else\relax\fi}
```
```{=latex}
\def\fancy@Oorh{\if@reversemargin\relax\else\hss\fi}
```
```{=latex}
\let\fancy@Oelh\fancy@Oorh
```
```{=latex}
\let\fancy@Oerh\fancy@Oolh
```
```{=latex}
\let\fancy@Oolf\fancy@Oolh
```
```{=latex}
\let\fancy@Oorf\fancy@Oorh
```
```{=latex}
\let\fancy@Oelf\fancy@Oelh
```
```{=latex}
\let\fancy@Oerf\fancy@Oerh
```
```{=latex}
\def\fancy@offsolh{\headwidth=\textwidth\advance\headwidth\f@ncyO@olh
                   \advance\headwidth\f@ncyO@orh\hskip-\f@ncyO@olh}
```
```{=latex}
\def\fancy@offselh{\headwidth=\textwidth\advance\headwidth\f@ncyO@elh
                   \advance\headwidth\f@ncyO@erh\hskip-\f@ncyO@elh}
```
```{=latex}
\def\fancy@offsolf{\headwidth=\textwidth\advance\headwidth\f@ncyO@olf
                   \advance\headwidth\f@ncyO@orf\hskip-\f@ncyO@olf}
```
```{=latex}
\def\fancy@offself{\headwidth=\textwidth\advance\headwidth\f@ncyO@elf
                   \advance\headwidth\f@ncyO@erf\hskip-\f@ncyO@elf}
```
```{=latex}
\def\fancy@setoffs{%
% Just in case \let\headwidth\textwidth was used
  \fancy@gbl\let\headwidth\fancy@headwidth
  \fancy@gbl\let\fancy@Oolh\fancy@offsolh
  \fancy@gbl\let\fancy@Oelh\fancy@offselh
  \fancy@gbl\let\fancy@Oorh\hss
  \fancy@gbl\let\fancy@Oerh\hss
  \fancy@gbl\let\fancy@Oolf\fancy@offsolf
  \fancy@gbl\let\fancy@Oelf\fancy@offself
  \fancy@gbl\let\fancy@Oorf\hss
  \fancy@gbl\let\fancy@Oerf\hss}
```
```{=latex}
\let\latex@makecol\@makecol
```
```{=latex}
\def\@makecol{\ifvoid\footins\footnotetrue\else\footnotefalse\fi
\let\topfloat\@toplist\let\botfloat\@botlist\latex@makecol}
```
```{=latex}
\def\iftopfloat#1#2{\ifx\topfloat\empty #2\else #1\fi}
```
```{=latex}
\def\ifbotfloat#1#2{\ifx\botfloat\empty #2\else #1\fi}
```
```{=latex}
\def\iffloatpage#1#2{\if@fcolmade #1\else #2\fi}
```
```{=latex}
\newcommand{\fancypagestyle}[2]{%
  \@namedef{ps@#1}{\let\fancy@gbl\relax#2\relax\ps@fancy}}
```
```{=latex}
\def\@maketitle{\vbox{\hsize\textwidth
%\linewidth\hsize \vskip 0.1in \toptitlebar \centering
{\LARGE\sc \@title\par}
%\bottomtitlebar % \vskip 0.1in %  minus
\ificlrfinal
    \lhead{Published as a conference paper at ICLR 2026}
    \def\And{\end{tabular}\hfil\linebreak[0]\hfil
            \begin{tabular}[t]{l}\bf\rule{\z@}{24pt}\ignorespaces}%
  \def\AND{\end{tabular}\hfil\linebreak[4]\hfil
            \begin{tabular}[t]{l}\bf\rule{\z@}{24pt}\ignorespaces}%
    \begin{tabular}[t]{l}\bf\rule{\z@}{24pt}\@author\end{tabular}%
\else
       \lhead{Under review as a conference paper at ICLR 2026}
   \def\And{\end{tabular}\hfil\linebreak[0]\hfil
            \begin{tabular}[t]{l}\bf\rule{\z@}{24pt}\ignorespaces}%
  \def\AND{\end{tabular}\hfil\linebreak[4]\hfil
            \begin{tabular}[t]{l}\bf\rule{\z@}{24pt}\ignorespaces}%
    \begin{tabular}[t]{l}\bf\rule{\z@}{24pt}Anonymous authors\\Paper under double-blind review\end{tabular}%
\fi
\vskip 0.3in minus 0.1in}}
```
```{=latex}
\renewenvironment{abstract}{\vskip.075in\centerline{\large\sc
Abstract}\vspace{0.5ex}\begin{quote}}{\par\end{quote}\vskip 1ex}
```
```{=latex}
\def\section{\@startsection {section}{1}{\z@}{-2.0ex plus
    -0.5ex minus -.2ex}{1.5ex plus 0.3ex
minus0.2ex}{\large\sc\raggedright}}
```
```{=latex}
\def\subsection{\@startsection{subsection}{2}{\z@}{-1.8ex plus
-0.5ex minus -.2ex}{0.8ex plus .2ex}{\normalsize\sc\raggedright}}
```
```{=latex}
\def\subsubsection{\@startsection{subsubsection}{3}{\z@}{-1.5ex
plus      -0.5ex minus -.2ex}{0.5ex plus
.2ex}{\normalsize\sc\raggedright}}
```
```{=latex}
\def\paragraph{\@startsection{paragraph}{4}{\z@}{1.5ex plus
0.5ex minus .2ex}{-1em}{\normalsize\bf}}
```
```{=latex}
\def\subparagraph{\@startsection{subparagraph}{5}{\z@}{1.5ex plus
  0.5ex minus .2ex}{-1em}{\normalsize\sc}}
```
```{=latex}
\def\subsubsubsection{\vskip
5pt{\noindent\normalsize\rm\raggedright}}
```
```{=latex}
\def\footnoterule{\kern-3pt \hrule width 12pc \kern 2.6pt }
```
```{=latex}
\def\@listi{\leftmargin\leftmargini}
```
```{=latex}
\def\@listii{\leftmargin\leftmarginii
   \labelwidth\leftmarginii\advance\labelwidth-\labelsep
   \topsep 2pt plus 1pt minus 0.5pt
   \parsep 1pt plus 0.5pt minus 0.5pt
   \itemsep \parsep}
```
```{=latex}
\def\@listiii{\leftmargin\leftmarginiii
    \labelwidth\leftmarginiii\advance\labelwidth-\labelsep
    \topsep 1pt plus 0.5pt minus 0.5pt
    \parsep \z@ \partopsep 0.5pt plus 0pt minus 0.5pt
    \itemsep \topsep}
```
```{=latex}
\def\@listiv{\leftmargin\leftmarginiv
     \labelwidth\leftmarginiv\advance\labelwidth-\labelsep}
```
```{=latex}
\def\@listv{\leftmargin\leftmarginv
     \labelwidth\leftmarginv\advance\labelwidth-\labelsep}
```
```{=latex}
\def\@listvi{\leftmargin\leftmarginvi
     \labelwidth\leftmarginvi\advance\labelwidth-\labelsep}
```
```{=latex}
\def\normalsize{\@setsize\normalsize{11pt}\xpt\@xpt}
```
```{=latex}
\def\small{\@setsize\small{10pt}\ixpt\@ixpt}
```
```{=latex}
\def\footnotesize{\@setsize\footnotesize{10pt}\ixpt\@ixpt}
```
```{=latex}
\def\scriptsize{\@setsize\scriptsize{8pt}\viipt\@viipt}
```
```{=latex}
\def\tiny{\@setsize\tiny{7pt}\vipt\@vipt}
```
```{=latex}
\def\large{\@setsize\large{14pt}\xiipt\@xiipt}
```
```{=latex}
\def\Large{\@setsize\Large{16pt}\xivpt\@xivpt}
```
```{=latex}
\def\LARGE{\@setsize\LARGE{20pt}\xviipt\@xviipt}
```
```{=latex}
\def\huge{\@setsize\huge{23pt}\xxpt\@xxpt}
```
```{=latex}
\def\Huge{\@setsize\Huge{28pt}\xxvpt\@xxvpt}
```
```{=latex}
\def\toptitlebar{\hrule height4pt\vskip .25in\vskip-\parskip}
```
```{=latex}
\def\bottomtitlebar{\vskip .29in\vskip-\parskip\hrule height1pt\vskip
.09in}
```
```{=latex}
\def\fillzeros[#1]#2{\cv@tmpc@=#2\relax\ifnum\cv@tmpc@<0\cv@tmpc@=-\cv@tmpc@\fi
\cv@tmpc=1 %
\loop\ifnum\cv@tmpc@<10 \else \divide\cv@tmpc@ by 10 \advance\cv@tmpc by 1 \fi
   \ifnum\cv@tmpc@=10\relax\cv@tmpc@=11\relax\fi \ifnum\cv@tmpc@>10 \repeat
\ifnum#2<0\advance\cv@tmpc1\relax-\fi
\loop\ifnum\cv@tmpc<#1\relax0\advance\cv@tmpc1\relax\fi \ifnum\cv@tmpc<#1 \repeat
\cv@tmpc@=#2\relax\ifnum\cv@tmpc@<0\cv@tmpc@=-\cv@tmpc@\fi \relax\the\cv@tmpc@}
```
```{=latex}
\def\makevruler[#1][#2][#3][#4][#5]{\begingroup\offinterlineskip
\textheight=#5\vbadness=10000\vfuzz=120ex\overfullrule=0pt%
\global\setbox\iclrrulerbox=\vbox to \textheight{%
{\parskip=0pt\hfuzz=150em\cv@boxheight=\textheight
\cv@lineheight=#1\global\iclrrulercount=#2%
\cv@tot\cv@boxheight\divide\cv@tot\cv@lineheight\advance\cv@tot2%
\cv@refno1\vskip-\cv@lineheight\vskip1ex%
\loop\setbox\cv@tmpbox=\hbox to0cm{{\iclrtenhv\hfil\fillzeros[#4]\iclrrulercount}}%
\ht\cv@tmpbox\cv@lineheight\dp\cv@tmpbox0pt\box\cv@tmpbox\break
\advance\cv@refno1\global\advance\iclrrulercount#3\relax
\ifnum\cv@refno<\cv@tot\repeat}}\endgroup}
```
```{=latex}
\def\iclrruler#1{\makevruler[12pt][#1][1][3][0.993\textheight]\usebox{\iclrrulerbox}}
```
```{=latex}
\newcommand{\figleft}{{\em (Left)}}
```
```{=latex}
\newcommand{\figcenter}{{\em (Center)}}
```
```{=latex}
\newcommand{\figright}{{\em (Right)}}
```
```{=latex}
\newcommand{\figtop}{{\em (Top)}}
```
```{=latex}
\newcommand{\figbottom}{{\em (Bottom)}}
```
```{=latex}
\newcommand{\captiona}{{\em (a)}}
```
```{=latex}
\newcommand{\captionb}{{\em (b)}}
```
```{=latex}
\newcommand{\captionc}{{\em (c)}}
```
```{=latex}
\newcommand{\captiond}{{\em (d)}}
```
```{=latex}
\newcommand{\newterm}[1]{{\bf #1}}
```
```{=latex}
\def\figref#1{figure~\ref{#1}}
```
```{=latex}
\def\Figref#1{Figure~\ref{#1}}
```
```{=latex}
\def\twofigref#1#2{figures \ref{#1} and \ref{#2}}
```
```{=latex}
\def\quadfigref#1#2#3#4{figures \ref{#1}, \ref{#2}, \ref{#3} and \ref{#4}}
```
```{=latex}
\def\secref#1{section~\ref{#1}}
```
```{=latex}
\def\Secref#1{Section~\ref{#1}}
```
```{=latex}
\def\twosecrefs#1#2{sections \ref{#1} and \ref{#2}}
```
```{=latex}
\def\secrefs#1#2#3{sections \ref{#1}, \ref{#2} and \ref{#3}}
```
```{=latex}
\def\eqref#1{equation~\ref{#1}}
```
```{=latex}
\def\Eqref#1{Equation~\ref{#1}}
```
```{=latex}
\def\plaineqref#1{\ref{#1}}
```
```{=latex}
\def\chapref#1{chapter~\ref{#1}}
```
```{=latex}
\def\Chapref#1{Chapter~\ref{#1}}
```
```{=latex}
\def\rangechapref#1#2{chapters\ref{#1}--\ref{#2}}
```
```{=latex}
\def\algref#1{algorithm~\ref{#1}}
```
```{=latex}
\def\Algref#1{Algorithm~\ref{#1}}
```
```{=latex}
\def\twoalgref#1#2{algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\Twoalgref#1#2{Algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\partref#1{part~\ref{#1}}
```
```{=latex}
\def\Partref#1{Part~\ref{#1}}
```
```{=latex}
\def\twopartref#1#2{parts \ref{#1} and \ref{#2}}
```
```{=latex}
\def\ceil#1{\lceil #1 \rceil}
```
```{=latex}
\def\floor#1{\lfloor #1 \rfloor}
```
```{=latex}
\def\1{\bm{1}}
```
```{=latex}
\newcommand{\train}{\mathcal{D}}
```
```{=latex}
\newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}}
```
```{=latex}
\newcommand{\test}{\mathcal{D_{\mathrm{test}}}}
```
```{=latex}
\def\eps{{\epsilon}}
```
```{=latex}
\def\reta{{\textnormal{$\eta$}}}
```
```{=latex}
\def\ra{{\textnormal{a}}}
```
```{=latex}
\def\rb{{\textnormal{b}}}
```
```{=latex}
\def\rc{{\textnormal{c}}}
```
```{=latex}
\def\rd{{\textnormal{d}}}
```
```{=latex}
\def\re{{\textnormal{e}}}
```
```{=latex}
\def\rf{{\textnormal{f}}}
```
```{=latex}
\def\rg{{\textnormal{g}}}
```
```{=latex}
\def\rh{{\textnormal{h}}}
```
```{=latex}
\def\ri{{\textnormal{i}}}
```
```{=latex}
\def\rj{{\textnormal{j}}}
```
```{=latex}
\def\rk{{\textnormal{k}}}
```
```{=latex}
\def\rl{{\textnormal{l}}}
```
```{=latex}
\def\rn{{\textnormal{n}}}
```
```{=latex}
\def\ro{{\textnormal{o}}}
```
```{=latex}
\def\rp{{\textnormal{p}}}
```
```{=latex}
\def\rq{{\textnormal{q}}}
```
```{=latex}
\def\rr{{\textnormal{r}}}
```
```{=latex}
\def\rs{{\textnormal{s}}}
```
```{=latex}
\def\rt{{\textnormal{t}}}
```
```{=latex}
\def\ru{{\textnormal{u}}}
```
```{=latex}
\def\rv{{\textnormal{v}}}
```
```{=latex}
\def\rw{{\textnormal{w}}}
```
```{=latex}
\def\rx{{\textnormal{x}}}
```
```{=latex}
\def\ry{{\textnormal{y}}}
```
```{=latex}
\def\rz{{\textnormal{z}}}
```
```{=latex}
\def\rvepsilon{{\mathbf{\epsilon}}}
```
```{=latex}
\def\rvtheta{{\mathbf{\theta}}}
```
```{=latex}
\def\rva{{\mathbf{a}}}
```
```{=latex}
\def\rvb{{\mathbf{b}}}
```
```{=latex}
\def\rvc{{\mathbf{c}}}
```
```{=latex}
\def\rvd{{\mathbf{d}}}
```
```{=latex}
\def\rve{{\mathbf{e}}}
```
```{=latex}
\def\rvf{{\mathbf{f}}}
```
```{=latex}
\def\rvg{{\mathbf{g}}}
```
```{=latex}
\def\rvh{{\mathbf{h}}}
```
```{=latex}
\def\rvu{{\mathbf{i}}}
```
```{=latex}
\def\rvj{{\mathbf{j}}}
```
```{=latex}
\def\rvk{{\mathbf{k}}}
```
```{=latex}
\def\rvl{{\mathbf{l}}}
```
```{=latex}
\def\rvm{{\mathbf{m}}}
```
```{=latex}
\def\rvn{{\mathbf{n}}}
```
```{=latex}
\def\rvo{{\mathbf{o}}}
```
```{=latex}
\def\rvp{{\mathbf{p}}}
```
```{=latex}
\def\rvq{{\mathbf{q}}}
```
```{=latex}
\def\rvr{{\mathbf{r}}}
```
```{=latex}
\def\rvs{{\mathbf{s}}}
```
```{=latex}
\def\rvt{{\mathbf{t}}}
```
```{=latex}
\def\rvu{{\mathbf{u}}}
```
```{=latex}
\def\rvv{{\mathbf{v}}}
```
```{=latex}
\def\rvw{{\mathbf{w}}}
```
```{=latex}
\def\rvx{{\mathbf{x}}}
```
```{=latex}
\def\rvy{{\mathbf{y}}}
```
```{=latex}
\def\rvz{{\mathbf{z}}}
```
```{=latex}
\def\erva{{\textnormal{a}}}
```
```{=latex}
\def\ervb{{\textnormal{b}}}
```
```{=latex}
\def\ervc{{\textnormal{c}}}
```
```{=latex}
\def\ervd{{\textnormal{d}}}
```
```{=latex}
\def\erve{{\textnormal{e}}}
```
```{=latex}
\def\ervf{{\textnormal{f}}}
```
```{=latex}
\def\ervg{{\textnormal{g}}}
```
```{=latex}
\def\ervh{{\textnormal{h}}}
```
```{=latex}
\def\ervi{{\textnormal{i}}}
```
```{=latex}
\def\ervj{{\textnormal{j}}}
```
```{=latex}
\def\ervk{{\textnormal{k}}}
```
```{=latex}
\def\ervl{{\textnormal{l}}}
```
```{=latex}
\def\ervm{{\textnormal{m}}}
```
```{=latex}
\def\ervn{{\textnormal{n}}}
```
```{=latex}
\def\ervo{{\textnormal{o}}}
```
```{=latex}
\def\ervp{{\textnormal{p}}}
```
```{=latex}
\def\ervq{{\textnormal{q}}}
```
```{=latex}
\def\ervr{{\textnormal{r}}}
```
```{=latex}
\def\ervs{{\textnormal{s}}}
```
```{=latex}
\def\ervt{{\textnormal{t}}}
```
```{=latex}
\def\ervu{{\textnormal{u}}}
```
```{=latex}
\def\ervv{{\textnormal{v}}}
```
```{=latex}
\def\ervw{{\textnormal{w}}}
```
```{=latex}
\def\ervx{{\textnormal{x}}}
```
```{=latex}
\def\ervy{{\textnormal{y}}}
```
```{=latex}
\def\ervz{{\textnormal{z}}}
```
```{=latex}
\def\rmA{{\mathbf{A}}}
```
```{=latex}
\def\rmB{{\mathbf{B}}}
```
```{=latex}
\def\rmC{{\mathbf{C}}}
```
```{=latex}
\def\rmD{{\mathbf{D}}}
```
```{=latex}
\def\rmE{{\mathbf{E}}}
```
```{=latex}
\def\rmF{{\mathbf{F}}}
```
```{=latex}
\def\rmG{{\mathbf{G}}}
```
```{=latex}
\def\rmH{{\mathbf{H}}}
```
```{=latex}
\def\rmI{{\mathbf{I}}}
```
```{=latex}
\def\rmJ{{\mathbf{J}}}
```
```{=latex}
\def\rmK{{\mathbf{K}}}
```
```{=latex}
\def\rmL{{\mathbf{L}}}
```
```{=latex}
\def\rmM{{\mathbf{M}}}
```
```{=latex}
\def\rmN{{\mathbf{N}}}
```
```{=latex}
\def\rmO{{\mathbf{O}}}
```
```{=latex}
\def\rmP{{\mathbf{P}}}
```
```{=latex}
\def\rmQ{{\mathbf{Q}}}
```
```{=latex}
\def\rmR{{\mathbf{R}}}
```
```{=latex}
\def\rmS{{\mathbf{S}}}
```
```{=latex}
\def\rmT{{\mathbf{T}}}
```
```{=latex}
\def\rmU{{\mathbf{U}}}
```
```{=latex}
\def\rmV{{\mathbf{V}}}
```
```{=latex}
\def\rmW{{\mathbf{W}}}
```
```{=latex}
\def\rmX{{\mathbf{X}}}
```
```{=latex}
\def\rmY{{\mathbf{Y}}}
```
```{=latex}
\def\rmZ{{\mathbf{Z}}}
```
```{=latex}
\def\ermA{{\textnormal{A}}}
```
```{=latex}
\def\ermB{{\textnormal{B}}}
```
```{=latex}
\def\ermC{{\textnormal{C}}}
```
```{=latex}
\def\ermD{{\textnormal{D}}}
```
```{=latex}
\def\ermE{{\textnormal{E}}}
```
```{=latex}
\def\ermF{{\textnormal{F}}}
```
```{=latex}
\def\ermG{{\textnormal{G}}}
```
```{=latex}
\def\ermH{{\textnormal{H}}}
```
```{=latex}
\def\ermI{{\textnormal{I}}}
```
```{=latex}
\def\ermJ{{\textnormal{J}}}
```
```{=latex}
\def\ermK{{\textnormal{K}}}
```
```{=latex}
\def\ermL{{\textnormal{L}}}
```
```{=latex}
\def\ermM{{\textnormal{M}}}
```
```{=latex}
\def\ermN{{\textnormal{N}}}
```
```{=latex}
\def\ermO{{\textnormal{O}}}
```
```{=latex}
\def\ermP{{\textnormal{P}}}
```
```{=latex}
\def\ermQ{{\textnormal{Q}}}
```
```{=latex}
\def\ermR{{\textnormal{R}}}
```
```{=latex}
\def\ermS{{\textnormal{S}}}
```
```{=latex}
\def\ermT{{\textnormal{T}}}
```
```{=latex}
\def\ermU{{\textnormal{U}}}
```
```{=latex}
\def\ermV{{\textnormal{V}}}
```
```{=latex}
\def\ermW{{\textnormal{W}}}
```
```{=latex}
\def\ermX{{\textnormal{X}}}
```
```{=latex}
\def\ermY{{\textnormal{Y}}}
```
```{=latex}
\def\ermZ{{\textnormal{Z}}}
```
```{=latex}
\def\vzero{{\bm{0}}}
```
```{=latex}
\def\vone{{\bm{1}}}
```
```{=latex}
\def\vmu{{\bm{\mu}}}
```
```{=latex}
\def\vtheta{{\bm{\theta}}}
```
```{=latex}
\def\va{{\bm{a}}}
```
```{=latex}
\def\vb{{\bm{b}}}
```
```{=latex}
\def\vc{{\bm{c}}}
```
```{=latex}
\def\vd{{\bm{d}}}
```
```{=latex}
\def\ve{{\bm{e}}}
```
```{=latex}
\def\vf{{\bm{f}}}
```
```{=latex}
\def\vg{{\bm{g}}}
```
```{=latex}
\def\vh{{\bm{h}}}
```
```{=latex}
\def\vi{{\bm{i}}}
```
```{=latex}
\def\vj{{\bm{j}}}
```
```{=latex}
\def\vk{{\bm{k}}}
```
```{=latex}
\def\vl{{\bm{l}}}
```
```{=latex}
\def\vm{{\bm{m}}}
```
```{=latex}
\def\vn{{\bm{n}}}
```
```{=latex}
\def\vo{{\bm{o}}}
```
```{=latex}
\def\vp{{\bm{p}}}
```
```{=latex}
\def\vq{{\bm{q}}}
```
```{=latex}
\def\vr{{\bm{r}}}
```
```{=latex}
\def\vs{{\bm{s}}}
```
```{=latex}
\def\vt{{\bm{t}}}
```
```{=latex}
\def\vu{{\bm{u}}}
```
```{=latex}
\def\vv{{\bm{v}}}
```
```{=latex}
\def\vw{{\bm{w}}}
```
```{=latex}
\def\vx{{\bm{x}}}
```
```{=latex}
\def\vy{{\bm{y}}}
```
```{=latex}
\def\vz{{\bm{z}}}
```
```{=latex}
\def\evalpha{{\alpha}}
```
```{=latex}
\def\evbeta{{\beta}}
```
```{=latex}
\def\evepsilon{{\epsilon}}
```
```{=latex}
\def\evlambda{{\lambda}}
```
```{=latex}
\def\evomega{{\omega}}
```
```{=latex}
\def\evmu{{\mu}}
```
```{=latex}
\def\evpsi{{\psi}}
```
```{=latex}
\def\evsigma{{\sigma}}
```
```{=latex}
\def\evtheta{{\theta}}
```
```{=latex}
\def\eva{{a}}
```
```{=latex}
\def\evb{{b}}
```
```{=latex}
\def\evc{{c}}
```
```{=latex}
\def\evd{{d}}
```
```{=latex}
\def\eve{{e}}
```
```{=latex}
\def\evf{{f}}
```
```{=latex}
\def\evg{{g}}
```
```{=latex}
\def\evh{{h}}
```
```{=latex}
\def\evi{{i}}
```
```{=latex}
\def\evj{{j}}
```
```{=latex}
\def\evk{{k}}
```
```{=latex}
\def\evl{{l}}
```
```{=latex}
\def\evm{{m}}
```
```{=latex}
\def\evn{{n}}
```
```{=latex}
\def\evo{{o}}
```
```{=latex}
\def\evp{{p}}
```
```{=latex}
\def\evq{{q}}
```
```{=latex}
\def\evr{{r}}
```
```{=latex}
\def\evs{{s}}
```
```{=latex}
\def\evt{{t}}
```
```{=latex}
\def\evu{{u}}
```
```{=latex}
\def\evv{{v}}
```
```{=latex}
\def\evw{{w}}
```
```{=latex}
\def\evx{{x}}
```
```{=latex}
\def\evy{{y}}
```
```{=latex}
\def\evz{{z}}
```
```{=latex}
\def\mA{{\bm{A}}}
```
```{=latex}
\def\mB{{\bm{B}}}
```
```{=latex}
\def\mC{{\bm{C}}}
```
```{=latex}
\def\mD{{\bm{D}}}
```
```{=latex}
\def\mE{{\bm{E}}}
```
```{=latex}
\def\mF{{\bm{F}}}
```
```{=latex}
\def\mG{{\bm{G}}}
```
```{=latex}
\def\mH{{\bm{H}}}
```
```{=latex}
\def\mI{{\bm{I}}}
```
```{=latex}
\def\mJ{{\bm{J}}}
```
```{=latex}
\def\mK{{\bm{K}}}
```
```{=latex}
\def\mL{{\bm{L}}}
```
```{=latex}
\def\mM{{\bm{M}}}
```
```{=latex}
\def\mN{{\bm{N}}}
```
```{=latex}
\def\mO{{\bm{O}}}
```
```{=latex}
\def\mP{{\bm{P}}}
```
```{=latex}
\def\mQ{{\bm{Q}}}
```
```{=latex}
\def\mR{{\bm{R}}}
```
```{=latex}
\def\mS{{\bm{S}}}
```
```{=latex}
\def\mT{{\bm{T}}}
```
```{=latex}
\def\mU{{\bm{U}}}
```
```{=latex}
\def\mV{{\bm{V}}}
```
```{=latex}
\def\mW{{\bm{W}}}
```
```{=latex}
\def\mX{{\bm{X}}}
```
```{=latex}
\def\mY{{\bm{Y}}}
```
```{=latex}
\def\mZ{{\bm{Z}}}
```
```{=latex}
\def\mBeta{{\bm{\beta}}}
```
```{=latex}
\def\mPhi{{\bm{\Phi}}}
```
```{=latex}
\def\mLambda{{\bm{\Lambda}}}
```
```{=latex}
\def\mSigma{{\bm{\Sigma}}}
```
```{=latex}
\newcommand{\tens}[1]{\bm{\mathsfit{#1}}}
```
```{=latex}
\def\tA{{\tens{A}}}
```
```{=latex}
\def\tB{{\tens{B}}}
```
```{=latex}
\def\tC{{\tens{C}}}
```
```{=latex}
\def\tD{{\tens{D}}}
```
```{=latex}
\def\tE{{\tens{E}}}
```
```{=latex}
\def\tF{{\tens{F}}}
```
```{=latex}
\def\tG{{\tens{G}}}
```
```{=latex}
\def\tH{{\tens{H}}}
```
```{=latex}
\def\tI{{\tens{I}}}
```
```{=latex}
\def\tJ{{\tens{J}}}
```
```{=latex}
\def\tK{{\tens{K}}}
```
```{=latex}
\def\tL{{\tens{L}}}
```
```{=latex}
\def\tM{{\tens{M}}}
```
```{=latex}
\def\tN{{\tens{N}}}
```
```{=latex}
\def\tO{{\tens{O}}}
```
```{=latex}
\def\tP{{\tens{P}}}
```
```{=latex}
\def\tQ{{\tens{Q}}}
```
```{=latex}
\def\tR{{\tens{R}}}
```
```{=latex}
\def\tS{{\tens{S}}}
```
```{=latex}
\def\tT{{\tens{T}}}
```
```{=latex}
\def\tU{{\tens{U}}}
```
```{=latex}
\def\tV{{\tens{V}}}
```
```{=latex}
\def\tW{{\tens{W}}}
```
```{=latex}
\def\tX{{\tens{X}}}
```
```{=latex}
\def\tY{{\tens{Y}}}
```
```{=latex}
\def\tZ{{\tens{Z}}}
```
```{=latex}
\def\gA{{\mathcal{A}}}
```
```{=latex}
\def\gB{{\mathcal{B}}}
```
```{=latex}
\def\gC{{\mathcal{C}}}
```
```{=latex}
\def\gD{{\mathcal{D}}}
```
```{=latex}
\def\gE{{\mathcal{E}}}
```
```{=latex}
\def\gF{{\mathcal{F}}}
```
```{=latex}
\def\gG{{\mathcal{G}}}
```
```{=latex}
\def\gH{{\mathcal{H}}}
```
```{=latex}
\def\gI{{\mathcal{I}}}
```
```{=latex}
\def\gJ{{\mathcal{J}}}
```
```{=latex}
\def\gK{{\mathcal{K}}}
```
```{=latex}
\def\gL{{\mathcal{L}}}
```
```{=latex}
\def\gM{{\mathcal{M}}}
```
```{=latex}
\def\gN{{\mathcal{N}}}
```
```{=latex}
\def\gO{{\mathcal{O}}}
```
```{=latex}
\def\gP{{\mathcal{P}}}
```
```{=latex}
\def\gQ{{\mathcal{Q}}}
```
```{=latex}
\def\gR{{\mathcal{R}}}
```
```{=latex}
\def\gS{{\mathcal{S}}}
```
```{=latex}
\def\gT{{\mathcal{T}}}
```
```{=latex}
\def\gU{{\mathcal{U}}}
```
```{=latex}
\def\gV{{\mathcal{V}}}
```
```{=latex}
\def\gW{{\mathcal{W}}}
```
```{=latex}
\def\gX{{\mathcal{X}}}
```
```{=latex}
\def\gY{{\mathcal{Y}}}
```
```{=latex}
\def\gZ{{\mathcal{Z}}}
```
```{=latex}
\def\sA{{\mathbb{A}}}
```
```{=latex}
\def\sB{{\mathbb{B}}}
```
```{=latex}
\def\sC{{\mathbb{C}}}
```
```{=latex}
\def\sD{{\mathbb{D}}}
```
```{=latex}
\def\sF{{\mathbb{F}}}
```
```{=latex}
\def\sG{{\mathbb{G}}}
```
```{=latex}
\def\sH{{\mathbb{H}}}
```
```{=latex}
\def\sI{{\mathbb{I}}}
```
```{=latex}
\def\sJ{{\mathbb{J}}}
```
```{=latex}
\def\sK{{\mathbb{K}}}
```
```{=latex}
\def\sL{{\mathbb{L}}}
```
```{=latex}
\def\sM{{\mathbb{M}}}
```
```{=latex}
\def\sN{{\mathbb{N}}}
```
```{=latex}
\def\sO{{\mathbb{O}}}
```
```{=latex}
\def\sP{{\mathbb{P}}}
```
```{=latex}
\def\sQ{{\mathbb{Q}}}
```
```{=latex}
\def\sR{{\mathbb{R}}}
```
```{=latex}
\def\sS{{\mathbb{S}}}
```
```{=latex}
\def\sT{{\mathbb{T}}}
```
```{=latex}
\def\sU{{\mathbb{U}}}
```
```{=latex}
\def\sV{{\mathbb{V}}}
```
```{=latex}
\def\sW{{\mathbb{W}}}
```
```{=latex}
\def\sX{{\mathbb{X}}}
```
```{=latex}
\def\sY{{\mathbb{Y}}}
```
```{=latex}
\def\sZ{{\mathbb{Z}}}
```
```{=latex}
\def\emLambda{{\Lambda}}
```
```{=latex}
\def\emA{{A}}
```
```{=latex}
\def\emB{{B}}
```
```{=latex}
\def\emC{{C}}
```
```{=latex}
\def\emD{{D}}
```
```{=latex}
\def\emE{{E}}
```
```{=latex}
\def\emF{{F}}
```
```{=latex}
\def\emG{{G}}
```
```{=latex}
\def\emH{{H}}
```
```{=latex}
\def\emI{{I}}
```
```{=latex}
\def\emJ{{J}}
```
```{=latex}
\def\emK{{K}}
```
```{=latex}
\def\emL{{L}}
```
```{=latex}
\def\emM{{M}}
```
```{=latex}
\def\emN{{N}}
```
```{=latex}
\def\emO{{O}}
```
```{=latex}
\def\emP{{P}}
```
```{=latex}
\def\emQ{{Q}}
```
```{=latex}
\def\emR{{R}}
```
```{=latex}
\def\emS{{S}}
```
```{=latex}
\def\emT{{T}}
```
```{=latex}
\def\emU{{U}}
```
```{=latex}
\def\emV{{V}}
```
```{=latex}
\def\emW{{W}}
```
```{=latex}
\def\emX{{X}}
```
```{=latex}
\def\emY{{Y}}
```
```{=latex}
\def\emZ{{Z}}
```
```{=latex}
\def\emSigma{{\Sigma}}
```
```{=latex}
\newcommand{\etens}[1]{\mathsfit{#1}}
```
```{=latex}
\def\etLambda{{\etens{\Lambda}}}
```
```{=latex}
\def\etA{{\etens{A}}}
```
```{=latex}
\def\etB{{\etens{B}}}
```
```{=latex}
\def\etC{{\etens{C}}}
```
```{=latex}
\def\etD{{\etens{D}}}
```
```{=latex}
\def\etE{{\etens{E}}}
```
```{=latex}
\def\etF{{\etens{F}}}
```
```{=latex}
\def\etG{{\etens{G}}}
```
```{=latex}
\def\etH{{\etens{H}}}
```
```{=latex}
\def\etI{{\etens{I}}}
```
```{=latex}
\def\etJ{{\etens{J}}}
```
```{=latex}
\def\etK{{\etens{K}}}
```
```{=latex}
\def\etL{{\etens{L}}}
```
```{=latex}
\def\etM{{\etens{M}}}
```
```{=latex}
\def\etN{{\etens{N}}}
```
```{=latex}
\def\etO{{\etens{O}}}
```
```{=latex}
\def\etP{{\etens{P}}}
```
```{=latex}
\def\etQ{{\etens{Q}}}
```
```{=latex}
\def\etR{{\etens{R}}}
```
```{=latex}
\def\etS{{\etens{S}}}
```
```{=latex}
\def\etT{{\etens{T}}}
```
```{=latex}
\def\etU{{\etens{U}}}
```
```{=latex}
\def\etV{{\etens{V}}}
```
```{=latex}
\def\etW{{\etens{W}}}
```
```{=latex}
\def\etX{{\etens{X}}}
```
```{=latex}
\def\etY{{\etens{Y}}}
```
```{=latex}
\def\etZ{{\etens{Z}}}
```
```{=latex}
\newcommand{\pdata}{p_{\rm{data}}}
```
```{=latex}
\newcommand{\ptrain}{\hat{p}_{\rm{data}}}
```
```{=latex}
\newcommand{\Ptrain}{\hat{P}_{\rm{data}}}
```
```{=latex}
\newcommand{\pmodel}{p_{\rm{model}}}
```
```{=latex}
\newcommand{\Pmodel}{P_{\rm{model}}}
```
```{=latex}
\newcommand{\ptildemodel}{\tilde{p}_{\rm{model}}}
```
```{=latex}
\newcommand{\pencode}{p_{\rm{encoder}}}
```
```{=latex}
\newcommand{\pdecode}{p_{\rm{decoder}}}
```
```{=latex}
\newcommand{\precons}{p_{\rm{reconstruct}}}
```
```{=latex}
\newcommand{\laplace}{\mathrm{Laplace}}
```
```{=latex}
\newcommand{\E}{\mathbb{E}}
```
```{=latex}
\newcommand{\Ls}{\mathcal{L}}
```
```{=latex}
\newcommand{\R}{\mathbb{R}}
```
```{=latex}
\newcommand{\emp}{\tilde{p}}
```
```{=latex}
\newcommand{\lr}{\alpha}
```
```{=latex}
\newcommand{\reg}{\lambda}
```
```{=latex}
\newcommand{\rect}{\mathrm{rectifier}}
```
```{=latex}
\newcommand{\softmax}{\mathrm{softmax}}
```
```{=latex}
\newcommand{\sigmoid}{\sigma}
```
```{=latex}
\newcommand{\softplus}{\zeta}
```
```{=latex}
\newcommand{\KL}{D_{\mathrm{KL}}}
```
```{=latex}
\newcommand{\Var}{\mathrm{Var}}
```
```{=latex}
\newcommand{\standarderror}{\mathrm{SE}}
```
```{=latex}
\newcommand{\Cov}{\mathrm{Cov}}
```
```{=latex}
\newcommand{\normlzero}{L^0}
```
```{=latex}
\newcommand{\normlone}{L^1}
```
```{=latex}
\newcommand{\normltwo}{L^2}
```
```{=latex}
\newcommand{\normlp}{L^p}
```
```{=latex}
\newcommand{\normmax}{L^\infty}
```
```{=latex}
\newcommand{\parents}{Pa}
```
```{=latex}
\DeclareMathOperator*{\argmax}{arg\,max}
```
```{=latex}
\DeclareMathOperator*{\argmin}{arg\,min}
```
```{=latex}
\DeclareMathOperator{\sign}{sign}
```
```{=latex}
\DeclareMathOperator{\Tr}{Tr}
```
```{=latex}
\let\ab\allowbreak
```
```{=latex}
\renewcommand\thesubfigure{(\alph{subfigure})}
```
```{=latex}
\newcommand{\usericon}{\raisebox{-0.25ex}{\includegraphics[height=1.3em]{figs/user.png}}}
```
```{=latex}
\newcommand{\modelicon}{\raisebox{-0.25ex}{\includegraphics[height=1.3em]{figs/logo.png}}}
```
```{=latex}
\newcommand{\bestres}[1]{{\textbf{\textcolor{red}{#1}}}}
```
```{=latex}
\newcommand{\secondres}[1]{\textcolor{blue}{\uline{#1}}}
```
```{=latex}
\newcommand{\taskone}{Scenario Understanding\xspace}
```
```{=latex}
\newcommand{\tasktwo}{Causality Discovery\xspace}
```
```{=latex}
\newcommand{\taskthree}{Event-aware Forecasting\xspace}
```
```{=latex}
\newcommand{\taskfour}{Decision Making\xspace}
```
```{=latex}
\newcommand{\dataset}{\textsc{TSR-Suite}\xspace}
```
```{=latex}
\newcommand{\method}{\textsc{TimeOmni-1}\xspace}
```
```{=latex}
\newcommand{\fix}{\marginpar{FIX}}
```
```{=latex}
\newcommand{\new}{\marginpar{NEW}}
```
Introduction
============

Time series data underpin a wide range of real-world systems, including energy, transportation, finance, and healthcare [@lu2024trnn; @liu2023sadiselfadaptivedecomposedinterpretable; @guan2023spatial; @lan2025gem]. Comprehending real-world time series extends beyond mere pattern recognition, it necessitates multi-step and multi-hop reasoning to identify external factors driving temporal changes and to support downstream tasks that inherently build upon upstream pattern understanding and extrapolation [@kongPositionEmpoweringTime2025b]. For instance, effectively scheduling energy demand requires integrating external knowledge such as extreme weather events, inferring causal mechanisms, anticipating event-driven variations, and ultimately supporting downstream decisions [@mackinlayEventStudiesEconomics1997; @liangFoundationModelsSpatioTemporal2025]. However, most existing time series approaches remain centered on basic pattern analytics and fall short in addressing such complex reasoning requirements, restricting their effectiveness in scenarios that demand a deeper understanding of context and robust decision-making support.

Large language models (LLMs) have recently demonstrated impressive multi-step reasoning abilities across text, code, and mathematics [@cot; @grpo]. This potential for time series reasoning, however, remains largely untapped. The primary obstacle is the scarcity of large-scale multimodal time series alignment, instruction, and labeled chain-of-thoughts data during pretraining, which hinders the development of corresponding time series reasoning abilities. This capability gap is further evidenced on even leading LLMs (e.g., GPT-4.1) by recent benchmark [@merrillLanguageModelsStill2024a; @appleTimeSeriesReasoning; @jinPositionWhatCan2024]. Furthermore, time series specific architectures such as Time-MoE [@Time-moe] and Moirai [@moirai] remain largely confined to forecasting tasks and lack the generalized reasoning capabilities required for broader applications. These gaps underscore the urgent need for dedicated time series reasoning models (TSRMs) that advance time series understanding, strengthen reasoning, and facilitate temporal analytics and knowledge generation, paving the way toward general-purpose time series intelligence.

However, two key limitations hinder the development of TSRMs: **(1) The scarcity of high-quality data to support general-purpose reasoning over time series.** Early efforts, such as constructing TSQA datasets [@TimeMQA], remain largely at the level of surface time series question answering and suffer from insufficient input context. Moreover, the formulation of time series reasoning tasks in existing multimodal datasets has not been systematically studied, leaving them unable to capture genuine reasoning depth with time series data. **(2) The lack of a validated and feasible pathway for effective time series reasoning across tasks.** It remains unclear which tasks genuinely demand reasoning capabilities over time series, as this question has not been systematically studied. This gap, combined with data scarcity, has confined existing research to narrow, task-specific settings. Many current approaches are trained independently for each task or even each dataset; for example, TimeMaster [@TimeMaster] employs six distinct models for six datasets. Such fragmentation hinders the transfer of reasoning capabilities across tasks and leaves the development of general-purpose time series reasoning an open challenge. These challenges naturally raise a pivotal question: How can we take a solid step toward fully *incentivizing reasoning capabilities in LLMs over time series*, so they can tackle complex real-world problems that inherently demand such reasoning?

Answering this question first drives us to tackle the challenge of data scarcity. Based on the limitations of existing time series QA datasets, we argue that time series reasoning tasks should adhere to two key principles. First, they should reward genuine reasoning rather than superficial pattern matching by systematically incorporating multi-step reasoning tasks and complete reasoning chains. Second, they should ensure context sufficiency to enable unambiguous answering or response generation, thereby strengthening the model's reasoning capacity and generalization across diverse scenarios. Guided by these principles, we formalize four atomic tasks that genuinely require reasoning with time series and introduce , which covers three fundamental time series reasoning capabilities: (1) *perception*, acquired through scenario understanding and causal discovery, reveals key temporal patterns; (2) *extrapolation*, realized via event-aware forecasting, predicts future trends and anomalies; and (3) *decision-making*, developed through perception and extrapolation, supports informed, adaptive actions. Building on this foundation, we present , the first generalized reasoning model for time series. The central premise is that effective time series reasoning requires internalizing fundamental temporal priors. To this end, first injects the above three capabilities identified by into LLMs through supervised fine-tuning (SFT) as priors. We then design novel time series task-grounded rewards to cultivate genuine reasoning from these priors via policy optimization. Finally, to validate that these capabilities represent complementary facets of general time series reasoning, we unify all task capabilities within a single model with joint training.

Our contributions lie in three aspects:

1.  **New Datasets and Testbed.** We introduce , the first comprehensive time series reasoning suite that formalizes four core tasks spanning three capabilities: perception, extrapolation, and decision-making. It contains more than 23K samples, of which 2.3K are carefully curated through a human-guided hierarchical annotation process. The suite serves not only as a testbed for thorough evaluation but also as a foundational data pipeline for training TSRMs.

2.  **New Models.** We present , the first generalized reasoning model on time series data. It unifies diverse reasoning tasks within a two-stage curriculum: Stage 1 employs supervised fine-tuning with human-guided reasoning traces to inject temporal priors across the three key capabilities: perception, extrapolation, and decision-making; Stage 2 leverages reinforcement learning with novel task-grounded rewards to go beyond mimicking priors to robust reasoning.

3.  **Comprehensive Evaluation and Key Insights.** achieves Top-2 performance under both in-distribution (ID) and out-of-distribution (OOD) testbeds. Notably, it surpasses GPT-4.1 by 40.6% (ID) and 28.1% (OOD) in causal discovery accuracy while maintaining high valid-response rates across all tasks. Further experiments provide the first evidence that joint training across diverse time series reasoning tasks yields mutual gains across capabilities.

Related Work
============

**Large Time Series Models.** Early efforts primarily aimed to endow time series models with zero-shot capability, mitigating domain-specific limitations when large-scale time series data were scarce. Time-LLM [@Time-LLM] sought to transfer the generalization ability of LLMs into the time series domain. Prior developments of time series models are discussed in Appendix [7](#app.:related_work){reference-type="ref" reference="app.:related_work"}. With the increasing availability of large-scale time series datasets, training **time series foundation models (TSFMs)** from scratch emerged as the mainstream approach [@moirai; @Chronos; @Time-moe; @yaotowards]. These models demonstrated promising zero-shot performance but still fell short of supporting multi-task and multimodal capability. With the advent of multimodal models [@Flamingo; @GPT4] and reasoning-centric models [@o1; @deepseek-r1], the intersection of time series and LLMs has re-emerged at the forefront. Broadly, existing approaches can be categorized into two main groups. The first are **time series language models (TSLMs)**, which primarily adapt the language modeling paradigm to temporal data, focusing on supervised pattern fitting and QA without reasoning [@ChatTS; @chattime; @TimeMQA; @TempoGPT; @TimeRA; @ITFormer]. The second are **time series reasoning models (TSRMs)**, distinguished by their attempt to employ reinforcement learning (RL) to cultivate genuine reasoning ability rather than pattern matching, and by their use of explicit reasoning to improve accuracy while providing interpretable, step-by-step explanations that enhance trustworthiness [@ts_Slowthinking_LLMs; @InferringEvents; @timer1]. However, this line of research is still in its infancy: the definition and necessity of reasoning remain vague, existing QA datasets are of limited quality, and most current works restrict themselves to single-task experiments under the R1-paradigm [@deepseek-r1], without yet establishing a general-purpose reasoning framework for time series analogous to LLMs.

**Reasoning with Large Language Models.** Generative LLMs offer greater flexibility and generalization than traditional deep learning models, making them effective in complex tasks that require multi-step reasoning [@cot; @zero-shot-cot]. However, reasoning errors can propagate and degrade performance in some cases. To address this, @lightman2023let introduces a step-level reward mechanism into both data construction and model fine-tuning to enhance reasoning. DeepSeek-R1-Zero [@deepseek-r1] shows that RL using only format and final answers can also improve reasoning. And it has expanded to math, code, translation, and multimodal tasks [@zhang2025srpo; @zhang2025right; @feng2025mt; @huang2025vision; @bimark; @zhan2025vision; @visual-rft]. However, LLM-based reasoning for time series remains underexplored due to data scarcity.

Methodology
===========

**Problem Definition.** We define time series reasoning as the process in which reasoning model (RM) $p_\theta$ first generate a sequence of intermediate rationales $R=(r_1,\ldots,r_K)$ and then produce a final answer $y$, conditioned on (1) observed time series inputs $X=\{x^{(m)}_{1:T}\}_{m=1}^M$ and (2) auxiliary context $C$ (e.g., task instructions or external knowledge). Formally: $$(R, y) \sim p_\theta(R, y \mid X, C) = p_\theta(R \mid X, C)\, p_\theta(y \mid R, X, C).$$ This formulation covers both discrete-output tasks, where $y$ is a categorical option, and sequence-output tasks, where $y$ is a numerical sequence, under a unified reasoning framework. To standardize outputs, RMs must generate rationales $R$ within `<think></think>` tags, followed by the final answer $y$ enclosed in an `<answer></answer>` block. In contrast, non-reasoning models directly predict $y \sim p_\theta(y \mid X, C)$ without rationales, producing only the `<answer></answer>`.

Formulating Reasoning-Critical Time Series Tasks {#2principles}
------------------------------------------------

**Limitations of Existing QA Tasks.** We use Time-MQA [@TimeMQA], the largest existing time series question answering (TSQA) dataset, to highlight two limitations. (1) *Many questions are overly simple and straightforward, where invoking reasoning leads to over-thinking.* At the aggregate level, as shown in Figure [1](#fig:1a){reference-type="ref" reference="fig:1a"}, the accuracy gap between stronger models (GPT-4.1) and smaller baselines (Qwen2.5-14B) is marginal, and in some cases reversed, indicating that additional reasoning capacity brings no benefit. Furthermore, all models achieve accuracy above 75%, highlighting that the tasks are not sufficiently challenging. At the instance level, as illustrated by the True/False QA from Time-MQA in Figure [2](#fig:1b){reference-type="ref" reference="fig:1b"}, the question can be directly answered by non-reasoning models, while reasoning complicates the process. (2) *Questions often lack sufficient input information, either in the time series $X$ or the context $C$, which prevents well-grounded answers and introduces ambiguity.* As shown in Figure [3](#fig:1c){reference-type="ref" reference="fig:1c"}, even advanced models plateau below 65% accuracy and show no gains after SFT. To investigate this, we conducted a human evaluation, which revealed numerous ambiguous cases caused by missing context. As exemplified in Figure [4](#fig:1d){reference-type="ref" reference="fig:1d"}, the options are not clearly distinguished (e.g., no explicit thresholds for high, moderate, and low volatility), forcing the model to guess rather than a reward modeling [@silver2025welcome] oriented reason. Consequently, errors reflect chance rather than insufficient time series reasoning ability (see Appendix [8](#app.:exsiting_tsqa_shortcomings){reference-type="ref" reference="app.:exsiting_tsqa_shortcomings"} for detailed analysis of existing TSQA datasets). To address these issues, we propose two design principles for formulating time series QA tasks that require genuine reasoning.

![](figs/fig1a.png){#fig:1a width="\\linewidth"}

![](figs/fig1b.png){#fig:1b width="\\linewidth"}

![](figs/fig1c.png){#fig:1c width="\\linewidth"}

![](figs/fig1d.png){#fig:1d width="\\linewidth"}

**Principle 1 --- QA-pairs must reward reasoning.** A reasoning model $M_{\text{RM}}$ explicitly generates rationales $R$ before producing the answer $y$, whereas a non-reasoning baseline $M_{\text{NRM}}$ directly outputs $y$. To determine whether a task requires reasoning, RMs should (significantly) outperform non-reasoning models (NRMs): $\bar{S}(M_{\text{RM}}) \gg \bar{S}(M_{\text{NRM}})$, where $\bar{S}(\cdot)$ denotes the mean score across tasks (e.g., accuracy for categorical prediction tasks or regression metrics for regression tasks).

**Principle 2 --- QA-pairs must ensure context sufficiency.** Both the time series input $X$ and auxiliary context $C$ constitute the basis for reasoning. Unlike coding and mathematical problem solving, where a well-posed problem typically admits a unique solution [@MathPrompter], time series problem solving is especially sensitive to the sufficiency of $X$ and $C$. Let $K$ denote the number of ambiguous options. Even an ideal reasoner with infinite reasoning capacity ($RC \to \infty$) will be forced to guess if $X$ or $C$ is underspecified (e.g., missing thresholds for distinguishing high vs. low variance), while it should substantially exceed random guessing once $X$ and $C$ are sufficient: $$\lim_{RC \to \infty} P(\text{correct} \mid X, C) \;\begin{cases}
       \approx \tfrac{1}{K}, & \text{if $X$ or $C$ is underspecified}, \\
       \gg \tfrac{1}{K}, & \text{if $X$ and $C$ are sufficient}.
    \end{cases}$$ Therefore, ensuring context sufficiency is a critical design principle for formulating reasoning-critical time series tasks, as it prevents ambiguity and enables reasoning to be applied meaningfully.

**Reasoning-Critical Tasks.** The two principles motivate us to directly address the unique challenges of time series reasoning (ensuring QA requires reasoning and context sufficiency). To this end, we design a suite of tasks that form a progressive pathway covering three fundamental time series reasoning capabilities: (1) perception, (2) extrapolation, and (3) decision-making.

As shown in Figure [5](#fig:2){reference-type="ref" reference="fig:2"}, the foundation of time series reasoning capabilities is perception, where the model first recognizes temporal patterns and then uncovers their underlying causes. This includes **Task 1:** , which focuses on single-series attribution by linking fluctuations to generative scenarios or external events (e.g., higher temperatures leading to increased ice-cream sales). It also encompasses **Task 2:** , which extends attribution to the multi-series setting, requiring the model to compare trends across sequences and identify causal relations (e.g., upstream discharge influencing downstream flow). Together, these tasks ensure the model not only observes time series but also interprets them in a context-aware and causal manner. **Task 3:** requires the model to build on its perception ability to extrapolate future trajectories under explicit event perturbations. Accurate extrapolation depends on leveraging intrinsic temporal knowledge to analyze external events and infer their impact on temporal dynamics. Finally, **Task 4:** represents the culmination of this chain. Building on the perception of temporal patterns (Task 1), causal relations (Task 2), and extrapolation (Task 3), the model must integrate these to select actions (Task 4) that maximize downstream utility (e.g., maximizing profits).

![Illustrative examples of the four reasoning-critical time series tasks in .](figs/fig2.png){#fig:2 width="1\\linewidth"}

[\[fig:2\]]{#fig:2 label="fig:2"}

By following the progressive capabilities of perception, extrapolation, and decision-making in formulating reasoning-critical tasks, we ensure that reasoning is an intrinsic requirement. Solving these tasks demands explicit reasoning from the outset, unlike conventional analytical tasks such as interpolation, where models often succeed through implicit fitting without reasoning.

 {#sec:tsr-suite}

To mitigate the scarcity of data in the field, we construct **T**ime **S**eries **R**easoning **Suite** (), the first unified dataset suite tailored for time series reasoning. Unlike prior benchmarks designed purely for evaluation, is built as a training-and-evaluation suite that supports TSRMs development. The dataset spans 10 diverse domains and contains 23,605 curated QA pairs. Among them, 2,339 samples are annotated through a human-guided hierarchical annotation process. Detailed statistics for each task are provided in Appendix [9.2](#app.:data_statistics){reference-type="ref" reference="app.:data_statistics"}. As shown in Figure [6](#fig:3){reference-type="ref" reference="fig:3"}(a), the data organization comprises three components as follows.

**Raw Data Collection.** Guided by the \`\`perception--extrapolation--decision-making" pathway underlying our four tasks, we systematically collect publicly available time series data across 10 domains. Figure [6](#fig:3){reference-type="ref" reference="fig:3"}(a) provides an overview of the domain distribution, see Appendix [9.1](#app.:raw_data_source){reference-type="ref" reference="app.:raw_data_source"} for data source details.

**Task Formulation.** To align with our task design and support RL, we standardize the QA format across tasks. Specifically, Tasks 1, 2, and 4 are framed as discrete-output selection problems, while Task 3 is formulated as a sequence-output forecasting task, as shown in Figure [5](#fig:2){reference-type="ref" reference="fig:2"}. Each task adopts a customized data construction pipeline and is extensible to support further scaling with new input series. A key improvement over prior datasets is that our dataset, with over 23K QA pairs, is sufficiently large to support both training and evaluation (e.g., CiK with $355$ samples [@CiK], TSAIA with $1,054$ samples [@TSAIA]), rather than serving solely as a testbed.

**Hierarchical Chain-of-Thoughts.** Existing time series QA datasets typically provide only labels [@TimeMQA], overlooking the fact that LLMs lack temporal priors for time series reasoning. To fill the gap, we design a hierarchical annotation pipeline involving an LLM Analyzer, Human Reviewers, and an LLM Rewriter (Figure [6](#fig:3){reference-type="ref" reference="fig:3"}(b)). **(1) Human-guided solvable annotation.** Instead of asking the LLM analyzer to directly solve the problems, we guide it with structured templates to elicit consistent reasoning, and we retain correctly solved samples as *Step-1 CoT data*. **(2) Context sufficiency verification.** For questions answered incorrectly in the first step, human experts use a customized evaluation interface (see Appendix [9.5](#app.:interface){reference-type="ref" reference="app.:interface"}) to examine whether the provided context is sufficient to disambiguate the answer. If a question is solvable by human reviewers, expert-written reasoning chains are subsequently polished by the rewriter to follow our structured templates, and the resulting samples are collected as *Step-2 CoT data*. **Task 3 () is treated as a special case**: unlike tasks with unique answers, forecasting outputs cannot perfectly match the ground truth due to inherent noise in real-world time series data. Human reviewers examine the cases and select 400 samples with relatively low mean absolute error (MAE). As a result, annotated predictions in Task 3 may not coincide exactly with the ground truth, but they capture plausible and well-justified reasoning. Additional analysis of Task 3 is provided in Appendix [9.4](#app.:task3_special_notes){reference-type="ref" reference="app.:task3_special_notes"}.

![Overview of data and training pipeline. **(a)** Construction of , including *domain distribution* and *sample statistics*. **(b)** Hierarchical CoT annotation pipeline with outputs from each step for all tasks. **(c)** Two-stage training of : Stage 1 injects temporal priors via SFT; Stage 2 refines reasoning with task-grounded reward signals under RL.](figs/fig3.png){#fig:3 width="\\textwidth"}

[\[fig:3\]]{#fig:3 label="fig:3"}

 {#section}

Developing time series reasoning poses unique challenges compared to other domains. Pretrained LLMs lack temporal priors, as they are not largely exposed to time series data during pretraining. To bridge this gap, we propose a two-stage training paradigm: (1) injecting temporal priors to anchor the model in a temporal knowledge space, and (2) refining these priors for robust reasoning through task-grounded rewards (Figure [6](#fig:3){reference-type="ref" reference="fig:3"}(c)). All experiments in this section use in-distribution (ID) testbeds.

R0.63

![image](figs/fig4.png){width="\\linewidth"}

![image](figs/fig5.png){width="\\linewidth"}

**Stage 1: Injecting Time Series Reasoning Priors.** Human-guided reasoning priors instruct LLMs on how to decompose time series tasks into meaningful components. These traces narrow the exploration space to focus on three key capabilities (i.e., perception, extrapolation, and decision-making) instead of drifting toward commonsense heuristics or generic algebraic QA. We inject this knowledge through supervised fine-tuning (SFT). Implementation details of SFT are provided in Appendix [11.1](#app.:sft){reference-type="ref" reference="app.:sft"}.

**Finding 1:** Time series reasoning ability need not be innate; it can be effectively cultivated via supervised fine-tuning on a small set of high-quality, curated reasoning traces.

Base models without temporal priors collapse to chance-level accuracy when questions require fundamental temporal understanding (e.g., Task 2: 21.6% vs. 33.3% random guess in Figure [\[fig:4\]](#fig:4){reference-type="ref" reference="fig:4"}). Injecting reasoning traces, even with $<$1K seeds, boosts Task 2 accuracy by 46.1% after Stage 1, with comparable gains across other tasks. This prove that time series reasoning is not inherent to LLMs but can be systematically established through temporal priors.

**Finding 2:** Human-guided traces establish decomposition priors critical for time series reasoning.

Without guidance, LLMs tend to produce unstable, generic math-style reasoning traces that inconsistent across samples and fail to capture temporal dependencies. In contrast, when prompted with human-guided templates, the pretrained LLMs generates structured traces that explicitly follow decomposition strategies and achieve substantially higher accuracy. As shown in Figure [\[fig:5\]](#fig:5){reference-type="ref" reference="fig:5"}, on GPT-4.1, human-guided templates improve zero-shot consistency accuracy from $28.7$% to $71.1$% on Task 2, with improvements also observed across all four tasks. These further confirm that pretrained LLMs lack temporal priors and must be enhanced through Stage 1 training.

**Stage 2: Refining Reasoning with Task-grounded Rewards.** While Stage 1 provides priors, they remain insufficient for robust reasoning. Stage 2 employs RL through group relative policy optimization [@grpo] to turn mimicking priors into stable and generalizable reasoning behaviors (Figure [6](#fig:3){reference-type="ref" reference="fig:3"}(c)). Implementation details of RL stage are provided in Appendix [11.2](#app.:rl){reference-type="ref" reference="app.:rl"}.

R0.3 ![image](figs/fig6.png){width="\\linewidth"}

Here we focus on designing task-grounded, outcome-based rewards for time series reasoning, with detailed reward design provided in Appendix [12](#app.:reward_function){reference-type="ref" reference="app.:reward_function"}. Each sample receives a reward composed of format verification and task correctness. $\mathcal{R}_{\text{format}}$ enforces the `<think></think><answer></answer>` schema. For correctness, we distinguish task types: for Tasks 1, 2, and 4, $\mathcal{R}_{\text{discrete}} \in \{0,1\}$ denotes exact-match accuracy ($1$ if correct, $0$ otherwise). For Task 3, we add a counting bonus $\mathcal{R}_{\text{count}}=0.1$ if the predicted sequence length matches the required horizon. This structural reward is essential since LLMs still struggle with counting. For example, our Stage I checkpoint achieves only $55.7$% success on sequence length. In addition, we use an exponential decay to map the unbounded MAE into a normalized range, which compresses arbitrarily large MAE toward zero to ensure higher rewards for smaller MAE.

**Finding 3:** Reinforcement learning works reliably only once the base model is anchored with fundamental temporal priors, which prevent collapse into spurious exploration.

r0.67

![](figs/fig7a.png){#fig:7a width="\\linewidth"}

![](figs/fig7b.png){#fig:7b width="\\linewidth"}

![](figs/fig7c.png){#fig:7C width="\\linewidth"}

[\[fig:joint\_training\]]{#fig:joint_training label="fig:joint_training"}

Applying Stage 2 directly to a base model yields only marginal or even negative improvements (as shown in Figure [\[fig:6\]](#fig:6){reference-type="ref" reference="fig:6"}, with a 5.3% drop on Task 4), since the rewards cannot distinguish genuine temporal knowledge from exploration within the pretraining corpus space. In contrast, when preceded by Stage 1, the same rewards refine temporal priors and progressively develop into robust reasoning.

**Joint Training for Time Series Reasoning.** Unlike prior single-task (or single-dataset) pipelines [@reasonrft; @TimeMaster], we investigate whether unifying perception, extrapolation, and decision-making objectives through joint training yields mutual benefits. We design the following two complementary experimental settings to systematically study the synergistic gains among the three reasoning capabilities.

**Finding 4:** Joint training turns perception, extrapolation, and decision-making from silos into complementary capabilities, supporting a train-once, use-across-tasks paradigm for TSRMs.

*Progressive Capability Transfer.* This evaluates whether precursor reasoning capabilities transfer to downstream decision-making in a zero-shot manner. We evaluate three conditions on the ID decision-making testbed: (1) base model without precursor training, (2) model trained only on perception tasks, and (3) model trained on both perception and extrapolation tasks. As shown in Figure [7](#fig:7a){reference-type="ref" reference="fig:7a"}, accuracy on decision-making tasks increases from $25.5$% to $26.2$% and further to $31.3$%, indicating that precursor capabilities enhance downstream reasoning even without direct supervision.

*Progressive Capability Supplement.* This assesses supervised joint training by gradually incorporating precursor tasks. We compare: (1) training solely on decision-making, (2) joint training on extrapolation and decision-making, and (3) full joint training across four tasks covers all three capabilities. Decision-making accuracy rises from $40.9$% to $45.7$% and peaks at $47.9$%, as shown in Figure [8](#fig:7b){reference-type="ref" reference="fig:7b"}, confirming that progressively supplementing related tasks creates complementary learning benefits.

*Scaling to All Tasks.* Building on the above complementary settings, we compare single-task training against joint training across all four tasks. As shown in Figure [9](#fig:7C){reference-type="ref" reference="fig:7C"}, joint training consistently outperforms single-task training on the ID testbed. These results support a \`\`train-once, use-across-task" paradigm for time series reasoning, where joint training effectively captures intrinsic connections within the temporal reasoning capabilities without task interference.

Experiments
===========

**Evaluation Metrics.** We observe that different models vary significantly in instruction-following ability, sometimes generating repetitive or malformed outputs. To ensure fair comparison, we adopt the standardized system prompt shown in Appendix [10.2](#app.:system_prompt){reference-type="ref" reference="app.:system_prompt"} and apply regular expressions to extract answers. We report the Success Rate (SR), which is the proportion of model outputs that yield a valid and extractable answer. All subsequent evaluation metrics are computed only on these valid cases, ensuring that performance reflects time series reasoning ability rather than instruction-following compliance. For discrete-output tasks (, , ), we use Accuracy (ACC) via exact match. For the sequence-output task (), we use Mean Absolute Error (MAE) to assess forecasting precision. Higher ACC and lower MAE indicate better performance. The hyperparameters used are provided in Appendix [13](#app.:training_configuration){reference-type="ref" reference="app.:training_configuration"}.

width=,center `\renewcommand{\arraystretch}{1.25}`{=latex}

  -------------------------------------------------------------------- --------------------- --------- ---------------------------------- --------- ---------------------------------- --------- ---------------------------------- --------- ---------------------------------- --------- ---------------------------------- --------- ---------------------------------- --------- ---------------------------------- --------- ----------------------------------
                                                                                                                                                                                                                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                                                                                                                                                                                                                  
   (lr)3-4(lr)5-6(lr)7-8(lr)9-10 (lr)11-12(lr)13-14(lr)15-16(lr)17-18                         **ACC**   [**SR%**]{style="color: mygray"}   **ACC**   [**SR%**]{style="color: mygray"}   **ACC**   [**SR%**]{style="color: mygray"}   **ACC**   [**SR%**]{style="color: mygray"}   **MAE**   [**SR%**]{style="color: mygray"}   **MAE**   [**SR%**]{style="color: mygray"}   **ACC**   [**SR%**]{style="color: mygray"}   **ACC**   [**SR%**]{style="color: mygray"}
                         **Proprietary Models**                                                                                                                                                                                                                                                                                                                                                                                   
                           GPT-4.1-2025-04-14                                                            [100.0]{style="color: mygray"}               [100.0]{style="color: mygray"}     28.7      [99.9]{style="color: mygray"}                [100.0]{style="color: mygray"}               [97.4]{style="color: mygray"}     170.78     [76.1]{style="color: mygray"}      25.5      [100.0]{style="color: mygray"}     27.8      [100.0]{style="color: mygray"}
                              GPT-4.1-Nano                                                     66.2      [97.5]{style="color: mygray"}      62.6      [98.7]{style="color: mygray"}      29.8      [98.6]{style="color: mygray"}      28.0      [98.4]{style="color: mygray"}      18.98     [92.8]{style="color: mygray"}     170.78     [76.1]{style="color: mygray"}      28.9      [99.5]{style="color: mygray"}      34.1      [97.8]{style="color: mygray"}
                         **Open-Source Models**                                                                                                                                                                                                                                                                                                                                                                                   
                         Llama-3.1-70B-Instruct                                                56.4      [100.0]{style="color: mygray"}     59.6      [100.0]{style="color: mygray"}     23.4      [100.0]{style="color: mygray"}     28.9      [99.9]{style="color: mygray"}      24.67     [92.8]{style="color: mygray"}     238.98     [97.0]{style="color: mygray"}      20.3      [96.8]{style="color: mygray"}      17.7      [97.4]{style="color: mygray"}
                       Mistral-Small-3.1-24B-Ins                                               64.8      [100.0]{style="color: mygray"}     69.2      [100.0]{style="color: mygray"}     24.6      [100.0]{style="color: mygray"}     25.8      [100.0]{style="color: mygray"}     17.28     [72.0]{style="color: mygray"}                [43.4]{style="color: mygray"}                [100.0]{style="color: mygray"}               [100.0]{style="color: mygray"}
                         Llama-3.1-8B-Instruct                                                 36.6      [46.5]{style="color: mygray"}      32.1      [46.8]{style="color: mygray"}       \-        [3.7]{style="color: mygray"}       \-        [1.9]{style="color: mygray"}      27.68     [52.91]{style="color: mygray"}    186.80     [29.8]{style="color: mygray"}       7.4      [28.7]{style="color: mygray"}      16.2      [42.9]{style="color: mygray"}
                            Mistral-7B-v0.3                                                    40.5      [92.2]{style="color: mygray"}      34.7      [87.6]{style="color: mygray"}      29.0      [86.0]{style="color: mygray"}      26.9      [82.6]{style="color: mygray"}       \-        [5.3]{style="color: mygray"}       \-        [0.0]{style="color: mygray"}      24.3      [94.2]{style="color: mygray"}      16.7      [96.7]{style="color: mygray"}
                          Qwen2.5-Instruct-7B                                                  48.5      [100.0]{style="color: mygray"}     42.8      [100.0]{style="color: mygray"}     21.6      [99.8]{style="color: mygray"}      26.3      [100.0]{style="color: mygray"}     23.28     [53.1]{style="color: mygray"}     146.12     [55.46]{style="color: mygray"}     25.5      [100.0]{style="color: mygray"}     24.9      [100.0]{style="color: mygray"}
                    **Time Series Language Models**                                                                                                                                                                                                                                                                                                                                                                               
                                Time-MQA                                     Llama3-8B         32.2      [29.5]{style="color: mygray"}      25.1      [32.6]{style="color: mygray"}      30.1      [44.3]{style="color: mygray"}      31.2      [37.2]{style="color: mygray"}       \-        [1.4]{style="color: mygray"}       \-        [0.4]{style="color: mygray"}      12.0      [13.3]{style="color: mygray"}      11.6      [15.8]{style="color: mygray"}
                                Time-MQA                                  Mistral-7B-v0.3      15.1      [21.5]{style="color: mygray"}      27.8      [22.1]{style="color: mygray"}       8.4      [50.2]{style="color: mygray"}       4.0      [52.2]{style="color: mygray"}       \-        [0.2]{style="color: mygray"}       \-        [0.0]{style="color: mygray"}       5.4      [36.1]{style="color: mygray"}      10.0      [47.3]{style="color: mygray"}
                                Time-MQA                                    Qwen2.5-7B         25.0      [14.0]{style="color: mygray"}      37.5      [22.7]{style="color: mygray"}      29.5      [33.0]{style="color: mygray"}      30.5      [32.0]{style="color: mygray"}      19.76     [12.2]{style="color: mygray"}       \-        [6.5]{style="color: mygray"}      23.8      [58.0]{style="color: mygray"}      26.4      [44.3]{style="color: mygray"}
                                 ChatTS                                                         \-        [6.0]{style="color: mygray"}       \-        [6.9]{style="color: mygray"}      18.2      [30.1]{style="color: mygray"}      18.6      [26.7]{style="color: mygray"}       \-        [0.0]{style="color: mygray"}       \-        [0.0]{style="color: mygray"}       5.8      [27.1]{style="color: mygray"}      11.1      [27.1]{style="color: mygray"}
                    **Time Series Reasoning Models**                                                                                                                                                                                                                                                                                                                                                                              
                                Time-R1                                 Qwen2.5-Instruct-7B    30.9      [94.0]{style="color: mygray"}      34.0      [92.5]{style="color: mygray"}                [53.8]{style="color: mygray"}      31.4      [48.9]{style="color: mygray"}      17.61     [38.7]{style="color: mygray"}       \-        [6.3]{style="color: mygray"}      27.8      [95.7]{style="color: mygray"}      32.2      [93.1]{style="color: mygray"}
                                **Ours**                                                                                                                                                                                                                                                                                                                                                                                          
                                                                        Qwen2.5-Instruct-7B              [97.5]{style="color: mygray"}                [98.3]{style="color: mygray"}                [99.8]{style="color: mygray"}                [99.8]{style="color: mygray"}                [93.8]{style="color: mygray"}                [82.3]{style="color: mygray"}                 [100]{style="color: mygray"}                 [100]{style="color: mygray"}
  -------------------------------------------------------------------- --------------------- --------- ---------------------------------- --------- ---------------------------------- --------- ---------------------------------- --------- ---------------------------------- --------- ---------------------------------- --------- ---------------------------------- --------- ---------------------------------- --------- ----------------------------------

[\[tab:1\]]{#tab:1 label="tab:1"}

**Baselines.** We compare against up-to-date models in two categories: **(1) Time series language models:** Time-R1 [@timer1] (TSRMs for classical forecasting), Time-MQA [@TimeMQA] (fine-tuned 7B-8B models tailored for TSQA), and ChatTS [@ChatTS] (fine-tuned 14B model for time series understanding). **(2) General-purpose LLMs:** two GPT-4.1 variants as proprietary representatives, and five open-source LLMs (7B to 70B) for comprehensive evaluation.

Main Results
------------

As shown in Table [\[tab:1\]](#tab:1){reference-type="ref" reference="tab:1"}, **consistently ranks among the top-2 models across all time series reasoning tasks.** Notably, it exceeds GPT-4.1 by **40.6%** (ID) and **28.1%** (OOD) on causal discovery. While achieving comparable accuracy on scenario understanding, surpasses GPT-4.1 by a wide margin on tasks requiring deeper temporal priors (e.g., decision-making). Existing time series specialized models, however, exhibit weaker instruction-following ability than general LLMs (consistently lower SR). For example, ChatTS achieves 0% SR on the event-aware forecasting task; upon inspection, we found it fails to produce the required numeric sequences, generating only free-form text. This highlights a critical limitation of existing time series task-specific models: over-specialization compromises generalization ability.

More Analysis {#ablation}
-------------

r0.3 ![image](figs/fig8.png){width="0.95\\linewidth"}

**General Reasoning Capability.** We evaluate whether our time series specialization diminishes general reasoning ability. We compare the base model, Stage 1 SFT model, and on three general reasoning benchmarks: DROP [@Dua2019DROP], GPQA [@rein2024gpqa], and ReClor [@yu2020reclor], which focus respectively on numerical reasoning, graduate-level knowledge reasoning, and logical reasoning. As shown in Figure [\[fig:8\]](#fig:8){reference-type="ref" reference="fig:8"}, improves average accuracy by 16.5% over the base model and 1.3% over the Stage 1 model. This indicates our approach not only maintains but also enhances general reasoning capabilities while specializing in time series tasks, avoiding the instruction-following degradation observed in other specialized models.

**Ablation on Training Stage.** We evaluate two configurations: (1) **Stage 1 models**, including NRMs via answer-only fine-tuning (ANS-SFT) and RMs via CoT fine-tuning (CoT-SFT); (2) **Stage 1+ Stage 2 models** (CoT-SFT+RL), which first activates reasoning via CoT-SFT and then applies RL. We analyze performance under multi-task joint training. As shown in Table [\[tab:2\]](#tab:2){reference-type="ref" reference="tab:2"}, complete two-stage training (CoT-SFT+RL) delivers the most balanced performance, ranking Top-2. In causal discovery, CoT-SFT reaches 67.7% accuracy compared to only 30.5% for ANS-SFT, showing that answer-only supervision merely fits answer distributions without fostering reasoning. On decision-making, the CoT-SFT vs. ANS-SFT gap narrows from 10.1% (ID) to 5.5% (OOD), further confirming that ANS-SFT fails to foster reasoning, whereas CoT-SFT establishes transferable reasoning skills that are consolidated by RL in Stage II.

**Ablation on Training Strategy.** We compare single-task fine-tuning against multi-task joint training under identical training budgets. As shown in Table [\[tab:2\]](#tab:2){reference-type="ref" reference="tab:2"}, multi-task joint training often enhances performance across all tasks and training stages. On the ID testbed, the jointly trained CoT-SFT+RL model () achieves accuracy gains of 8.2%, 1.8%, 2.46 (MAE), and 7.0% across the four tasks compared to single-task training. Together with Figure [\[fig:joint\_training\]](#fig:joint_training){reference-type="ref" reference="fig:joint_training"}, which demonstrates progressive capability *transfer* and *supplement*, these results validate that joint training effectively integrates temporal reasoning capabilities, reinforcing the \`\`train-once, use-across-tasks" paradigm.

width=,center `\renewcommand{\arraystretch}{1.25}`{=latex}

  -------------------------------------------------------------------- ----------------------- --------- ---------------------------------- --------- ---------------------------------- --------- ---------------------------------- --------- ---------------------------------- --------- ---------------------------------- --------- ---------------------------------- --------- ---------------------------------- --------- ----------------------------------
                                                                                                                                                                                                                                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                                                                                                                                                                    
   (lr)3-4(lr)5-6(lr)7-8(lr)9-10 (lr)11-12(lr)13-14(lr)15-16(lr)17-18                           **ACC**   [**SR%**]{style="color: mygray"}   **ACC**   [**SR%**]{style="color: mygray"}   **ACC**   [**SR%**]{style="color: mygray"}   **ACC**   [**SR%**]{style="color: mygray"}   **MAE**   [**SR%**]{style="color: mygray"}   **MAE**   [**SR%**]{style="color: mygray"}   **ACC**   [**SR%**]{style="color: mygray"}   **ACC**   [**SR%**]{style="color: mygray"}
                             **Base Model**                             *Qwen2.5-Instruct-7B*    48.5      [100.0]{style="color: mygray"}     42.8      [100.0]{style="color: mygray"}     21.6      [99.8]{style="color: mygray"}      26.3      [100.0]{style="color: mygray"}     23.28     [53.1]{style="color: mygray"}                [55.5]{style="color: mygray"}      25.5      [100.0]{style="color: mygray"}     24.9      [100.0]{style="color: mygray"}
                                                                             Single-task         77.5      [100.0]{style="color: mygray"}     73.9      [100.0]{style="color: mygray"}     35.7      [100.0]{style="color: mygray"}     33.8      [100.0]{style="color: mygray"}     23.87     [39.7]{style="color: mygray"}     150.42      [0.6]{style="color: mygray"}      20.2      [100.0]{style="color: mygray"}     24.2      [100.0]{style="color: mygray"}
                                                                                                                                                                                                                                                                                                                                                                                                                                    
                                                                             Single-task         73.9      [100.0]{style="color: mygray"}     85.6      [83.9]{style="color: mygray"}      66.3      [96.0]{style="color: mygray"}                [92.4]{style="color: mygray"}      15.10     [64.6]{style="color: mygray"}     157.21     [34.5]{style="color: mygray"}      39.4      [98.40]{style="color: mygray"}     47.3      [94.87]{style="color: mygray"}
                                                                                                                                                                                                                                                                                                                                                                                                                                    
                                                                             Single-task         82.5      [100.0]{style="color: mygray"}               [98.7]{style="color: mygray"}      67.5      [99.6]{style="color: mygray"}      61.7      [99.0]{style="color: mygray"}      16.76     [79.2]{style="color: mygray"}     169.88     [66.0]{style="color: mygray"}      40.9      [100.0]{style="color: mygray"}               [99.6]{style="color: mygray"}
                                                                                                                                                                                                                                                                                                                                                                                                                                    
  -------------------------------------------------------------------- ----------------------- --------- ---------------------------------- --------- ---------------------------------- --------- ---------------------------------- --------- ---------------------------------- --------- ---------------------------------- --------- ---------------------------------- --------- ---------------------------------- --------- ----------------------------------

[\[tab:2\]]{#tab:2 label="tab:2"}

Conclusion
==========

In this paper, we introduce , which addresses the scarcity of reasoning-critical time series data. It formalizes four tasks across three fundamental capabilities for time series reasoning: perception, extrapolation, and decision-making. On this basis, we present , the first generalized, unified model for time series reasoning. It first injects temporal priors through supervised fine-tuning. Then, reinforcement learning with task-grounded rewards guides the model beyond mimicking priors toward robust reasoning. Experiments show that achieves top-tier performance while preserving the general reasoning ability of the base model. Finally, we demonstrate that joint training across diverse reasoning tasks yields mutual gains, supporting a \`\`train-once, use-across-tasks" paradigm for future time series reasoning models.

Ethics Statement {#ethics-statement .unnumbered}
================

Our work focuses solely on scientific challenges and does not involve human subjects, animals, or environmentally sensitive materials. We foresee no ethical risks or conflicts of interest. We are committed to upholding the highest standards of scientific integrity and ethical conduct to ensure the validity and reliability of our findings.

Reproducibility Statement {#reproducibility-statement .unnumbered}
=========================

To ensure reproducibility, we provide: Detailed hyperparameters for multiple training stages (Appendix [13](#app.:training_configuration){reference-type="ref" reference="app.:training_configuration"}); All system prompts used in annotation, training, and evaluation (Appendix [10](#app.prompt){reference-type="ref" reference="app.prompt"}). Our code and model checkpoints are publicly available.

Acknowledgment {#acknowledgment .unnumbered}
==============

S. Pan was partially supported by Australian Research Council (ARC) under grants FT210100097 and DP240101547 and the CSIRO -- National Science Foundation (US) AI Research Collaboration Program. This work was also supported by the NVIDIA Academic Grant in Higher Education and Developer program.

The Use of Large Language Models
================================

During the preparation of this manuscript, we only employed large language models as auxiliary tools for non-substantive tasks. Their applications were limited to assisting in code debugging, checking grammar and formatting consistency, and improving the fluency of written text. The research design, experimental analysis, and conceptual contribution were independent of the LLMs output. All scientific insights and conclusions presented in this work are solely attributable to the authors.

Further Related Work {#app.:related_work}
====================

Time series analysis has underpinned applications in finance, energy, transportation, healthcare, among others, over the past decade [@lu2024trnn; @liu2023sadiselfadaptivedecomposedinterpretable; @guan2023spatial; @lan2025gem]. Most existing studies still concentrate on a single specific task, such as forecasting, classification, anomaly detection, and imputation [@ST-DCAN; @ts_classification_2017; @anomaly_detection_ming; @Bipartite_imputation]. These systems are typically **task-specific** and lack generality across tasks. Attempts such as TimesNet [@Timesnet] and UNITS [@UNITS] replace the output layer and loss to reuse a common backbone across tasks, but the resulting models still exhibit limited out-of-distribution (OOD) robustness. Transfer-learning approaches [@TransferLearning_bigdata; @TransferLearning_kdd] and pre-trained models [@GPT-ST] seek to mitigate OOD shifts; however, empirical evaluations typically remain **domain-specific** (e.g., different districts of one city) rather than achieving genuine transfer from domain A to domain B. A growing line of work further argues that a core bottleneck in time series analytics lies in the lack of integration with supplementary textual knowledge [@CiK; @EventTSF], yet current models remain **modality-locked**, being unable to ingest such event information in textual form. In addition, most existing models adopt fixed output formats and depend on black-box computation, providing **limited interpretability**, even though some efforts rely on attention map [@liu2024itransformer], causal inference [@NuwaDynamics], or visualization of hidden representations [@timemixer++; @cai2024forecastgrapher; @yi2024fouriergnn] to offer implicit explanations. Such latent interpretability, however, is often difficult for non-experts to understand or trust.

In contrast to these models, mitigates domain-specific brittleness by curating cross-domain, reasoning-critical time series data and expanding the task space beyond surface QA. And improves reasoning accuracy and OOD generalization at scale across diverse domains through multi-task joint training. Finally, yields step-by-step rationales that decompose temporal priors, event effects, and decision criteria, turning black-box predictions into transparent, reproducible reasoning.

Limitations of existing time series reasoning datasets {#app.:exsiting_tsqa_shortcomings}
======================================================

![Evidence for the necessity of reasoning and the sufficiency of context on Time-MQA dataset: multiple-choice and true/false tasks saturate in in-distribution settings, while the anomaly-detection task exhibits apparent guessing under out-of-distribution shift.](figs/appendix_fig9a.png){#fig:appendix_fig9 width="\\linewidth"}

![Evidence for the necessity of reasoning and the sufficiency of context on Time-MQA dataset: multiple-choice and true/false tasks saturate in in-distribution settings, while the anomaly-detection task exhibits apparent guessing under out-of-distribution shift.](figs/appendix_fig9b.png){#fig:appendix_fig9 width="\\linewidth"}

![Evidence for the necessity of reasoning and the sufficiency of context on Time-MQA dataset: multiple-choice and true/false tasks saturate in in-distribution settings, while the anomaly-detection task exhibits apparent guessing under out-of-distribution shift.](figs/appendix_fig9c.png){#fig:appendix_fig9 width="\\linewidth"}

![Evidence for the necessity of reasoning and the sufficiency of context on Time-MQA dataset: multiple-choice and true/false tasks saturate in in-distribution settings, while the anomaly-detection task exhibits apparent guessing under out-of-distribution shift.](figs/appendix_fig9d.png){#fig:appendix_fig9 width="\\linewidth"}

As a supplement to the Section [3.1](#2principles){reference-type="ref" reference="2principles"}, we also conducted additional experiments on existing datasets to assess how these limitations affect reasoning performance (shown in Figure [13](#fig:appendix_fig9){reference-type="ref" reference="fig:appendix_fig9"}): **(1) Reasoning necessity.** Across most tasks, Time-MQA, pre-trained LLMs, and fine-tuned TS-LLMs do not exhibit clear performance separation, indicating that the difficulty distribution lacks sufficient granularity to discriminate model capabilities. Moreover, CoT-based SFT does not outperform ANS-based SFT, suggesting that explicit reasoning is not required to solve these tasks. In fact, when tasks can be addressed through surface-level pattern matching, extra reasoning capacity yields only marginal gains. Therefore, we should construct dataset with calibrated difficulty gradients that genuinely require reasoning to solve. **(2) Context sufficiency.** For anomaly detection, current datasets lack relevant contextual information, forcing models to decide solely from the input sequences. As a result, both Time-MQA and pre-trained LLMs achieve only 50--60% accuracy, barely above random selection. After training, performance of ANS-based SFT improves by more than 20% on the ID subset, whereas CoT-based gains are less than half of that, indicating that improvements primarily come from directly mapping encoded sequence features to anomalous outcomes, rather than reasoning about the causes of anomalies. Consequently, for TSR with LLMs, we should prioritize providing sufficient context to support reasoning, rather than pushing models to overfit of time-series signals.

Dataset Details
===============

Raw Data Source {#app.:raw_data_source}
---------------

**Task 1: .** We collect diverse time series data across different scenarios from the work of @merrillLanguageModelsStill2024a. The out-of-distribution (OOD) test set is constructed based on the original domains of the time series data: specifically, samples from the *Agricultural*, *Education* and *Healthcare* domains are used as OOD test data.

**Task 2: .** We perform causal discovery on river discharge time series from the CausalRivers dataset [@CausalRivers], aiming to uncover causal relationships from observational data. Ground-truth causal directions are determined according to river flow: the amount of water measured at an upstream station directly influences the amount measured downstream at a later time, and we thus consider such relations causal; if two rivers are not connected, we consider them non-causal. For the OOD test set, we split by geographical regions: training and in-distribution (ID) test data are taken from Eastern Germany, while the OOD test set is sourced from Bavaria.

**Task 3: .** We use a human mobility dataset as the primary training scenario, specifically taxi drop-off data near the Barclays Center in New York City. We collect the raw time series together with aligned events from @event-aware-forecasting [@EventTSF]. For the OOD test set, we adopt electricity load time series paired with weather events from the EWELD dataset [@EWELD].

**Task 4: .** To evaluate decision-making with counterfactual reasoning ability (i.e., reasoning about the outcomes of unobserved actions), we adopt a sandbox environment based on real building load data. Specifically, we use the CityLearn dataset [@nweye2023citylearn], which provides building load profiles and battery charge/discharge operations under a dynamic pricing scheme. Given 48 hours of historical building load and peak-valley pricing information, models are required to determine charge/discharge strategies for the next 24 hours. For the OOD test set, we select two buildings whose load patterns differ significantly from those in the training and ID test sets.

For all four tasks, we construct clear data pipelines, as detailed in Section [3.2](#sec:tsr-suite){reference-type="ref" reference="sec:tsr-suite"}, thereby facilitating future dataset expansions and task extensions.

Data Statistics {#app.:data_statistics}
---------------

This section provides the detailed quantitative breakdown of , complementing the high-level overview in Section [3.2](#sec:tsr-suite){reference-type="ref" reference="sec:tsr-suite"}. Table [1](#tab:stats){reference-type="ref" reference="tab:stats"} lists the number of samples available for each reasoning task, stratified by their use in the two-stage training (Stage 1 SFT and Stage 2 RL) as well as in-distribution (ID) and out-of-distribution (OOD) testbed. The data confirms a substantial scale for SFT (Stage 1) and an even larger set for RL (Stage 2), ensuring robust learning and generalization evaluation for each task.

::: {#tab:stats}
  **Task**    **\#Stage 1 Train**   **\#Stage 2 Train**   **\#ID Test**   **\#OOD Test**     
  ---------- --------------------- --------------------- --------------- ---------------- -- --
                      609                  5104                200             899           
                      778                  6044                800             800           
                      400                  2780                418             476           
                      552                  3284                188             273           

  : Detailed sample count statistics for the four time series reasoning tasks in across training stages and testbed.
:::

Statistics on Sequence Length and Token Budget
----------------------------------------------

In this section, we provide statistics for the actual sequence lengths used in in Table [2](#tab:seq-len){reference-type="ref" reference="tab:seq-len"}, as well as the corresponding token budgets computed using the tokenizer of our base model (Qwen2.5-Instruct-7B) in Table [3](#tab:token-budget){reference-type="ref" reference="tab:token-budget"}. These results clarify that our tasks involve substantially longer sequences than the illustrative examples in Figure [5](#fig:2){reference-type="ref" reference="fig:2"}. The average total token budget (1, 106 tokens) remains far below the maximum supported input length of Qwen2.5-Instruct-7B (32,768 tokens).

width=,center `\renewcommand{\arraystretch}{1.15}`{=latex}

::: {#tab:seq-len}
                                                                 
  ---------------------- ------------- ------------- ----------- -----------
  **MAX / AVG length**    800 / 316.3   124 / 121.7   96 / 78.3   48 / 48.0

  : Maximum and average time series lengths across four tasks.
:::

[\[tab:seq-len\]]{#tab:seq-len label="tab:seq-len"}

width=,center `\renewcommand{\arraystretch}{1.15}`{=latex}

::: {#tab:token-budget}
                                                            **Overall**
  ------------------------------- ------ ----- ----- ----- -------------
  **AVG tokens of series $X$**     1701   698   357   281       860
  **AVG tokens of context $C$**    261    160   216   408       246
  **AVG total tokens**             1962   858   573   689    **1106**

  : Average token budgets computed using the Qwen2.5-Instruct-7B tokenizer.
:::

[\[tab:token-budget\]]{#tab:token-budget label="tab:token-budget"}

Task 3 () Special Notes {#app.:task3_special_notes}
-----------------------

::: {#tab:t3_notes}
                             ID      OOD
  ------------------------ ------- --------
   LLM Analyzer Generated   15.10   157.2
    Ground Truth Guided     24.53   395.56

  : MAE($\downarrow$) of CoT-SFT with different chain construction on Task 3.
:::

[\[tab:t3\_notes\]]{#tab:t3_notes label="tab:t3_notes"}

Unlike multiple-choice tasks, where correct answers are explicitly listed among options, Task 3 requires forecasting future sequence within a fixed output window. This open-ended formulation significantly increases the difficulty of constructing coherent reasoning chains and prevents the LLM Analyzer from perfectly predicting results that are fully aligned with the ground truth. However, instead of guiding the generation of reasoning chains using the ground truth, we allowed the LLM Analyzer to generate predictions based on its own understanding. In our experiments, reasoning chains generated with ground-truth hints consistently resulted in worse CoT based SFT performance compared to those produced directly by the LLM Analyzer as shown in Table [4](#tab:t3_notes){reference-type="ref" reference="tab:t3_notes"}. This result also aligns with recent findings [@zhao2025automatic; @gao2025principleddataselectionalignment; @li2025curriculumrlaifcurriculumalignmentreinforcement], which suggests that the most effective training data are instances slightly beyond a model's current ability but not prohibitively difficult. Furthermore, ground-truth--guided chains tend to obscure the task's inherent difficulty and deviate from the base model's natural data distribution. Therefore, we examine the cases and select 400 samples with relatively low MAE chains generated by the LLM Analyzer without relying on ground-truth hints as supervision for Stage 1 training to balance difficulty and quality.

Human Evaluation Interface {#app.:interface}
--------------------------

As described in Figure [6](#fig:3){reference-type="ref" reference="fig:3"}, when the LLM Analyzer (GPT-4.1 in our case) fails to solve a sample in Step 1, the instance proceeds to Step 2. In this step, human reviewers use the interface shown in Figure [\[fig:interface\]](#fig:interface){reference-type="ref" reference="fig:interface"} to examine whether the provided context is sufficient to disambiguate the answer. If the question is solvable by human reviewers, their reasoning chains are further polished by the LLM Rewriter to follow our structured templates, and the resulting samples are collected as Step 2 CoT data.

Prompt Used in this Paper {#app.prompt}
=========================

Human-guided reasoning Template for Hierarchical CoT Annotation. {#app.:human-guided_reasoning_template}
----------------------------------------------------------------

As detailed in Section [3.2](#sec:tsr-suite){reference-type="ref" reference="sec:tsr-suite"}, our hierarchical annotation pipeline relies on structured reasoning templates to ensure consistency and quality in the Chain-of-Thought (CoT) generation process. These templates serve as explicit guidelines for the LLM Analyzer in the initial solving phase, as well as for human experts during verification and the LLM Rewriter in the refinement phase. The templates defined in this section are specifically designed for CoT annotation only. They provide a systematic framework for breaking down each reasoning task into logical steps, ensuring that all annotated traces follow a consistent structure while capturing the essential temporal reasoning processes. This approach guarantees that the resulting CoT data maintains high quality and facilitates effective model learning.

**Step 1. Series length check**\
Observed length={L}. Expected per option: A{exp\_len\_A}; B{exp\_len\_B}; C{exp\_len\_C}; D{exp\_len\_D}. Retain option(s) whose expected length $\approx$ L.\
**Step 2. Magnitude & unit sanity**\
Value range={min}--{max}. Typical ranges: A{range\_A}; B{range\_B}; C{range\_C}; D{range\_D}. Eliminate options whose units/ranges mismatch.\
**Step 3. Shape & temporal pattern**\
Note trends/seasonality/spikes: {key\_patterns}. Compare to option narratives: A{match\_or\_not}; B{match\_or\_not}; C{match\_or\_not}; D{match\_or\_not}. Keep best‑matching narratives.\
**Step 4. External‑event alignment**\
Identify clear events (e.g., single‑day surge, mid‑series drop): {events}. Which option explicitly explains this?\
**Step 5. Final elimination & plausibility**\
Remaining candidates: {remaining}. Choose the scenario that satisfies all of length, magnitude, pattern, and event consistency.\
**Step 6. Double‑check length consistency**\
Confirm {tentative\_choice} expected length =={L}? → {yes/no}. If "no", revert to next best candidate; else accept. Final choice: {chosen\_option}.

**Step 1. Identify the baseline patterns**\
You should first identify the baseline patterns and trend from the historical series.\
**Step 2. Estimate the impact of any events**\
Next, estimate the incremental impact of any special events (pre-event buildup, during-event lift, post-event dispersal) as an overlay on the baseline.\
**Step 3. Combine the baseline and event effects**\
Finally, given the current context, combine the baseline and event effects to generate the forecast sequence.

**Step 1. Trend Consistency**

Check whether the two series demonstrate structurally consistent trends, such as shared "stable → rise → fall" shapes, both rise/fall at similar points (within ±1--2 time steps), and flat/stable periods aligned in time. It's OK for their absolute values to differ --- match shape, not magnitude.

Red Flag: If one rises while the other stays flat or falls → Stop in the option: are not causal.

Be perceptually flexible. Flatness doesn't require perfect constancy --- as long as fluctuations are very small relative to the scale of the full time series, they can still be considered flat.\
**Step 2. Key Fluctuation Alignment**

Check whether the two time series have notable peaks, dips, or inflection points at the same or nearly the same time.

You must ensure: Spikes/dips occur within ±1--2 steps (which means $\leq$24 hours lag if sampling is 12h). For time series with low overall discharge (maybe near 0), even modest changes can be meaningful if they represent a clear pattern change relative to baseline. If peaks differ by $\geq$3 steps, it's too much lag to infer causality → the two time series are not causal. Do not confuse visually similar shapes with causality if key changes happen at clearly different times.\
**Step 3. Direction of Causality**

Only perform this step if both Step 1 and 2 pass. Use the domain principle: \"Small rivers flow into big rivers\" --- not the reverse.

Rule: If \`mean(A) $<$ mean(B)\`, then \`A → B\`. If \`mean(B) $<$ mean(A)\`, then \`B → A\`.

Notes: If Step 1 or 2 fails, skip Step 3. Timing matters: 3 steps of lag (36h) is already too much. Matching is about structure and timing, not numbers.

**Step 1. Forecast the next 24-hour load**

Use the historical 48-hour load pattern to generate a forecast for tomorrow's 24-hour load. Pay special attention to the peak-price hours and estimate the likely loads during those hours.\
**Step 2. Principles for evaluating strategies**

Charging should take place during off-peak hours when electricity price is low. Discharging should take place during peak hours when electricity price is high and forecasted load is significant. Avoid charging during peak hours or discharging during off-peak hours, as these operations increase cost instead of saving it.\
**Step 3. Cost calculation and strategy comparison**

For each strategy, compute the expected saving using: $$\begin{aligned}
\text{Saving} &= 
\sum_{h \in \text{peak}}
\min\bigl(\hat{L}(h), P_{\max}^{\text{dis}}\bigr)
\cdot
\left(p_{\text{peak}} - p_{\text{valley}}\right)\end{aligned}$$ where $\hat{L}(h)$ is the forecasted load at hour $h$, $P_{\max}^{\text{dis}}$ is the maximum discharging power, and $p_{\text{peak}}$, $p_{\text{valley}}$ are the peak and valley electricity prices. Select the strategy with the highest saving.

System Prompt for Training and Evaluation {#app.:system_prompt}
-----------------------------------------

This section presents the system prompts used in the ablation study on training stages (Section [4.2](#ablation){reference-type="ref" reference="ablation"}). The prompts are categorized into two types: **Chain-of-Thought(CoT)** prompts that require models to generate reasoning traces before answers, and **Answer-only(ANS)** prompts that directly output final answers without explicit reasoning.

For Tasks 1, 2, and 4, the CoT prompts enforce a structured output format where models must provide step-by-step reasoning within `<think>` tags before the final answer in `<answer>` tags. The ANS prompts for these tasks skip the reasoning step and output only the final answer. For Task 3, the prompts are adapted to accommodate sequence predictions while maintaining the same CoT/ANS distinction. These prompts ensure consistent evaluation across different training configurations: ANS-SFT uses ANS prompts, CoT-SFT uses CoT prompts, and CoT-SFT+RL uses CoT prompts during both training stages.

[\[System Prompt for Training and Evaluation\]]{#System Prompt for Training and Evaluation label="System Prompt for Training and Evaluation"}

Output Format:\
`<think>`Your step-by-step reasoning process that justifies your answer`</think>`\
`<answer>`Your final answer(Note: Only output a single uppercase letter of the correct option)`</answer>`

You should think the impact of the event first, then output the predicted sequence.\
Output Format:\
`<think>`Your step-by-step reasoning process`</think>`\
`<answer>`\[Your predicted sequence\]`</answer>`

Output Format:\
`<answer>`Your final answer(Note: Only output a single uppercase letter of the correct option)`</answer>`

You should output the predicted sequence directly.\
Output Format:\
`<answer>`\[Your predicted sequence\]`</answer>`

Model Robustness to Prompt Variations
-------------------------------------

In this section, we investigate the robustness of to prompt variations. We design three prompt perturbations:

-   **Paraphrased Question.** We use ChatGPT to rewrite the original question while keeping the semantic meaning unchanged.

-   **Paraphrased System Prompt.** We use ChatGPT to paraphrase the system prompt while keeping the semantic meaning unchanged.

-   **w/o System Prompt.** We entirely remove the system prompt to simulate the extreme case.

The experimental results are summarized in Table [\[tab:prompt-var\]](#tab:prompt-var){reference-type="ref" reference="tab:prompt-var"}. The results show that paraphrasing either the question or the system prompt leads to minimal performance degradation, indicating that the model does not rely on specific wording or phrasing.

Surprisingly, even in the absence of the system prompt, the model remains highly robust. Although it no longer outputs the explicit `<think>` and `<answer>` tags, we adjust the evaluation script to parse the free-form responses and observe that still maintains strong performance, sometimes even outperforming the original prompt. Manual inspection further reveals that the model continues to follow the reasoning template injected during Stage 1 training, suggesting that the temporal prior has been deeply internalized.

Overall, these results demonstrate that is highly robust to prompt variations and resistant to perturbed instructions.

width=,center `\renewcommand{\arraystretch}{1.2}`{=latex}

  --------------------------------------------------------------------- ------ ------------------------------- ------ ------------------------------- ------ ------------------------------- ------ ------------------------------- ------- ------------------------------- ------- ------------------------------- ------ -------------------------------- ------ --------------------------------
                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                   
   (lr)2-3(lr)4-5 (lr)6-7(lr)8-9 (lr)10-11(lr)12-13 (lr)14-15(lr)16-17   ACC    [SR%]{style="color: mygray"}    ACC    [SR%]{style="color: mygray"}    ACC    [SR%]{style="color: mygray"}    ACC    [SR%]{style="color: mygray"}     MAE    [SR%]{style="color: mygray"}     MAE    [SR%]{style="color: mygray"}    ACC     [SR%]{style="color: mygray"}    ACC     [SR%]{style="color: mygray"}
                           Paraphrase Question                           86.2   [97.8]{style="color: mygray"}   85.3   [97.5]{style="color: mygray"}   65.8   [99.5]{style="color: mygray"}   60.2   [98.8]{style="color: mygray"}           [90.1]{style="color: mygray"}   150.9   [81.5]{style="color: mygray"}          [99.8]{style="color: mygray"}    56.1   [99.5]{style="color: mygray"}
                        Paraphrase System Prompt                                [99.5]{style="color: mygray"}          [98.6]{style="color: mygray"}   66.8   [99.8]{style="color: mygray"}   62.8   [99.5]{style="color: mygray"}   15.28   [93.3]{style="color: mygray"}   152.1   [84.0]{style="color: mygray"}   43.6   [100.0]{style="color: mygray"}          [100.0]{style="color: mygray"}
                            w/o System Prompt                            86.5    [--]{style="color: mygray"}    87.1    [--]{style="color: mygray"}            [--]{style="color: mygray"}            [--]{style="color: mygray"}             [--]{style="color: mygray"}             [--]{style="color: mygray"}    40.4    [--]{style="color: mygray"}             [--]{style="color: mygray"}
                           **Original Prompt**                                  [97.5]{style="color: mygray"}          [98.3]{style="color: mygray"}          [99.8]{style="color: mygray"}          [99.8]{style="color: mygray"}   14.30   [93.8]{style="color: mygray"}           [82.3]{style="color: mygray"}          [100.0]{style="color: mygray"}   58.9   [100.0]{style="color: mygray"}
  --------------------------------------------------------------------- ------ ------------------------------- ------ ------------------------------- ------ ------------------------------- ------ ------------------------------- ------- ------------------------------- ------- ------------------------------- ------ -------------------------------- ------ --------------------------------

[\[tab:prompt-var\]]{#tab:prompt-var label="tab:prompt-var"}

Implementation Details of the Training Stages
=============================================

Since there currently lacks a pre-trained time series encoder analogous to Vision Transformers (ViT) [@VIT] in the computer vision domain, we follow the common practice of tokenizing time series into text inputs, which aligns with the approaches adopted in Time-R1 [@timer1] and Time-MQA [@TimeMQA].

Stage 1: Supervised Fine-Tuning (SFT) {#app.:sft}
-------------------------------------

SFT is a process where a pre-trained model is further trained on a labeled dataset to adapt it for a specific task. This is achieved by minimizing the negative log-likelihood of the output given the input data. In the context of time series reasoning, the model learns to generate intermediate rationales and final answers based on observed time series data and auxiliary context. Specifically, based on a carefully curated dataset $\mathcal{D}=\{(X_i, C_i, R_i, y_i)\}_{i=1}^N$, the model's parameters $\theta$ are updated by minimizing the loss function: $$\mathcal{L}(\theta; \mathcal{D}) = -\frac{1}{N}\sum_{i=1}^N\log\pi_\theta((R'_i, y'_i)|(X_i, C_i))$$ where $R'_i$ and $y'_i$ are the predicted rationale and final answer, respectively. And for the $i$-th example, $X_i$ and $C_i$ represent the observed time series and auxiliary context.

Stage 2: Reinforcement Learning (RL) {#app.:rl}
------------------------------------

In reinforcement learning, we employ the group relative policy optimization (GRPO) algorithm [@grpo] to refine our post-SFT model using carefully designed reward functions. Given an input pair $(X,C)$, GRPO samples $N$ rationale-answer trajectories $\{(R_i, y_i)\}^N_{i=1}$ from the policy model $\pi_\theta$, and organizes them into groups $\{\mathcal{G}_b\}^B_{b=1}$. The reward for trajectory $i \in \mathcal{G}_b$ is computed as $$\hat{A}_i = \mathcal{R}(R_i, y_i) - \frac{1}{|\mathcal{G}_b|} \sum_{j \in \mathcal{G}_b} \mathcal{R}(R_j, y_j),$$ where $\mathcal{R}$ combines both the correctness of the answer and the quality of the response format. The policy is then updated using the following objective: $$\begin{aligned}
    \mathcal{L}^{\text{GRPO}}(\theta) &= \frac{1}{N} \sum_{i=1}^N \Bigg[ \min \Bigg( \frac{\pi_\theta(R_i, y_i \mid X, C)}{\pi_{\theta_{\text{refer}}}(R_i, y_i \mid X, C)} \hat{A}_i, \\
    &\quad \text{clip}\left( \frac{\pi_\theta}{\pi_{\theta_{\text{refer}}}}, 1-\epsilon, 1+\epsilon \right) \hat{A}_i \Bigg) \notag - \beta \, D_{\text{KL}}\big( \pi_\theta \parallel \pi_{\theta_{\text{refer}}} \big) \Bigg]\end{aligned}$$ Here, $\pi_{\theta_{\text{refer}}}$ indicates the post-SFT model. $\epsilon$ and $\beta$ are hyperparameters that control the clipping threshold of the PPO update and the weight of the Kullback-Leibler (KL) divergence penalty, respectively.

Discussion of Tokenizing Time Series into Text Inputs
-----------------------------------------------------

employs a text-based strategy, representing time series as textual sequences. For a direct comparison, we also implement an embedding-based approach by training a time series encoder from scratch following the ChatTS [@ChatTS] architecture, which is conceptually similar to OpenTSLM [@opentslm] and Time-LLM [@Time-LLM]. The same base model (Qwen2.5-Instruct-14B) is used in both settings to ensure a fair comparison. Results across all four tasks are summarized in Table [\[tab:ts\_vs\_embed\]](#tab:ts_vs_embed){reference-type="ref" reference="tab:ts_vs_embed"}, while inference efficiency under identical test conditions is provided in Table [5](#tab:efficiency_ts_vs_embed){reference-type="ref" reference="tab:efficiency_ts_vs_embed"}.

Our results indicate that a simple MLP-style encoder, similar to those used in ChatTS, OpenTSLM, and Time-LLM, does not provide a clear advantage over text serialization. Although the encoder improves Task 1 accuracy in-distribution, it leads to substantially lower success rates for Tasks 2 and 4, suggesting interference with the base model's instruction-following ability. A potential reason for the performance gap is the higher data requirement of the embedding-based method. Training a time series encoder from scratch is data-intensive, and our 2.3k CoT-Data might be inadequate.

The embedding-based design also slows inference by a factor of three because it introduces an additional neural path that cannot benefit from the kernel-level optimizations available in vLLM. For memory usage, the text-based approach requires a larger KV cache due to its longer tokenized representation; however, this overhead remains manageable and scales reasonably with input length. In our comparison in Table [5](#tab:efficiency_ts_vs_embed){reference-type="ref" reference="tab:efficiency_ts_vs_embed"}, the average time series length exceeds 300 time steps, yet the peak GPU memory remains below 52GB.

Taken together, the text-based strategy is both effective and efficient for the settings considered in this work, where sequence lengths and input dimensionality is modest. While dedicated time series encoders may offer advantages for higher-dimensional or extremely long sequences, the field currently lacks a widely adopted pretrained encoder analogous to ViT [@VIT] for vision. Developing such a general purpose time series encoder is an promising direction for our future work.

width=,center `\renewcommand{\arraystretch}{1.2}`{=latex}

  --------------------------------------------------------------------- ------ -------------------------------- ------ -------------------------------- ------ -------------------------------- ------ ------------------------------- ------- ------------------------------- -------- ------------------------------- ------ -------------------------------- ------ --------------------------------
                                                                                                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                       
   (lr)2-3(lr)4-5 (lr)6-7(lr)8-9 (lr)10-11(lr)12-13 (lr)14-15(lr)16-17   ACC     [SR%]{style="color: mygray"}    ACC     [SR%]{style="color: mygray"}    ACC     [SR%]{style="color: mygray"}    ACC    [SR%]{style="color: mygray"}     MAE    [SR%]{style="color: mygray"}     MAE     [SR%]{style="color: mygray"}    ACC     [SR%]{style="color: mygray"}    ACC     [SR%]{style="color: mygray"}
                               Base Model                                54.0   [100.0]{style="color: mygray"}   54.5   [100.0]{style="color: mygray"}   30.5   [100.0]{style="color: mygray"}   31.4   [99.8]{style="color: mygray"}           [88.0]{style="color: mygray"}            [64.1]{style="color: mygray"}          [100.0]{style="color: mygray"}          [100.0]{style="color: mygray"}
                            Text-based (Ours)                                   [100.0]{style="color: mygray"}          [99.3]{style="color: mygray"}           [100.0]{style="color: mygray"}          [98.2]{style="color: mygray"}           [90.2]{style="color: mygray"}            [84.9]{style="color: mygray"}          [100.0]{style="color: mygray"}          [99.0]{style="color: mygray"}
                             Embedding-based                                    [92.9]{style="color: mygray"}           [97.7]{style="color: mygray"}           [68.7]{style="color: mygray"}           [62.9]{style="color: mygray"}   22.15   [75.1]{style="color: mygray"}   162.23   [58.4]{style="color: mygray"}   18.2   [59.5]{style="color: mygray"}    20.0   [51.9]{style="color: mygray"}
  --------------------------------------------------------------------- ------ -------------------------------- ------ -------------------------------- ------ -------------------------------- ------ ------------------------------- ------- ------------------------------- -------- ------------------------------- ------ -------------------------------- ------ --------------------------------

[\[tab:ts\_vs\_embed\]]{#tab:ts_vs_embed label="tab:ts_vs_embed"}

width=, center `\renewcommand{\arraystretch}{1.2}`{=latex}

::: {#tab:efficiency_ts_vs_embed}
      **Method**       **Samples**   **Avg Inference Speed per Sample (s)**   **Peak GPU Memory (GB)**
  ------------------- ------------- ---------------------------------------- --------------------------
   Text-based (Ours)       100                        7.07                             51.77
    Embedding-based        100                       22.01                             32.04

  : Inference efficiency comparison.
:::

[\[tab:efficiency\_ts\_vs\_embed\]]{#tab:efficiency_ts_vs_embed label="tab:efficiency_ts_vs_embed"}

Time Series Task-grounded Reward Design {#app.:reward_function}
=======================================

All samples are required to follow a *basic format reward* $\mathcal{R}_{\text{format}}$, which checks whether outputs comply with the schema `<think>...</think><answer>...</answer>`. For discrete-output tasks (scenario understanding, causality discovery, decision-making), correctness is directly measurable by $$\mathcal{R}_{\text{discrete}} =
    \begin{cases}
    1, & \hat{y} = y, \\
    0, & \text{otherwise}.
    \end{cases}$$

For sequence-output task (event-aware forecasting), we define a continuous reward based on the exponential decay of mean absolute error (MAE): $$\text{MAE} = \frac{1}{T}\sum_{t=1}^T \big|\hat{y}_{t} - y_{t}\big|,$$ $$\mathcal{R}_{\text{sequence}} =
    \begin{cases}
    0, & \text{if } \; \mathrm{len}(\hat{y}) \neq \mathrm{len}(y), \\
    \exp\!\left(-\alpha \cdot \text{MAE}\right) + \mathcal{R}_{\text{count}}, & \text{if } \; \mathrm{len}(\hat{y}) = \mathrm{len}(y),
    \end{cases}$$ together with a horizon-matching bonus $\mathcal{R}_{\text{count}} = 0.1$. And the final per-sample reward integrates all components as: $$\mathcal{R}_i = \lambda \mathcal{R}_{\text{format}} + (1-\lambda)\mathcal{R}_{\text{task}},$$ where $\mathcal{R}_{\text{task}}$ refers to $\mathcal{R}_{\text{discrete}}$ for discrete-output tasks and $\mathcal{R}_{\text{sequence}}$ for sequence-output task. We set $\lambda = 0.1$ in all experiments.

Reward Sensitivity Analysis
---------------------------

In this section, we conduct controlled sensitivity sweeps over the key reward components ($\alpha$, $R_{\text{count}}$, $\lambda$), and evaluate their effects on training stability. All experiments are trained for 1 epoch. For $\alpha$ and $R_{\text{count}}$, which only affect the reward design of Task 3, we report the MAE curves for Task 3 over 1 epoch. For $\lambda$, we evaluate the proportion of responses with a positive task reward $\mathcal{R}_{\text{task}}$ over 1 epoch, where $\mathcal{R}_{\text{task}}$ denotes $\mathcal{R}_{\text{discrete}}$ for discrete-output tasks and $\mathcal{R}_{\text{sequence}}$ for the sequence-output task.

**$\alpha$ and $R_{\text{count}}$ exhibit low sensitivity.** Varying these coefficients leads to only mild changes in MAE curves, indicating that the RL process remains stable under a broad range of settings. **Format reward is essential**. When disabling the format component ($\lambda = 0$), the model collapses after 70 steps, showing a sharp drop in the positive ratio of $\mathcal{R}_{\text{task}}$. This confirms that format consistency is necessary for stable RL optimization.

Overall, the current hyperparameter choices yield the better trade-off between stability and performance. These analyses provide clear guidance for reproducing and tuning the reward design.

![Reward sensitivity across $\alpha$, $R_{\text{count}}$, and $\lambda$. The model is robust to $\alpha$ and $R_{\text{count}}$, whereas disabling the format reward ($\lambda = 0$) causes training collapse.](figs/rebuttal/fig_hparam_alpha.pdf "fig:"){#fig:reward_sensitivity width="\\linewidth"} [\[fig:sensitivity\_alpha\]]{#fig:sensitivity_alpha label="fig:sensitivity_alpha"}

![Reward sensitivity across $\alpha$, $R_{\text{count}}$, and $\lambda$. The model is robust to $\alpha$ and $R_{\text{count}}$, whereas disabling the format reward ($\lambda = 0$) causes training collapse.](figs/rebuttal/fig_hparam_rcount.pdf "fig:"){#fig:reward_sensitivity width="\\linewidth"} [\[fig:sensitivity\_rcount\]]{#fig:sensitivity_rcount label="fig:sensitivity_rcount"}

![Reward sensitivity across $\alpha$, $R_{\text{count}}$, and $\lambda$. The model is robust to $\alpha$ and $R_{\text{count}}$, whereas disabling the format reward ($\lambda = 0$) causes training collapse.](figs/rebuttal/fig_hparam_lambda.pdf "fig:"){#fig:reward_sensitivity width="\\linewidth"} [\[fig:sensitivity\_lambda\]]{#fig:sensitivity_lambda label="fig:sensitivity_lambda"}

Training Configuration {#app.:training_configuration}
======================

Our training process follows a two-stage procedure, consisting of supervised fine-tuning (SFT) followed by reinforcement learning (RL). For the SFT stage, we begin by fine-tuning the Qwen2.5-7B-Instruct [@qwen2.5] with full-parameter updates for a total of 1 epoch, utilizing DeepSpeed ZeRO-3 [@rasley2020deepspeed] for efficient training. The fine-tuning is performed in BF16 precision with FlashAttention-2 enabled to accelerate attention operations. The maximum sequence length is set to 8192, and the per-device batch size is 1, with gradient accumulation of 32. The optimization procedure uses a peak learning rate of $1.0 \times 10^{-5}$ with a cosine learning rate scheduler and the warm-up ratio as 0.1. All training is carried out using the LLaMA-Factory repository [@zheng2024llamafactory] on a system equipped with a single NVIDIA H200-140G GPU. In the RL stage, we continue training from the Stage 1 checkpoint using the verl repository [@sheng2024hybridflow] and FSDP [@zhao2023pytorchfsdp] under BF16 precision. The maximum sequence length is reduced to 2048, and training is performed across 8 NVIDIA A100-80G GPUs. The training batch size is 128, with RL minibatches of size 32 and per-GPU micro-batches of 8. Gradient clipping is applied with a maximum global norm of 3.0, and a KL-penalty coefficient of 0.04 is used to regulate the model. For rollouts, 8 trajectories are collected per update, with a sampling temperature of 0.7. The learning rate is set to $1.0 \times 10^{-6}$, and training continues for 3 epochs.

Reinforcement Learning Training Dynamics
========================================

This section reports the full reinforcement learning (RL) training dynamics in Figure [23](#fig:training_curve){reference-type="ref" reference="fig:training_curve"}. The plots track both the overall reward and the format correct ratio for the training split and the OOD testbed throughout the RL stage. To reduce evaluation cost during RL, we cap the maximum response length at 2048 tokens during training. As a result, the performance in these curves may appear slightly lower than the full results reported in Table [\[tab:1\]](#tab:1){reference-type="ref" reference="tab:1"}, which is expected. Overall, the curves provide a clear view of the optimization behavior and demonstrate that RL training for remains stable and continues to improve over training steps.

![Full reinforcement learning training dynamics showing reward progression and format correctness for the training split and the OOD testbed.](figs/rebuttal/fig_training_curve_training_reward.png){#fig:training_curve width="\\linewidth"}

![Full reinforcement learning training dynamics showing reward progression and format correctness for the training split and the OOD testbed.](figs/rebuttal/fig_training_curve_training_format.png){#fig:training_curve width="\\linewidth"}

![Full reinforcement learning training dynamics showing reward progression and format correctness for the training split and the OOD testbed.](figs/rebuttal/fig_training_curve_test_reward.png){#fig:training_curve width="\\linewidth"}

![Full reinforcement learning training dynamics showing reward progression and format correctness for the training split and the OOD testbed.](figs/rebuttal/fig_training_curve_test_format.png){#fig:training_curve width="\\linewidth"}

Scaling of Training Dataset
===========================

In this section, we analyze the scaling behavior of by varying the amount of training data used in Stage 1 (SFT) and Stage 2 (RL). All experiments strictly follow the training configurations reported in Appendix [13](#app.:training_configuration){reference-type="ref" reference="app.:training_configuration"}. All Stage 1 (SFT) data scaling runs are trained for 1 epoch, and all Stage 2 (RL) data scaling runs are trained for 3 epochs. Figures [24](#fig:sft_data_scaling){reference-type="ref" reference="fig:sft_data_scaling"} present the scaling behavior of Stage 1 (SFT) as we vary the amount of CoT-Data (25%, 50%, 75%, 100%). Figures [25](#fig:rl_data_scaling){reference-type="ref" reference="fig:rl_data_scaling"} present the scaling behavior of Stage 2 (RL) as we increase the amount of RL-Data (25%, 50%, 75%, 100%).

![Scaling of Stage 1 (SFT) Training Dataset.](figs/rebuttal/sft_scaling_grid.png){#fig:sft_data_scaling width="1\\linewidth"}

![Scaling of Stage 2 (RL) Training Dataset.](figs/rebuttal/rl_scaling_grid.png){#fig:rl_data_scaling width="1\\linewidth"}

Across all four tasks and both ID/OOD testbeds, the primary ACC/MAE metrics follow a clear scaling pattern: larger training sets monotonically improve Task 1, Task 2, Task 4 ACC and reduce Task 3 MAE. Notably, Task 3 MAE does not perfectly follow a scaling trend during SFT, which suggests that sample quality may influence the SFT phase more strongly. However, once the model enters Stage 2 where it performs its own exploration and refinement, Task 3 MAE follows a more clear scaling pattern and continues to decrease. In Stage 1, SR for Task 1, Task 2, Task 4 is already saturated near 100%. In Stage 2, RL continues to improve the SR for Task 3. Together, these curves show that both training stages benefit from increased data.

Additional Evaluation on External Time Series QA Benchmarks
===========================================================

To further validate the generalization ability of , we conduct experiments on **three external time series QA benchmarks** spanning **13 tasks and 3,406 samples**:

-   **MTBench** [@MTBench] (real-world time series; 4 tasks; 2,380 samples),

-   **TimeSeriesExam** [@TimeSeriesExam] (synthetic time series; 5 tasks; 746 samples),

-   **CaTS-Bench** [@CaTSBench] (real-world time series; 4 tasks; 280 samples).

We evaluate alongside two baselines: (1) **Qwen2.5-Instruct-7B**, the base model of , and (2) the **original results** reported in the original paper.

width=,center `\renewcommand{\arraystretch}{1.15}`{=latex}

  **Method**                              **FinanceQA (7-Day)**   **WeatherQA (7-Day)**   **FinanceQA (30-Day)**   **WeatherQA (14-Day)**   **AVG**
  -------------------------------------- ----------------------- ----------------------- ------------------------ ------------------------ ---------
  ***Num. Samples***                               484                     666                     523                      707               --
  **Base Model (Qwen2.5-Instruct-7B)**            62.2                                                                      36.2             49.2
  **Original Paper (GPT-4o)**                                             41.7                     52.8                                    
                                                                                                                                           

[\[tab:mtbenchts\]]{#tab:mtbenchts label="tab:mtbenchts"}

width=,center `\renewcommand{\arraystretch}{1.15}`{=latex}

  **Method**                              **Anomaly Detection**   **Causality**   **Noise Understanding**   **Pattern Recognition**   **Similarity Analysis**   **AVG**
  -------------------------------------- ----------------------- --------------- ------------------------- ------------------------- ------------------------- ---------
  ***Num. Samples***                               108                 72                   84                        362                       120               --
  **Base Model (Qwen2.5-Instruct-7B)**                                                                                                                         
  **Original Paper (Phi-3.5)**                    28.0                                     26.0                      47.0                      45.0              39.6
                                                                      37.5                                                                                     

[\[tab:timeseriesexam\]]{#tab:timeseriesexam label="tab:timeseriesexam"}

width=,center `\renewcommand{\arraystretch}{1.15}`{=latex}

  **Method**                                 **Caption Retrieval**   **TimeSeries Retrieval**   **Amplitude Comparison**   **Mean Comparison**   **AVG**
  ----------------------------------------- ----------------------- -------------------------- -------------------------- --------------------- ---------
  ***Num. Samples***                                  100                      100                         40                      40              --
  **Base Model (Qwen2.5-Instruct-7B)**                                                                                                          
  **Original Paper (LLaVA v1.6 Mistral)**            44.0                      29.0                       43.0                    35.0            37.8
                                                                                                                                                

[\[tab:catsbench\]]{#tab:catsbench label="tab:catsbench"}

MTBench is a multimodal time series QA benchmark where the auxiliary text $C$ contains task-relevant information (e.g. external events) rather than only task instructions. This setting closely matches the formulation of , where the joint distribution $$(R, y) \sim p_\theta(R, y \mid X, C) 
= p_\theta(R \mid X, C)\, p_\theta(y \mid R, X, C)$$ ensures that both the reasoning path $R$ and the final answer $y$ depend on the interaction between the time series input $X$ and the contextual information $C$. Owing to this strong alignment, achieves the largest gains on MTBench, even surpassing the originally reported GPT-4o performance.

By contrast, the textual input in TimeSeriesExam and CaTS-Bench serves primarily as task instructions, meaning that the reasoning is dominated by the time series modality and is less aligned with the multimodal reasoning design of . Even under this more restricted setup, outperforms the base model, demonstrating that it has acquired time series reasoning abilities that transfer robustly across domains and tasks.

Case Study
==========

[^1]:  <https://github.com/AntonGuan/TimeOmni-1>

[^2]: ![image](figs/rebuttal/huggingface_logo.png){height="1em"} <https://huggingface.co/anton-hugging/TimeOmni-1-7B>

[^3]: Correspondence to: M. Jin $<$mingjinedu\@gmail.com$>$ and S. Pan $<$s.pan\@griffith.edu.au$>$