---
abstract: |
  Understanding time series is crucial for its application in real-world scenarios. Recently, large language models (LLMs) have been increasingly applied to time series tasks, leveraging their strong language capabilities to enhance various applications. However, research on multimodal LLMs (MLLMs) for time series understanding and reasoning remains limited, primarily due to the scarcity of high-quality datasets that align time series with textual information. This paper introduces ChatTS, a novel MLLM designed for time series analysis. ChatTS treats time series as a modality, similar to how vision MLLMs process images, enabling it to perform both understanding and reasoning with time series. To address the scarcity of training data, we propose an attribute-based method for generating synthetic time series with detailed attribute descriptions. We further introduce Time Series Evol-Instruct, a novel approach that generates diverse time series Q&As, enhancing the model's reasoning capabilities. To the best of our knowledge, ChatTS is the first MLLM that takes multivariate time series as input for understanding and reasoning, which is fine-tuned exclusively on synthetic datasets. We evaluate its performance using benchmark datasets with real-world data, including six alignment tasks and four reasoning tasks. Our results show that ChatTS significantly outperforms existing vision-based MLLMs (e.g., GPT-4o) and text/agent-based LLMs, achieving a 46.0% improvement in alignment tasks and a 25.8% improvement in reasoning tasks. We have open-sourced the source code, model checkpoint and datasets at <https://github.com/NetManAIOps/ChatTS>.
author:
- Zhe Xie
- Zeyan Li
- Xiao He
- Longlong Xu
- Xidao Wen
- Tieying Zhang
- Jianjun Chen
- Rui Shi
- Dan Pei
bibliography:
- ref.bib
title: |
  ChatTS: Aligning Time Series with LLMs via Synthetic Data\
  for Enhanced Understanding and Reasoning
---

```{=latex}
\newcommand{\modelname}{ChatTS}
```
```{=latex}
\newcommand{\todo}[1]{\textcolor{red}{TODO: #1}}
```
```{=latex}
\newcommand{\zm}[1]{\textcolor{green}{zm: #1}}
```
Introduction {#sec:introduction}
============

Multimodal large language models (MLLMs) have recently achieved significant progress in vision-language tasks, showing exceptional performance even in scenarios requiring complex understanding and reasoning [@liu2024visual; @bai2023qwen; @li2023blip; @yin2024survey]. However, this success has not been replicated in the time series domain. Even though some studies have attempted to integrate LLMs with time series, such as TimeLLM [@jin2023time], they usually only focus on specific classical time series tasks (e.g., forecasting) rather than understanding, reasoning, and dialogue based on time series attributes, as well as integrating into existing LLM workflows. Moreover, recent studies indicate that LLMs still struggle with zero-shot reasoning about time series [@merrill2024language]. This is particularly significant because time series analysis, widely applied in domains such as electricity [@wolde2006electricity], healthcare [@penfold2013use], traffic [@li2015trend], weather [@lim2021time], and finance [@sezer2020financial], frequently requires understanding and reasoning about time series patterns. Therefore, the ability to reason using both text and time series data is a critical capability for MLLMs, enabling them to support human decision-making by providing natural language explanations that align with human logic. Figure [1](#fig:tsqa_intro){reference-type="ref" reference="fig:tsqa_intro"} illustrates such an example in an AIOps [@zhong2023survey] scenario where understanding and reasoning about multivariate system monitoring time series are achieved through natural language dialogue, thereby improving the diagnostic and troubleshooting process.

![Example of an AIOps application of time series-related dialogue.](figures/tsqa_intro.png){#fig:tsqa_intro width="\\linewidth"}

Existing LLM-based methods for understanding and reasoning about time series attributes can be broadly categorized into text-based, vision-based, and agent-based approaches. Text-based methods directly use LLMs by structuring historical observations as raw text [@alnegheimish2024large]. However, these methods are often constrained by the limitation of prompt length and generally perform poorly in understanding the global features of time series compared to vision-based methods. Vision-based methods utilize vision MLLMs, which accept plot figures of time series data [@merrill2024language], such as GPT-4o [@gpt4o] or Qwen-VL [@bai2023qwen]. While these methods can better capture global features, they are limited by the resolution of the plotted figures and face challenges in accurately interpreting the details. Recent works also show how agents can leverage time series analysis tools to interact with LLMs [@dbot; @rcagent]. However, the ability of agents to understand time series is restricted by the functionality of the tools.

![image](figures/tsqa_methods.png){width="0.95\\linewidth"}

Therefore, there is a strong need for TS-MLLM, a MLLM that can naturally handle time series modality, akin to how vision MLLMs process images. Such models have the potential to unlock valuable insights from time series by providing intuitive, question-driven analysis capabilities. Specifically, TS-MLLMs can capture global and local features and relationships between multivariate time series (MTS), areas where existing LLMs and MLLMs have struggled. By incorporating textual modalities as input, these models can broaden their applicability and better contextualize time series data, aligning the analysis with user queries. If successful, TS-MLLMs could perform novel tasks such as citing patterns and events in time series as evidence for observations and inferences, drawing interpretable conclusions from complex dynamical systems, and recognizing and responding to temporal patterns [@merrill2024language].

However, developing TS-MLLMs with effective understanding and reasoning ability for time series attributes faces several core challenges. First, multimodal time series data, especially language-time series pair data, is extremely scarce [@merrill2024language; @chow2024towards; @jin2024position]. Unlike modalities such as images and audio, almost no research focuses on the language alignment of time series. As a result, there is a significant lack of time-series + text data, which makes the construction of time-series dialogue and reasoning datasets challenging. This is fundamental for TS-MLLMs to develop temporal understanding and reasoning capabilities. Second, time-series data contains abundant shape and numerical attributes (*i.e.*, the types of local fluctuations and their amplitudes). Therefore, a diverse range of text is needed to comprehensively describe these attributes while ensuring accuracy to achieve effective alignment. Third, real-world time-series data are usually variable in length, multivariate, and of uncertain quantity. The correlations among MTS are often a focus of attention (as illustrated in Figure [1](#fig:tsqa_intro){reference-type="ref" reference="fig:tsqa_intro"}). In MLLMs for other modalities, such as images, few methods emphasize the relationships between multiple samples. However, such relationships are indispensable for understanding and reasoning about time series. Finally, there is a lack of evaluation data and methods for TS-MLLMs. Developing comprehensive and reasonable datasets and methodologies to evaluate their performance is necessary.

To address the challenges above, we innovatively propose a method to fine-tune a pre-trained LLM for TS-MLLMs solely using synthetic time series and text data. An important reason is that synthetic time series data for time series model training has shown good results [@fu2024synthetic]. However, current methods are difficult to apply directly because time series-text alignment tasks require both *precise* and *diverse* time series attribute descriptions. Therefore, we propose an attribute-based method for generating synthetic time series and precise text attributes to facilitate the modal alignment of time series with LLMs. Compared with existing studies on synthetic time-series generation [@fu2024synthetic; @zhang2018generative], the proposed attribute-based time-series generation method provides precise textual attributes for each detailed pattern of the time series, laying a foundation for generating diverse text data. Furthermore, to equip MLLM with enhanced time series understanding and reasoning capabilities, we propose the Time Series Evol-Instruct (TSEvol) algorithm. Through the diverse combinations of attributes and tasks, TSEvol can generate diverse time series Q&A datasets through evolutions, thereby enhancing the model's overall performance. To handle multivariate time-series inputs and fully preserve semantic information, we propose , trained using the generated synthetic datasets. employs a context-aware time-series encoder capable of encoding time series of (theoretically) arbitrary length and quantity while retaining their original numerical information. Finally, to support comprehensive evaluation regarding both language alignment and time series reasoning, we have collected evaluation datasets comprising both real and synthetic time series. These datasets include both alignment and reasoning tasks with uni/multivariate time series, ensuring a thorough assessment of the model's performance.

**Our contributions.** This paper makes the following contributions.

-   We propose to align LLMs with time series using attribute-based synthetic time series and text data. Building on this, we further introduce Time Series Evol-Instruct (TSEvol), an algorithm that generates diverse, accurate, and multimodal training datasets of time series and text entirely through synthetic data generation.

-   We propose a context-aware TS-MLLM, , designed for variable-length, multivariate time series input and trained using the generated synthetic data. To the best of our knowledge, is *the first TS-MLLM with multivariate time series as input for understanding and reasoning about time series attributes*.

-   We have collected evaluation datasets containing real-world time series data, including six alignment tasks and four reasoning tasks. Evaluation results across multiple datasets demonstrate that significantly outperforms baseline models, including GPT-4o, in both time series alignment and reasoning tasks.

-   We have open-sourced the model, source code, and evaluation datasets to support future research: <https://github.com/NetManAIOps/ChatTS>.

Preliminary and Motivation {#sec:motivation}
==========================

Problem Definition
------------------

The task of a TS-MLLM is to generate text-based responses based on the input textual query and MTS array. Given a set of time series $\mathcal{T} = \{T_1, T_2, \dots, T_n\}$, where each $T_i = \{t_{i,1}, t_{i,2}, \dots, t_{i,m_i}\}$ represents a sequence of $m_i$ observed values over time for the $i$-th metric, and a natural language question $Q$, the goal is to generate an answer $A$ that captures relevant patterns or relationships across $\mathcal{T}$ based on the context of $Q$. Formally, it can be defined as follows:

-   **Input:**

    -   A set of time series $\mathcal{T} = \{T_1, T_2, \dots, T_n\}$, where $T_i \in \mathbb{R}^{m_i}$ represents the values of the $i$-th metric over $m_i$ time points.

    -   A natural language query $Q$ specifies the information of interest within the time series data.

-   **Output:** A text answer $A$ derived from the $\mathcal{T}$ analysis, providing insights based on $Q$.

The task of TS-MLLM can be expressed as a function: $$f(Q, \mathcal{T}) \rightarrow A,$$ where $f$ denotes the model or algorithm responsible for interpreting the text query $Q$ and generating the text answer $A$ by analyzing relevant patterns and relationships across the time series in $\mathcal{T}$.

Existing Methods
----------------

Although mainstream LLMs currently do not support the direct input of time series modality data, time series information can be provided to LLMs through alternative methods to do simple understanding and reasoning about time series attributes, as shown in Figure [\[fig:tsqa\_methods\]](#fig:tsqa_methods){reference-type="ref" reference="fig:tsqa_methods"}. Existing approaches can be broadly categorized into text-based, vision-based, and agent-based, each with distinct limitations.

Text-based methods encode time series values as raw text [@alnegheimish2024large]. However, these methods are constrained by the length of prompts, limiting their global analysis capabilities and often resulting in an incomplete understanding of the data context (refer to Section [4](#sec:evaluation){reference-type="ref" reference="sec:evaluation"}).

Vision-based approaches, which use visual representations of time series data (e.g., time series plots) processed by vision MLLMs [@gpt4o; @bai2023qwen], may face challenges in accurately capturing detailed information in time series, resulting in lower accuracy for data-intensive tasks and high computational overhead (refer to Section [4](#sec:evaluation){reference-type="ref" reference="sec:evaluation"}).

Agent-based methods employ a reasoning and action strategy, breaking down complex tasks into a sequence of thoughts, observations, and actions conducted by external tools to analyze time series. While potentially more flexible, this approach is heavily dependent on expert knowledge and effectiveness of tools, token-intensive, and time-consuming, often requiring extensive token chains to handle MTS data. Additionally, hallucination becomes a significant problem [@yoffe2024debunc] as the chains grow longer, reducing reliability in complex analytical tasks.

Time Series Multimodal LLM
--------------------------

TS-MLLM is a new type of MLLM that aims at overcoming the limitations of existing methods by *natively* integrating both textual and time series inputs (see Figure [\[fig:tsqa\_methods\]](#fig:tsqa_methods){reference-type="ref" reference="fig:tsqa_methods"}). It can process multiple time series data and textual descriptions, enabling a unified analysis that captures complex, multivariate relationships. Unlike previous methods, it does not rely on lengthy token chains or visual representations, thereby reducing computational overhead and mitigating issues with hallucination. Through the alignment of time series and text, TS-MLLM can perform both global and local analysis of the shape and numerical information of time series. This capability allows it to achieve higher accuracy and greater potential than existing methods.

Methodology {#sec:method}
===========

Overview
--------

![Overview of .](figures/overview.png){#fig:overview width="\\linewidth"}

Due to the scarcity of high-quality datasets that align time series with textual information, we propose to generate synthetic text-time series pairs for model training. Synthetic data is a common approach when there is a lack of sufficient real training data, and its effectiveness has been well validated in various fields [@fu2024synthetic; @savage2023synthetic; @luo2023time]. However, as discussed earlier, \"time series + text\" data for TS-MLLM requires sufficient accuracy to ensure alignment precision, comprehensive coverage of time series attributes to guarantee effective multimodal alignment, and task diversity in the text to enhance QA and reasoning abilities. Unfortunately, existing time series generation methods [@fu2024synthetic; @zhang2018generative] fail to achieve these goals. A key reason is that we need a *diverse* set of time series and *precise, detailed* descriptions of time series patterns. Therefore, in this paper, we propose an attribute-based method to generate time series + text data, as illustrated in Figure [2](#fig:overview){reference-type="ref" reference="fig:overview"}:

![image](figures/feature_generator.png){width="1.0\\linewidth"}

-   **Attribute Selector** (Section [3.2](#sec:time_series_generator){reference-type="ref" reference="sec:time_series_generator"}): To produce highly controllable time-series data with precise attributes, we use a detailed feature set to describe time series. These attributes are aligned with real-world settings through an LLM selection.

-   **Attribute-Based Time Series Generator** (Section [3.2](#sec:time_series_generator){reference-type="ref" reference="sec:time_series_generator"}): Construct time series that correspond exactly to the attribute pool using a rule-based approach.

-   **Time Series Evol-Instruct** (Section [3.3](#sec:evol_instruct){reference-type="ref" reference="sec:evol_instruct"}): A novel Time Series Evol-Instruct module for creating large, diverse, and accurate datasets of time-series and text question-answering pairs for complex reasoning.

-   **Model Design** (Section [3.4](#sec:mllm){reference-type="ref" reference="sec:mllm"}): To handle MTS, we design a context-aware MLLM encoding for multiple time series input, along with a value-preserved time series encoding method.

-   **Model Training** (Section [3.5](#sec:model_training){reference-type="ref" reference="sec:model_training"}): A large-scale training and a SFT are conducted to perform language alignment and improve time series-related reasoning ability.

As shown in Figure [2](#fig:overview){reference-type="ref" reference="fig:overview"}, the framework in integrates synthetic data generation and model training into a pipeline that ensures effective time series attributes understanding and reasoning with only synthetic data. First, building on the attribute-based time series generation and TSEvol described in Sections 3.2 and 3.3, the pipeline generates synthetic data that captures intricate numerical and textual information for effective multimodal alignment. This data is then used to train the model (Section 3.4), where the context-aware time series encoder preserves the time series values while aligning attributes with textual semantics accurately. Finally, by alignment training and SFT (Section 3.5), achieves precise alignment between time series encoding and text embeddings, along with enhanced reasoning capabilities.

Attribute-Based Time Series Generator {#sec:time_series_generator}
-------------------------------------

Diverse time series and precise, detailed textual attribute descriptions are essential to achieve accurate time series language alignment. Time series have rich pattern attributes, which can be roughly categorized into trend, periodicity, and remainder [@rb1990stl; @DBLP:journals/pvldb/HeLTWL23]. Much existing research on the generation of time series [@fu2024synthetic; @fons2024evaluating] also adopts similar approaches to classify these attributes. Therefore, following existing studies, we classify time series attributes into four major categories, Trend, Periodicity, Noise, and Local Fluctuation, to construct the corresponding attribute set for time series.

Based on this, we propose an attribute selector and an attribute-based time series generator that produces synthetic time series data (see Figure [\[fig:feature\_generator\]](#fig:feature_generator){reference-type="ref" reference="fig:feature_generator"}). First, we define an \`\`All Attribute Set", which includes many specific attributes under different attribute categories. The All Attribute Set includes 4 types of Trend, 7 types of Seasonality, 3 types of Noise, and 19 types of local fluctuations. The complete list can be found in the source code. Different attributes within the same category can be combined. A time series can include multiple segments of trends and several local fluctuations by combining the same type of attributes (see Figure [\[fig:feature\_generator\]](#fig:feature_generator){reference-type="ref" reference="fig:feature_generator"}). Additionally, by combining sine waves, we can generate a diverse range of periodic fluctuation patterns. Therefore, the proposed time series generator can theoretically generate an infinite number of different time series, ensuring the richness of attributes. We also introduced a GPT Selector. Specifically, when generating an attribute set for time series, we randomly sample a metric from a large \`\`Metric Set" that contains 567 predefined metric names from real-world scenarios and use GPT to choose a *attribute subset* from the all attribute set, based on the actual physical meaning of the metric and the predefined scenario. This helps align time series with real-world physical meanings.

Then, the *Attribute Sampler* randomly samples a combination of attributes from the Attribute Subset. It also assigns specific numerical values, like position and amplitude, based on rules and constraints from the GPT Selector. These details are stored in the \`\`Attribute Pool", which records all the detailed information about a time series. The *Time Series Generator* finally creates time series arrays that *exactly* match the attributes from the pool in a rule-based manner (more details can be found in the source code). This process allows us to generate diverse synthetic time series with precise attribute descriptions.

Time Series Evol-Instruct {#sec:evol_instruct}
-------------------------

![Time Series Evol-Instruct](figures/evol_instruct.png){#fig:evol_instruct width="\\linewidth"}

![image](figures/model.png){width="0.9\\linewidth"}

To improve the model's question-answering and reasoning abilities, it is essential to have high-quality SFT training data that is diverse in format and tasks. However, due to the lack of time-series + text data, it is challenging to obtain sufficiently diverse time-series-related training data directly. To generate accurate time-series + text SFT data with rich question-answering formats, inspired by Evol-Instruct [@xu2023wizardlm] and its multimodal version MMEvol [@luo2024mmevol], we innovatively propose Time Series Evol-Instruct (TSEvol).

Evol-Instruct [@xu2023wizardlm] is a data generation approach that incrementally evolves instructional prompts and their outputs to enhance the diversity and complexity of training datasets for LLMs. TSEvol builds upon Evol-Instruct by introducing a mechanism to incorporate time series attributes dynamically into each evolutionary step (see Figure [3](#fig:evol_instruct){reference-type="ref" reference="fig:evol_instruct"}). TSEvol relies on *attribute pools* of multivariate time series (see Section [3.2](#sec:time_series_generator){reference-type="ref" reference="sec:time_series_generator"}). Additionally, to enhance the model's ability to analyze correlations, we introduce a correlation pool, which records time series with related attributes (refer to the source code for details). During each step of the evolution process, a subset of attributes is randomly selected from the *attribute pool* and added as *additional context*, guiding the LLMs to generate Q&As about a broader set of time series attributes according to the *evolution type*. With TSEvol, generated Q&As can cover more attributes in the time series and avoid repetitive questions. We also added an attribute-based eliminator to ensure the Q&As match the time series attributes. In addition to the commonly used evolution types, we also add two more types, reasoning (reasoning-based questions) and situation (situation-based questions), to enhance the model's ability to handle complex questions.

Time Series Multimodal LLM {#sec:mllm}
--------------------------

In this subsection, we introduce the model structure of the proposed , as shown in Figure [\[fig:model\]](#fig:model){reference-type="ref" reference="fig:model"}. takes multivariate time series and text, along with their *contextual information* as the input.

### Context-Aware Time-Series Multimodal LLM

To handle the multimodal inputs, first separates the input time series arrays and the text. Following the established practice in encoding time series for LLMs [@jin2023time], the input time series arrays are divided into fixed-size patches, which enables the model to handle and encode temporal patterns more effectively. We employ a simple 5-layer MLP to encode each patch of the time series, as time series inherently have sequential patterns. Therefore, a simple structure can map the patch features to a space aligned with the text embedding. For text input, they are tokenized and then encoded through a text embedding layer. In this way, each patch of the time series and each text token are mapped to the same space.

To fully retain the contextual information of multivariate time series, we performed token-level concatenation based on the position of the time series in the original input. Specifically, the encoded patches corresponding to each time series were inserted between the surrounding text tokens. Unlike the method used in TimeLLM [@jin2023time], this approach ensures that the contextual information of the time series is fully preserved. This is especially important in multivariate scenarios, where referencing the corresponding time series in textual form is often necessary. This process results in a sequence that reflects the multivariate structure of the data, enabling the LLM to capture both temporal and contextual dependencies across different metrics. This sequence is then fed into the LLM, which generates an answer that incorporates insights from both the time series data and the natural language query, achieving a multimodal understanding suited for complex question-answering tasks.

### Value-Preserved Time Series Normalization

![Value-Preserved Time Series Normalization](figures/encode.png){#fig:encode width="\\linewidth"}

The numerical features of time series are essential, as real-world applications often involve specific numerical queries (e.g., asking for the maximum CPU utilization). However, normalization of time series data can lead to losing original numerical information. To address this, we introduce a value-preserved time series normalization scheme (as shown in Figure [4](#fig:encode){reference-type="ref" reference="fig:encode"}). First, we apply standard min-max normalization (0-1 scaling) to each time series array. Then, for each time series, we include the normalization parameters-\`\`Value Scaling" (the scaling factor during normalization) and \`\`Value Offset" (the offset applied during normalization)---in the text **as part of a prompt**. This approach leverages the numerical understanding capabilities of LLMs, enabling us to normalize time series features while preserving the original numerical information. To further enhance numerical understanding, numerical tasks are included in the training dataset (see Section [3.5](#sec:model_training){reference-type="ref" reference="sec:model_training"}).

![image](figures/evaluation_examples.png){width="0.8\\linewidth"}

Model Training {#sec:model_training}
--------------

::: {#tab:training_datasets}
  ------------------------ -------- ----------- ----------- -------- -----------------
         **Stage**                                                   
   (lr)2-4(lr)5-6 Dataset    UTS     MTS-Shape   MTS-Local   TSEvol   Instruct Follow
       **\# Samples**       35,000    35,000      35,000     24,270        5,050
  ------------------------ -------- ----------- ----------- -------- -----------------

  : Training Datasets
:::

is trained based on QWen2.5-14B-Instruct [@yang2024qwen2][^1], with a two-stage fine-tuning process: large-scale alignment training and supervised fine-tuning (SFT). Table [1](#tab:training_datasets){reference-type="ref" reference="tab:training_datasets"} shows the datasets we use during training.

### Large-Scale Alignment Training

In the first stage, we perform large-scale alignment training using the attribute-based synthetic time series data to establish an initial alignment between the text and time series modalities within the LLM. This stage enables to align textual descriptions with time series attributes effectively. During the alignment stage, we created three datasets for large-scale training based on a series of manually designed templates and LLM refinement. The *UTS* dataset includes tasks for basic attribute descriptions of univariate time series (both global and local attribute tasks are included). The *MTS-Shape* dataset consists of multivariate data with *global* trend correlations designed to enhance the model's ability to analyze multivariate correlations. The *MTS-Local* dataset contains multivariate data with correlated *local* fluctuations, aiming to improve the model's capability in analyzing local features of multivariate data. Given MTS's more complex feature combinations, we set the training data size for MTS and UTS at an approximately 2:1 ratio. We conduct a dataset scaling study in Section [4.5](#sec:dataset_scaling){reference-type="ref" reference="sec:dataset_scaling"} to investigate the impact of training dataset size.

### Supervised Fine-Tuning

In the second stage, we use SFT to develop the LLM's ability to perform complex question-answering and reasoning tasks. This stage utilizes two main types of training data: the datasets generated with TSEvol, designed to enhance the model's question answering and reasoning ability about time series, and an instruction-following (IF) dataset, constructed based on a series of predefined templates, designed to enhance the model's ability to follow specific response formats. For TSEvol, we used the dataset from alignment training along with LLM-generated QAs as the seed data. Together, these datasets train the multimodal LLM to respond accurately to time series-specific queries and follow task instructions, strengthening its capacity for complex, context-driven question-answering and reasoning tasks. In both Alignment and SFT stages, we enhance ChatTS's numerical capabilities through a series of numerical tasks. Specifically, we explicitly train the model to learn various aspects, such as maximum/minimum values, segmented averages, local features (e.g., spike positions and amplitudes), seasonality and trend amplitudes, and raw numerical values at individual time points. The numerical evaluation metrics in our experimental results further demonstrate ChatTS's strong performance in time series numerical analysis.

### Training Settings

We use QA pairs as the data format for both training stages. During alignment training, we mixed in a small amount of IF data and found that this mitigates the decline in the model's IF ability. In the SFT stage, we mixed 30% of the alignment training dataset to reduce overfitting. The training dataset includes time series with lengths ranging from 64 to 1024 to ensure that can handle varying time series lengths. Full-parameter SFT is used for with DeepSpeed [@deepspeed] and LLaMA-Factory [@zheng2024llamafactory], with Qwen2.5-14B-Instruct [@yang2024qwen2; @qwen14b] as the base model. Inference for both Qwen and is also conducted with DeepSpeed.

Evaluation {#sec:evaluation}
==========

In this section, we will comprehensively evaluate the performance of by answering the following research questions (RQs):

-   **RQ1.** How well does align with time series?

-   **RQ2.** How does perform in time series reasoning tasks?

-   **RQ3.** Are attribute-based data and TSEvol effective?

-   **RQ4.** How does the training set size affect model performance?

-   **RQ5.** Is the time series modalilty in truly useful?

-   **RQ6.** Does , with its native time-series multimodal capabilities, have advantages over agent-based methods?

Experimental Setup
------------------

### Evaluation Tasks

To comprehensively evaluate the model's performance, we set two categories of evaluation tasks: alignment tasks and reasoning tasks, following the general evaluation methods of multimodal LLMs [@liu2024visual; @luo2024mmevol; @cai2024timeseriesexam]. For each type of evaluation task, we designed a series of subtasks based on existing work. Some example QAs are shown in Figure [\[fig:evaluation\_examples\]](#fig:evaluation_examples){reference-type="ref" reference="fig:evaluation_examples"} (more details can be found in the source code). Specific tasks that rely heavily on domain-specific knowledge (e.g., classification and etiological reasoning) were excluded due to the lack of high-quality datasets that provide sufficient background information. Therefore, we primarily focused on the following tasks:

Alignment tasks are divided into univariate and multivariate:

-   **Univariate tasks.** Identify trends, seasonality, noise, and local fluctuations. These tasks include both *categorical* subtasks and *numerical* subtasks.

-   **Multivariate tasks.** Correlation and clustering. These tasks are all categorical.

The reasoning tasks include inductive reasoning, deductive reasoning, causal reasoning, and comparison reasoning (MCQ2):

-   **Inductive reasoning.** Q&A task. Inductive summarization of the physical meaning reflected by a uni/multivariate time series.

-   **Deductive reasoning.** True/False (T/F) task. Reasoning based on a predefined condition in conjunction with univariate time series.

-   **Causal reasoning.** Multiple-choice task. Based on univariate time series, select the most likely cause.

-   **Comparison reasoning (MCQ2).** Multiple-choice task. Compare two time series and select the correct answer.

More details about the evaluation tasks can be found in the source code and the evaluation dataset.

### Evaluation Metrics

For categorical tasks in alignment evaluation, we match labels from the responses of LLMs using rule-based matching and use F1-Score as the metric. For numerical tasks in alignment evaluation, we extract numbers from the responses of LLMs and use *relative accuracy* (1.0 - relative error) as the metric: $$relative\_accuracy = \max\left(1.0 - \frac{\left| V_{answer} - V_{label} \right|}{\left| V_{label} \right|}, 0.0 \right)$$ We set a minimum value of 0.0 for relative accuracy to mitigate the impact of outlier results. For Q&A tasks in inductive reasoning, answers are evaluated using RAGAS [@es2023ragas], a keyword-matching approach through LLM-based fuzzy matching. T/F and MC tasks are directly evaluated through choice matching and the accuracy is calculated. All evaluation metrics are the higher, the better.

### Evaluation Datasets

\>p0.9cm\|\>X\>p1.6cm **Dataset** & Tasks & \# Questions\
& Alignment (Trend, Season, Noise, Local, Correlation, Cluster), Reasoning (Inductive, Deductive, Causal) &\
& Alignment (Trend, Season, Noise, Local, Correlation, Cluster), Reasoning (Inductive) &\
**MCQ2** & Reasoning (Comparison - MCQ2) & 100\

Our evaluation is conducted on three datasets (see Table [\[tab:tasks\]](#tab:tasks){reference-type="ref" reference="tab:tasks"}) to test the model's performance across both real-world and synthetic time series scenarios. Dataset A and B are collected by us, and Dataset MCQ2 is an open-source dataset [@merrill2024language].

*Dataset A* includes real-world time series data collected from multiple domains, including AIOps [@li2022constructing], weather [@weatherdataset], the NAB (Numenta Anomaly Benchmark) [@ahmad2017unsupervised], and Oracle system metrics [@li2022actionable]. We manually label and collect a total of 525 questions, including both alignment tasks and reasoning tasks.

To expand the size of the evaluation set, we used the attribute-based time series generator introduced in to generate a series of time series and created alignment Q&A by applying a set of templates. We also develop a set of reasoning questions with LLM, resulting in a larger-scale *Dataset B* containing 1,616 questions. Considering the complexity of reasoning tasks, we have included only inductive reasoning tasks in the reasoning tasks of this dataset to ensure the quality of the questions.

*MCQ2* [@merrill2024language] is an open-source dataset [@mcq2dataset] that includes comparison reasoning tasks. The questions, answers, and time series in this dataset are all generated by LLMs. We did not use the etiological reasoning and forecasting datasets as they are not aligned with our evaluation settings. Furthermore, [@merrill2024language] suggests that the settings of the MCQ1 dataset are unsuitable for evaluating the performance of time series reasoning, so we also did not adopt it. Considering the inference cost, we randomly sampled 100 questions.

  ----------------------------------------------------------- ---------- ------------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ------------ --------------- ----------- ----------- ----------- ----------- ----------
                          **Dataset**                          **Type**  **Model**                                                        Corr.       Clus.                 **Tokens**   **Est. Cost**                                                  
   (lr)4-5(lr)6-7(lr)8-9(lr)10-11(lr)12-12(lr)13-13(lr)14-15             Task             Cate.       Num.        Cate.       Num.        Cate.       Num.        Cate.        Num.          Cate.         Cate.       Cate.       Num.                     \$
                                                                         GPT-4o-mini      0.585       0.752       0.649       0.264     **0.952**     0.312       0.263       0.187          0.357         0.254       0.464       0.310       1.3M        0.20
                                                                         GPT-4o           0.585     **0.882**     0.811       0.768       0.905       0.153       0.379       0.256          0.476         0.333       0.542       0.371       1.3M        3.25
                                                                         GPT-4-Turbo      0.526       0.699       0.649       0.131       0.900       0.339       0.303       0.247          0.417         0.269       0.490       0.353       1.3M        13.0
                                                                         QWen2.5-14B      0.707       0.709       0.622       0.205       0.833       0.231       0.137       0.099          0.571         0.349       0.464       0.241       1.3M        0.35
                           (lr)2-17                                      GPT-4o-mini      0.610       0.501       0.432       0.205       0.667       0.201       0.242       0.184          0.357         0.330       0.404       0.248      2.2M\*       0.33
                                                                         GPT-4o           0.659       0.613       0.811       0.559       0.810       0.248       0.537       0.414          0.476         0.480       0.609       0.436      0.13M\*      0.32
                           (lr)2-17                                      GPT-4o-mini      0.559       0.773       0.595       0.270       0.714       0.105       0.400       0.212          0.381         0.361       0.469       0.309       3.0M        0.45
                                                                         GPT-4o           0.537       0.650       0.405       0.000       0.595       0.088       0.232       0.136          0.429         0.417       0.390       0.220       2.7M        6.75
                           (lr)2-17                                                     **0.927**     0.874     **0.973**   **0.849**     0.857     **0.511**   **0.895**   **0.805**      **0.905**     **0.782**   **0.889**   **0.788**   **0.08M**   **0.02**
                                                                         GPT-4o-mini      0.619       0.716       0.711       0.317       0.427       0.198       0.145       0.091          0.335         0.269       0.336       0.217       4.5M        0.67
                                                                         GPT-4o           0.690       0.825       0.732       0.474       0.573       0.331       0.191       0.136          0.324         0.281       0.366       0.284       4.5M        11.3
                                                                         GPT-4-Turbo      0.667       0.732       0.667       0.345       0.348       0.067       0.188       0.133          0.438         0.369       0.385       0.259       4.5M        45.0
                                                                         QWen2.5-14B      0.711       0.669       0.705       0.217       0.256       0.094       0.111       0.082          0.402         0.276       0.339       0.193       4.5M        1.22
                           (lr)2-17                                      GPT-4o-mini      0.679       0.240       0.814       0.453       0.305       0.238       0.141       0.081          0.327         0.307       0.347       0.142      11.4M\*      1.71
                                                                         GPT-4o           0.702       0.361       0.938       0.589       0.610       0.398       0.375       0.265          0.367         0.389       0.472       0.311      0.56M\*      1.40
                           (lr)2-17                                      GPT-4o-mini      0.612       0.591       0.455       0.605       0.375       0.000       0.043       0.022          0.654         0.585       0.372       0.125       8.5M        1.27
                                                                         GPT-4o           0.532       0.586       0.619       0.658       0.391       0.262       0.551       0.287          0.500         0.464       0.490       0.370       7.2M        10.8
                           (lr)2-17                                                     **0.976**   **0.902**   **1.000**   **0.930**   **0.927**   **0.572**   **0.828**   **0.752**      **0.818**     **0.834**   **0.862**   **0.787**   **0.34M**   **0.09**
  ----------------------------------------------------------- ---------- ------------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ------------ --------------- ----------- ----------- ----------- ----------- ----------

### Baselines

Based on different modalities, we categorized the baseline methods into the following types:

-   **Text-Based:** These methods convert time series arrays into textual prompts as inputs for LLMs. We choose several mainstream LLMs as our base model (GPT-4o/GPT-4o-mini/GPT-4-Turbo/QWen2.5-14B-Instruct) for evaluation.

-   **Vision-Based:** These methods plot time series and input them into visual MLLMs. We choose mainstream vision MLLMs (GPT-4o/GPT-4o-mini) for evaluation.

-   **Agent-Based:** These methods employ the ReAct [@yao2022react] framework to interact with multiple tools to analyze the time series. The tools used include single-point/range query, STL decomposition, anomaly detection (autoregression AD in adtk [@adtk]), and classification (Rocket [@dempster2020rocket]) for UTS; trend/fluctuation correlation (based on Pearson correlation & rules), multivariate version of AD and classification for MTS. We choose GPT-4o/GPT-4o-mini for the agent. More details about the tools' implementation can be found in the source code. *We also conducted additional experiments to explore further the capabilities of agent-based methods (Section [4.7](#sec:agent_study){reference-type="ref" reference="sec:agent_study"}), which studies the impact of tool accuracy*.

### Implementation

For GPT-based models, we used OpenAI's API to infer and track token consumption. For and QWen-based models, the training and inference are conducted locally on 8$\times$A800 GPUs. The token consumption for is calculated after the \`\`Reorder & Concat" step.

RQ1. Alignment Tasks
--------------------

The evaluation results on alignment tasks are shown in Table [\[tab:align\]](#tab:align){reference-type="ref" reference="tab:align"}. ChatTS consistently outperforms all baseline models across nearly all tasks and datasets, achieving 46.0%--75.9% improvement in categorical metrics and 80.7%--112.7% in numerical metrics compared to industry-leading models like GPT-4o. This demonstrates that synthetic training data can effectively enable strong alignment with real-world time series.

Among the baselines, GPT-4o (Vision) performs best, suggesting vision-based MLLMs possess some capability to analyze shape characteristics of time series, though they remain limited by image resolution when interpreting details. Text-based methods struggle with the constraints of prompt length, while agent-based approaches performed below expectations (see Section [4.7](#sec:agent_study){reference-type="ref" reference="sec:agent_study"} for detailed analysis).

::: {#tab:reason}
   **Type**  **Model**       Induct.     Deduct.     Causal       MCQ2      **Average**
  ---------- ------------- ----------- ----------- ----------- ----------- -------------
             GPT-4o-mini      0.333       0.326       0.576       0.480        0.429
             GPT-4o           0.336       0.628       0.685       0.470        0.530
             GPT-4-Turbo      0.280       0.581       0.644       0.490        0.499
             QWen2.5-14B      0.184       0.605       0.348       0.320        0.364
             GPT-4o-mini      0.323       0.442       0.495       0.480        0.435
             GPT-4o           0.322       0.605       0.652       0.490        0.517
             GPT-4o-mini      0.219       0.357       0.692       0.340        0.402
             GPT-4o           0.167       0.553       0.696       0.380        0.449
                            **0.518**   **0.744**   **0.804**   **0.600**    **0.667**

  : Reasoning tasks. Inductive Reasoning is in the form of Q&A, evaluated with RAGAS. Other tasks are MC or T/F questions, which are evaluated with accuracy.
:::

ChatTS's advantages are particularly pronounced in multivariate tasks, where text-based models face challenges with excessively long prompts and vision-based models struggle to distinguish features across multiple time series plotted simultaneously. In contrast, ChatTS's context-aware time series encoding accurately analyzes referenced time series based on contextual information.

From the efficiency perspective, 's native multimodal encoding requires significantly fewer tokens to represent time series data, resulting in much lower costs compared with the baselines (see Table [\[tab:align\]](#tab:align){reference-type="ref" reference="tab:align"}). This shows both the effectiveness and efficiency of treating time series as a native modality.

RQ2. Reasoning Tasks
--------------------

The comparison results of our model and the baseline models for Reasoning Tasks are shown in Table [2](#tab:reason){reference-type="ref" reference="tab:reason"}. Reasoning tasks are typically more complex and better aligned with real-world application scenarios than alignment tasks. It can be found that achieves consistent improvements over the baseline models across all reasoning tasks. In the Inductive Reasoning task, achieved a 34.5% improvement compared to the baseline models, indicating that can accurately associate time series attributes with their physical meanings in the real world. This demonstrates that the proposed attribute-based time series generation effectively enables the model to understand the patterns of the physical world reflected in time series. Moreover, also achieved notable improvements in other reasoning tasks, which indicates that even with only synthetic training data, the model can be equipped with good reasoning capabilities related to time series. This further demonstrates the effectiveness of the proposed attribute-based time series generation method and TSEvol.

![Dataset A](figures/ablation_study_a.png){#fig:ablation_study_a width="0.95\\linewidth"}

![Dataset B](figures/ablation_study_b.png){#fig:ablation_study_b width="0.95\\linewidth"}

![Ablation studies on reasoning tasks.](figures/ablation_study_reasoning.png){#fig:ablation_study_reasoning width="0.85\\linewidth"}

RQ3. Studies of Synthetic Training Data
---------------------------------------

To evaluate the effectiveness of attribute-based time series generation and TSEvol, we conducted ablation studies with two variants: (1) *w/o Attribute-Based*, where all training datasets were replaced by GPT-generated datasets from [@merrill2024language], containing time series directly generated using GPT-produced Python code with corresponding GPT-generated Q&As; (2) *w/o TSEvol*, where SFT datasets were replaced with data directly generated using an LLM without the evolutionary approach, though with prompts designed to encourage diversity. Both variants included the instruct-following dataset to ensure fair comparison.

The evaluation results in Figures [\[fig:ablation\_study\]](#fig:ablation_study){reference-type="ref" reference="fig:ablation_study"} and [7](#fig:ablation_study_reasoning){reference-type="ref" reference="fig:ablation_study_reasoning"} reveal that models trained on GPT-generated data performed significantly worse across alignment tasks, particularly for local fluctuation detection and numerical analysis. This suggests the attribute-based generation method better captures precise feature details and numerical values. Meanwhile, models trained with TSEvol demonstrated substantial improvements in reasoning capabilities and modest gains in alignment tasks, indicating that TSEvol effectively diversifies question formats and generates tailored Q&As for different time series attributes, enhancing overall model performance.

RQ4. Scaling of Training Dataset {#sec:dataset_scaling}
--------------------------------

![Train Data](figures/dataset_scaling_study.png){#fig:dataset_scaling_study width="1.0\\linewidth"}

![SFT on Text-Based LLMs](figures/text_model_study.png){#fig:text_model_study width="1.0\\linewidth"}

Figure [8](#fig:dataset_scaling_study){reference-type="ref" reference="fig:dataset_scaling_study"} illustrates the relationship between ChatTS performance and training data size. The results show that increasing the Phase 1 training dataset size from 10% to 100% of the current size significantly improves performance, but further expansion yields minimal gains. Thus, our chosen training set size is well-balanced, ensuring sufficient data for effective alignment while avoiding too much resource consumption during training.

RQ5. Study of Time Series Modality
----------------------------------

To investigate the effectiveness of the time series multimodality in , we performed an ablation study based on a text-only version of (w/o TS Modality). We remove the time series encoder in (*i.e.* using the original QWen-2.5 model) and use the same training data with (the time series arrays are encoded into text) in model training. The experimental results are shown in Figure [\[fig:ablation\_study\]](#fig:ablation_study){reference-type="ref" reference="fig:ablation_study"} and Figure [7](#fig:ablation_study_reasoning){reference-type="ref" reference="fig:ablation_study_reasoning"}. Overall, the model using only the text modality performs significantly worse than the original model. This indicates that encoding multimodal information is crucial for accurately capturing both shape and numerical information. However, in certain sub-evaluation metrics (e.g., noise), the text-only model outperforms the multimodal , suggesting that text modality models still have strong capabilities for identifying small fluctuations. In MTS tasks, the text-only model is nearly incapable of answering any questions. This implies that even with extensive multivariate training data, text-only LLMs still struggle to handle multivariate problems due to excessively long context lengths because of severe hallucinations and inaccurate responses. Additionally, to compare the performance gains between text-based LLMs and TS-MLLMs, we fine-tuned various sizes of the Qwen2.5 series text-based LLMs using the text version of the ChatTS training dataset (as shown in Figure [9](#fig:text_model_study){reference-type="ref" reference="fig:text_model_study"}). Experimental results indicate that even when fine-tuning the larger Qwen2.5-32B text model, the results still do not outperform those of (14B), which has native multimodal capabilities. This further validates the importance of native multimodal capabilities in , whether in the accuracy for MTS analysis or cost efficiency (see Table [\[tab:align\]](#tab:align){reference-type="ref" reference="tab:align"}).

RQ6. Study of Agent-Based Methods {#sec:agent_study}
---------------------------------

![Agent with different tool accuracy.](figures/agent_study.png){#fig:agent_study_accuracy width="1.0\\linewidth"}

![Agent w/o different tools.](figures/agent_tool_study.png){#fig:agent_tool_study width="1.0\\linewidth"}

![Study of error cases.](figures/agent_error_study.png){#fig:agent_error_study width="1.0\\linewidth"}

Agent-based methods are widely applied but showed suboptimal performance in our evaluations (RQ1, RQ2) due to several main issues: (1) tool inaccuracy (2) error tool use, and (3) response formatting that caused parsing failures. To explore their performance upper bound, we conducted detailed analyses:

1.  **Parsing Failures:** We exclude responses that failed to parse, ensuring all outputs were valid.

2.  **Perfect Tools:** We design \`\`perfect tools" with controlled accuracy through time series labels in the synthetic dataset (Figure [10](#fig:agent_study_accuracy){reference-type="ref" reference="fig:agent_study_accuracy"}). The accuracy of the perfect tools can be strictly controlled.

3.  **Tool Ablation:** We perform ablation studies (Figure [11](#fig:agent_tool_study){reference-type="ref" reference="fig:agent_tool_study"}) to evaluate the impact of individual tools on accuracy.

4.  **Error Analysis:** We categorize agent errors into three types: Error Tool Using, Misunderstanding, and Hallucination. We analyze their impact on performance (Figure [12](#fig:agent_error_study){reference-type="ref" reference="fig:agent_error_study"}).

Our sensitivity analysis (Figure [10](#fig:agent_study_accuracy){reference-type="ref" reference="fig:agent_study_accuracy"}) shows that agent performance is highly sensitive to tool accuracy, especially in the \[0.9, 1.0\] range. For categorical and numerical tasks, the agent with perfect tools slightly outperforms on UTS tasks but lags behind in MTS tasks. For agents, MTS tasks typically require more tool calls and reasoning, which places higher demands on LLMs' tool-using and summarization capabilities. In contrast, processes multiple time series natively, reducing complexity and improving accuracy for MTS tasks. The tool ablation study (Figure [11](#fig:agent_tool_study){reference-type="ref" reference="fig:agent_tool_study"}) shows that Agent depends heavily on both tool precision and completeness, particularly for MTS tasks. Even with all tools available, agents frequently fail to invoke the correct tool at the right time (e.g., using the classification tool rather than the anomaly detection tool to identify the position of a spike), limiting their effectiveness. Error analysis (Figure [12](#fig:agent_error_study){reference-type="ref" reference="fig:agent_error_study"}) reveals \`\`Error Tool Using" as the largest source of errors. When these cases are excluded, agent accuracy exceeds 95%, surpassing . This validates the correctness of the implementation of perfect tools and the model, which also shows their limitations: Agents may struggle with tool selection and reasoning.

In summary, while perfect tools improve agent performance, challenges such as tool selection errors, misunderstandings, and hallucinations persist, leaving agents less effective than ChatTS for complex time series tasks.

Case Studies and Applications {#sec:case_study}
=============================

Case Studies on Real-World Data
-------------------------------

![Case studies on real-world time series data.](figures/case_study_real.png){#fig:case_study_real width="1.0\\linewidth"}

To investigate the performance of on *real-world* time series, we perform several case studies with challenging questions, the results are shown in Figure [13](#fig:case_study_real){reference-type="ref" reference="fig:case_study_real"}.

### Shape and Statistical Analysis

The \`\`Basic Shape Analysis" case demonstrates 's capability to analyze an NYC taxi passenger time series with complex periodic fluctuations and local anomalies. accurately identifies multiple trend segments, the periodicity, and the upward spike along with their amplitudes. This shows 's capability to capture both global patterns and localized features. In the \`\`Statistic Analysis" case, analyzes advertisement CPC data with misleading scaling. Despite potential confusion in the minimum value, correctly identifies the max/min values and their positions. These cases show 's robustness in statistical analysis in complex real-world time series.

### OOD Fluctuation Recognition

The \`\`OOD Fluctuation" case presents with a traffic occupancy time series containing an OOD fluctuation pattern *absent from its training data*. However, accurately describes it as a \`\`Convex-Shaped Elevation", characterized by a gradual rise followed by a sharper decline along with the overall shape of a convex. This demonstrates 's inherent understanding of time series patterns themselves, rather than simply repeating representations from the training set. This indicates that has a certain capacity to generalize to real-world data despite being trained exclusively on synthetic data.

Real-World Application: DB Operation
------------------------------------

![An application case of in a failure diagnosis with an Oracle database system.](figures/case_study_aiops.png){#fig:case_study_aiops width="1.0\\linewidth"}

To illustrate the performance of with its native time series multimodal capability in real-world applications, we present a typical Oracle DB operation application through an MTS-related multi-turn dialogue with . In this case study, an Oracle DB operator has identified a recent anomaly and retrieves several time series metrics from the monitoring system, inputting them into for analysis (as shown in Figure [14](#fig:case_study_aiops){reference-type="ref" reference="fig:case_study_aiops"})[^2]. By querying , the operator obtains the names of all metrics with anomalies. Then, to accurately pinpoint the root cause, the operator provides with a textual document titled \`\`Oracle Database Troubleshooting Rulebook" and requests to analyze the root cause and propagation of the system failure step-by-step, combining insights from the rulebook and the time series anomalies. Notably, the rulebook is entirely in *text* form, without a strictly structured format, which is helpful for the operators to share their expert experience effectively. The responses of show that it can accurately identify anomalies and amplitudes in multivariate time series. By leveraging \`\`the metric with the largest fluctuation" in the rulebook, can further reason about the root cause and failure propagation path. This further shows that can effectively utilize its *alignment capability* to analyze time series and perform complex analysis in real-world applications with its robust *reasoning ability*.

Real-World Application: Detailed Analysis
-----------------------------------------

![An application case in detailed time series.](figures/case_study_aapl.png){#fig:case_study_aapl width="1.0\\linewidth"}

Another typical application of is conducting a detailed analysis of time series features, combined with LLMs' knowledge and reasoning capabilities to perform simple reasoning and question answering. In Figure [15](#fig:case_study_aapl){reference-type="ref" reference="fig:case_study_aapl"}, we present a case study of time series analysis on the discussion intensity of AAPL-related topics on Twitter, using data from NAB [@ahmad2017unsupervised]. Notably, even without explicit instructions from the user to identify local fluctuations, can accurately infer the user's intent and determine the timestamps of all three \`\`hot events" from the time series. Furthermore, can precisely identify the highest point and its position in the time series based on the numerical values of the local peaks and perform event analysis according to the physical meaning of the series. This demonstrates that can accurately recognize both shape and numerical characteristics of time series and perform reasoning and analysis based on vague user input.

Baseline Comparison
-------------------

![Case Study on Complex Time Series Caption](figures/case_study_caption.png){#fig:case_study_caption width="1.0\\linewidth"}

![Case Study on Trend](figures/case_study_2.png){#fig:case_study_2 width="1.0\\linewidth"}

### Seasonality, Trend and Fluctuations

As shown in Figure [16](#fig:case_study_caption){reference-type="ref" reference="fig:case_study_caption"}, due to tool inaccuracy, the Agent fails to identify the periodicity. This further led the Agent to misinterpret periodic patterns as different trend changes, resulting in errors in trend analysis. Moreover, the LLM does not realize the problem or attempt to correct it. Similarly, the Vision-based model also exhibited errors in analyzing local fluctuations and periodicity. In contrast, , with its time-series modality awareness, accurately captures the periodicity and trend transitions. This case shows a key limitation of agent-based tools: the precision of tools alone cannot overcome the cascading errors caused by the initial error of time series patterns.

### Detailed Trend Analysis

Figure [17](#fig:case_study_2){reference-type="ref" reference="fig:case_study_2"} presents a misleading case where the original image suggests a steady trend. However, in the cropped plot that ignores the two spikes, there is a significant decreasing trend. Due to the subtle nature of trends displayed in the time series images, the Vision-based model incorrectly classified them as steady. Similarly, text-based models, while identifying starting values, fail to identify the global shape of the time series. In contrast, captures both the overall trend and numerical details accurately due to its native time series encoding capabilities.

Related Work
============

**Multimodal LLMs (MLLMs).** MLLMs have developed rapidly in recent years and found extensive applications [@zhang2024mm; @yin2024survey]. A significant body of research integrates different types of data to achieve multimodal fusion, including images [@liu2024visual; @bai2023qwen; @li2023blip], videos [@maaz2023video; @zhang2023video; @li2023videochat], audio [@chu2023qwen; @rubenstein2023audiopalm], and graphs [@zhang2024graphtranslator; @pan2024unifying]. These models have been applied across diverse domains, with image-based question answering and reasoning representing an important research direction. Many studies leverage vision-based LLMs for image reasoning tasks [@luo2024mmevol; @jiang2023cross], fully utilizing the natural language understanding and reasoning capabilities of large language models. However, in the field of time series, despite the existence of numerous works (as discussed below) that combine time series data with LLMs, research on aligning LLMs with time-series modalities with time-series modalities remains limited. This limitation is primarily due to the scarcity of high-quality multimodal datasets that combine time series with textual information [@merrill2024language; @chow2024towards; @jin2024position]. As a result, the development of time series-specific MLLMs for question-answering and reasoning tasks has lagged behind other modalities.

**Time Series Question Answering (TSQA).** With the rapid development of LLMs, TSQA systems have combined the reasoning capabilities of LLMs with time series analysis to enable more efficient cross-domain decision-making and complex task handling [@jin2024position]. Time series question-answering systems have been explored in various fields, such as AIOps [@dbot; @rcagent], IoT [@xing2021deepsqa; @gallo2023conversational], healthcare [@yu2023zero; @oh2024ecg], finance [@maitre2020event; @kurisinkel2024text2timeseries], and traffic [@da2024open; @lai2023large]. However, these methods are often limited to agent-based [@yao2022react] and retrieval-augmented generation (RAG) [@lewis2020retrieval] approaches, lacking a comprehensive understanding of time series and sufficient reasoning capabilities. Although some recent studies [@chow2024towards] have attempted to leverage temporal multimodal approaches for time series reasoning tasks, they typically rely on task-specific corpora. They are trained and evaluated on specific tasks (e.g., classification tasks or forecasting tasks), lacking multivariate analysis capabilities. Compared to the research on multimodal question answering in fields like images and videos, time series question answering still lacks robust multimodal alignment methods and evaluation frameworks [@merrill2024language; @cai2024timeseriesexam]. Therefore, in contrast to existing studies, this paper is the first to propose a comprehensive time series modality alignment and fine-tuning process, evaluated using multiple alignment and reasoning tasks.

**LLM + Time Series.** In addition to the research above, many studies have combined LLMs with time series for various downstream tasks, leveraging the powerful capabilities of LLMs [@jin2023time; @gruver2024large; @su2024large; @chang2023llm4ts; @liu2024time; @cao2023tempo; @zhou2023one; @cai2023jolt]. However, while these models are using LLMs as backbones, they are designed for specific downstream tasks and lack language alignment capabilities, making them unsuitable for question answering and reasoning applications. Moreover, some studies employ vision-based multimodal LLMs for time series prediction [@chen2024visionts] and anomaly detection [@zhuang2024see]. This approach aligns with the vision-based LLM methods discussed in this paper but is significantly constrained in its ability to analyze time series.

Limitation and Future Work
==========================

Due to the limited existing research on time series understanding and reasoning, although has explored an effective approach, we believe it still has a number of limitations. First, while our experiments demonstrate that synthetic data can achieve satisfactory alignment and reasoning performance, we believe that real-world data is essential for further enhancing the capabilities of TS-MLLMs. We hope more relevant datasets will emerge in the future. Second, although we found that a simple MLP encoder performs well due to the relatively simple structure of time series data, exploring more effective methods for multimodal encoding and integration remains a valuable research direction. Third, despite labeling hundreds of real-world time series and using 14 evaluation metrics for evaluation, we believe that this is still insufficient for a comprehensive evaluation of TS-MLLMs. More labeled real-world data is needed for a more comprehensive evaluation. Finally, while this work focuses on *understanding tasks* like language alignment and reasoning, MLLM-based time series *generation* is also worth exploring. Thus, developing a multimodal model that can generate time series based on textual input is an important area for future research.

Conclusion
==========

Understanding and reasoning are important for real-world time series applications, but research is limited due to the lack of time series-text data. In this paper, we propose , the first TS-MLLM with multivariate time series as input for complex time series QA and reasoning, which is fine-tuned on synthetic data. We introduce an attribute-based time series generation method, which not only generates diverse time series but also provides complete and precise attribute descriptions. Building on this, we further propose TSEvol, which leverages rich attribute combinations from the attribute pool and Evol-Instruct to generate diverse and accurate QAs, enhancing the model's capabilities in complex question answering and reasoning. To comprehensively evaluate the capabilities of our model, we collect datasets that include real-world time series data, covering the evaluation of both alignment tasks and reasoning tasks. Evaluation results show that our model achieves significant improvements, outperforming baselines by 46.0% in alignment tasks and 25.8% in reasoning tasks. These findings demonstrate the effectiveness of our approach in bridging the gap between time series data and natural language understanding. We have open-sourced the source code, trained model weights, and the evaluation datasets for reproduction and future research: <https://github.com/NetManAIOps/ChatTS>.

[^1]: https://huggingface.co/Qwen/Qwen2.5-14B-Instruct

[^2]: DB Metrics can be input into through API. The system implementation details are out of the scope of this paper.
