TSL-JEPA
Status: draft research idea extracted from internal discussion notes.
Collaboration
If this direction resonates with you, I would be happy to talk with like-minded people, collaborate on research, and work on use-cases together.
Ideas are not the bottleneck. Hands are. Time-series modeling should be moving at least as fast as vision, audio, and robotics.
- Email: alexander.chemeris@gmail.com
- X: @chemeris
- Telegram: @alexanderchemeris
Summary
TSL-JEPA is a query-conditioned JEPA idea for time series. The model should learn a useful time-series embedding first, then use a query to predict a target embedding, label, structured value, or human-readable answer only when a readout is needed.
The core claim is not that time-series models should become better free-form text generators. The sharper claim is that time series need a structured readout interface: retrieval, alerting, captioning, classification, shape matching, numeric property extraction, and later richer reasoning should all sit on top of a shared representation-space prediction objective.
VL-JEPA is the nearest interface analogy: predict target embeddings and decode selectively. ChatTS is the nearest next-token-prediction comparison point. Florence-2 is the data and output-contract analogy: a compact model becomes practical when dense labels and structured task prompts make the output space usable.
Motivation
Many time-series-language systems frame the task as:
time series + text query -> answer tokensThat is useful, but it makes language generation the main training interface. TSL-JEPA changes the center of gravity:
time series + query -> predicted target representation -> optional readoutThe readout can be a text caption, but it can also be a class, an alert, a retrieval embedding, a scalar value, an interval, a segment boundary, or another typed target. This matters because production time-series systems usually need parseable outputs before they need fluent prose.
The strongest comparison against next-token prediction is therefore not “can it chat?” The more important comparison is whether TSL-JEPA is more robust when the same underlying question is expressed through different query forms, candidate labels, domains, or structured output contracts.
Interface
flowchart LR X[time-series window or trajectory] --> XE[x-encoder] Q[query or task token] --> P[predictor] XE --> P Y[text, label, numeric, or structured target] --> YE[y-encoder] YE --> L[JEPA target space] P --> L P --> R{readout needed?} R -- no --> E[embedding, retrieval, alert score, or class score] R -- yes --> D[selective decoder or formatter] D --> O[text or structured output]
The x-encoder can start as a frozen time-series encoder, but the idea is not tied to one backbone. The durable object is the interface: a time-series representation, a query-conditioned predictor, a target representation, and selective readout.
Output Contract
TSL-JEPA should support several output surfaces without turning every task into open-ended text generation.
- Retrieval: embed a time-series window and query so similar windows, regimes, or examples can be found.
- Alerting and classification: compare the predicted target embedding with candidate labels or alert types.
- Captioning: decode text when a human-readable explanation is needed.
- Structured readouts: emit typed values such as amplitude, frequency, trend, shape class, jump flags, segment boundaries, or event labels.
The structured-output branch is important. If the target is “amplitude,” a numeric decoder or typed formatter may be a better output contract than a free-form sentence. If the target is “which alert should fire?”, candidate-label scoring may be better than unconstrained generation.
Data Foundation
The main scaling bottleneck is the lack of an ImageNet-like time-series label foundation. Time-series labels are heterogeneous: shape labels, statistical properties, regimes, anomalies, events, forecasts, captions, domain labels, and operational alerts do not naturally live in one small ontology.
The near-term path should therefore narrow the scope before claiming general zero-shot time-series understanding. A practical first version can use two complementary data sources:
- Synthetic labeled series: generators such as CauKer can expose known factors, labels, and query targets directly.
- Real domain series with generated annotations: real multivariate time series or event streams can receive model-generated, heuristic, or human-audited labels.
This is where Florence-2 is useful. The lesson is not to copy its vision pipeline literally. The lesson is to shape the target distribution with dense, multi-granularity labels. A single time-series window should have many query-target views, not one loose caption or one class label.
Florence-2 Translation
Florence-2 suggests two reusable rules for TSL-JEPA.
First, dense labels are a modeling tool. Multiple labels for the same observation reduce ambiguity and make the target distribution easier to learn. For time series, this means producing many structured targets per segment: trends, shapes, jumps, frequencies, regimes, alerts, and caption-like summaries.
Second, structured output is often better than free text. Florence-2 serializes boxes, points, OCR, captions, and grounding tasks through a promptable sequence interface. TSL-JEPA can use the same principle without forcing everything through chat: queries can select typed output contracts, and a separate formatter can turn structured answers into prose when needed.
Scope
The first scope should be intentionally limited. TSL-JEPA does not need to prove universal time-series understanding in one step.
A good first public story would show that a query-conditioned JEPA interface can cover a small but coherent set of time-series tasks better than a pure next-token-prediction baseline under matched data and model budgets. The task set should be broad enough to show the interface, but narrow enough that the label ontology remains stable.
Useful initial surfaces:
- retrieval over similar time-series windows;
- alert or class selection from candidate labels;
- structured property extraction;
- optional captioning as a selective readout.
Relation To Existing Wiki Threads
TSL-JEPA belongs next to JEPA, not as a generic language-modeling trick but as a time-series extension of predictive embedding learning.
It also belongs next to Vision-Language Models, because VL-JEPA and Florence-2 show two different ways to avoid making open-ended token generation the only interface: predicted embeddings with selective decoding, and structured task-prompted outputs.
The idea also depends on Synthetic Data For Time Series and Iterative Dataset Bootstrapping, because the model will not scale without dense labels or generated annotation layers.
Relation To Foundation TSFM Agenda
This is an idea page, so the verdicts below describe the intended contribution if the proposed system works. Evidence status is recorded separately in the Evidence and Missing pieces columns.
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Context interface | partially closes | Proposes a query-conditioned interface where task, label, or output contract conditions the time-series representation. Evidence is an internal design proposal grounded in VL-JEPA and ChatTS analogies. | Define a stable query/target schema and test whether context is necessary rather than decorative. |
| Representation quality | partially closes | Proposes predicting target representations before optional decoding, so labels, retrieval, captions, and typed outputs can share one latent interface. Evidence is not yet public for time series. | Show that the representation preserves both semantic state and dense numeric detail. |
| Data diversity, curriculum, and long tail | adjacent | Florence-2 motivates dense multi-label supervision; CauKer motivates scalable synthetic labeled series. | Build a dense time-series query-target corpus and test transfer to real domains. |
| Dynamic compute allocation | partially closes | The interface decodes only when a human-readable answer is needed, following the VL-JEPA selective-decoding pattern. | Define readout triggers for events, alerts, retrieval, and captioning. |
| Benchmark level | warning | The comparison should be against next-token prediction on matched data, not only against weak fixed-label baselines. | Need held-out query reformulations, candidate-label reformulations, and domain shifts. |
Open Questions
- Which query-target ontology is stable enough for the first public TSL-JEPA version?
- Should the first target space be text embeddings, label embeddings, typed numeric outputs, or a mixture?
- How should structured outputs be serialized so the model learns values and events rather than prose style?
- Which tasks should define the first scope: retrieval, alerting, captioning, classification, property extraction, or regime labeling?
- How much data should come from synthetic generators versus generated annotations over real time series?
- Can TSL-JEPA outperform a ChatTS-style next-token pipeline under query and candidate-label reformulation?
- How should SIGReg or another anti-collapse regularizer be used when the target space mixes labels, text, and numeric values?