Robotics Time-Series Modeling

Summary

Robotics foundation models usually do not treat sensor data as a plain forecasting table. They model action-conditioned multimodal trajectories: recent observations, language or task context, proprioceptive state when available, and actions or control inputs over a short control horizon. The time-series object is a trajectory, not a standalone sequence of scalar measurements.

The dominant interface is therefore:

context, observation_history, optional action_history -> action_chunk

or, for world models:

context, observation_history, action_sequence -> future_observation_or_latent_trajectory

This page uses Terminology: robot commands are actions or numeric control inputs; camera, proprioception, force, tactile, and IMU samples are observations; task text and embodiment metadata are context; uncontrolled scene changes are events or exogenous variables.

What The Wiki Currently Believes

Robotics Data Is Trajectory Data

The closest local sources are embodied trajectory datasets and robotic world-model papers. Open X-Embodiment standardizes robot-learning datasets across embodiments and trains RT-X policies over image history, language instructions, and discretized end-effector actions. DROID adds in-the-wild manipulation episodes with synchronized cameras, calibration, depth, and language. BridgeData V2 is a frequent substrate for language-conditioned manipulation policies and action-conditioned world-model evaluation. RoboTurk is earlier teleoperation context: the raw signal is still a time-ordered demonstration with human-generated control inputs.

This differs from most Time-Series Foundation Models. A forecasting model predicts future observations from history; a robot policy chooses a control input; an action-conditioned world model predicts how future observations change under candidate actions.

Common Encoding Pattern

Robotics models tend to encode each timestep as a multimodal bundle:

Component	Usual Encoding	Time-Series Meaning
Image/video	CNN, ViT, VLM, or latent encoder tokens per frame	High-dimensional observation of partial state
Proprioception	Normalized numeric features projected to tokens or fused into the action head	Observed robot state, not the whole environment state
Force, tactile, IMU	High-frequency numeric streams, often downsampled, windowed, or encoded by a small temporal encoder	Contact and motion observations; usually more time-sensitive than RGB
Language instruction	Text tokens or a frozen text/VLM embedding	Task context, not a time series unless it changes over the episode
Embodiment/control metadata	Robot ID, control mode, action-space metadata, control frequency	Static or slowly changing context that disambiguates action semantics
Actions/control inputs	Discrete bins, text-like action tokens, continuous chunks, diffusion/flow trajectories, or frequency-space tokens	Controllable inputs used for policy output or world-model conditioning

The dominant VLA recipe is late fusion: images, language, and robot-state vectors are encoded by modality-specific frontends, projected to a shared token width, and then concatenated into one Transformer sequence. Proprioception usually enters as one or more continuous state tokens after an MLP or linear projection rather than through scalar binning.

The cross-embodiment problem is mostly an interface problem. Open X-Embodiment coarsely aligns heterogeneous robots by choosing a canonical camera stream and mapping controls into a normalized end-effector action representation, but this loses embodiment-specific details. Newer physical-intelligence-style models increasingly add metadata for control mode, speed, quality, visual subgoals, or embodiment so the same model can interpret the same numeric action representation differently.

Time Handling

Robotics time is mostly control time, not calendar time. The model sees a finite history window and outputs either the next control input, a short action chunk, or a future latent/video trajectory. The key design choices are:

History length: RT-X-style policies use a short image history; world models condition on a finite visual-action history; diffusion and flow policies condition on recent observations before generating action chunks.
Action horizon: ACT, diffusion policies, RDT-like models, and pi0-like flow policies predict multiple future control inputs at once, then execute part of the chunk before replanning.
Receding horizon: diffusion and flow policies usually replan repeatedly rather than executing an entire long rollout open-loop.
Temporal smoothing: ACT-style temporal ensembling averages overlapping action chunks to reduce jitter and compounding errors.
Frame rate and control frequency: cross-robot data may use different control rates; models either normalize actions into a shared representation, include control-frequency metadata, or rely on the dataset conversion layer.
Generative timestep: diffusion and flow policies have a second notion of time: the denoising or flow timestep. This is usually encoded explicitly, often with sinusoidal features, and is separate from physical control time.
Elapsed time versus token index: when sampling is irregular, token position is not enough. This is the same failure mode that Kairos addresses for time-series forecasting with mixed patch sizes and physical-time calibration.

The practical rule is: keep delta_t, control frequency, frame subsampling, and action horizon explicit whenever data crosses robots, sensors, or datasets.

Action Encoding

There are three main action interfaces:

Discretized action tokens. RT-1 and RT-X-style models bucket each action dimension and train with categorical cross-entropy. RT-2-style VLAs cast action bins into text-like tokens so a VLM can output robot controls.
Continuous action chunks. ACT, Diffusion Policy, RDT, and pi0-style models generate a sequence of future continuous controls, often for dexterous manipulation where one-step actions are too myopic or noisy.
Compressed action tokens. FAST-style action tokenization moves an action chunk into frequency space before discrete tokenization, making high-frequency continuous controls more compatible with autoregressive VLA training.

For this wiki, all three are representations of control inputs. They should not be confused with passive events or exogenous variables.

Attention And Sequence Mixers

Robotics models reuse Transformer components, but the attention pattern is shaped by the embodied interface:

Pattern	Typical Use	Attention / Mixer
Decoder-only policy Transformer	RT-1/RT-X-style policies over image-history and language tokens	Causal or autoregressive self-attention over compressed observation/context tokens before action-token prediction
VLM-as-policy	RT-2/OpenVLA-style policies	Pretrained vision-language attention reused; action tokens become part of the output vocabulary
Action Chunking Transformer	Bimanual imitation learning and low-data dexterous tasks	Transformer over observations and latent action sequence, with temporal ensembling at inference
Diffusion Transformer policy	Diffusion Policy and RDT-style policies	Denoising network over action trajectories, conditioned on visual/language/proprioceptive observations by cross-attention or token fusion
Flow-matching action expert	pi0-style physical-intelligence policies	VLM backbone supplies semantic context; a separate continuous action expert generates fluent control trajectories
Blockwise multimodal policy	pi0-style physical-intelligence policies	Separate blocks for image/language context, robot state, and future action tokens; causal masking prevents later action tokens from leaking into observation/state processing
Action-conditioned latent world model	Robotic video/latent rollout for planning and policy evaluation	DiT-style transition model; Reconstruction Or Semantics? uses factorized spatial attention within frames and causal temporal attention across frames
Recurrent or SSM mixer	Low-latency control, long sensor histories, or deployment-constrained loops	Mamba-2-style state-space mixers are attractive when recurrent inference matters more than full attention over long histories

The attention axis is not just “global versus local.” Robotics often needs structured mixing across space, time, sensor channel, language context, and action horizon. Factorized attention is common because full attention over all video patches, timesteps, views, and action tokens becomes expensive quickly.

Latent World Models

Reconstruction Or Semantics? is the strongest local anchor for robotic world models. It frames action-conditioned video world models as predictors of future observations from observation and action histories, but argues that the latent space matters more than pixel fidelity alone. Semantic latents can preserve action-relevant object state, task progress, and controllability better than reconstruction latents.

That is directly relevant to sensor time series: the model should not merely reconstruct the next frame or next sensor value. It should preserve the latent state variables that make action consequences predictable.

Numeric State Trajectories

Not all robotics time series are vision-heavy. D4RL and CausalWorld are cleaner numeric-control anchors: they expose state-action trajectories and are closer to classical action-conditioned dynamics learning. They are useful when the question is about action-conditioned multivariate time series rather than visual manipulation.

Design Heuristics

Preserve the distinction between observation, state, latent state, action, control input, and exogenous variable.
Treat RGB/video as an observation stream, not as the state itself.
Keep proprioception and action history even when vision dominates; contact-rich manipulation often needs short-term motion and gripper state.
Represent high-frequency tactile, force, and IMU streams as windowed multivariate time series with explicit sampling rates before fusing them with lower-rate camera frames. The common adaptation path is to add them as new state or observation tokens during fine-tuning.
Use action chunks when one-step controls are noisy, delayed, or too locally ambiguous.
Use action-conditioned world models when the task is planning or policy evaluation, not direct behavior cloning.
Use semantic latent spaces when downstream control cares about object identity, task progress, and action recoverability more than pixel-level reconstruction.
Treat cross-embodiment normalization as a lossy modeling decision; record what was normalized, discretized, resampled, or dropped.

External Anchors To Ingest Next

These sources are useful next candidates for full source pages if this topic becomes a major robotics branch of the wiki:

Open Questions

Should this wiki treat robot manipulation datasets as a separate branch of Action-Conditioned Time-Series Datasets, or keep them in a robotics-specific topic because vision and embodiment dominate the interface?
What is the best canonical representation for proprioception, force, tactile, and IMU streams when combining them with VLM-style image tokens?
When should robot action chunks be represented as continuous trajectories, discrete tokens, or frequency-space tokens?
Can factorized spatial-temporal attention preserve contact dynamics as well as it preserves visual task progress?
Which benchmarks evaluate action recoverability, closed-loop success, and latent-state faithfulness rather than only video fidelity or imitation loss?

Alex Knowledge Base

Explorer

Robotics Time-Series Modeling

Robotics Time-Series Modeling

Summary

What The Wiki Currently Believes

Robotics Data Is Trajectory Data

Common Encoding Pattern

Time Handling

Action Encoding

Attention And Sequence Mixers

Latent World Models

Numeric State Trajectories

Design Heuristics

External Anchors To Ingest Next

Open Questions

Graph View

Table of Contents

Backlinks

Alex Knowledge Base

Explorer

Robotics Time-Series Modeling

Robotics Time-Series Modeling

Summary

What The Wiki Currently Believes

Robotics Data Is Trajectory Data

Common Encoding Pattern

Time Handling

Action Encoding

Attention And Sequence Mixers

Latent World Models

Numeric State Trajectories

Design Heuristics

External Anchors To Ingest Next

Open Questions

Related Pages

Graph View

Table of Contents

Backlinks