Robotics Time-Series Modeling
Summary
Robotics foundation models usually do not treat sensor data as a plain forecasting table. They model action-conditioned multimodal trajectories: recent observations, language or task context, proprioceptive state when available, and actions or control inputs over a short control horizon. The time-series object is a trajectory, not a standalone sequence of scalar measurements.
The dominant interface is therefore:
context, observation_history, optional action_history -> action_chunkor, for world models:
context, observation_history, action_sequence -> future_observation_or_latent_trajectoryThis page uses Terminology: robot commands are actions or numeric control inputs; camera, proprioception, force, tactile, and IMU samples are observations; task text and embodiment metadata are context; uncontrolled scene changes are events or exogenous variables.
What The Wiki Currently Believes
Robotics Data Is Trajectory Data
The closest local sources are embodied trajectory datasets and robotic world-model papers. Open X-Embodiment standardizes robot-learning datasets across embodiments and trains RT-X policies over image history, language instructions, and discretized end-effector actions. DROID adds in-the-wild manipulation episodes with synchronized cameras, calibration, depth, and language. BridgeData V2 is a frequent substrate for language-conditioned manipulation policies and action-conditioned world-model evaluation. RoboTurk is earlier teleoperation context: the raw signal is still a time-ordered demonstration with human-generated control inputs.
This differs from most Time-Series Foundation Models. A forecasting model predicts future observations from history; a robot policy chooses a control input; an action-conditioned world model predicts how future observations change under candidate actions.
Common Encoding Pattern
Robotics models tend to encode each timestep as a multimodal bundle:
| Component | Usual Encoding | Time-Series Meaning |
|---|---|---|
| Image/video | CNN, ViT, VLM, or latent encoder tokens per frame | High-dimensional observation of partial state |
| Proprioception | Normalized numeric features projected to tokens or fused into the action head | Observed robot state, not the whole environment state |
| Force, tactile, IMU | High-frequency numeric streams, often downsampled, windowed, or encoded by a small temporal encoder | Contact and motion observations; usually more time-sensitive than RGB |
| Language instruction | Text tokens or a frozen text/VLM embedding | Task context, not a time series unless it changes over the episode |
| Embodiment/control metadata | Robot ID, control mode, action-space metadata, control frequency | Static or slowly changing context that disambiguates action semantics |
| Actions/control inputs | Discrete bins, text-like action tokens, continuous chunks, diffusion/flow trajectories, or frequency-space tokens | Controllable inputs used for policy output or world-model conditioning |
The dominant VLA recipe is late fusion: images, language, and robot-state vectors are encoded by modality-specific frontends, projected to a shared token width, and then concatenated into one Transformer sequence. Proprioception usually enters as one or more continuous state tokens after an MLP or linear projection rather than through scalar binning.
The cross-embodiment problem is mostly an interface problem. Open X-Embodiment coarsely aligns heterogeneous robots by choosing a canonical camera stream and mapping controls into a normalized end-effector action representation, but this loses embodiment-specific details. Newer physical-intelligence-style models increasingly add metadata for control mode, speed, quality, visual subgoals, or embodiment so the same model can interpret the same numeric action representation differently.
Time Handling
Robotics time is mostly control time, not calendar time. The model sees a finite history window and outputs either the next control input, a short action chunk, or a future latent/video trajectory. The key design choices are:
- History length: RT-X-style policies use a short image history; world models condition on a finite visual-action history; diffusion and flow policies condition on recent observations before generating action chunks.
- Action horizon: ACT, diffusion policies, RDT-like models, and pi0-like flow policies predict multiple future control inputs at once, then execute part of the chunk before replanning.
- Receding horizon: diffusion and flow policies usually replan repeatedly rather than executing an entire long rollout open-loop.
- Temporal smoothing: ACT-style temporal ensembling averages overlapping action chunks to reduce jitter and compounding errors.
- Frame rate and control frequency: cross-robot data may use different control rates; models either normalize actions into a shared representation, include control-frequency metadata, or rely on the dataset conversion layer.
- Generative timestep: diffusion and flow policies have a second notion of time: the denoising or flow timestep. This is usually encoded explicitly, often with sinusoidal features, and is separate from physical control time.
- Elapsed time versus token index: when sampling is irregular, token position is not enough. This is the same failure mode that Kairos addresses for time-series forecasting with mixed patch sizes and physical-time calibration.
The practical rule is: keep delta_t, control frequency, frame subsampling, and action horizon explicit whenever data crosses robots, sensors, or datasets.
Action Encoding
There are three main action interfaces:
- Discretized action tokens. RT-1 and RT-X-style models bucket each action dimension and train with categorical cross-entropy. RT-2-style VLAs cast action bins into text-like tokens so a VLM can output robot controls.
- Continuous action chunks. ACT, Diffusion Policy, RDT, and pi0-style models generate a sequence of future continuous controls, often for dexterous manipulation where one-step actions are too myopic or noisy.
- Compressed action tokens. FAST-style action tokenization moves an action chunk into frequency space before discrete tokenization, making high-frequency continuous controls more compatible with autoregressive VLA training.
For this wiki, all three are representations of control inputs. They should not be confused with passive events or exogenous variables.
Attention And Sequence Mixers
Robotics models reuse Transformer components, but the attention pattern is shaped by the embodied interface:
| Pattern | Typical Use | Attention / Mixer |
|---|---|---|
| Decoder-only policy Transformer | RT-1/RT-X-style policies over image-history and language tokens | Causal or autoregressive self-attention over compressed observation/context tokens before action-token prediction |
| VLM-as-policy | RT-2/OpenVLA-style policies | Pretrained vision-language attention reused; action tokens become part of the output vocabulary |
| Action Chunking Transformer | Bimanual imitation learning and low-data dexterous tasks | Transformer over observations and latent action sequence, with temporal ensembling at inference |
| Diffusion Transformer policy | Diffusion Policy and RDT-style policies | Denoising network over action trajectories, conditioned on visual/language/proprioceptive observations by cross-attention or token fusion |
| Flow-matching action expert | pi0-style physical-intelligence policies | VLM backbone supplies semantic context; a separate continuous action expert generates fluent control trajectories |
| Blockwise multimodal policy | pi0-style physical-intelligence policies | Separate blocks for image/language context, robot state, and future action tokens; causal masking prevents later action tokens from leaking into observation/state processing |
| Action-conditioned latent world model | Robotic video/latent rollout for planning and policy evaluation | DiT-style transition model; Reconstruction Or Semantics? uses factorized spatial attention within frames and causal temporal attention across frames |
| Recurrent or SSM mixer | Low-latency control, long sensor histories, or deployment-constrained loops | Mamba-2-style state-space mixers are attractive when recurrent inference matters more than full attention over long histories |
The attention axis is not just “global versus local.” Robotics often needs structured mixing across space, time, sensor channel, language context, and action horizon. Factorized attention is common because full attention over all video patches, timesteps, views, and action tokens becomes expensive quickly.
Latent World Models
Reconstruction Or Semantics? is the strongest local anchor for robotic world models. It frames action-conditioned video world models as predictors of future observations from observation and action histories, but argues that the latent space matters more than pixel fidelity alone. Semantic latents can preserve action-relevant object state, task progress, and controllability better than reconstruction latents.
That is directly relevant to sensor time series: the model should not merely reconstruct the next frame or next sensor value. It should preserve the latent state variables that make action consequences predictable.
Numeric State Trajectories
Not all robotics time series are vision-heavy. D4RL and CausalWorld are cleaner numeric-control anchors: they expose state-action trajectories and are closer to classical action-conditioned dynamics learning. They are useful when the question is about action-conditioned multivariate time series rather than visual manipulation.
Design Heuristics
- Preserve the distinction between observation, state, latent state, action, control input, and exogenous variable.
- Treat RGB/video as an observation stream, not as the state itself.
- Keep proprioception and action history even when vision dominates; contact-rich manipulation often needs short-term motion and gripper state.
- Represent high-frequency tactile, force, and IMU streams as windowed multivariate time series with explicit sampling rates before fusing them with lower-rate camera frames. The common adaptation path is to add them as new state or observation tokens during fine-tuning.
- Use action chunks when one-step controls are noisy, delayed, or too locally ambiguous.
- Use action-conditioned world models when the task is planning or policy evaluation, not direct behavior cloning.
- Use semantic latent spaces when downstream control cares about object identity, task progress, and action recoverability more than pixel-level reconstruction.
- Treat cross-embodiment normalization as a lossy modeling decision; record what was normalized, discretized, resampled, or dropped.
External Anchors To Ingest Next
These sources are useful next candidates for full source pages if this topic becomes a major robotics branch of the wiki:
- RT-1: Robotics Transformer for Real-World Control at Scale
- RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
- Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware / ACT
- Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
- Octo: An Open-Source Generalist Robot Policy
- OpenVLA: An Open-Source Vision-Language-Action Model
- RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
- pi0: A Vision-Language-Action Flow Model for General Robot Control
- FAST: Efficient Action Tokenization for Vision-Language-Action Models
- pi0.7: a Steerable Model with Emergent Capabilities
Open Questions
- Should this wiki treat robot manipulation datasets as a separate branch of Action-Conditioned Time-Series Datasets, or keep them in a robotics-specific topic because vision and embodiment dominate the interface?
- What is the best canonical representation for proprioception, force, tactile, and IMU streams when combining them with VLM-style image tokens?
- When should robot action chunks be represented as continuous trajectories, discrete tokens, or frequency-space tokens?
- Can factorized spatial-temporal attention preserve contact dynamics as well as it preserves visual task progress?
- Which benchmarks evaluate action recoverability, closed-loop success, and latent-state faithfulness rather than only video fidelity or imitation loss?