OpenVLA: An Open-Source Vision-Language-Action Model
Source
- Raw Markdown: paper_openvla-2024.md
- PDF: paper_openvla-2024.pdf
- Preprint: arXiv 2406.09246
- Project page: openvla.github.io
- Official code: github.com/openvla/openvla
- Official weights: openvla/openvla-7b
Core Claim
OpenVLA makes the RT-2-style VLA recipe open: a pretrained VLM is fine-tuned on robot trajectories so image observations and language instructions map to discretized end-effector action tokens.
Method Notes
- The model uses an autoregressive action-token interface rather than a diffusion or flow action expert.
- It is a strong baseline for semantic transfer and open reproducibility, but the action representation is still quantized.
- The source is a useful counterweight to the diffusion/flow trend: modern robotics still uses classical Transformer/VLM next-token machinery when action precision and frequency demands are manageable.
Evidence And Limitations
OpenVLA reports competitive results across several robot policy evaluations and provides code/weights. Reported limitations include single-image observations, limited native history/proprioception support in the initial version, and lower inference rates than compact control-specialized policies.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Control and action interface | adjacent | Fine-tunes a VLM backbone to emit discretized robot action tokens conditioned on image and language. | Does not model future trajectories under candidate actions or provide counterfactual rollouts. |
| Context interface | adjacent | Combines visual observations and natural-language instructions as policy context. | Lacks channel context, topology, event streams, and numeric system context for TSFMs. |
| Benchmarks | adjacent | Evaluates multi-robot out-of-the-box control and fine-tuning across BridgeData, Google Robot, and Franka tasks. | Physical manipulation metrics do not test latent-state time-series modeling or digital-world control. |
Links Into The Wiki
- OpenVLA
- Foundation Time-Series Model Research Agenda
- Robotics Text Conditioning
- Robotics Time-Series Modeling
Open Questions
- Can action-token VLAs match diffusion/flow action experts after better tokenization such as FAST?
- Which robotics tasks are bottlenecked by semantic understanding rather than continuous-control fidelity?