GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Source
- Raw Markdown: paper_gr00t-n1-2025.md
- PDF: paper_gr00t-n1-2025.pdf
- Preprint: arXiv 2503.14734
- Official code: github.com/NVIDIA/Isaac-GR00T
- Official dataset: PhysicalAI-Robotics-GR00T-X-Embodiment-Sim
Core Claim
GR00T N1 is an open humanoid VLA policy with a dual-system design: a vision-language module interprets observations and instructions, while a Diffusion Transformer trained with action flow matching generates high-frequency continuous motor actions.
Method Notes
- The source explicitly separates slower semantic processing from faster action generation.
- System 2 is a VLM for image/language interpretation; System 1 is a DiT-style action module conditioned on VLM outputs and robot state.
- The model is a control-input generator over short trajectories, not a future-observation world model.
Evidence And Limitations
The source reports simulation and real GR-1 humanoid evaluations, cross-embodiment data use, and public release artifacts. Limitations include bounded task suites, post-training requirements for specific embodiments, and safety constraints that must be handled outside the released checkpoint.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Causal structure, counterfactuals, and control | partially closes | Generates short-horizon motor trajectories conditioned on observations, instructions, VLM outputs, and robot state, and provides a fast/slow control-architecture analogy for digital-world robots. | It is an action generator, not a future-observation world model for comparing candidate interventions or an observability/digital-system state model. |
| Multi-modal future distributions | adjacent | Uses action flow matching / diffusion-style trajectory generation for continuous control. | Does not expose calibrated multiple future system states for planning under uncertainty. |
Links Into The Wiki
- GR00T N1
- Robotics Time-Series Modeling
- Foundation Time-Series Model Research Agenda
- Robotics Text Conditioning
Open Questions
- How much does actionless-video pretraining help once pseudo-actions introduce labeling error?
- Can the System 2/System 1 split become a standard interface for humanoid fast/slow control?