Robotics Text Conditioning

Summary

Text in robotics has moved from a single language instruction embedding to a multi-level control interface. It can now serve as task context, planner output, executable code, subtask handoff, action-token output space, metadata, strategy hints, visual-subgoal descriptions, progress explanation, and a safety/debugging channel.

The current best mental model is not one path from text to motors, but a stack:

human instruction
  -> embodied reasoning / planning text
  -> subtask, skill, metadata, or subgoal prompt
  -> vision-language-action policy or action expert
  -> continuous control inputs

In this wiki’s terminology, most text is context, not an action by itself. It becomes an action only when the system turns it into a selected skill, generated policy program, action token, control input, or subtask command passed to a controller.

What The Wiki Currently Believes

Established Pattern: Instruction Embeddings

The older language-conditioned policy interface is simple: pair each robot trajectory with a natural-language instruction, encode the instruction, fuse it with visual and robot-state observations, and train a policy to predict actions. Open X-Embodiment is the local source anchor for this line because it standardizes robot trajectories with language instructions and trains RT-X models over image histories and discretized end-effector actions.

This pattern treats text as task context:

image_history + instruction_embedding -> next action or action token

It is useful for object selection, task identity, and simple semantic generalization, but it usually does not expose the model’s intermediate plan.

Planner Pattern: Text Selects Skills

The SayCan-style pattern keeps the language model above the motor policy. A high-level instruction is decomposed into textual skills or subtasks, while robot-specific affordance or value functions decide which skill is feasible in the current state. This makes text a planning interface over a library of low-level behaviors.

The core shape is:

instruction + current state + skill list -> selected textual skill
selected skill -> low-level controller

This is still valuable when the robot’s action space is too brittle for end-to-end control, when safety gates are needed, or when developers want interpretable plans.

Program Pattern: Text Becomes Policy Code

Code-as-policies systems use an LLM to transform a natural-language command into executable policy code that calls perception and control APIs. This gives text a precise operational form: loops, conditionals, geometry, object references, and feedback can be written explicitly.

This pattern is strong for structured tabletop tasks, geometric instructions, or environments with reliable symbolic/perception APIs. Its weakness is that it depends on API design and runtime guardrails rather than learning the full policy end to end.

Multimodal Sentence Pattern

PaLM-E-style models inject continuous embodied observations into a language-model embedding space, creating “multimodal sentences” that interleave text, image encodings, and state estimates. The important shift is that robot observations are not merely side inputs to a policy head; they are projected into the same sequence interface as language tokens.

The interface looks like:

text tokens + visual tokens + state tokens -> textual plan or robot-relevant completion

This is a bridge between language-model reasoning and embodied state, but it is not by itself the dominant low-level continuous-control solution.

VLA Pattern: Text And Actions Share The Token Interface

RT-2 and OpenVLA-style models turn robot actions into text-like output tokens so a vision-language model can be fine-tuned for robot control. The text instruction remains an ordinary language input, image features are projected into the language-model token stream, and the model autoregressively emits action tokens.

This makes robot control look like conditional language modeling:

image tokens + instruction tokens -> action tokens

Open X-Embodiment records the RT-X version of this idea: RT-1-X and RT-2-X both take images plus text instructions and output tokenized end-effector actions, with RT-2-X using a VLM backbone and action-as-language representation.

The strength is semantic transfer from web-scale vision-language pretraining. The weakness is that discrete action tokens can be awkward for high-frequency, dexterous, continuous control.

Diffusion And Flow Pattern: Text Conditions Continuous Action Experts

Newer robotics foundation models increasingly keep the semantic backbone but move low-level action generation into a diffusion or flow-matching action expert. Text remains part of the observation/context block, but actions are generated as continuous chunks rather than ordinary language tokens.

The interface becomes:

image tokens + instruction tokens + state tokens -> continuous action chunk

This pattern is visible in the π0, RDT, Octo, GR00T, and related lines. The key modeling change is that text guides the action distribution, while the action decoder is optimized for physical control rather than next-token language modeling.

Hierarchical Reasoning Pattern

The newest trend is to separate high-level embodied reasoning from low-level action generation. A reasoning model consumes text, images, state, task constraints, and sometimes tool outputs; it then emits natural-language subtasks, progress assessments, or strategy prompts for a lower-level policy.

The Gemini Robotics-ER / Gemini Robotics split is the clearest public version of this pattern: the embodied reasoning model plans and emits natural-language step instructions, while the VLA/action model executes them. GR00T’s System 2 / System 1 split has the same broad shape: a vision-language module interprets the environment and instructions, then a diffusion transformer generates continuous actions.

This turns text into a handoff protocol between model layers:

mission text -> plan text -> subtask text -> action expert

Steerable Prompt Pattern

π0.7-style prompting broadens text from “what to do” into “what strategy to use.” The prompt may include:

  • task instruction;
  • subtask instruction from a high-level policy;
  • metadata such as quality, speed, or control mode;
  • visual subgoal or world-model output;
  • embodiment or control-modality hints.

This is important because it lets a single policy consume heterogeneous data: demonstrations, autonomous rollouts, failures, human videos, and curated specialist data can be made more usable when prompt metadata tells the model how to interpret the behavior.

Action-Level Reasoning Pattern

Action Chain-of-Thought style work points to a limit of pure language reasoning: intermediate language subtasks may be too coarse for precise control. A newer direction is to express reasoning as coarse action intents, reference trajectories, or latent action priors that condition the final action head.

This suggests a likely convergence:

language reasoning for semantic structure
+ action-space reasoning for physical detail
+ continuous action decoder for execution

Interface Taxonomy

Text RoleInput Or OutputTypical ConsumerMain BenefitFailure Mode
Task instructionInputPolicy or VLAHuman-friendly task specificationAmbiguous or underspecified commands
Skill labelIntermediate outputLow-level controllerInterpretable planningSkill library bottleneck
Policy codeOutputRuntime/API layerExplicit logic and geometryAPI fragility and safety risks
Multimodal sentence textInput/outputLLM/VLM backboneShared sequence interface for reasoningWeak low-level control
Action token stringOutputRobot action decoderReuses language-model training machineryDiscretization and high-frequency control limits
Subtask instructionIntermediate outputVLA/action expertLong-horizon decompositionError propagation between planner and executor
Metadata promptInputGeneralist policyBehavior steering across data quality, speed, or embodimentMetadata can become inconsistent or underspecified
Visual subgoal caption/promptInputPolicy or world modelBridges semantic goal and physical targetMay omit contact/geometry details
Natural-language rationaleOutputHuman/operator/debuggerTransparency and monitoringRationale can be unfaithful to action cause

Practical Design Guidance

  • Use plain instruction embeddings when the task set is narrow, the language is simple, and closed-loop behavior is short-horizon.
  • Use planner-over-skills when safety, interpretability, or discrete skill reuse matters more than dexterity.
  • Use code generation only when perception/control APIs are reliable and runtime verification is strong.
  • Use VLA action tokens when semantic generalization is more important than high-frequency precision.
  • Use diffusion or flow action experts when continuous dexterous control, multimodal action distributions, or action chunks matter.
  • Use hierarchical reasoning when tasks require tools, long-horizon planning, progress estimation, or changing constraints.
  • Use steerable metadata prompts when training data mixes high-quality demonstrations, failures, autonomous data, different control modes, or different robot embodiments.

External Anchors To Ingest Next

These sources SHOULD be ingested as full source pages if robotics text conditioning becomes a durable branch of the wiki:

Open Questions

  • Which information should remain natural language, and which should be converted into action-space, latent, or visual-subgoal representations?
  • How can a policy verify that a natural-language plan is faithful to its actual action trajectory?
  • Should metadata prompts such as quality, speed, control mode, and embodiment be human-authored, learned, or generated by a world model?
  • How much language capability is lost when a VLM is fine-tuned only on robot action data?
  • Can action-level reasoning replace language chain-of-thought for precise contact-rich manipulation?