Robotics Foundation Models
Vision-language-action models combine perception, language, and motor control. The 2025-2026 wave (RT-2, OpenVLA, Octo, π0) is the foundation-model moment for robotics.
The VLA idea
One model takes camera input + a natural-language instruction (“put the cup on the shelf”) and outputs motor commands. No separate vision pipeline, no scripted planner, no hand-engineered controller. Vision-Language-Action (VLA).
The 2026 wave
- RT-2 / RT-X: Google’s pioneers. Trained on a mix of internet-scale image-text and robot demonstrations.
- OpenVLA: open-weight 7B VLA from Stanford/MIT. Reasonable for a research baseline.
- Octo: small generalist policy across robot types.
- π0 (Physical Intelligence): state-of-art generalist robot policy in 2025. Manipulation across many tasks.
The data problem
Internet text is essentially free. Robot demonstrations require physical robots, teleoperators, and time. The Open X-Embodiment dataset (1M+ demonstrations across many robots) was a 2024 milestone but is still tiny next to text corpora. Simulation, sim-to-real transfer, and synthetic generation are filling the gap.
Realistic 2026 capabilities
- Pick-and-place across dozens of objects: working in lab and warehouse pilots.
- Long-horizon kitchen tasks (cooking from a recipe): impressive demos, fragile in production.
- Generalisation to new environments: improving fast but still limited.
The trajectory: the GPT-3 moment for robotics is approaching. Expect a 2027-2028 inflection where robot generalists become economically deployable.