AI & ML Advanced By Samson Tanimawo, PhD Published Jul 7, 2026 5 min read

Robotics Foundation Models

Vision-language-action models combine perception, language, and motor control. The 2025-2026 wave (RT-2, OpenVLA, Octo, π0) is the foundation-model moment for robotics.

The VLA idea

One model takes camera input + a natural-language instruction (“put the cup on the shelf”) and outputs motor commands. No separate vision pipeline, no scripted planner, no hand-engineered controller. Vision-Language-Action (VLA).

The 2026 wave

RT-2 / RT-X: Google’s pioneers. Trained on a mix of internet-scale image-text and robot demonstrations.
OpenVLA: open-weight 7B VLA from Stanford/MIT. Reasonable for a research baseline.
Octo: small generalist policy across robot types.
π0 (Physical Intelligence): state-of-art generalist robot policy in 2025. Manipulation across many tasks.

The data problem

Internet text is essentially free. Robot demonstrations require physical robots, teleoperators, and time. The Open X-Embodiment dataset (1M+ demonstrations across many robots) was a 2024 milestone but is still tiny next to text corpora. Simulation, sim-to-real transfer, and synthetic generation are filling the gap.

Realistic 2026 capabilities

Pick-and-place across dozens of objects: working in lab and warehouse pilots.
Long-horizon kitchen tasks (cooking from a recipe): impressive demos, fragile in production.
Generalisation to new environments: improving fast but still limited.

The trajectory: the GPT-3 moment for robotics is approaching. Expect a 2027-2028 inflection where robot generalists become economically deployable.