Building general-purpose robots that work anywhere, on any task

Executive overview

Most robotics companies fail because each application requires a new company built from scratch — new hardware, new software, new motion primitives. Physical Intelligence is building a single foundation model that can run any robot, on any task, in any environment.

The key finding: scale alone doesn't solve robotics. Real-world diversity, a pre-train/post-train recipe borrowed from language models, and hierarchical vision-language-action models together unlock genuine generalization.

Pre-training on broad robot data then fine-tuning on curated, high-quality demonstrations outperforms training on all data or curated data alone.

Why scale isn't enough

  • Industrial robot data has volume but no behavioral diversity — useless for open-world tasks
  • YouTube videos have scale but an embodiment gap; watching doesn't transfer to doing
  • Simulation data lacks realism and has a sim-to-real gap
  • Scale is necessary but not sufficient; data diversity and recipe matter more

The laundry-folding case study

  • Started with the simplest version: fold one shirt, single brand, single size
  • 100M-parameter model mapping camera images to target joint positions, running at 50 Hz
  • Crumpled starting state caused months of 0% success rate — variability was the hard problem
  • Breakthrough: pre-train on all robot data, then fine-tune on a curated, consistent, high-quality subset
  • This recipe cut fold time from 20 min → 12 min → 9 min for five items
  • Switching to a 3B-parameter vision-language model (PaliGemma) with a diffusion action head further improved speed and fold consistency
  • The same recipe transferred to unrelated tasks: table cleanup, coffee scooping, cardboard box construction, candle lighting

Pre-training and post-training recipe

  • Omitting pre-training: robot gets items out of a bin but makes no further progress
  • Omitting post-training (fine-tuning on curated data): similar failure
  • Combined recipe: reliable flattening, folding, and stacking
  • Nothing in the recipe is task-specific — fine-tuning on a new task reuses the same approach
  • Recipe worked on a robot the team had never physically seen; they fine-tuned on the partner's data remotely

Generalizing to unseen environments

  • Collected tidying data across 100+ unique rooms (mock kitchens, mock bedrooms, real San Francisco homes)
  • Mobile manipulation data was only 2.4% of the overall pre-training mix — the rest was static manipulation, web, and instructional data
  • Tested in three Airbnbs the team had never visited: robot closed cabinets, put away dishes, wiped spills, tidied bedrooms
  • Excluding static-lab robot data dropped novel-home performance by 20%+
  • Increasing location diversity in training data raised performance until it matched training in the target environment itself — the generalization gap was mostly closed
  • Remaining failure modes: items not fully placed, thin objects flush to surfaces, spatula placed in oven instead of drawer

Language following and the gradient-stopping fix

  • Early models ignored language instructions — asked to pick up a cutting board, they picked up a plate
  • Root cause: the randomly initialized diffusion action head corrupts the pre-trained VLM backbone during training
  • Fix: predict tokenized actions alongside diffusion actions; stop gradients from the diffusion head to protect the VLM
  • Result: language-following rate rose from 20% to 80%; training also converged faster

Responding to open-ended prompts and interjections

  • Hierarchical vision-language-action (VLA) model: a high-level policy breaks prompts into atomic subtask commands; a low-level policy executes them as joint-angle targets
  • Scaling human-robot interaction data is impractical, so synthetic prompts are generated: a VLM watches robot video and generates plausible user requests that would lead to that action
  • High-level policy trained on synthetic prompts can handle: "make me a ham and cheese sandwich", "make me a vegan sandwich, no pickles", "clean up only the trash but not the dishes"
  • Robot responds correctly to mid-task interjections and situated corrections (e.g., user asks for something sweet not already in the basket)
  • Frontier LLMs used directly as high-level planners scored substantially lower — they lack physical-world visual understanding

Key takeaways

  • General-purpose models beat specialist models for the same reason LLM-based coding assistants beat purpose-built ones: pre-training transfers
  • Real-world robot data at scale is necessary; simulation and synthetic data are useful for evaluation and as an analog to RL-generated data
  • Reinforcement learning (online data from the robot's own attempts) is the most promising lever for post-training performance and speed gains
  • The bottleneck for home-environment tasks is now reliability, not data diversity
  • Open problems remain: speed, partial observability, long-horizon planning, real-time inference infrastructure

More like this — when you're ready for early access.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Get early access to the full library.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.