The original is one click away. Open original ↗
Building general-purpose robots that work anywhere, on any task
Executive overview
Most robotics companies fail because each application requires a new company built from scratch — new hardware, new software, new motion primitives. Physical Intelligence is building a single foundation model that can run any robot, on any task, in any environment.
The key finding: scale alone doesn't solve robotics. Real-world diversity, a pre-train/post-train recipe borrowed from language models, and hierarchical vision-language-action models together unlock genuine generalization.
Pre-training on broad robot data then fine-tuning on curated, high-quality demonstrations outperforms training on all data or curated data alone.
Why scale isn't enough
- Industrial robot data has volume but no behavioral diversity — useless for open-world tasks
- YouTube videos have scale but an embodiment gap; watching doesn't transfer to doing
- Simulation data lacks realism and has a sim-to-real gap
- Scale is necessary but not sufficient; data diversity and recipe matter more
The laundry-folding case study
- Started with the simplest version: fold one shirt, single brand, single size
- 100M-parameter model mapping camera images to target joint positions, running at 50 Hz
- Crumpled starting state caused months of 0% success rate — variability was the hard problem
- Breakthrough: pre-train on all robot data, then fine-tune on a curated, consistent, high-quality subset
- This recipe cut fold time from 20 min → 12 min → 9 min for five items
- Switching to a 3B-parameter vision-language model (PaliGemma) with a diffusion action head further improved speed and fold consistency
- The same recipe transferred to unrelated tasks: table cleanup, coffee scooping, cardboard box construction, candle lighting
Pre-training and post-training recipe
- Omitting pre-training: robot gets items out of a bin but makes no further progress
- Omitting post-training (fine-tuning on curated data): similar failure
- Combined recipe: reliable flattening, folding, and stacking
- Nothing in the recipe is task-specific — fine-tuning on a new task reuses the same approach
- Recipe worked on a robot the team had never physically seen; they fine-tuned on the partner's data remotely
Generalizing to unseen environments
- Collected tidying data across 100+ unique rooms (mock kitchens, mock bedrooms, real San Francisco homes)
- Mobile manipulation data was only 2.4% of the overall pre-training mix — the rest was static manipulation, web, and instructional data
- Tested in three Airbnbs the team had never visited: robot closed cabinets, put away dishes, wiped spills, tidied bedrooms
- Excluding static-lab robot data dropped novel-home performance by 20%+
- Increasing location diversity in training data raised performance until it matched training in the target environment itself — the generalization gap was mostly closed
- Remaining failure modes: items not fully placed, thin objects flush to surfaces, spatula placed in oven instead of drawer
Language following and the gradient-stopping fix
- Early models ignored language instructions — asked to pick up a cutting board, they picked up a plate
- Root cause: the randomly initialized diffusion action head corrupts the pre-trained VLM backbone during training
- Fix: predict tokenized actions alongside diffusion actions; stop gradients from the diffusion head to protect the VLM
- Result: language-following rate rose from 20% to 80%; training also converged faster
Responding to open-ended prompts and interjections
- Hierarchical vision-language-action (VLA) model: a high-level policy breaks prompts into atomic subtask commands; a low-level policy executes them as joint-angle targets
- Scaling human-robot interaction data is impractical, so synthetic prompts are generated: a VLM watches robot video and generates plausible user requests that would lead to that action
- High-level policy trained on synthetic prompts can handle: "make me a ham and cheese sandwich", "make me a vegan sandwich, no pickles", "clean up only the trash but not the dishes"
- Robot responds correctly to mid-task interjections and situated corrections (e.g., user asks for something sweet not already in the basket)
- Frontier LLMs used directly as high-level planners scored substantially lower — they lack physical-world visual understanding
Key takeaways
- General-purpose models beat specialist models for the same reason LLM-based coding assistants beat purpose-built ones: pre-training transfers
- Real-world robot data at scale is necessary; simulation and synthetic data are useful for evaluation and as an analog to RL-generated data
- Reinforcement learning (online data from the robot's own attempts) is the most promising lever for post-training performance and speed gains
- The bottleneck for home-environment tasks is now reliability, not data diversity
- Open problems remain: speed, partial observability, long-horizon planning, real-time inference infrastructure
More like this — when you're ready for early access.
Join the waitlist for a personal account and content recommendations based on what you're working on.
No spam. Unsubscribe at any time.
You're on the list. We'll be in touch before launch.