Building general-purpose robots that work anywhere, on any task

Executive overview

Most robotics companies fail because each application requires a new company built from scratch — new hardware, new software, new motion primitives. Physical Intelligence is building a single foundation model that can run any robot, on any task, in any environment.

The key finding: scale alone doesn't solve robotics. Real-world diversity, a pre-train/post-train recipe borrowed from language models, and hierarchical vision-language-action models together unlock genuine generalization.

Pre-training on broad robot data then fine-tuning on curated, high-quality demonstrations outperforms training on all data or curated data alone.

Why scale isn't enough

Industrial robot data has volume but no behavioral diversity — useless for open-world tasks
YouTube videos have scale but an embodiment gap; watching doesn't transfer to doing
Simulation data lacks realism and has a sim-to-real gap
Scale is necessary but not sufficient; data diversity and recipe matter more

The laundry-folding case study

Started with the simplest version: fold one shirt, single brand, single size
100M-parameter model mapping camera images to target joint positions, running at 50 Hz
Crumpled starting state caused months of 0% success rate — variability was the hard problem
Breakthrough: pre-train on all robot data, then fine-tune on a curated, consistent, high-quality subset
This recipe cut fold time from 20 min → 12 min → 9 min for five items
Switching to a 3B-parameter vision-language model (PaliGemma) with a diffusion action head further improved speed and fold consistency
The same recipe transferred to unrelated tasks: table cleanup, coffee scooping, cardboard box construction, candle lighting

Pre-training and post-training recipe

Omitting pre-training: robot gets items out of a bin but makes no further progress
Omitting post-training (fine-tuning on curated data): similar failure
Combined recipe: reliable flattening, folding, and stacking
Nothing in the recipe is task-specific — fine-tuning on a new task reuses the same approach
Recipe worked on a robot the team had never physically seen; they fine-tuned on the partner's data remotely

Generalizing to unseen environments

Collected tidying data across 100+ unique rooms (mock kitchens, mock bedrooms, real San Francisco homes)
Mobile manipulation data was only 2.4% of the overall pre-training mix — the rest was static manipulation, web, and instructional data
Tested in three Airbnbs the team had never visited: robot closed cabinets, put away dishes, wiped spills, tidied bedrooms
Excluding static-lab robot data dropped novel-home performance by 20%+
Increasing location diversity in training data raised performance until it matched training in the target environment itself — the generalization gap was mostly closed
Remaining failure modes: items not fully placed, thin objects flush to surfaces, spatula placed in oven instead of drawer

Language following and the gradient-stopping fix

Early models ignored language instructions — asked to pick up a cutting board, they picked up a plate
Root cause: the randomly initialized diffusion action head corrupts the pre-trained VLM backbone during training
Fix: predict tokenized actions alongside diffusion actions; stop gradients from the diffusion head to protect the VLM
Result: language-following rate rose from 20% to 80%; training also converged faster

Responding to open-ended prompts and interjections

Hierarchical vision-language-action (VLA) model: a high-level policy breaks prompts into atomic subtask commands; a low-level policy executes them as joint-angle targets
Scaling human-robot interaction data is impractical, so synthetic prompts are generated: a VLM watches robot video and generates plausible user requests that would lead to that action
High-level policy trained on synthetic prompts can handle: "make me a ham and cheese sandwich", "make me a vegan sandwich, no pickles", "clean up only the trash but not the dishes"
Robot responds correctly to mid-task interjections and situated corrections (e.g., user asks for something sweet not already in the basket)
Frontier LLMs used directly as high-level planners scored substantially lower — they lack physical-world visual understanding

Key takeaways

General-purpose models beat specialist models for the same reason LLM-based coding assistants beat purpose-built ones: pre-training transfers
Real-world robot data at scale is necessary; simulation and synthetic data are useful for evaluation and as an analog to RL-generated data
Reinforcement learning (online data from the robot's own attempts) is the most promising lever for post-training performance and speed gains
The bottleneck for home-environment tasks is now reliability, not data diversity
Open problems remain: speed, partial observability, long-horizon planning, real-time inference infrastructure

Building general-purpose robots that work anywhere, on any task

Executive overview

Why scale isn't enough

The laundry-folding case study

Pre-training and post-training recipe

Generalizing to unseen environments

Language following and the gradient-stopping fix

Responding to open-ended prompts and interjections

Key takeaways

More like this — when you're ready for early access.

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.

Executive overview

Why scale isn't enough

The laundry-folding case study

Pre-training and post-training recipe

Generalizing to unseen environments

Language following and the gradient-stopping fix

Responding to open-ended prompts and interjections

Key takeaways

More like this — when you're ready for early access.

More in AI

Building $10,000 software MVPs with AI in under an hour

How to actually make money with AI: five brutal truths

How to choose the right home for your AI workflow

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.