The original is one click away. Open original ↗
Fei-Fei Li on spatial intelligence as the next frontier in AI
Executive overview
Language took less than a million years to evolve; vision took 540 million. That gap explains why 3D spatial understanding remains unsolved while LLMs already pass the Turing test. Fei-Fei Li argues AGI cannot be complete without spatial intelligence — the ability to understand, generate, and reason about the 3D world.
She founded World Labs to build world models that go beyond flat pixels and language tokens. The goal: a foundation model for spatial intelligence with applications spanning robotics, design, gaming, and the metaverse.
Spatial intelligence is the hardest open problem in AI — and the most consequential.
Why spatial intelligence is harder than language
- Language is 1D and purely generative; the real world is 3D and physically constrained.
- Vision sensing is a projection — collapsing 3D to 2D — which is mathematically ill-posed.
- World models must fluidly span generation (gaming, metaverse) and reconstruction (robotics).
- Language data is abundant on the internet; spatial data largely exists only in embodied experience.
- The visual cortex consumes far more of the human brain than language areas.
The ImageNet origin story
- In 2007, Li and her students bet that AI needed a paradigm shift toward data-driven methods.
- They scraped one billion images from the internet and built a full visual taxonomy.
- The project was open-sourced from day one and paired with an annual public challenge.
- For three years (2009–2012) there was little signal that it was working.
- In 2012, Hinton's team (Supervision/AlexNet) used CNNs plus two GPUs to achieve a step-change in error rate — the first moment data, GPUs, and neural networks converged.
From objects to scenes to worlds
- ImageNet solved object recognition; the next problem was scene understanding.
- Li's lifelong goal was machine storytelling — describing a full scene the way humans do.
- Around 2015, Andrej Karpathy and Li published some of the first image-captioning papers, combining vision and natural language.
- The reverse problem — generating images from text — seemed like a joke in 2015; it is now generative AI.
- The arc: objects → scenes → 3D world models.
World Labs and what spatial AI enables
- Co-founded with Justin Johnson, Ben Mildenhall (NeRF), and Christoph Lassner (precursor to Gaussian Splatting).
- Target use cases: 3D content creation for designers, architects, game developers, and artists.
- Longer-term: robotics, metaverse, marketing, and entertainment.
- World models must obey physics and support both generative and reconstructive use cases.
- Data strategy is hybrid — quality matters as much as quantity; details not public.
Hiring and what makes great researchers
- Li's single hiring criterion: intellectual fearlessness — the willingness to embrace hard problems and go all in.
- Applies equally to PhD students, researchers, and engineering hires at World Labs.
- World Labs is actively hiring across engineering, product, 3D, and generative AI.
Advice for founders and PhD students
- Grad school is for burning curiosity; startups require a more focused commercial goal — know which you're in.
- PhD students should target problems where compute and scale alone won't win: interdisciplinary AI, theory, explainability, small-data regimes.
- Academia no longer leads on compute or data; find the North Stars industry can't easily reach.
- Immigrant and minority founders: develop the capacity to not overindex on being the outsider — focus on building.
- "Forget what you've done. Forget what others think. Hunker down and build."
More like this — when you're ready for early access.
Join the waitlist for a personal account and content recommendations based on what you're working on.
No spam. Unsubscribe at any time.
You're on the list. We'll be in touch before launch.