Building generative AI foundation models without billions of dollars

Executive overview

The assumption that foundation model training requires massive funding and large teams is wrong. YC Winter 2024 companies built functional video, audio, and domain-specific foundation models during a single batch using only $500K in YC funding — primarily through GPU credits from an Azure deal.

The three levers are data, compute, and expertise. Hacking any one of them is enough to make progress. Expertise is more accessible than it appears: several founders simply read papers and self-taught.

You can train competitive foundation models as a small team by trading compute for data quality, using synthetic data, or narrowing the task scope.

How Sora works

Sora combines a transformer model (text) with a diffusion model (image generation) plus a temporal component for frame consistency.
Training unit is space-time patches — 3D matrices of pixels across spatial and temporal dimensions, analogous to tokens in text models.
Visual transformer foundation: Google's 2020 "Images are Worth 16x16 Words" paper showed transformers could handle image recognition, replacing expensive convolutional neural networks.
Temporal coherence draws from 2018 World Model robotics paper: separate modules for perception, temporal memory (RNNs), and a controller that combines them.
Scale estimate: if GPT-4 is ~1 trillion parameters (2D), a video model is likely 10x that — requiring roughly 10x the GPU count.
Likely uses game engine footage (Unreal Engine / Unity) for synthetic training data: controllable physics, infinite camera angles, no rights issues.

How YC companies hack the three constraints

Compute

YC's Azure deal provides a dedicated GPU cluster: instant access within 24 hours, no resource contention, over $500K in credits.
Companies trained foundation models without touching YC investment money — GPU credits covered it entirely.
Smaller model choice reduces compute: Metalware used GPT-2.5 (1B parameters) instead of GPT-4 scale, pairing it with high-quality narrow data.

Data

SyncLab (real-time lip syncing) trained on a single A100 using low-resolution video — resolution reduction is quadratic in savings across two spatial dimensions.
Metalware (hardware copilot) scanned textbook figures and equations — small dataset, but domain-specific and high quality.
Find (software copilot) generated synthetic data from programming competition problems, producing effectively unlimited high-quality training examples.
Synthetic data works because capable LLMs can reason, not just pattern-match — that reasoning ability bootstraps the flywheel.

Expertise

Sonato (text-to-song) was built by 21-year-old new graduates in two months through self-teaching from papers.
Playground AI's founder Suhail Doshi (previously Mixpanel) locked himself in his apartment for a month reading AI papers, then built a model that rivals Stable Diffusion on a fraction of the budget.
K-Scale Labs (consumer humanoid robots) was founded by the engineer who built Tesla's foundation robotics model for Optimus.

YC company spotlights

Infinity AI: deep fake video of a specific person trained on ~1 hour of YouTube footage; fine-tuning a pre-trained foundation model requires little data to adapt to a new identity.
SyncLab: real-time lip sync API, single A100, extremely accurate cross-language sync; compression and low-res training were the key hacks.
Sonato: text-to-song with artist voice cloning; one of only two or three models in the world doing this; best-in-class results.
Metalware: hardware design copilot for SpaceX-style engineering workflows; GPT-2.5 plus high-quality textbook data outperforms general-purpose models for narrow CAD tasks.
GuyLab: explainable foundation model — predicts outputs and surfaces interpretable feature weights, unlike standard deep learning black boxes.
Find: software copilot for developers; answers outperform Stack Overflow; synthetic data from programming competitions was the data hack.

Applications beyond entertainment

Weather (Atmo, YC 2020): trained on 90 terabytes of weather data with physics equations baked in; result is 1,000,000x more compute-efficient than NOAA's traditional physics simulation, with higher accuracy — at seed-stage cost.
Drug discovery (Diffuse Bio): generative AI for proteins and molecules for new drugs and gene therapies; founder published Nature papers; used custom CUDA kernels to reduce compute overhead.
Brain signals (Pyramidal): foundation model predicting EEG signals for stroke detection and brain-computer interfaces; EEG is temporal like video — same space-time chunking approach reduced iteration runtime complexity from quadratic; single training run needed only 800 GPU-hours.
CAD design (DraftAid): AI short-circuits expensive Fortran-based physics kernels in SolidWorks/AutoCAD; replaces force/shear polynomial solvers with faster learned approximations.

The broader opportunity

General-purpose models (GPT-4) will not specialize deeply into niche physics domains — that's the moat for vertical foundation model startups.
The concern that OpenAI will out-compete domain-specific AI startups is mostly unfounded for narrow, high-value verticals.
Physics simulation is the thread connecting video generation, robotics, weather, biology, and CAD — a world model that understands physics is a platform, not just a product.
Self-driving parallel: simulation-generated training data already outpaces real-world data by 10:1 or more; the same approach will apply to generative video.
Robotics is the next convergence point: a physics-accurate world simulator plugged into a robot is the long-sought path to real-world AI agents.

Building generative AI foundation models without billions of dollars

Executive overview

How Sora works

How YC companies hack the three constraints

YC company spotlights

Applications beyond entertainment

The broader opportunity

More like this — when you're ready for early access.

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.

Executive overview

How Sora works

How YC companies hack the three constraints

YC company spotlights

Applications beyond entertainment

The broader opportunity

More like this — when you're ready for early access.

More in AI

Building $10,000 software MVPs with AI in under an hour

How to actually make money with AI: five brutal truths

How to choose the right home for your AI workflow

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.