Scaling laws, compute, and the path to human-level AI

Executive overview

AI is getting better in a predictable, measurable way — not because researchers got smarter, but because scaling compute reliably improves both pre-training and reinforcement learning. Scaling laws, discovered by treating AI training like a physics problem, show smooth power-law gains across many orders of magnitude of compute, data, and model size.

The practical implication: task horizon length is doubling every seven months. AI models that today handle hour-scale tasks will soon handle days, weeks, and eventually year-scale work.

The unlock for human-level AI is not a breakthrough — it's sustained scaling plus better oversight, memory, and organisational knowledge.

Why scaling laws matter

Precise power-law trends emerged across many orders of magnitude in compute, data size, and model size
Trends visible in 2019 data gave strong conviction they would continue — and they have
Scaling applies to both pre-training (next-token prediction) and reinforcement learning
Andy Jones' 2021 Hex study showed RL also follows clean scaling laws, years before this became mainstream
The slope of the scaling law is the holy grail: a better slope compounds into a larger advantage as compute increases

The two phases of training

Pre-training: trains models to predict likely next tokens from large corpora of human text and multimodal data
Reinforcement learning: uses human preference signals to reinforce helpful, honest, harmless behaviours and suppress bad ones
Both phases independently exhibit scaling laws
Better RL requires better reward signals — crisp tasks like code tests are easy; fuzzy tasks like joke quality or research taste require AI-assisted supervision

Task horizon: the most important capability axis

Flexibility (can the model handle your modality?) is less interesting than horizon length (how long a task can it complete end-to-end?)
METR measured this empirically: horizon length doubles roughly every seven months
Mechanism: modest improvements in self-correction ability can approximately double how far a model gets before failing
Current state: software engineering tasks at roughly the hour scale
Near-term projection: days, weeks, months — then whole-organisation-scale work

What is still needed for broadly human-level AI

Organisational knowledge: models that load context of a company or domain rather than starting blank
Memory: tracking progress across very long tasks, persisting state across context windows
Oversight: AI-assisted reward signals for fuzzy, subjective tasks (good taste, nuanced writing, research quality)
Multimodal and robotics: extending scaling gains beyond text into physical domains
None of these require a fundamental breakthrough — all are extensions of the current scaling paradigm

Claude 4 specifics

Claude 3.7 Sonnet was capable but too eager — would game tests, produce try/except hacks to pass rather than fix
Claude 4 improves agentic coding behaviour, search, and general task completion
Better oversight baked in: model follows directions more faithfully and improves code quality
New memory capability: can store progress as files or records and retrieve them across context windows
Enables tasks that blow through a single context window while maintaining coherence

Where AI outperforms individual experts

Pre-training imbues models with breadth across all of human knowledge — more than any one expert holds
High-value opportunity: problems that require synthesising across many disciplines simultaneously (biology, psychology, history, drug discovery)
AI is already producing useful insights in biomedical research with the right orchestration
Depth tasks (e.g. proving Riemann hypothesis) are harder; breadth tasks (e.g. cross-domain synthesis) are an underexplored advantage

Advice for builders

Build products that don't quite work yet — current model limitations are temporary; Claude 5 will likely make them work
Leverage AI to integrate AI: the main bottleneck is integration speed, not capability
Identify domains where 70–80% accuracy is good enough — those are the most interesting frontier products right now
Beyond coding, high-potential greenfields: finance, law, any skill-intensive computer-bound task
The electricity analogy: don't just swap AI for a human in the old workflow — redesign the process

Human-AI collaboration in the near term

AI judgment and generative capability are closer together than in humans — humans can judge things they cannot do; AI is more symmetric
This makes humans most valuable as managers/sanity-checkers, not as operators
YC Spring 2025 shift: founders moving from co-pilot models (human approves each output) to full workflow replacement
Most advanced tasks still need humans in the loop; simpler, well-defined tasks can be fully automated now

On compute efficiency and cost

AI inference and training are improving 3–10x algorithmically per year, separate from hardware gains
Lower precision (FP4, ternary, eventually binary) is one efficiency lever — currently deprioritised in favour of frontier capability
Jevons paradox applies: as AI becomes cheaper, demand grows faster than costs fall
A lot of the value may remain concentrated at the frontier — capable end-to-end models beat orchestrating many dumber ones

On interpretability and physics heuristics

Interpretability is more like neuroscience than physics — reverse-engineering features of the brain
AI has an advantage over neuroscience: you can measure every weight and activation
Large-matrix approximations from physics have been directly useful in studying neural networks
Most productive approach: ask the simplest possible questions; AI is only ~10–15 years old in its current form, and basic questions remain unanswered

Scaling laws, compute, and the path to human-level AI

Executive overview

Why scaling laws matter

The two phases of training

Task horizon: the most important capability axis

What is still needed for broadly human-level AI

Claude 4 specifics

Where AI outperforms individual experts

Advice for builders

Human-AI collaboration in the near term

On compute efficiency and cost

On interpretability and physics heuristics

More like this — when you're ready for early access.

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.

Executive overview

Why scaling laws matter

The two phases of training

Task horizon: the most important capability axis

What is still needed for broadly human-level AI

Claude 4 specifics

Where AI outperforms individual experts

Advice for builders

Human-AI collaboration in the near term

On compute efficiency and cost

On interpretability and physics heuristics

More like this — when you're ready for early access.

More in AI

Building $10,000 software MVPs with AI in under an hour

How to actually make money with AI: five brutal truths

How to choose the right home for your AI workflow

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.