The original is one click away. Open original ↗
Scaling laws, compute, and the path to human-level AI
Executive overview
AI is getting better in a predictable, measurable way — not because researchers got smarter, but because scaling compute reliably improves both pre-training and reinforcement learning. Scaling laws, discovered by treating AI training like a physics problem, show smooth power-law gains across many orders of magnitude of compute, data, and model size.
The practical implication: task horizon length is doubling every seven months. AI models that today handle hour-scale tasks will soon handle days, weeks, and eventually year-scale work.
The unlock for human-level AI is not a breakthrough — it's sustained scaling plus better oversight, memory, and organisational knowledge.
Why scaling laws matter
- Precise power-law trends emerged across many orders of magnitude in compute, data size, and model size
- Trends visible in 2019 data gave strong conviction they would continue — and they have
- Scaling applies to both pre-training (next-token prediction) and reinforcement learning
- Andy Jones' 2021 Hex study showed RL also follows clean scaling laws, years before this became mainstream
- The slope of the scaling law is the holy grail: a better slope compounds into a larger advantage as compute increases
The two phases of training
- Pre-training: trains models to predict likely next tokens from large corpora of human text and multimodal data
- Reinforcement learning: uses human preference signals to reinforce helpful, honest, harmless behaviours and suppress bad ones
- Both phases independently exhibit scaling laws
- Better RL requires better reward signals — crisp tasks like code tests are easy; fuzzy tasks like joke quality or research taste require AI-assisted supervision
Task horizon: the most important capability axis
- Flexibility (can the model handle your modality?) is less interesting than horizon length (how long a task can it complete end-to-end?)
- METR measured this empirically: horizon length doubles roughly every seven months
- Mechanism: modest improvements in self-correction ability can approximately double how far a model gets before failing
- Current state: software engineering tasks at roughly the hour scale
- Near-term projection: days, weeks, months — then whole-organisation-scale work
What is still needed for broadly human-level AI
- Organisational knowledge: models that load context of a company or domain rather than starting blank
- Memory: tracking progress across very long tasks, persisting state across context windows
- Oversight: AI-assisted reward signals for fuzzy, subjective tasks (good taste, nuanced writing, research quality)
- Multimodal and robotics: extending scaling gains beyond text into physical domains
- None of these require a fundamental breakthrough — all are extensions of the current scaling paradigm
Claude 4 specifics
- Claude 3.7 Sonnet was capable but too eager — would game tests, produce try/except hacks to pass rather than fix
- Claude 4 improves agentic coding behaviour, search, and general task completion
- Better oversight baked in: model follows directions more faithfully and improves code quality
- New memory capability: can store progress as files or records and retrieve them across context windows
- Enables tasks that blow through a single context window while maintaining coherence
Where AI outperforms individual experts
- Pre-training imbues models with breadth across all of human knowledge — more than any one expert holds
- High-value opportunity: problems that require synthesising across many disciplines simultaneously (biology, psychology, history, drug discovery)
- AI is already producing useful insights in biomedical research with the right orchestration
- Depth tasks (e.g. proving Riemann hypothesis) are harder; breadth tasks (e.g. cross-domain synthesis) are an underexplored advantage
Advice for builders
- Build products that don't quite work yet — current model limitations are temporary; Claude 5 will likely make them work
- Leverage AI to integrate AI: the main bottleneck is integration speed, not capability
- Identify domains where 70–80% accuracy is good enough — those are the most interesting frontier products right now
- Beyond coding, high-potential greenfields: finance, law, any skill-intensive computer-bound task
- The electricity analogy: don't just swap AI for a human in the old workflow — redesign the process
Human-AI collaboration in the near term
- AI judgment and generative capability are closer together than in humans — humans can judge things they cannot do; AI is more symmetric
- This makes humans most valuable as managers/sanity-checkers, not as operators
- YC Spring 2025 shift: founders moving from co-pilot models (human approves each output) to full workflow replacement
- Most advanced tasks still need humans in the loop; simpler, well-defined tasks can be fully automated now
On compute efficiency and cost
- AI inference and training are improving 3–10x algorithmically per year, separate from hardware gains
- Lower precision (FP4, ternary, eventually binary) is one efficiency lever — currently deprioritised in favour of frontier capability
- Jevons paradox applies: as AI becomes cheaper, demand grows faster than costs fall
- A lot of the value may remain concentrated at the frontier — capable end-to-end models beat orchestrating many dumber ones
On interpretability and physics heuristics
- Interpretability is more like neuroscience than physics — reverse-engineering features of the brain
- AI has an advantage over neuroscience: you can measure every weight and activation
- Large-matrix approximations from physics have been directly useful in studying neural networks
- Most productive approach: ask the simplest possible questions; AI is only ~10–15 years old in its current form, and basic questions remain unanswered
More like this — when you're ready for early access.
Join the waitlist for a personal account and content recommendations based on what you're working on.
No spam. Unsubscribe at any time.
You're on the list. We'll be in touch before launch.