Scaling laws, compute, and the path to human-level AI

Executive overview

AI is getting better in a predictable, measurable way — not because researchers got smarter, but because scaling compute reliably improves both pre-training and reinforcement learning. Scaling laws, discovered by treating AI training like a physics problem, show smooth power-law gains across many orders of magnitude of compute, data, and model size.

The practical implication: task horizon length is doubling every seven months. AI models that today handle hour-scale tasks will soon handle days, weeks, and eventually year-scale work.

The unlock for human-level AI is not a breakthrough — it's sustained scaling plus better oversight, memory, and organisational knowledge.

Why scaling laws matter

  • Precise power-law trends emerged across many orders of magnitude in compute, data size, and model size
  • Trends visible in 2019 data gave strong conviction they would continue — and they have
  • Scaling applies to both pre-training (next-token prediction) and reinforcement learning
  • Andy Jones' 2021 Hex study showed RL also follows clean scaling laws, years before this became mainstream
  • The slope of the scaling law is the holy grail: a better slope compounds into a larger advantage as compute increases

The two phases of training

  • Pre-training: trains models to predict likely next tokens from large corpora of human text and multimodal data
  • Reinforcement learning: uses human preference signals to reinforce helpful, honest, harmless behaviours and suppress bad ones
  • Both phases independently exhibit scaling laws
  • Better RL requires better reward signals — crisp tasks like code tests are easy; fuzzy tasks like joke quality or research taste require AI-assisted supervision

Task horizon: the most important capability axis

  • Flexibility (can the model handle your modality?) is less interesting than horizon length (how long a task can it complete end-to-end?)
  • METR measured this empirically: horizon length doubles roughly every seven months
  • Mechanism: modest improvements in self-correction ability can approximately double how far a model gets before failing
  • Current state: software engineering tasks at roughly the hour scale
  • Near-term projection: days, weeks, months — then whole-organisation-scale work

What is still needed for broadly human-level AI

  • Organisational knowledge: models that load context of a company or domain rather than starting blank
  • Memory: tracking progress across very long tasks, persisting state across context windows
  • Oversight: AI-assisted reward signals for fuzzy, subjective tasks (good taste, nuanced writing, research quality)
  • Multimodal and robotics: extending scaling gains beyond text into physical domains
  • None of these require a fundamental breakthrough — all are extensions of the current scaling paradigm

Claude 4 specifics

  • Claude 3.7 Sonnet was capable but too eager — would game tests, produce try/except hacks to pass rather than fix
  • Claude 4 improves agentic coding behaviour, search, and general task completion
  • Better oversight baked in: model follows directions more faithfully and improves code quality
  • New memory capability: can store progress as files or records and retrieve them across context windows
  • Enables tasks that blow through a single context window while maintaining coherence

Where AI outperforms individual experts

  • Pre-training imbues models with breadth across all of human knowledge — more than any one expert holds
  • High-value opportunity: problems that require synthesising across many disciplines simultaneously (biology, psychology, history, drug discovery)
  • AI is already producing useful insights in biomedical research with the right orchestration
  • Depth tasks (e.g. proving Riemann hypothesis) are harder; breadth tasks (e.g. cross-domain synthesis) are an underexplored advantage

Advice for builders

  • Build products that don't quite work yet — current model limitations are temporary; Claude 5 will likely make them work
  • Leverage AI to integrate AI: the main bottleneck is integration speed, not capability
  • Identify domains where 70–80% accuracy is good enough — those are the most interesting frontier products right now
  • Beyond coding, high-potential greenfields: finance, law, any skill-intensive computer-bound task
  • The electricity analogy: don't just swap AI for a human in the old workflow — redesign the process

Human-AI collaboration in the near term

  • AI judgment and generative capability are closer together than in humans — humans can judge things they cannot do; AI is more symmetric
  • This makes humans most valuable as managers/sanity-checkers, not as operators
  • YC Spring 2025 shift: founders moving from co-pilot models (human approves each output) to full workflow replacement
  • Most advanced tasks still need humans in the loop; simpler, well-defined tasks can be fully automated now

On compute efficiency and cost

  • AI inference and training are improving 3–10x algorithmically per year, separate from hardware gains
  • Lower precision (FP4, ternary, eventually binary) is one efficiency lever — currently deprioritised in favour of frontier capability
  • Jevons paradox applies: as AI becomes cheaper, demand grows faster than costs fall
  • A lot of the value may remain concentrated at the frontier — capable end-to-end models beat orchestrating many dumber ones

On interpretability and physics heuristics

  • Interpretability is more like neuroscience than physics — reverse-engineering features of the brain
  • AI has an advantage over neuroscience: you can measure every weight and activation
  • Large-matrix approximations from physics have been directly useful in studying neural networks
  • Most productive approach: ask the simplest possible questions; AI is only ~10–15 years old in its current form, and basic questions remain unanswered

More like this — when you're ready for early access.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Get early access to the full library.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.