How Anthropic trains frontier AI: pre-training, scaling, and engineering reality

Executive overview

Scaling laws predict that more compute, data, and parameters reliably reduce loss — and lower loss means smarter models. The core loop is simple: train a model, sell it, buy more compute, repeat. But executing that loop at frontier scale is a deep engineering problem, not a research one.

The central insight: scale is the dominant variable — everything else is noise until it isn't, and the hardest part is finding the bugs you don't know you have.

What pre-training is and why next-token prediction won

Pre-training is training on internet-scale text by predicting the next word — every token is a training signal, so data density is enormous.
Autoregressive (GPT-style) training won over masked approaches (BERT/BART) largely empirically; a key advantage is it enables open-ended text generation directly.
Scaling laws quantify the relationship: more compute → lower loss, in a power law, predictably.
The commercial feedback loop reinforces this: better model → product revenue → more compute → better model.
Most architectural hyperparameter choices matter less than people expect; throwing more compute at a suboptimal architecture still improves it.

Early infrastructure and getting efficient with limited compute

Anthropic started small but believed scaling laws were clear when others didn't — training budgets comparable to GPT-3 (~$5M) were achievable for a well-capitalised startup.
Cloud providers were used from the start; low-level network topology still mattered — reverse-engineering cluster layout via latency clustering to identify cross-building bottlenecks.
Distributed training frameworks (data parallelism, pipeline parallelism, tensor sharding) were written from scratch rather than depending on packages, because Anthropic planned to exceed the scales those packages were built for.
Efficiency came from modelling GPU utilisation (MFU) on paper first, then profiling to close the gap between theoretical and actual performance.
Single-GPU profilers existed; multi-thousand-GPU profiling required hacking traces together manually.

Architecture choices and hyperparameter search

Hundreds of hyperparameters govern a training run; the standard approach is test at small scale, extrapolate to large scale.
When a run curves off the expected power-law loss trajectory, it signals something is wrong — but distinguishing a fundamental ceiling from a correctable bug is hard, because you rarely have the counterfactual.
Attention is the operation most requiring low-level optimisation; most other operations were handled at the torch.matmul level without going into CUDA kernels.

Scaling challenges: hardware failures and novel stacks

At scale, any single chip failure can crash a job — the standard early approach had no fault tolerance.
As cluster size grows, failure probability grows; checkpointing and fast restarts became essential mitigations.
GPUs themselves can be incorrect — wrong outputs, not just failures — requiring engineers to consider hardware as a possible bug source, not just code.
Everything from data-centre power delivery to chip interconnects is novel; there are few generations of this hardware, so bugs surface constantly.
Close collaboration with chip providers (mostly via shared Slack) includes building minimal reproducers of failures to share without exposing full codebases.

Data: quantity, quality, and synthetic risk

The internet is very large and its useful size is genuinely unknown; there is no reliable counter for how much quality data exists.
Data growth rate versus compute growth rate is uncertain — it is not obvious which is outpacing the other.
PageRank-style quality signals may not map to AI training quality; tail data (rarely linked) may be valuable precisely because it fills gaps the common corpus already covers.
Synthetic data distillation — training a smaller model on a larger model's outputs — can work for capability transfer but cannot produce a model smarter than the source.
LLM-generated internet content is hard to detect and the effect on training is unclear; moderate contamination may still push toward correct distributions if bad outputs are filtered by publication.
Intentional poisoning (adversarial web content designed to harm model behaviour) is a real but hard-to-quantify risk.

Evals: what actually matters

Loss on next-token prediction is a surprisingly strong signal and should not be dismissed.
Good evals must: (1) measure something you genuinely care about, (2) have low noise (enough tokens to produce tight confidence intervals), (3) run quickly.
Saturation is a recurring problem — hitting a benchmark ceiling reveals the benchmark was a proxy, not the goal.
Hard evals (e.g., can a model handle a long patient conversation and extract the right signal?) are what matter for real-world deployment but are the hardest to design and run.
Startups have a comparative advantage here: building domain-specific evals that labs will then optimise against is a lever to shape frontier model behaviour.

Pre-training versus post-training and alignment

Alignment is getting a model's goals to match human intentions — especially important as models approach and exceed human capability.
The practical split: post-training (RLHF, constitutional AI, fine-tuning) handles most personality and behaviour shaping because iteration loops are days, not months.
Pre-training runs take months; a bug or wrong decision discovered after the fact is costly with no easy rollback.
Some alignment properties may eventually move into pre-training for robustness — constitutional rules trained in rather than prompted in are harder to override.
Small models used for pre-training experiments can barely form sentences, making alignment behaviour essentially untestable at that scale.

Team structure and what skills actually matter

Engineers are the primary constraint, more than researchers — the core problem is correct, efficient implementation at scale, not novel architecture invention.
Two engineering archetypes needed: generalists who understand the full stack, and deep specialists (e.g., Jax efficiency, attention kernels, networking).
Too many generalists means nothing is deeply optimised; too many specialists means the manager must carry all cross-cutting context.
The rarest and most valuable skill: being able to debug anything at any level of the stack, from ML learning dynamics down to packet-level networking.
Pair programming was the primary learning method in early days; watching an expert use a profiler teaches more than any write-up.

Pre-training and inference co-design

Inference and pre-training teams co-design models — a model that is very large or architecturally complex can be impossible to serve efficiently.
Model size, number of communication points, and architectural novelty all directly affect inference cost.
At rate-limited scale (compute is the binding constraint), inference efficiency directly determines how many users can be served.
Inference workloads are more HBM-bandwidth-bound; pre-training is more flop-bound due to larger batch sizes — this creates opportunities to match workloads to chip characteristics.

Looking ahead: paradigm shifts and hard bugs

RL-based training (post-training compute scaling) is one paradigm shift already underway; more are likely before AGI.
The "just scale" thesis is probably sufficient to reach AGI but surprising new techniques along the way seem likely given the many orders of magnitude remaining.
The thing that most concerns the pre-training team is not a known-unknown — it is subtle bugs that corrupt a multi-month training run with no visible signal.
A single miscast precision deep in a kernel, a wrong layer connection, or a silent communication fault can invalidate an entire model generation.
At AGI scale, economic growth from automation will be enormous; ensuring that growth is broadly distributed is an important open problem.

How Anthropic trains frontier AI: pre-training, scaling, and engineering reality

Executive overview

What pre-training is and why next-token prediction won

Early infrastructure and getting efficient with limited compute

Architecture choices and hyperparameter search

Scaling challenges: hardware failures and novel stacks

Data: quantity, quality, and synthetic risk

Evals: what actually matters

Pre-training versus post-training and alignment

Team structure and what skills actually matter

Pre-training and inference co-design

Looking ahead: paradigm shifts and hard bugs

More like this — when you're ready for early access.

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.

Executive overview

What pre-training is and why next-token prediction won

Early infrastructure and getting efficient with limited compute

Architecture choices and hyperparameter search

Scaling challenges: hardware failures and novel stacks

Data: quantity, quality, and synthetic risk

Evals: what actually matters

Pre-training versus post-training and alignment

Team structure and what skills actually matter

Pre-training and inference co-design

Looking ahead: paradigm shifts and hard bugs

More like this — when you're ready for early access.

More in AI

Building $10,000 software MVPs with AI in under an hour

How to actually make money with AI: five brutal truths

How to choose the right home for your AI workflow

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.