The original is one click away. Open original ↗
How Anthropic trains frontier AI: pre-training, scaling, and engineering reality
Executive overview
Scaling laws predict that more compute, data, and parameters reliably reduce loss — and lower loss means smarter models. The core loop is simple: train a model, sell it, buy more compute, repeat. But executing that loop at frontier scale is a deep engineering problem, not a research one.
The central insight: scale is the dominant variable — everything else is noise until it isn't, and the hardest part is finding the bugs you don't know you have.
What pre-training is and why next-token prediction won
- Pre-training is training on internet-scale text by predicting the next word — every token is a training signal, so data density is enormous.
- Autoregressive (GPT-style) training won over masked approaches (BERT/BART) largely empirically; a key advantage is it enables open-ended text generation directly.
- Scaling laws quantify the relationship: more compute → lower loss, in a power law, predictably.
- The commercial feedback loop reinforces this: better model → product revenue → more compute → better model.
- Most architectural hyperparameter choices matter less than people expect; throwing more compute at a suboptimal architecture still improves it.
Early infrastructure and getting efficient with limited compute
- Anthropic started small but believed scaling laws were clear when others didn't — training budgets comparable to GPT-3 (~$5M) were achievable for a well-capitalised startup.
- Cloud providers were used from the start; low-level network topology still mattered — reverse-engineering cluster layout via latency clustering to identify cross-building bottlenecks.
- Distributed training frameworks (data parallelism, pipeline parallelism, tensor sharding) were written from scratch rather than depending on packages, because Anthropic planned to exceed the scales those packages were built for.
- Efficiency came from modelling GPU utilisation (MFU) on paper first, then profiling to close the gap between theoretical and actual performance.
- Single-GPU profilers existed; multi-thousand-GPU profiling required hacking traces together manually.
Architecture choices and hyperparameter search
- Hundreds of hyperparameters govern a training run; the standard approach is test at small scale, extrapolate to large scale.
- When a run curves off the expected power-law loss trajectory, it signals something is wrong — but distinguishing a fundamental ceiling from a correctable bug is hard, because you rarely have the counterfactual.
- Attention is the operation most requiring low-level optimisation; most other operations were handled at the
torch.matmullevel without going into CUDA kernels.
Scaling challenges: hardware failures and novel stacks
- At scale, any single chip failure can crash a job — the standard early approach had no fault tolerance.
- As cluster size grows, failure probability grows; checkpointing and fast restarts became essential mitigations.
- GPUs themselves can be incorrect — wrong outputs, not just failures — requiring engineers to consider hardware as a possible bug source, not just code.
- Everything from data-centre power delivery to chip interconnects is novel; there are few generations of this hardware, so bugs surface constantly.
- Close collaboration with chip providers (mostly via shared Slack) includes building minimal reproducers of failures to share without exposing full codebases.
Data: quantity, quality, and synthetic risk
- The internet is very large and its useful size is genuinely unknown; there is no reliable counter for how much quality data exists.
- Data growth rate versus compute growth rate is uncertain — it is not obvious which is outpacing the other.
- PageRank-style quality signals may not map to AI training quality; tail data (rarely linked) may be valuable precisely because it fills gaps the common corpus already covers.
- Synthetic data distillation — training a smaller model on a larger model's outputs — can work for capability transfer but cannot produce a model smarter than the source.
- LLM-generated internet content is hard to detect and the effect on training is unclear; moderate contamination may still push toward correct distributions if bad outputs are filtered by publication.
- Intentional poisoning (adversarial web content designed to harm model behaviour) is a real but hard-to-quantify risk.
Evals: what actually matters
- Loss on next-token prediction is a surprisingly strong signal and should not be dismissed.
- Good evals must: (1) measure something you genuinely care about, (2) have low noise (enough tokens to produce tight confidence intervals), (3) run quickly.
- Saturation is a recurring problem — hitting a benchmark ceiling reveals the benchmark was a proxy, not the goal.
- Hard evals (e.g., can a model handle a long patient conversation and extract the right signal?) are what matter for real-world deployment but are the hardest to design and run.
- Startups have a comparative advantage here: building domain-specific evals that labs will then optimise against is a lever to shape frontier model behaviour.
Pre-training versus post-training and alignment
- Alignment is getting a model's goals to match human intentions — especially important as models approach and exceed human capability.
- The practical split: post-training (RLHF, constitutional AI, fine-tuning) handles most personality and behaviour shaping because iteration loops are days, not months.
- Pre-training runs take months; a bug or wrong decision discovered after the fact is costly with no easy rollback.
- Some alignment properties may eventually move into pre-training for robustness — constitutional rules trained in rather than prompted in are harder to override.
- Small models used for pre-training experiments can barely form sentences, making alignment behaviour essentially untestable at that scale.
Team structure and what skills actually matter
- Engineers are the primary constraint, more than researchers — the core problem is correct, efficient implementation at scale, not novel architecture invention.
- Two engineering archetypes needed: generalists who understand the full stack, and deep specialists (e.g., Jax efficiency, attention kernels, networking).
- Too many generalists means nothing is deeply optimised; too many specialists means the manager must carry all cross-cutting context.
- The rarest and most valuable skill: being able to debug anything at any level of the stack, from ML learning dynamics down to packet-level networking.
- Pair programming was the primary learning method in early days; watching an expert use a profiler teaches more than any write-up.
Pre-training and inference co-design
- Inference and pre-training teams co-design models — a model that is very large or architecturally complex can be impossible to serve efficiently.
- Model size, number of communication points, and architectural novelty all directly affect inference cost.
- At rate-limited scale (compute is the binding constraint), inference efficiency directly determines how many users can be served.
- Inference workloads are more HBM-bandwidth-bound; pre-training is more flop-bound due to larger batch sizes — this creates opportunities to match workloads to chip characteristics.
Looking ahead: paradigm shifts and hard bugs
- RL-based training (post-training compute scaling) is one paradigm shift already underway; more are likely before AGI.
- The "just scale" thesis is probably sufficient to reach AGI but surprising new techniques along the way seem likely given the many orders of magnitude remaining.
- The thing that most concerns the pre-training team is not a known-unknown — it is subtle bugs that corrupt a multi-month training run with no visible signal.
- A single miscast precision deep in a kernel, a wrong layer connection, or a silent communication fault can invalidate an entire model generation.
- At AGI scale, economic growth from automation will be enormous; ensuring that growth is broadly distributed is an important open problem.
More like this — when you're ready for early access.
Join the waitlist for a personal account and content recommendations based on what you're working on.
No spam. Unsubscribe at any time.
You're on the list. We'll be in touch before launch.