The original is one click away. Open original ↗
Nvidia and the AI era: how GPU dominance became inevitable
Executive overview
By late 2022, large language models running on transformers burst into mainstream use, turning Nvidia's decade-long bet on GPU-accelerated data centres into the most profitable position in modern tech history. The company had quietly assembled every layer of the stack — chips, networking, software — while competitors watched.
The preparation was already done: CUDA, Mellanox, and the Hopper architecture were built years before the demand arrived.
The research chain that produced the AI moment
- AlexNet (2012) ran convolutional neural networks on two consumer GeForce GPUs using CUDA, proving parallel compute could unlock AI
- The Toronto team — Hinton, Krizhevesky, Ilya Sutskever — was scooped up by Google; Sutskever later co-founded OpenAI
- Google's 2017 transformer paper ("Attention Is All You Need") made sequence models trainable in parallel, at scale
- Transformers are O(n²) in compute, but GPUs can run all comparisons simultaneously — the bottleneck became memory, not speed
- GPT parameter counts scaled from 120M (GPT-1) to 175B (GPT-3) to ~1.7T (GPT-4); model quality improved discontinuously with scale
- OpenAI converted to a for-profit entity in 2019 and took $1B from Microsoft to afford the compute required
Why the data centre is the computer
- Von Neumann CPUs execute one instruction at a time; GPUs run tens of thousands in parallel — a "giant Archimedes lever" on Moore's Law
- Training large models requires hundreds of gigabytes of on-chip memory, forcing multiple GPUs to be networked as one logical computer
- The H100 has 18,500 CUDA cores, 640 tensor cores, and 80 streaming multiprocessors; it is 9× faster than the A100 for AI training
- CoWoS (chip-on-wafer-on-substrate) 2.5D packaging stacks high-bandwidth memory close to the logic die — currently 10–15% of TSMC's total capacity
- TSMC capacity for CoWoS is the binding constraint on H100 supply, not Nvidia's willingness to sell
Nvidia's three-part data centre platform
- Mellanox / InfiniBand (acquired 2020, $7B): the only high-bandwidth rack-to-rack networking stack that can treat a full data centre as one computer
- Grace CPU (announced Sept 2022): an ARM-based CPU designed from scratch to orchestrate massive GPU clusters, not for laptops
- Hopper GPU architecture (H100): split from the gaming Lovelace line, enabling Nvidia to monopolise CoWoS capacity at TSMC for AI chips
- Together these three form the DGX system — a fully integrated AI supercomputer; a single DGX H100 starts at $500K, the GH200 SuperPod (256 racks) is hundreds of millions
- DGX Cloud launched through Azure, Oracle and Google — a virtualised DGX rented via Nvidia's own interface at $37K/month for A100 access
CUDA: the software moat
- Released 2006; today a compiler, runtime, debugger, profiler, native language (CUDA C++), and industry-specific libraries
- Backwards-compatible across every Nvidia GPU shipped since 2006 — 500M CUDA-capable GPUs in the wild
- Developer count: 100K (2010) → 1M (2016) → 2M (2018) → 4M (2023); roughly 10,000 person-years of cumulative investment
- ~1,600 Nvidia employees have "CUDA" in their LinkedIn title; competitors' open-source equivalents (ROCm, OpenCL) are years behind
- The Apple-vs-Android analogy: Nvidia controls the tightly coupled hardware-software stack; PyTorch is the open ecosystem that rivals are trying to route through
Financial results and competitive position
- Q1 FY24 (reported May 2023): revenue $7.2B, up 19% QoQ
- Q2 FY24 guidance: $11B — up 53% QoQ, 65% YoY; stock rose 25% in after-hours
- Q2 FY24 actual: total revenue $13.5B (+88% QoQ, +100% YoY); data centre alone $10.3B (+141% QoQ, +171% YoY)
- Gross margin: ~70%, forecast 72% — vs 24% pre-CUDA era
- ~50% of data centre revenue comes from cloud service providers (AWS, Azure, Google, Meta); CSPs buy bare chips and integrate themselves
- China revenue was 25% of total before export controls; Nvidia created the A800/H800 (capped NVLink bandwidth) to comply — still selling at volume
- Jensen's revised TAM framing: $1 trillion in installed data centre hardware, $250B annual refresh spend — a grounded claim vs the earlier "1% of everything" slide
Bull and bear cases
- Bull: accelerated computing is still a fraction of total workloads; Jensen is correct that every application will gain a generative AI layer; Nvidia moves at a six-month ship cycle competitors cannot match; data centre capex lock-in is decade-long
- Bear: every large tech company (Google TPU, Amazon Trainium, Microsoft/AMD rumours) is incentivised to break the moat; PyTorch aggregates developers in a way that could eventually disintermediate hardware; a confidence crisis in AI could slow enterprise capex; inference workloads are less differentiated than training
- Nvidia is not Cisco or Intel — it controls the software stack and has direct developer relationships; the closer analogy is Microsoft, or old-school IBM in its mainframe era
More like this — when you're ready for early access.
Join the waitlist for a personal account and content recommendations based on what you're working on.
No spam. Unsubscribe at any time.
You're on the list. We'll be in touch before launch.