Nvidia and the AI era: how GPU dominance became inevitable

Executive overview

By late 2022, large language models running on transformers burst into mainstream use, turning Nvidia's decade-long bet on GPU-accelerated data centres into the most profitable position in modern tech history. The company had quietly assembled every layer of the stack — chips, networking, software — while competitors watched.

The preparation was already done: CUDA, Mellanox, and the Hopper architecture were built years before the demand arrived.

The research chain that produced the AI moment

  • AlexNet (2012) ran convolutional neural networks on two consumer GeForce GPUs using CUDA, proving parallel compute could unlock AI
  • The Toronto team — Hinton, Krizhevesky, Ilya Sutskever — was scooped up by Google; Sutskever later co-founded OpenAI
  • Google's 2017 transformer paper ("Attention Is All You Need") made sequence models trainable in parallel, at scale
  • Transformers are O(n²) in compute, but GPUs can run all comparisons simultaneously — the bottleneck became memory, not speed
  • GPT parameter counts scaled from 120M (GPT-1) to 175B (GPT-3) to ~1.7T (GPT-4); model quality improved discontinuously with scale
  • OpenAI converted to a for-profit entity in 2019 and took $1B from Microsoft to afford the compute required

Why the data centre is the computer

  • Von Neumann CPUs execute one instruction at a time; GPUs run tens of thousands in parallel — a "giant Archimedes lever" on Moore's Law
  • Training large models requires hundreds of gigabytes of on-chip memory, forcing multiple GPUs to be networked as one logical computer
  • The H100 has 18,500 CUDA cores, 640 tensor cores, and 80 streaming multiprocessors; it is 9× faster than the A100 for AI training
  • CoWoS (chip-on-wafer-on-substrate) 2.5D packaging stacks high-bandwidth memory close to the logic die — currently 10–15% of TSMC's total capacity
  • TSMC capacity for CoWoS is the binding constraint on H100 supply, not Nvidia's willingness to sell

Nvidia's three-part data centre platform

  • Mellanox / InfiniBand (acquired 2020, $7B): the only high-bandwidth rack-to-rack networking stack that can treat a full data centre as one computer
  • Grace CPU (announced Sept 2022): an ARM-based CPU designed from scratch to orchestrate massive GPU clusters, not for laptops
  • Hopper GPU architecture (H100): split from the gaming Lovelace line, enabling Nvidia to monopolise CoWoS capacity at TSMC for AI chips
  • Together these three form the DGX system — a fully integrated AI supercomputer; a single DGX H100 starts at $500K, the GH200 SuperPod (256 racks) is hundreds of millions
  • DGX Cloud launched through Azure, Oracle and Google — a virtualised DGX rented via Nvidia's own interface at $37K/month for A100 access

CUDA: the software moat

  • Released 2006; today a compiler, runtime, debugger, profiler, native language (CUDA C++), and industry-specific libraries
  • Backwards-compatible across every Nvidia GPU shipped since 2006 — 500M CUDA-capable GPUs in the wild
  • Developer count: 100K (2010) → 1M (2016) → 2M (2018) → 4M (2023); roughly 10,000 person-years of cumulative investment
  • ~1,600 Nvidia employees have "CUDA" in their LinkedIn title; competitors' open-source equivalents (ROCm, OpenCL) are years behind
  • The Apple-vs-Android analogy: Nvidia controls the tightly coupled hardware-software stack; PyTorch is the open ecosystem that rivals are trying to route through

Financial results and competitive position

  • Q1 FY24 (reported May 2023): revenue $7.2B, up 19% QoQ
  • Q2 FY24 guidance: $11B — up 53% QoQ, 65% YoY; stock rose 25% in after-hours
  • Q2 FY24 actual: total revenue $13.5B (+88% QoQ, +100% YoY); data centre alone $10.3B (+141% QoQ, +171% YoY)
  • Gross margin: ~70%, forecast 72% — vs 24% pre-CUDA era
  • ~50% of data centre revenue comes from cloud service providers (AWS, Azure, Google, Meta); CSPs buy bare chips and integrate themselves
  • China revenue was 25% of total before export controls; Nvidia created the A800/H800 (capped NVLink bandwidth) to comply — still selling at volume
  • Jensen's revised TAM framing: $1 trillion in installed data centre hardware, $250B annual refresh spend — a grounded claim vs the earlier "1% of everything" slide

Bull and bear cases

  • Bull: accelerated computing is still a fraction of total workloads; Jensen is correct that every application will gain a generative AI layer; Nvidia moves at a six-month ship cycle competitors cannot match; data centre capex lock-in is decade-long
  • Bear: every large tech company (Google TPU, Amazon Trainium, Microsoft/AMD rumours) is incentivised to break the moat; PyTorch aggregates developers in a way that could eventually disintermediate hardware; a confidence crisis in AI could slow enterprise capex; inference workloads are less differentiated than training
  • Nvidia is not Cisco or Intel — it controls the software stack and has direct developer relationships; the closer analogy is Microsoft, or old-school IBM in its mainframe era

More like this — when you're ready for early access.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Get early access to the full library.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.