The AI Arms Race Gets Real: Google's Ironwood vs. Nvidia's Reign
Google's latest Tensor Processing Unit (TPU), the Ironwood, is poised to enter general availability, marking a significant escalation in the AI infrastructure battle. The core question, as always, is whether it can genuinely challenge Nvidia's dominance, or if it's simply another player in a game Nvidia seems destined to win.
Google claims Ironwood delivers four times the performance of its previous TPU generation. Anthropic, a key AI player, plans to deploy up to a million Ironwood chips for its Claude language model. These are impressive figures, but let's dig deeper. Ironwood boasts 4.6 petaFLOPS of dense FP8 performance, nearly matching Nvidia's B200. Memory bandwidth is comparable, too. But raw specs are only part of the story.
The Scale Game: Pods vs. Racks
Google's strategy hinges on scale. Nvidia's NVL72 rack systems integrate 72 Blackwell accelerators. Ironwood, in contrast, is offered in pods ranging from 256 to a staggering 9,216 TPUs. Google claims its Jupiter datacenter network can theoretically support clusters of up to 43 TPU v7 pods – that's around 400,000 accelerators. That is an incredible claim. The size of these pods are supported by a 3D torus topology, which allows the individual chips to connect in a three-dimensional mesh.
However, it's important to consider the practical implications. While Nvidia clusters containing hundreds of thousands of GPUs exist, they're built from eight-way GPU boxes in scale-out domains. Nvidia's NVL72 increases the compute unit size, but still lags behind Google's TPU pods. How many of these massive Google clusters will actually be deployed? It's one thing to claim theoretical support; it's another to see it in action.

Google's use of optical circuit switches (OCS) to manage these massive pods is intriguing. These switches, which create physical connections between TPUs, minimize latency and enhance fault tolerance. If a TPU fails, the OCS can seamlessly replace it. This approach contrasts with Nvidia's packet switches, which, while potentially faster, can introduce latency under heavy load. Which is better truly depends on the workload.
Here's where I get a bit skeptical. (I've looked at countless earnings calls, and this is the part that I find genuinely puzzling.) Sundar Pichai stated on a recent earnings call that Google is seeing substantial demand for its AI infrastructure products. He also raised the capital expenditure forecast for 2025 to $93 billion, up from an earlier estimate of $85 billion. That's an $8 billion increase. But how much of that is specifically tied to TPU infrastructure, and how much is general cloud expansion? Details on the specific allocation remain scarce. Google takes aim at Nvidia, rolls out its most powerful AI chip
The Software Advantage (Or Disadvantage?)
Ultimately, hardware is only half the equation. Software and developer ecosystems matter just as much. Nvidia has cultivated a robust software stack and a vast developer community over years. Google, while making strides, still has ground to cover. Anthropic's adoption of TPUs is a positive sign, but it's not a guaranteed victory. Anthropic is also deploying workloads across hundreds of thousands of Amazon's Trainium 2 accelerators.
The analysts on the earnings calls ask Jensen Huang quarter after quarter about the threat of AI ASICs. And while he plays it off, it's hard to ignore the fact that Google is catching up quickly. Still, the software, in the end, may be the deciding factor.