Google Details TPUv4 and its Crazy Optically Reconfigurable AI Network

Google showcased its advanced optically reconfigurable AI network at Hot Chips 2023. The company uses optical circuit switching to enhance performance, reduce power consumption, and increase flexibility for its AI training cluster. The system, which has been in production for years, is designed to link together Google's Tensor Processing Unit (TPU) chips. The latest 7nm Google TPUv4 has more than doubled the peak FLOPS and reduced power compared to its predecessor, TPUv3. The TPUv4 also includes a built-in SparseCore accelerator.

The supercomputer features a board with four TPUv4 chips and is liquid-cooled, leading to significant power savings. Google's system uses PCIe Gen3 x16 back to the host and has a number of interconnects. Each system comprises 64 racks with 4096 interconnected chips. The system uses an Optically Reconfigurable Interconnect, which allows for direct connection between chips and efficient data sharing. Google can adjust the optical routing to change topologies based on model needs. The company also increased the on-chip memory to 128MB to keep data access local. Google's supercomputer outperforms the NVIDIA A100 on a performance-per-watt basis.

Key takeaways:

Google has been using an optically reconfigurable AI network for its AI training cluster, which allows for better performance, lower power, and more flexibility.
The Google TPUv4, a 7nm chip, has been revealed with an overprovisioned power to fulfill a 5ms service time SLA and is designed to scale out and run as part of large-scale infrastructure.
Google's supercomputer uses optical circuit switching (OCS) for direct connection between chips, which allows for efficient data sharing and higher utilization of the nodes.
Google is expected to start talking about the TPUv5 soon, and it is speculated that Google will push more into the AI space and compete against NVIDIA in AI hardware and cloud services.

Google Details TPUv4 and its Crazy Optically Reconfigurable AI Network

Key takeaways:

Comments (0)

Newsletter