30,000-40,000 US dollars per chip, Nvidia's latest AI chip exposed design defect

30,000-40,000 US dollars per chip, Nvidia's latest AI chip exposed design defect

  • tech
  • 2024-06-13
  • 127 Comments

According to a report by The Information, sources have revealed that Nvidia has informed its customers that the new Blackwell B200 chip will be delayed by three months or more, with bulk shipments potentially postponed until Q1 of next year.

Subsequently, SemiAnalysis dissected the technical challenges of Nvidia's Blackwell B200 chip in its latest research report, focusing primarily on the packaging aspect. A Blackwell B200 chip contains two Blackwell GPUs, and the critical circuitry connecting the two Blackwell GPUs currently has some issues, leading to a low yield rate for the chip. This malfunction also poses mass production challenges for the GB200 super chip, which includes two Blackwell GPUs and one Grace CPU.

Packaging is key to the performance of Blackwell B200 and GB200

As per Nvidia's press conference information, the Blackwell B200 chip is defined as the strongest AI accelerator chip on the planet, built on TSMC's 4nm process. Inside the chip, a dual-core design is adopted by interconnecting two Blackwell GPUs, with a transistor count of up to 208 billion per chip.

Advertisement

Advanced process technology + advanced packaging make the Blackwell B200 chip a performance beast, with AI performance reaching 20 PFLOPS, five times that of the H100 chip, bringing a 30-fold efficiency improvement for the inference of LLM (Large Language Models). The performance of the Blackwell GB200 chip is even stronger, with its AI performance being seven times that of the H100, and its training performance being four times that of the H100.

It is not difficult to see that, due to the failure of Moore's Law, advanced packaging technology has become a key means for Nvidia to enhance chip performance. In order to have two GPUs packaged together and still function as a single GPU, Nvidia used a high-speed bandwidth interface of 10 TB/s NV-HBI between the two Blackwell GPUs, and the entire chip is also packaged with 192GB of high-speed HBM3e memory.

However, sources have revealed that there are some issues with the circuit design between the two Blackwell GPUs, preventing the B200 chip from being mass-produced as planned. The Blackwell packaging is the first packaging design to use TSMC's CoWoS-L technology for mass production, using an RDL interposer with local silicon interconnect (LSI) and embedded bridge chips. Compared to the previously popular CoWoS-S, CoWoS-L is more complex, but it also brings higher performance benefits.

Google, Meta, and Amazon are the most affected

The Blackwell chip was originally scheduled to start mass production in October 2024. If the delay pushes it back to April 2025, it will directly affect Nvidia's quarterly revenue. The most affected customers are Google, Meta, and Amazon.

Reports indicate that Google, Meta, and Amazon have placed orders with Nvidia totaling over $60 billion. According to estimates by the renowned investment bank Morgan Stanley, the connection technology used for a single Nvidia supercomputing card is NVL36 and NVL72, which means that a computing card carries 36 or 72 Blackwell B200 chips. Based on a unit price of $30,000-$40,000 per chip, even with 36 chips, the cost of the chips alone exceeds $1 million. If it's $40,000, it would be $1.5 million. Therefore, the price of a computing card using NVL36 connection will be set at $2 million, and that using NVL72 connection will be priced at $3 million, in line with Huang Renxun's definition of the more you buy, the cheaper it gets.If Google, Meta, and Amazon all commit to purchasing $30 million worth of NVL72 computing cards, then the demand for Blackwell B200 chips would soar to 1.4 million units. Should there be further discounts for large-scale procurements, this number could be even higher. Indeed, according to sources close to Google, the company is likely to invest around $10 billion to acquire 400,000 Blackwell B200 chips, along with the necessary server infrastructure. This could lead to a demand shortfall of approximately 2.4 million Blackwell B200 chips with a budget of $60 billion.

Furthermore, when NVIDIA unveiled the Blackwell B200 chip, it was revealed that Dell, Microsoft, OpenAI, Oracle, Tesla, and xAI all have plans to adopt Blackwell products. Focusing solely on the Blackwell B200 chip, its delayed release could impact NVIDIA's revenue targets for this year.

However, currently, the creation of cutting-edge data centers relies 100% on computing chips from NVIDIA. NVIDIA's compensatory measure is to continue extending the shipment volume of the Hopper series chips and to strive to supply Blackwell B200 chips this year. Once achieved, NVIDIA could potentially aim for $200 billion in revenue by 2025, with $47.5 billion projected for 2024.

In conclusion, according to statistics from the semiconductor research institution TechInsights, the global data center GPU total shipment volume reached 3.85 million units in 2023, a 44.2% increase from the 2.67 million units in 2022. NVIDIA, with a 98% market share, firmly holds the top position. Given such dominance, the immense demand for the NVIDIA Blackwell B200 chip is conceivable.

Nevertheless, the Blackwell B200 chip is a complex one, utilizing TSMC's CoWoS-L packaging technology, and is currently facing some mass production challenges. However, it is believed that with the capabilities of NVIDIA and TSMC, these issues will be resolved swiftly.

Comments