DeepSeek’s Extensive GPU Use and Development Costs for AI Models

DeepSeek : DeepSeek's Extensive GPU Use and Development Costs for AI Models

DeepSeek has access to tens of thousands of GPU accelerators for developing its AI models, including H100 GPUs, which are under US export restrictions. The reported cost of nearly $5.6 million for DeepSeek v3 likely represents only a small part of the total expense.

In the paper for the V3 model, DeepSeek mentions a relatively small data center with 2048 H800 accelerators from Nvidia. The company estimates hypothetical rental costs of $2 per hour per H800 GPU. With nearly 2.8 million computing hours (distributed across 2048 GPUs), the $5.6 million cost arises.

The developers themselves note a limitation: “Please note that the costs mentioned above only include the official training of DeepSeek-V3 and not the costs associated with earlier research and ablation experiments on architectures, algorithms, or data.”

DeepSeek has access to approximately 60,000 Nvidia accelerators through its parent company, High-Flyer: 10,000 A100 from the Ampere generation before US export restrictions took effect, 10,000 H100 from the gray market, 10,000 China-adapted H800, and 30,000 H20, which Nvidia introduced following newer export restrictions.

Scale AI CEO Alexandr Wang mentioned in an interview with CNBC that DeepSeek would use 50,000 H100 accelerators. This might be a misunderstanding: the H100, H800, and H20 (allegedly totaling 50,000) are all models from the Hopper generation but in different versions.

H100 is the standard type for the West. Nvidia slowed down the Nvlink for the H800 due to export restrictions, which affects communication between multiple GPUs. The H20 followed with new restrictions, having significantly reduced computing power but unrestricted Nvlink. It also uses the maximum possible memory expansion with 96 GB of High-Bandwidth Memory (HBM3) and a transfer rate of 4 TB/s.

Semianalysis calculates that the necessary servers for the 60,000 GPUs cost about $1.6 billion. Operating costs are additional. The salaries of the development teams are not included.

96% of the $5.6 million mentioned by DeepSeek is attributed to pre-training, where the final underlying model is trained. The previous development effort, including all innovations from DeepSeek-V2, is ignored in the paper.

The development of the caching technique Multi-Head Latent Attention (MLA) alone is said to have taken months. This technique compresses all generated tokens, allowing the AI model to quickly access data without using much storage space during new queries.

Another innovation likely required significant resources: “Dual Pipe.” DeepSeek uses part of the Streaming Multiprocessors (SMs) in Nvidia’s GPUs as a kind of virtual Data Processing Unit (DPU), as highlighted by Nextplatform. They independently handle data movements in and between AI accelerators, with much lower latency than using CPUs, increasing efficiency.

In the paper for the more powerful R1 model, DeepSeek does not provide any information on the hardware used. The use of a small data center would be even less credible here. Recent reports suggest that DeepSeek might also use AI accelerators from Huawei for R1.