NVIDIA Beats Everyone To DeepSeek V4 With Day-0 Blackwell Support, Pushing 3,500 Tokens Per Second On 1.6T Models
Key Points:
- DeepSeek V4 introduces significant optimizations, reducing single-token inference FLOPs to 27% and KV cache usage to 10% for a one-million-token context window, and includes new models with up to 1.6 trillion parameters.
- NVIDIA provides Day-0 support for DeepSeek V4 on its Blackwell GPUs, which deliver the necessary scale and low-latency performance for trillion-parameter AI models and long-context inference.
- The NVIDIA Blackwell architecture features advanced technologies like NVFP4 quantization, Dynamo, and optimized CUDA kernels, achieving nearly 3500 tokens per second throughput per GPU in early benchmarks.
- DeepSeek V4's use of FP4 (MXFP4) quantization reduces memory traffic and latency, and compatibility with Huawei's upcoming Ascend 950PR and 950DT chips indicates support for China's domestic AI hardware.
- NVIDIA continues to support the open-source AI ecosystem by providing tools, microservices, and fine-tuning workflows to facilitate integration and development across various deployment stages.