NVIDIA Beats Everyone To DeepSeek V4 With Day-0 Blackwell Support, Pushing 3,500 Tokens Per Second On 1.6T Models

Wccftech • April 26, 2026 • business

Key Points:

DeepSeek V4 introduces significant optimizations, reducing single-token inference FLOPs to 27% and KV cache usage to 10% for a one-million-token context window, and includes new models with up to 1.6 trillion parameters.
NVIDIA provides Day-0 support for DeepSeek V4 on its Blackwell GPUs, which deliver the necessary scale and low-latency performance for trillion-parameter AI models and long-context inference.
The NVIDIA Blackwell architecture features advanced technologies like NVFP4 quantization, Dynamo, and optimized CUDA kernels, achieving nearly 3500 tokens per second throughput per GPU in early benchmarks.
DeepSeek V4's use of FP4 (MXFP4) quantization reduces memory traffic and latency, and compatibility with Huawei's upcoming Ascend 950PR and 950DT chips indicates support for China's domestic AI hardware.
NVIDIA continues to support the open-source AI ecosystem by providing tools, microservices, and fine-tuning workflows to facilitate integration and development across various deployment stages.

Trending Business