DeepSeek's FP8 GEMM (General Matrix Multiplication) implementation, as showcased in their DeepGEMM library, has sparked some buzz for its reported performance, particularly when compared to NVIDIA's native APIs like cuBLAS. Based on what’s out there, here’s my take.
DeepGEMM is designed to squeeze every ounce of efficiency out of NVIDIA’s Hopper GPUs (like the H800), hitting over 1350 TFLOPS in FP8 precision under ideal conditions. That’s impressive when you consider NVIDIA’s cuBLAS, a go-to for matrix operations, is estimated to deliver around 1500 TFLOPS on an H800 for certain matrix sizes—though this scales up to 3000 TFLOPS on an H100, which DeepSeek likely didn’t use due to export restrictions. Posts on X and details from DeepSeek’s own releases suggest their library can outperform “expert-tuned kernels” (which could imply cuBLAS or similar) across a range of matrix sizes, sometimes by a significant margin—think up to 2.7x faster in extreme cases, as one X user noted.
What’s the trick? DeepGEMM uses a lightweight, just-in-time (JIT) compilation approach, keeping the core logic lean at around 300 lines of code. It leans hard into fine-grained scaling and custom optimizations, like two-level accumulation (using Tensor Cores for FP8 math and CUDA cores for higher-precision summing), to dodge the precision pitfalls of FP8. NVIDIA’s native APIs, while robust and general-purpose, don’t seem to match this level of specialization. DeepSeek’s focus on Mixture-of-Experts (MoE) layouts and their ability to tweak low-level instructions (even dabbling in PTX assembly) likely gives them an edge in tailored workloads—think AI training and inference for models like V3 and R1.
That said, it’s not a slam dunk. NVIDIA’s APIs are built for broad compatibility and reliability across diverse use cases, not just AI-specific GEMMs. DeepGEMM’s gains might shine brightest in DeepSeek’s own sandbox—optimized for their models and hardware constraints (like the H800’s nerfed bandwidth). Without head-to-head benchmarks on identical setups, claims of it being “much faster” feel anecdotal. The 1350+ TFLOPS figure is stellar, but cuBLAS could still edge out in raw peak performance on bigger matrices or less-specialized tasks. Plus, DeepSeek’s reliance on Hopper-specific Tensor Cores means it’s not a universal drop-in replacement.
So, is it “much faster”? Probably yes for DeepSeek’s niche—AI-driven, FP8-heavy, MoE-focused workloads on constrained hardware. For the average user leaning on NVIDIA’s stack? Maybe not as dramatic. It’s a testament to clever engineering over brute force, but the jury’s still out until someone runs the numbers side-by-side. What do you think—seen any solid comparisons?