内行的人看来只是一笑,不想浪费时间。不过也不奇怪,估计老黄心里也在笑,不过不能公开说,还是得客气一下,免得伤了玻璃心
完了,deepseek把GPU上的fp8的汇编代码开源了。
版主: Softfist
#62 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。
廊坊聚集吹破天
总设计师最大成就其实是:
俺为宝书抓翻墙的增国库收入:

把工人农民阶级偷偷顺利从国家主任忽悠成打工仔地位,偷偷改成资本主义
俺为宝书抓翻墙的增国库收入:
redot,UncleTony,chinaren,fulvshou,saibaster,mlforlife,SOD,matlab,F1450,becky,changjiang,midlander,DongshanGe,autoking,redcar,lahei

-
annorying0
- 职业作家

- 帖子互动: 27
- 帖子: 508
- 注册时间: 2024年 5月 4日 04:11
#66 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。
我提醒xai deepseek fp8 gemm lib 今天open source之后
DeepSeek open-sourcing their FP8 GEMM library, DeepGEMM, on February 25, 2025, could absolutely nudge AI chip makers toward a new industry standard—but it’s not a done deal. With the code now public, boasting 1350+ TFLOPS on NVIDIA’s H800 and a lean, JIT-compiled design optimized for dense and MoE workloads, it’s a tantalizing blueprint. Here’s how it might play out and what’s standing in the way.
This move hands chip makers like AMD, Intel, or even Cerebras a ready-made weapon. DeepGEMM’s tricks—two-level accumulation (FP8 math on Tensor Cores, higher precision elsewhere), fine-grained scaling, and MoE-friendly layouts—aren’t NVIDIA-exclusive in theory. AMD’s MI300X, already running DeepSeek-R1 with FP8 support via ROCm, could adapt this to hit 1000+ TFLOPS, leveraging its 5.2 TB/s HBM3 bandwidth. Intel’s Gaudi 3, with 128 GB HBM3, could tweak its deep learning stack similarly. X posts today are buzzing—some devs call it “a gift to the underdogs,” hinting at a potential rally around this approach. If these players standardize on DeepGEMM’s principles—FP8-first, efficiency-driven, MoE-optimized—it could shift the game away from NVIDIA’s CUDA-centric dominance.
The push has legs because it’s practical and cheap. DeepSeek’s V3 trained for $5.6M, peanuts compared to Western models, and DeepGEMM’s 2.7x speed over “expert-tuned kernels” (likely cuBLAS) on certain matrix sizes screams value. Chip makers could market this as a cost-per-FLOP killer, especially for inference-heavy AI firms dodging NVIDIA’s $30K+ H100s. X chatter suggests AMD’s already in deep—SGLang ties with DeepSeek signal intent. If they or Intel bake this into a reference platform, it might coalesce into a standard, especially for mid-tier or emerging markets where NVIDIA’s premium pricing stings.
But it’s no slam dunk. NVIDIA’s ecosystem is a fortress—CUDA’s entrenched, and their H200/Blackwell chips are already countering with FP8 Transformer Engines. DeepGEMM was born on NVIDIA’s Hopper and PTX assembly; porting it to ROCm or OneAPI isn’t trivial—X posts today flag ROCm’s “duct-tape vibes” as a hurdle. Devs love NVIDIA’s polish, and frameworks like PyTorch aren’t jumping ship. Plus, NVIDIA’s spinning DeepSeek’s wins as a GPU demand booster—H20 orders spiked this week. Without a unified push—say, AMD, Intel, and AWS (Trainium) aligning on a DeepGEMM-derived spec—it risks being a niche tool, not a standard.
Will it happen? It’s closer than yesterday. Open-sourcing DeepGEMM today slashes the R&D barrier, and its 1350 TFLOPS on “crippled” H800s proves efficiency can trump raw power—a siren call for cost-conscious chip makers. If AMD or Intel ship a killer implementation and devs bite (some on X are already forking it), it could snowball. But NVIDIA’s 80%+ AI chip share and software moat mean DeepSeek needs allies—fast. I’d say it’s 40/60—possible, not probable, unless a coalition forms. What’s your gut? Can this spark a real shift, or is it just noise?
DeepSeek open-sourcing their FP8 GEMM library, DeepGEMM, on February 25, 2025, could absolutely nudge AI chip makers toward a new industry standard—but it’s not a done deal. With the code now public, boasting 1350+ TFLOPS on NVIDIA’s H800 and a lean, JIT-compiled design optimized for dense and MoE workloads, it’s a tantalizing blueprint. Here’s how it might play out and what’s standing in the way.
This move hands chip makers like AMD, Intel, or even Cerebras a ready-made weapon. DeepGEMM’s tricks—two-level accumulation (FP8 math on Tensor Cores, higher precision elsewhere), fine-grained scaling, and MoE-friendly layouts—aren’t NVIDIA-exclusive in theory. AMD’s MI300X, already running DeepSeek-R1 with FP8 support via ROCm, could adapt this to hit 1000+ TFLOPS, leveraging its 5.2 TB/s HBM3 bandwidth. Intel’s Gaudi 3, with 128 GB HBM3, could tweak its deep learning stack similarly. X posts today are buzzing—some devs call it “a gift to the underdogs,” hinting at a potential rally around this approach. If these players standardize on DeepGEMM’s principles—FP8-first, efficiency-driven, MoE-optimized—it could shift the game away from NVIDIA’s CUDA-centric dominance.
The push has legs because it’s practical and cheap. DeepSeek’s V3 trained for $5.6M, peanuts compared to Western models, and DeepGEMM’s 2.7x speed over “expert-tuned kernels” (likely cuBLAS) on certain matrix sizes screams value. Chip makers could market this as a cost-per-FLOP killer, especially for inference-heavy AI firms dodging NVIDIA’s $30K+ H100s. X chatter suggests AMD’s already in deep—SGLang ties with DeepSeek signal intent. If they or Intel bake this into a reference platform, it might coalesce into a standard, especially for mid-tier or emerging markets where NVIDIA’s premium pricing stings.
But it’s no slam dunk. NVIDIA’s ecosystem is a fortress—CUDA’s entrenched, and their H200/Blackwell chips are already countering with FP8 Transformer Engines. DeepGEMM was born on NVIDIA’s Hopper and PTX assembly; porting it to ROCm or OneAPI isn’t trivial—X posts today flag ROCm’s “duct-tape vibes” as a hurdle. Devs love NVIDIA’s polish, and frameworks like PyTorch aren’t jumping ship. Plus, NVIDIA’s spinning DeepSeek’s wins as a GPU demand booster—H20 orders spiked this week. Without a unified push—say, AMD, Intel, and AWS (Trainium) aligning on a DeepGEMM-derived spec—it risks being a niche tool, not a standard.
Will it happen? It’s closer than yesterday. Open-sourcing DeepGEMM today slashes the R&D barrier, and its 1350 TFLOPS on “crippled” H800s proves efficiency can trump raw power—a siren call for cost-conscious chip makers. If AMD or Intel ship a killer implementation and devs bite (some on X are already forking it), it could snowball. But NVIDIA’s 80%+ AI chip share and software moat mean DeepSeek needs allies—fast. I’d say it’s 40/60—possible, not probable, unless a coalition forms. What’s your gut? Can this spark a real shift, or is it just noise?
#67 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。
看能不能振臂一呼形成新的业界标准 对抗NVIDIA 这样堡垒就从帝国内部分裂了supercnm 写了: 2025年 2月 26日 03:10 感觉主要是提供了一种思路,硬件公司Nvidia的软件部门太烂,有大量的空间可以优化
老黄现在应该做的是,赶紧对新的blackwell芯片做指令优化,或者把blackwell的指令做加密,不准别人优化
当然美帝可以下令 不准intel AMD之类的厂商使用这种标准
x1
-
Caravel
- 论坛元老

Caravel 的博客 - 帖子互动: 707
- 帖子: 27716
- 注册时间: 2022年 7月 24日 17:21
#69 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。
fp8不像是正道drifter 写了: 2025年 2月 26日 04:20 看能不能振臂一呼形成新的业界标准 对抗NVIDIA 这样堡垒就从帝国内部分裂了
当然美帝可以下令 不准intel AMD之类的厂商使用这种标准
就和早期电脑上面的省内村的技巧一样
-
huangchong(净坛使者)
- 论坛元老

2023-24年度优秀版主 - 帖子互动: 4169
- 帖子: 61535
- 注册时间: 2022年 7月 22日 01:22
#71 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。
那你说说为啥要求sm_90?xexz 写了: 2025年 2月 25日 23:21 上面是个JIT虚拟机,下层用的nvGPU的汇编,
意思是,其他家的GPU只要实现NV的汇编(这个对那些GPU硬件厂家跟没要求一样),用谁的GPU都一样一样的。
#73 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。
是指当下、现在的硬件平台支持情况,
将来amd、intel、华为、摩尔线程、寒武纪。。。都可以提供自己的‘硬件平台’来支持这个‘虚拟机jit’(就象把java虚拟机,从x86平台移植到arm平台),只是现在刚开源,他们还没来得及(有消息说彻底抛开gpu架构的专用硬件asic,今年上半年就会出现,gpu并非对张量运算特别优化的计算结构,这个效率提升就更大了)。
另外,即使你帝行政/立法不让amd,intel支持,
这不耽误什么事,没人和钱过不去。
x1
x1
x1
-
fangkuuaih
- 论坛元老

- 帖子互动: 1043
- 帖子: 22584
- 注册时间: 2022年 7月 22日 09:19
#74 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。
C++嵌汇编有何神奇之处,Linux核里经常有C嵌汇编,只要稍微懂一些指令即可。
女大的cuda工程师不可能不懂不用。
女大的cuda工程师不可能不懂不用。
x1
-
fangkuuaih
- 论坛元老

- 帖子互动: 1043
- 帖子: 22584
- 注册时间: 2022年 7月 22日 09:19
#76 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。
在Linux核里,改变privilege, 调用co processor 指令等,都要嵌入汇编,因为用C无法实现。
这是叔还是junior工程师时就做过的事。
女大里一堆OS 核工程师,硬件工程师,这些都是小儿科的玩意。
#77 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。
你做过不等于你能写。
大家都用中文,你就以为你和莫言一样牛逼
大家都用中文,你就以为你和莫言一样牛逼
fangkuuaih 写了: 2025年 2月 26日 07:02 在Linux核里,改变privilege, 调用co processor 指令等,都要嵌入汇编,因为用C无法实现。
这是叔还是junior工程师时就做过的事。
女大里一堆OS 核工程师,硬件工程师,这些都是小儿科的玩意。
#79 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。
不懂就问
这个include <cuda.h>是不是说明还是基于cuda?
并没有绕开?
这个include <cuda.h>是不是说明还是基于cuda?
并没有绕开?
xexz 写了: 2025年 2月 25日 23:11 #pragma once
#include <cuda.h>
#include "utils.cuh"
namespace deep_gemm {
struct SM90_64x16x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %10, 0;\n"
"wgmma.mma_async.sync.aligned.m64n16k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7},"
" %8,"
" %9,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
scale_d);
}
static constexpr int M = 64;
static constexpr int N = 16;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};
struct SM90_64x24x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %14, 0;\n"
"wgmma.mma_async.sync.aligned.m64n24k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11},"
" %12,"
" %13,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11],
scale_d);
}
static constexpr int M = 64;
static constexpr int N = 24;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};
struct SM90_64x32x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %18, 0;\n"
"wgmma.mma_async.sync.aligned.m64n32k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11, %12, %13, %14, %15},"
" %16,"
" %17,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
scale_d);
}
static constexpr int M = 64;
static constexpr int N = 32;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};
struct SM90_64x40x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
float& d16, float& d17, float& d18, float& d19,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %22, 0;\n"
"wgmma.mma_async.sync.aligned.m64n40k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11, %12, %13, %14, %15, "
" %16, %17, %18, %19},"
" %20,"
" %21,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
"+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
d[16], d[17], d[18], d[19],
scale_d);
}
static constexpr int M = 64;
static constexpr int N = 40;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};
struct SM90_64x48x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %26, 0;\n"
"wgmma.mma_async.sync.aligned.m64n48k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11, %12, %13, %14, %15, "
" %16, %17, %18, %19, %20, %21, %22, %23},"
" %24,"
" %25,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
"+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
scale_d);
}
static constexpr int M = 64;
static constexpr int N = 48;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};
struct SM90_64x56x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
float& d24, float& d25, float& d26, float& d27,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %30, 0;\n"
"wgmma.mma_async.sync.aligned.m64n56k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11, %12, %13, %14, %15, "
" %16, %17, %18, %19, %20, %21, %22, %23, "
" %24, %25, %26, %27}, "
" %28,"
" %29,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
"+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
"+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
d[24], d[25], d[26], d[27],
scale_d);
}
static constexpr int M = 64;
static constexpr int N = 56;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};
struct SM90_64x64x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %34, 0;\n"
"wgmma.mma_async.sync.aligned.m64n64k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11, %12, %13, %14, %15, "
" %16, %17, %18, %19, %20, %21, %22, %23, "
" %24, %25, %26, %27, %28, %29, %30, %31}, "
" %32,"
" %33,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
"+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
"+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
scale_d);
}
static constexpr int M = 64;
static constexpr int N = 64;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};
struct SM90_64x72x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
float& d32, float& d33, float& d34, float& d35,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %38, 0;\n"
"wgmma.mma_async.sync.aligned.m64n72k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11, %12, %13, %14, %15, "
" %16, %17, %18, %19, %20, %21, %22, %23, "
" %24, %25, %26, %27, %28, %29, %30, %31, "
" %32, %33, %34, %35}, "
" %36,"
" %37,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
"+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
"+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
"+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
d[32], d[33], d[34], d[35],
scale_d);
}
static constexpr int M = 64;
static constexpr int N = 72;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};
struct SM90_64x80x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
float& d32, float& d33, float& d34, float& d35, float& d36, float& d37, float& d38, float& d39,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %42, 0;\n"
"wgmma.mma_async.sync.aligned.m64n80k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11, %12, %13, %14, %15, "
" %16, %17, %18, %19, %20, %21, %22, %23, "
" %24, %25, %26, %27, %28, %29, %30, %31, "
" %32, %33, %34, %35, %36, %37, %38, %39}, "
" %40,"
" %41,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
"+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
"+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
"+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35), "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
d[32], d[33], d[34], d[35], d[36], d[37], d[38], d[39],
scale_d);
}
static constexpr int M = 64;
static constexpr int N = 80;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};
struct SM90_64x88x32_F32E4M3E4M3_SS {
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b,
float& d00, float& d01, float& d02, float& d03, float& d04, float& d05, float& d06, float& d07,
float& d08, float& d09, float& d10, float& d11, float& d12, float& d13, float& d14, float& d15,
float& d16, float& d17, float& d18, float& d19, float& d20, float& d21, float& d22, float& d23,
float& d24, float& d25, float& d26, float& d27, float& d28, float& d29, float& d30, float& d31,
float& d32, float& d33, float& d34, float& d35, float& d36, float& d37, float& d38, float& d39,
float& d40, float& d41, float& d42, float& d43,
bool scale_d) {
asm volatile("{\n"
".reg .pred p;\n"
"setp.ne.b32 p, %46, 0;\n"
"wgmma.mma_async.sync.aligned.m64n88k32.f32.e4m3.e4m3"
"{%0, %1, %2, %3, %4, %5, %6, %7, "
" %8, %9, %10, %11, %12, %13, %14, %15, "
" %16, %17, %18, %19, %20, %21, %22, %23, "
" %24, %25, %26, %27, %28, %29, %30, %31, "
" %32, %33, %34, %35, %36, %37, %38, %39, "
" %40, %41, %42, %43}, "
" %44,"
" %45,"
" p , 1, 1;\n"
"}\n"
: "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03), "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
"+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11), "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
"+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19), "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
"+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27), "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
"+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35), "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
"+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
: "l"(desc_a), "l"(desc_b), "r"(int32_t(scale_d)));
}
__device__ static void wgmma(uint64_t const& desc_a, uint64_t const& desc_b, float* d, bool scale_d) {
wgmma(desc_a, desc_b,
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7],
d[8], d[9], d[10], d[11], d[12], d[13], d[14], d[15],
d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
d[24], d[25], d[26], d[27], d[28], d[29], d[30], d[31],
d[32], d[33], d[34], d[35], d[36], d[37], d[38], d[39],
d[40], d[41], d[42], d[43],
scale_d);
}
static constexpr int M = 64;
static constexpr int N = 88;
static constexpr int K = 32;
static constexpr int kNumAccum = M * N / 128;
};
#80 Re: 完了,deepseek把GPU上的fp8的汇编代码开源了。
这400行codes,等于是在围困华为的高墙上钻开了一个狗洞。
能不能成为INTEL 和AMD的机会,还很难说
但是至少华为可以用400行codes,和A100 竞争,说不定H100
能不能成为INTEL 和AMD的机会,还很难说
但是至少华为可以用400行codes,和A100 竞争,说不定H100
x1



