新未名空间

以前买买提有个魏老师，也是对硬件非常精通，跟goodegg打赌铁路买票用几台server打赢了。

For example, when training V3 with NVIDIA’s H800 GPUs, DeepSeek customized parts of the GPU’s core computational units, called SMs (Streaming Multiprocessors), to suit their needs. Out of 132 SMs, they allocated 20 exclusively for server-to-server communication tasks instead of computational tasks.

This customization was carried out at the PTX (Parallel Thread Execution) level, a low-level instruction set for NVIDIA GPUs. PTX operates at a level close to assembly language, allowing for fine-grained optimizations such as register allocation and thread/warp-level adjustments. However, such detailed control is highly complex and difficult to maintain. This is why higher-level programming languages like CUDA are typically used, as they generally provide sufficient performance optimization for most parallel programming tasks without requiring lower-level modifications.

就是个稀疏矩阵的parallel计算问题

你如果熟悉早年超级计算机平行编程，就很容易理解了

高频交易那点latency optimization的破玩意，华为的人一上来就可以直接操翻。

整天玩骨干网核心路由交换机的，搞个几把cpu offloading不跟他妈的玩似的。

魏老师花了十几年做的smart home system，也不知道卖出去一套没有。他应该是不愁钱的，但那段人生，真有点浪费

Caravel 写了： 2025年 1月 26日 13:27 以前买买提有个魏老师，也是对硬件非常精通，跟goodegg打赌铁路买票用几台server打赢了。

For example, when training V3 with NVIDIA’s H800 GPUs, DeepSeek customized parts of the GPU’s core computational units, called SMs (Streaming Multiprocessors), to suit their needs. Out of 132 SMs, they allocated 20 exclusively for server-to-server communication tasks instead of computational tasks.

This customization was carried out at the PTX (Parallel Thread Execution) level, a low-level instruction set for NVIDIA GPUs. PTX operates at a level close to assembly language, allowing for fine-grained optimizations such as register allocation and thread/warp-level adjustments. However, such detailed control is highly complex and difficult to maintain. This is why higher-level programming languages like CUDA are typically used, as they generally provide sufficient performance optimization for most parallel programming tasks without requiring lower-level modifications.

弃婴千枝写了： 2025年 1月 26日 13:36 就是个稀疏矩阵的parallel计算问题

你如果熟悉早年超级计算机平行编程，就很容易理解了

这都是driver层面的东西，一般人不会优化它

新未名空间

高频交易程序员的硬件优化水平非常高

#1 高频交易程序员的硬件优化水平非常高

#2 Re: 高频交易程序员的硬件优化水平非常高

#3 Re: 高频交易程序员的硬件优化水平非常高

#4 Re: 高频交易程序员的硬件优化水平非常高

#5 Re: 高频交易程序员的硬件优化水平非常高