DeepSeek open source third bomb: key secrets for V3/R1 training inference, core code is only 300 lines

Source:Quantum bits

On the third day of the open source week, DeepSeek trained inference behind V3/R1 "Motive" is revealed -

DeepGEMM: A FP8 GEMM (general matrix multiplication) library that supports dense and hybrid expert (MoE) matrices Multiplication operation.

Let's take a brief look at GEMM first. GEMM, that is, general matrix multiplication, is a basic operation in linear algebra, a "frequently" in scientific computing, machine learning, deep learning and other fields, and is also a lot of The core of high-performance computing tasks.

But because its computational volume is often relatively large, GEMM performance optimization is a crucial point.

The DeepGEMM open source this time, DeepSeek still maintains the characteristics of "high performance + low cost". The highlights are as follows:

High performance: On GPUs with Hopper architecture, DeepGEMM can achieve performance up to 1350+FP8 TFLOPS.

Simplicity: The core logic has only about 300 lines of code, but its performance is better than the expert-tuned kernel.

Real-time compilation (JIT): adopts a fully instant compilation method, which means it can dynamically generate optimized code at runtime, thus adapting to different hardware and Matrix size.

High-dependency-free: This library is designed to be very lightweight and has no complex dependencies, which can make deployment and use simple.

Supports multiple matrix layouts: intensive matrix layout and two MoE layouts, which enable it to adapt to different application scenarios, including but not limited to deep learning hybrid expert model.

Simplely put, DeepGEMM is mainly used to accelerate matrix operations in deep learning, especially in large-scale model training and inference. It is particularly suitable for high efficiency. calculateResource scenarios can significantly improve computing efficiency.

Many netizens are more "paying" about this open source. Some people compare DeepGEMM to a superhero in the mathematics world, thinking that it is even more than a fast calculator. It is faster, more powerful than polynomial equations.

Some people also likened the release of DeepGEMM to a quantum state that is stable to a new reality, praising its clean and neat instant compilation.

Of course... some people are beginning to worry about the Nvidia stocks they have...

p>In-depth understanding of DeepGEMM

DeepGEMM is a library specially designed to achieve simple and efficient FP8 universal matrix multiplication (GEMMs). It also has fine-grained scaling function. This design Originated from DeepSeek V3.

It can handle both ordinary general matrix multiplication and support general matrix multiplication of MoE packets.

This library is written in CUDA and does not need to be compiled during installation, because it will pass through a lightweight instant compilation (JIT) module at runtime To compile all kernel programs.

At present, DeepGEMM only supports Nvidia's Hopper tensor core.

In order to solve the problem that the FP8 tensor core is not accurate enough when calculating accumulation, it adopts the two-level cumulative (improvement) method of CUDA core.

While DeepGEMM draws on some ideas from CUTLASS and CuTe, it does not rely too much on their templates or algebraic operations.

On the contrary, this library is designed very concisely, with only one core kernel function and the code volume is about 300 lines.

This makes it a concise and easy-to-understand resource, making it convenient for everyone to learn FP8 matrix multiplication and optimization technology under the Hopper architecture.

DeepGEMM's performance can match or exceed the expert tuning library of various matrix shapes.

So what are the specific performance?

The team tested all shapes that might be used in DeepSeek-V3/R1 inference (including pre-filling and decoding) on the H800 using NVCC 12.8, but without tensor parallelism ).

The following figure shows the performance of a normal DeepGEMM for intensive models:

From the test results, DeepGEMM computing performance can reach up to 1358 TFLOPS, and memory broadband can reach up to 2668 GB/s.

In terms of acceleration ratio, it can reach up to 2.7 times compared with the optimization implementation based on CUTLASS 3.6.

Look at the performance of DeepGEMM supporting continuous layout of MoE models:

and the performance of supporting MoE model masked layout is as follows:

How to use it?

To use DeepGEMM, you need to pay attention to several dependencies, including:

It is necessary to support the Hopper architecture GPU, sm_90a.

Python 3.8 and above.

CUDA 12.3 and above (recommended 12.8).

PyTorch 2.1 and above.

CUTLASS 3.6 and above

Development code is as follows:

#Submodulemustbeclonedgitclone--recursivegit@github.com:deepseek-ai/DeepGEMM.git#Makesymboliclinksforthird-party(CUTLASSandCuTe) includedirectoriespythonsetup.pydevelop#TestJITcompilationpythontests/test_jit.py#TestallGEMMimplments(normal,contiguous-grouped andmasked-grouped)pythontests/test_core.py

The installation code is as follows:

pythonsetup.pyinstall

After the above steps, just import deep_gemm into your Python project.

In terms of interfaces, for ordinary DeepGEMM, the deep_gemm.gemm_fp8_fp8_bf16_nt function can be called, and the NT format (non-transpose LHS and transpose RHS) can be supported.

For grouped DeepGEMMs, in the case of continuous layout, m_grouped_gemm_fp8_fp8_bf16_nt_contiguous; in the case of mask layout, m_grouped_gemm_fp8_fp8_bf16_nt_masked.

DeepGEMM also provides tool functions such as setting the maximum number of SMs and obtaining TMA alignment size; supports environment variables, such as DG_NVCC_COMPILER, DG_JIT_DEBUG, etc.

In addition, the DeepSeek team also provides several optimization methods, including:

JIT design: All kernels are compiled at runtime, no need to compile at installation; supports dynamic selection of optimal block size and pipeline stages.

Fine-grained scaling: solves the FP8 accuracy problem through two layers of CUDA core accumulation; supports non-power block sizes to optimize SM utilization.

FFMA SASS interleaving: improve performance by modifying the yield and reuse bits of the SASS instruction.

Interested Friends can click the GitHub link at the end of the article to view the details~

One More Thing

NVIDIA's stocks have been falling in the past few days... well... have been falling again:

However, in the early morning of the 27th Beijing time, Nvidia's fourth quarter performance report for fiscal year 2025 is about to be released, we can look forward to its performance~< /p>