News center > News > Headlines > Context
"Source God" DeepSeek breaks through the performance limit of H800. FlashMLA is heavily open source, and the computing power cost can be reduced.
Editor
4 hours ago 3,203

Source:Quantum Bits

DeepSeek Open Source Week, the cost reduction method is disclosed -

FlashMLA, directly breaking the upper limit of H800 calculation.

Netizen: How is this possible? ?

It is an efficient MLA decoding kernel developed for Hopper GPUs, optimized specifically for variable length sequences and is currently in production.

MLA is the innovative attention architecture proposed by DeepSeek. Starting from V2, MLA has greatly reduced the cost of DeepSeek in series models, but the computing and inference performance can still be the same as that of top models.

According to the official introduction, after FlashMLA is used, H800 can reach 3000GB/s of memory, achieving 580TFLOPS computing performance.

Netizens praised: I paid high respect to the engineering team and squeezed out every FLOP from Hopper's tensor core. This is how we take LLM services to the new frontier!

Some netizens have used it.

On the first day of open source: FlashMLA

The GitHub page has been updated at present. In just one hour, the number of Star stars has exceeded 1.2k.

This time has been released:

Support BF16;

Pagination KV cache, block size is 64

Quick boot:

Environmental requirements:

Hopper GPU

CUDA 12.3 and above

PyTorch 2.0 and above

At the end of the project, it also stated that this was inspired by the FlashAttention 2&3 and Nvidia CUTLASS projects.

FlashAttention can achieve fast and memory-efficient precise attention, and mainstream large models are in use. The latest third generation can make the H100 utilization rate soar to 75%. The training speed is increased by 1.5-2 times. The calculation throughput under FP16 is as high as 740TFLOPs/s, reaching a theoretical maximum throughput of 75%, making full use of computing resources, and before that it can only achieve 35%.

The core author is Tri Dao, Princeton Big Bull, chief scientist of Together AI.

NVIDIA CUTLASS is a collection of CUDA C++ template abstractions for implementing high-performance matrix-matrix multiplication (GEMM) and related calculations at all levels and scales within CUDA.

MLA, DeepSeek basic architecture

Lastly, let’s talk about MLA, the potential attention mechanism of bulls, DeeThe basic architecture of the pSeek series model is designed to optimize the inference efficiency and memory usage of the Transformer model while maintaining model performance.

It uses low-rank joint compression technology to project the key and value matrix in the multi-head attention to a low-dimensional latent space, thereby significantly reducing the storage of key-value caches (KV Cache). need. This approach is particularly important in long sequence processing, because traditional methods require storing a complete KV matrix, while MLA retains only critical information through compression.

In the V2 version, this innovative architecture reduced the video memory usage to 5%-13% of the most commonly used MHA architecture in the past, achieving a significant reduction in costs. Its inference cost is only 1/7 of the Llama 370B and 1/70 of the GPT-4 Turbo.

In V3, this cost reduction and acceleration is even more obvious, which directly makes DeepSeek attract global attention.

It is also today that DeepSeek-R1 has received more than 10,000 likes on HuggingFace, becoming the most popular big model among the nearly 1.5 million models on the platform.

HuggingFace CEO posted a message to announce the good news.

The whale is making waves! The whale is making waves!

Okay, look forward to it, what will be posted in the next four days?

Keywords: Bitcoin
Share to: