What exactly is the second open source king thrown by DeepSeek?

Image source: Generated by Unbounded AI

On February 25, DeepSeek, which opens the welfare, has thrown out a king-open source DeepEP, the world's first full-stack communication library for MoE models. Because it directly solved the anxiety of AI computing power, GitHub instantly soared 1,500 stars (referring to the collection volume), which shows the importance of the group frying in the circle.

Many people are curious about what DeepEP means? Imagine the express delivery station on Double Eleven: 2,048 couriers (GPUs) are crazy about carrying packages (AI data) between 200 warehouses (servers). The traditional transportation system is equivalent to letting the boys deliver goods through tricycles, and DeepEP The "magnetic levitation + quantum transmission" set is directly equipped with all employees to transmit information stably and efficiently.

Feature 1: Directly change transportation rules

At the NVLink conference call on August 29, 2024, Huang Renxun once specifically emphasized NVLink (a type of GPU developed by NVIDIA to make GPUs Direct interconnection technology, bidirectional mutual transmission speed can reach 1.8TB/S) for low latency, high output and large language models, it is believed that it is one of the key technologies to promote the development of large models.

However, this blown NVLink technology has been directly played to a new level by the Chinese team this time. The mystery of DeepEP lies in the optimization of NVLink, which means that between couriers in the same warehouse, transported by magnetic levitation tracks, with a speed of up to 158 containers per second (GB/s), which is equivalent to the distance from Beijing to Shanghai. Shorten the time to drink a sip of water.

Black Technology second is the low-latency core of the RDMA technology it contains. Just imagine that between warehouses in different cities, cargo is directly "quantum transmission", and the capacity of each aircraft (network card) reaches 47 per second. A container can also allow the aircraft to fly while loading, overlapping calculations and communications, and completely bid farewell to the shutdown waiting.

Feature 2: Intelligent sorting black technology: AI version of "the strongest brain"

When goods need to be distributed to different experts (subnets in the MoE model), traditional sorting The staff must unbox one by one to check, and the DeepEP's "scheduling-combination" system is like having pre-awareness: in training pre-filling mode, 4096 data packets are simultaneously on the intelligent conveyor belt, and automatically identify the same city or cross-city pieces; in the inference pre-filling mode , 128 expedited packages go through the VIP channel, and the delivery in 163 microseconds is 5 times faster than that of humans. At the same time, dynamic rail change technology is used, and the traffic peak is in seconds transmission mode, perfectly adapting to different scenario needs.

Feature 3: FP8 "bone reduction technique"

Ordinary goods are transported in standard boxes (FP32/FP16 format), and DeepEP can compress goods into micro capsules (FP8 format). The same truck can load three times more goods. What’s even more amazing is that these capsules will automatically return to their original state after they arrive at their destination, saving both postage and time.

This system has been in DeepSeek's own warehouse (H800 GPU cluster) test: The speed of freight in the same city has increased by 3 times, and the cross-city delay has been reduced to a level that humans cannot perceive. The most subversive thing is that it realizes the real "invisible transmission" - just like a courier riding a bike While stuffing packages into the express cabinet, the whole process went smoothly.

Now DeepSeek opens this ace, which is equivalent to making SF Express’s unmanned sorting system drawings public. It originally required 2,000 GPUs to be heavy tasks, but now it can be easily grasped by hundreds of them.

Earlier, DeepSeek released its first achievement in its "Open Source Week": the code of FlashMLA (literally translated as the fast bull potential attention mechanism), which also reduces the cost of the big model training process. One of the key technologies. In order to alleviate the cost anxiety in the upstream and downstream of the industrial chain, DeepSeek is giving it all.

Early, You Yang, founder of Luchen Technology, posted on social media that "in the short term, China's MaaS model may be the worst business model." His simple estimate is that if 100 billion tokens are output per day, The monthly machine cost of DeepSeek-based services is 450 million yuan, with a loss of 400 million yuan; the monthly income of AMD chips is 45 million yuan, with a monthly machine cost of 270 million yuan, which means the loss also exceeds 200 million yuan.

Online Consultation