Author: Liang Siqi Editor: Dong Yuqing
On February 25, DeepSeek, which opens its welfare, was dumped A king - Open source DeepEP, the world's first full-stack communication library for MoE models. Because it directly solved the anxiety of AI computing power, GitHub instantly soared 1,500 stars (referring to the collection volume), which shows the importance of the group frying in the circle.
Many people are curious about what DeepEP means? Imagine the express delivery station on Double Eleven: 2,048 couriers (GPUs) are crazy about carrying packages (AI data) between 200 warehouses (servers). The traditional transportation system is equivalent to letting the boys deliver goods through tricycles, and DeepEP The "magnetic levitation + quantum transmission" set is directly equipped with all employees to transmit information stably and efficiently.
Feature 1: Directly change transportation rules
Nvidia on August 29, 2024 During the conference call, Huang Renxun once emphasized the importance of NVLink (a technology developed by NVIDIA that allows direct interconnection between GPUs, with a two-way mutual transmission speed of up to 1.8TB/S) for low latency, high output and large language models. , believes that it is one of the key technologies that promote the development of large models.
However, this blown NVLink technology was directly played to a new level by the team this time. The mystery of DeepEP lies in the optimization of NVLink, which means that between couriers in the same warehouse, transported by magnetic levitation tracks, with a speed of up to 158 containers per second (GB/s), which is equivalent to the distance from Beijing to Shanghai. Shorten the time to drink a sip of water.
Black technology second is the low-latency core of the RDMA technology it contains. Just imagine that goods are directly "quantum transmission" between warehouses in different cities, each The aircraft (network card) has a capacity of 47 containers per second, and it can also allow the aircraft to fly while loading, overlapping calculations and communications, and completely bid farewell to the shutdown waiting.
Feature 2: Intelligent sorting black technology: AI version of "the most powerful brain"
When goods need to be distributed to different experts (subnets in MoE models), traditional sorters have to unbox and inspect one by one, while DeepEP's "scheduling-combination" systemThe system is like having predictive ability: in the training of pre-filling mode, 4096 data packets are carried on the intelligent conveyor belt at the same time, and automatically identify the same city or cross-city pieces; in the inference pre-filling mode, 128 expedited packages are carried on the VIP channel, and delivered in 163 microseconds Blinking 5 times faster than humans. At the same time, dynamic rail change technology is used, and the traffic peak is in seconds transmission mode, perfectly adapting to different scenario needs.
Feature 3: FP8 "Bone Reduction Technique"
Standard Box for Ordinary Goods (FP32 /FP16 format) transportation, while DeepEP can compress cargo into micro capsules (FP8 format), and trucks can also load three times more cargo. What’s even more amazing is that these capsules will automatically return to their original state after they arrive at their destination, saving both postage and time.
This system has been measured in DeepSeek's own warehouse (H800 GPU cluster): the speed of freight in the same city has been increased by 3 times, and the cross-city delay has been reduced to a level that is difficult for humans to perceive. The most subversive thing is that it realizes the real "invisible transmission" - just like a courier who rides a bicycle and stuffs packages into the express cabinet, the whole process is smooth and smooth.
Now DeepSeek opens this ace, which is equivalent to making SF Express’s unmanned sorting system drawings public. It originally required 2,000 GPUs and now it has hundreds of them. It can be easily grasped by the platform.
Earlier, DeepSeek released the first result of its "Open Source Week": the code for FlashMLA (literally translated as the fast bull potential attention mechanism), It is also one of the key technologies to reduce the cost in the training process of large model. In order to alleviate the cost anxiety in the upstream and downstream of the industrial chain, DeepSeek is giving it all.
Previously, You Yang, the founder of Luchen Technology, posted on social media that "in the short term, the MaaS model may be the worst business model", and his simple estimate , if 100 billion tokens are output per day, the monthly machine cost of DeepSeek service is 450 million yuan, and the loss is 400 million yuan; the monthly income of AMD chips is 45 million yuan, and the monthly machine cost is 270 million yuan, which means the loss also exceeds that of 200 million yuan.