DeepSeek

Image source: generated by Unbounded AI

After electric vehicles and consumer goods, the Chinese team has staged a "cost butcher" show in the field of AI .

With two months and $6 million, can you train an AI model that can compete with ChatGPT? Deepseek used its strength to interpret what "four or two moves a thousand pounds" means.

DeepSeek, a subsidiary of Magic Square Quantification, announced the release of the first version of a new series of models, DeepSeek-V3, and made it open source simultaneously. They only used 2,048 H800 graphics cards and took two months to train a DeepSeek-V3 with 671 billion parameters. Compared to Llama 3, which has a Meta training parameter volume of 405 billion, it used 16,384 more powerful H100 graphics cards and took 54 days. Deepseek’s training efficiency has increased by 11 times.

As soon as this happened, even CNBC couldn’t sit still. In the latest report, the reporter exclaimed after personally testing it: "The capabilities of this model can compete with OpenAI."

The attention and discussion DeepSeek-V3 has received in the technology circle is comparable to "Black Myth: Wukong" in the gaming industry. Its influence even made OpenAI CEO Ultraman unable to sit still and tweeted secretly. It is said that "copying is always easier than innovating". The market has also begun to worry: If everyone can train AI at such a low cost, those "shovel sellers" who make a fortune by selling graphics cards will panic, and Nvidia's stock price even fell in response.

However, Karpathy, another co-founder of OpenAI, said that this does not mean that cutting-edge LLM does not require large GPU clusters, but it shows that there are still many black technologies waiting to be mined in data and algorithms in the AI field.

So, how does Deepseek achieve such amazing training efficiency? The answer lies in their unique technical solutions.

1. Less is more: DeepSeek-V3’s new method for efficient AI training

The training efficiency level of DeepSeek-V3 reveals its clever training method - the key is to work smarter rather than simply relying on more hardware investment.

Specifically, Deepseek uses a cluster composed of 2048 Nvidia H800 GPUs. Each GPU realizes inter-GPU communication through NVLink interconnection and inter-node communication through InfiniBand interconnection. In this configuration, inter-GPU communication is quite fast, but inter-node communication is not, so optimization is key to improving performance and efficiency. DeepSeek implemented dozens of optimization techniques to reduce the computational requirements of its DeepSeek-v3, butSeveral key technologies have contributed to its impressive results, including:

MoE

Unlike a single huge neural network, DeepSeek-V3 adopts a MoE architecture (Mixture of Experts). The core concept of MoE can be understood this way: there is a group of experts in various fields who work together to solve problems. Faced with the user's task, the system will intelligently identify the most suitable expert to handle it and greatly reduce the amount of calculation through the sparse activation mechanism.

There is a significant difference in training costs between MoE and Dense Model. Although MoE models usually contain more parameters, due to their sparse activation mechanism, only part of the expert network is activated at a time, thus achieving greater model capacity and higher performance under the same computing budget. This makes the MoE model more efficient in the pre-training stage than a dense model of the same size, achieving similar or better performance at a lower computational cost.

DeepSeek-V3 adopts the MoE structure design of multiple small experts instead of using a few large experts like Mixtral. This design allows the model to reach a total parameter volume of 671B, and only needs to activate 37B parameters during actual operation, which greatly improves the sparsity of the model.

MLA

Another innovation of DeepSeek-V3 is Multi-head Latent Attention (MLA), which is an enhanced version of the attention mechanism commonly used in large language models.

MLA is an original structure of DeepSeek, proposed in DeepSeek-V2. Its core concept can be understood like this: when reading complex content, our brain not only processes each word, but also captures the meaning in it. Connections and implications. MLA enables DeepSeek-V3 to similarly focus on different parts of information at the same time to gain a richer understanding. This is especially useful when connecting dots of information, such as solving complex math problems or writing code.

FP8

Nvidia H800 is a version customized for the Chinese market with significantly reduced performance compared to its prototype Nvidia H100. The H800 limits the interconnect speed between cluster cards: about 400GB/s, while the H100 can reach up to 900GB/s.

This performance bottleneck makes reducing computing and communication the key to reducing training costs. DeepSeek uses the FP8 mixed precision framework to achieve faster computing speed and lower memory usage without sacrificing numerical values. stability. Key operations like matrix multiplication are performed in FP8, while sensitive parts like embedding and normalization layers are kept at higher precision (BF16 or FP32) to ensure accuracy. This approach reduces memory requirements while maintaining robust accuracy, with relative training loss error always controlled within 0.25%.

FP8 precisionThe use of DeepSeek-V3 is a major innovation. V3 is the first open source large-parameter MoE model successfully trained using FP8 mixed precision. This means it requires less memory and can significantly speed up calculations.

Dualpipe

The DualPipe algorithm developed by the DeepSeek team improves the parallel performance of the pipeline. Through the overlapping design of the calculation and communication stages, it effectively reduces the communication overhead caused by cross-node expert parallelism. At the same time, they optimized the cross-node communication kernel, improved bandwidth utilization, and reduced the computing resources required for communication. The DualPipe algorithm significantly alleviates training bottlenecks, especially the cross-node expert parallelism required by the MoE architecture. These optimizations allow the team to complete the training of V3 without using higher-cost tensor parallelism technology.

2. Is computing power bad? Hardware limitations spawn software innovation

From the outside world, DeepSeek can still achieve results even with poor chip performance, less funds and less GPU usage time. Better performance. This achievement is particularly noteworthy given the constraints on AI hardware resources they faced.

In October 2022, in order to prevent China from becoming a superpower in the field of artificial intelligence and computing, the United States implemented extensive chip export restrictions on China: This is part of the ongoing "chip war" between China and the United States. One of many blows.

The original intention of these chip restrictions is to limit China's development in the field of AI by cutting off China's access to top hardware. In order to cope with the new regulations and maintain its competitiveness in the Chinese market, NVIDIA launched a "customized version" of the H800 chip for the Chinese market.

The success of DeepSeek-V3 may herald an interesting turn: software innovation is breaking through hardware limitations. If their technical report is true, it may mean that China has the upper hand in the chip competition. In theory, restricted chips should limit their R&D breakthroughs. But in fact, Deepseek has made significant progress in both research and products, proving the possibility of finding another way.

Precisely because Chinese engineers cannot get the best hardware, it objectively promotes Chinese engineers’ innovation at the software level such as algorithms, architectures, and training strategies, and is “forced” to develop new methods to make full use of the hardware at hand. resources, even breaking through the limits traditionally considered. On the contrary, it has forced more innovations at the software level, rather than relying solely on hardware stacking.

This actually makes the US strategy of restricting China very ironic. If software technology becomes more and more powerful, it may not matter what hardware is used.

However, DeepSeek V3 also caused some controversy beyond its technical achievements. Users discovered that the model would claim to be ChatGPT in some cases.

A possibleThe explanation is that the training data set of DeepSeek-V3 may be mixed with the generated content of ChatGPT, causing confusion in the model during the learning process. Another possibility is that DeepSeek uses the GPT model for knowledge distillation during the training process, that is, using the output of the GPT model as a "teacher signal" to guide the learning of DeepSeek-V3.

A large model practitioner told Silicon Star, "Data distillation has little impact on costs. If it only relies on data distillation, why haven't others done it? Deepseek must rely on its own unique training and Engineering Practice Methods. ”

Under pressure and constraints, innovation often emerges in unexpected ways. Chinese engineers are using practical actions to prove that even if they face hardware limitations, they can still make impressive results in the field of AI. This kind of demand-driven innovation is likely to continue to bring about some breakthrough ideas.

For the artificial intelligence industry, DeepSeek-V3 heralds a possible paradigm shift in the way large-scale language models are developed. Through clever engineering and efficient training methods, cutting-edge artificial intelligence capabilities may be achievable without relying on massive computing resources. With the emergence of DeepSeek-V3, the market has become more diversified, providing more options for developers, content creators and even small start-ups.

Of course, if companies such as OpenAI and Meta use larger computing power clusters to train models with better performance in the future, the industry may once again set off a craze for ultra-large-scale pre-training.

By then, the industry may return to the old path of computing power arms race, and the "shovel sellers" in the AI field will continue to be the biggest winners.

Online Consultation