At the end of 2024, DeepSeek V3 suddenly became popular.
As an open source model, DeepSeek V3 has 671 billion parameters, but its performance is close to top closed source models such as GPT-4 and Claude 2.
What’s even more shocking is that according to Deep Search reports, it only took 2.788 million GPU hours to complete the training, which can be said to have reduced the training cost to a “cabbage price”.
Related reading: "Pinduoduo" deepseek suddenly hit the screen in the AI world
After the editor's simple actual measurement, the speed of DeepSeek V3 in the field of text generation is indeed amazing.
However, the core issue that the industry is concerned about is not its performance problem, but whether DeepSeek V3 really provides a more economical and affordable path for the world, especially the Chinese AI industry that lacks computing power?
There are doubters: for example, the widely circulated and comprehensive version of "replies from everyone in the group".
A: "The news about Magic Square is purely taken out of context. A 671B moe model is trained, and the fp8 architecture is used to reduce the GPU time consumption. Magic Square is indeed technically awesome. But Magic Square is Before training this model, they used their own r1 model (benchmarked by openai o1 model) to generate data, should the repeated attempts in this part be included in the cost? Not counting the previous confusion, just reducing costs and increasing efficiency in training does not mean that the inference requirements will be reduced. The decline only means that large manufacturers can use more cost-effective methods. method to explore the ultimate capabilities of the model. As long as there is growth logic on the application side, the demand for inference is still worth looking forward to. "B:" - Training is only once, and inference needs are essentially far greater than training needs. The user base is large. Deepseek stands on the shoulders of giants and uses a large amount of high-quality synthetic data. - The statistical caliber of Deepseek only calculates training, but the proportion of data requires a lot of pre-experiments, and the generation and cleaning of synthetic data also requires computing power. - Each expert in the MoE of Deepseek's model can be trained separately, which is a less labor-intensive solution compared to the dense architecture - Everyone has surpassed GPT 4o, llama. 3. They are stepped on every day. These two models are actually used the most by consumers and businesses. "C: "1. The training of FP8 itself is not very resource-intensive, and DS is. The "set" large model training has limited the capabilities of the large model, which reduces a lot of unnecessary consumption. 2. When OpenAI and Antropic train new things and new capabilities, the consumption may be detoured. It's like a hundred times a thousand times the last correct path.After reading the answers several times, even average students can get full marks, or close to full marks, on the college entrance examination mathematics paper within one hour. The more times you take a test paper, the faster it will be. Maybe you can get a perfect score in 30 minutes... The DS model adds a lot of "setting" factors, because it is known to be effective and help improve reasoning ability. 3. The pursuit of model ability is "general knowledge ability". In order to get good grades, no one can avoid the three years of reading. Now the computing power and data are just to shorten the time. The upper limit of the general knowledge ability of the large model is too high, and the calculation power has just begun. Whoever hesitates, who questions, and whoever falls behind. 4. The other is the interface of multimodality and embodied intelligence. A very important reason why GPT-5 is difficult to produce is that GPT-5 must have the potential ability to open the robot mode, which is to be able to process physical world data. This thing is also brand new, surpassing the capabilities of current large models. "There is also Hu Yanping's long Weibo "Why deepseek's popularity should not be overestimated"
But its basic technical principles that dare to withstand questioning also have their own logic. For example:
1. Multi-head Latent Attention (MLA)
In order to achieve efficient reasoning, DeepSeek V3 adopts a multi-head latent attention (MLA) mechanism. MLA jointly compresses attention keys and values through low rank, which greatly reduces the key-value (KV) cache during inference, thereby reducing cache requirements.
Specifically, MLA only needs to cache the compressed latent vectors and decoupled rotational position encoding keys, which significantly reduces the memory usage compared to traditional multi-head attention.
2. Mixture Expert Architecture. of Experts, MoE)
The biggest feature of DeepSeek V3 is the adoption of MoE architecture. Although the entire model contains 671 billion parameters, only about 37 billion parameters are activated at a time, thanks to the dynamic routing mechanism.
This mechanism uses finer-grained experts and isolates some experts as shared experts. In order to solve the common load imbalance problem in the MoE model, DeepSeek V3 innovatively adopts the unassisted loss load balancing strategy.
This strategy dynamically adjusts the expert selection probability by introducing a bias term for each expert, thereby achieving expert optimization without affecting model performance. Load balancing
Regardless of the final facts and reproducibility, deepseek The greatest value of this question is that it provides another new path for discussion in the current large-scale arms race of extremely high energy consumption and fighting for money and cards.
A similar topic is artificial intelligence. Intelligence expert Zhu Songchun proposed the evolution of a large model paradigm from "parrot talking" to "crow drinking water"
The so-called "parrot paradigm" refers to the currently commonly used AI based on big data and deep learning. Models, which are capable of simple imitation and repetition, but lack real understanding and reasoning capabilities.
And.The "Crow Paradigm" is a "small data, big task" model that emphasizes autonomous reasoning and long-term insights. It has the characteristics of low power consumption and relatively lower requirements for data and computing power. Zhu Songchun believes that it represents artificial intelligence. The future development direction of intelligence.
From this perspective, although DeepSeek V3 has relatively low training costs and computing power consumption, its training process still requires a large number of GPU hours.
Optimists believe that through distillation and optimization, DeepSeek V3 has achieved an effective breakthrough in reasoning capabilities, proving that AI is no longer just a language imitator, but gradually has the ability to make independent judgments. On the other hand, it demonstrates the huge driving force of algorithm optimization and hardware adaptation for the development of AI.
From MLA to MoE, from inference efficiency to cost control, it sets a new benchmark for open source AI models and allows us to see the possibility of the "Crow Paradigm".