OpenAI scientists praise the large model: the algorithm is very powerful and the computing power is used to the extreme!

Image source: generated by Unbounded AI

Andrej Karpathy, a member of the OpenAI founding team and senior research scientist, rarely shared a large open source model from China— —DeepSeek-v3.

Karpathy said that DeepSeek only used 2.8 million hours of GPU computing power to train a cutting-edge model that is stronger than Llama-3 405B (using 30.8 million hours of GPU), saving 11 times the overall cost. Left and right, the computing power is maximized.

This opens up a whole new world for small models and organizations limited by computing power - even with limited computing power, high-quality data and better algorithms can still be used to train high-performance large models .

In addition, DeepSeek's performance has greatly exceeded that of famous switches such as GPT-4o, Claude-3.5-Sonnet, Qwen2.5-72B, etc. in many mainstream benchmark tests such as MMLU, DROP, Codeforces, and AIME. The source model has become one of the most powerful open source large models currently.

Foreign netizens said that it seems that restricting the supply of chips to China has not stifled their progress, but has instead promoted technological innovation. Interestingly, resource constraints are not just barriers; they are powerful drivers of creativity.

It’s quite sad to read this netizen’s comment. Domestic AI chips are restricted and unable to obtain higher computing power. We rely on wisdom and innovative spirit to still break through the blockade - Tian Xingjian. A gentleman is constantly striving for self-improvement!

Is the United States really sure that it wants to “exclude China from the artificial intelligence competition”? It seems to me that we might be playing catch-up...

When the Chinese get a "lemon", they squeeze every last drop of juice out of it and make delicious lemonade. Hopefully resource-constrained labs in the U.S. can achieve the same thing.

China is about to become a super artificial intelligence power.

The model is great, but the team that made it happen is even better, and there is really no limit to human creativity.

Can the improvements made by DeepSeek to compensate for the limitations of smaller models also be applied to larger models? Can we expect a similar 11x increase in power when using a cluster of 100,000 GPUs?

I really want to try DeepSeek’s API, but it has been failing since this morning.

Love the open source model so much that they force the western world to lower prices.

Deepseek’s team is a group of super talented former quantitative analysts. Quantitative analysts are known for squeezing out every bit of performance gain. They succeeded again, only this time in a different field. People with high IQs are truly a blessing to the world.

Their training efficiency is crazy.

Training usedThe data is roughly the same as Llama 3 405B, about 15 trillion. But under the same training data, the computing power is reduced by 10 times.

Wow, someone finally solved the training efficiency problem. While everyone else is counting their AI budgets in billions of dollars, DeepSeek is developing cutting-edge large models with only a fraction of theirs. It seems that simply throwing more GPUs at the problem isn't always the answer.

This guy directly posted the picture, and DeepSeek directly defeated OpenAI and Meta~

Deep Seek v3 model Brief introduction

The architecture of Deep Seek V3 continues the second generation of efficient inference and low-cost training strategies, mainly including multi-head potential attention (MLA) and hybrid expert (MoE).

MLA is one of the core innovations of V3 and is mainly used to reduce memory usage during inference. MLA compresses the keys and values into a latent vector and caches only this vector during inference, rather than the full matrix of keys and values.

The compression process of MLA is implemented through the lower projection matrix and the upper projection matrix. The down projection matrix compresses the input vectors into latent vectors, and the up projection matrix reduces the latent vectors into keys and values. In this way, MLA only needs to cache latent vectors and separated keys during inference, significantly reducing the memory footprint.

MLA also performs low-rank compression on queries, further reducing activation memory during training. Therefore, MLA is one of the main reasons why V3 greatly reduces the computing power.

The traditional MoE architecture is prone to expert load imbalance when faced with large-scale data processing tasks. This imbalance can lead to serious consequences, the most prominent of which is routing collapse. When some experts bear too much load and others are relatively idle, the routing mechanism may get into chaos because it cannot allocate tasks effectively, causing the model to not work properly.

Due to the imbalance of expert load, computing resources cannot be reasonably allocated, making the overall computing process slow and inefficient. When processing complex language tasks, a large amount of computing power is required to support the model's reasoning and decision-making process.

V3 has improved MoE and introduced an advanced dynamic adjustment mechanism specifically used to optimize expert load. During the training process, MoE monitors the load of each expert in real time, and dynamically adjusts task allocation according to the actual load through a series of complex and precise algorithms. This dynamic adjustment is not a simple average distribution, but an intelligent distribution based on the expert's real-time processing capabilities and the characteristics of the current task.

For example, when the load of an expert is too high, the model will automatically transfer part of the tasks to experts with lighter loads to ensureEach expert can work within a reasonable load range.

In addition, V3's MoE also uses a special method to set a dynamic load threshold for each expert. When the load exceeds the threshold, the load adjustment mechanism is triggered. During the adjustment process, the model will comprehensively consider multiple factors, such as the historical processing efficiency of experts, the urgency of the current task, and the load balancing of the entire system. Therefore, V3's MoE not only interprets the problem of routing collapse, but also The computing power is maximized.

Actually, I have a question in my mind as I write this. If Deep Seek has 100,000 H100s, can it develop a super powerful model like o3?

In addition to open source the latest models, Deep Seek also provides free online services. Friends who want to try it can try it out. It is worth mentioning that you can also use a deep thinking mode like the o1 model, and the entire reasoning process will be written out.

Open source address: https://github.com/deepseek-ai/DeepSeek-V3

Online experience: https://chat.deepseek.com/

< p>Smiley face: https://huggingface.co/collections/deepseek-ai/deepseek-v3-676bc4546fb4876383c4208b

Online Consultation