DeepSeek is not perfect, there is a "deep curse" in the training process

DeepSeek is not perfect, there is a

Image source: Generating from unbounded AI

High-performance large models usually require thousands of GPUs during training, which takes months or even longer Complete a training. This huge resource investment makes every layer of the model efficiently trained to ensure the maximum utilization of computing resources.

But researchers from Dalian Polytechnic, West Lake University, Oxford University and other researchers on DeepSeek, Qwen, Llama and Mistral found that the deep layers of these models did not perform well during the training process, and could even be completely pruned. No impact on model performance.

For example, the researchers pruned the DeepSeek-7B model layer by layer to evaluate the contribution of each layer to the overall performance of the model. The results show that the deep layer of the model has little effect on performance, while the shallow layer removal performance will significantly decrease. This shows that the deep layer of the DeepSeek model fails to effectively learn useful features during training, while the shallow layer undertakes most of the feature extraction tasks.

This phenomenon is called the "Curse of Depth", and researchers have also proposed an effective solution - LayerNorm Scaling.

Introduction to Deep Curse

The root of the "Deep Curse" phenomenon lies in the characteristics of Pre-LN. Pre-LN is a normalization technique widely used in Transformer architecture models that normalizes the inputs of each layer, rather than the outputs. Although this normalization method can stabilize the training process of the model, it also brings a serious problem. As the depth of the model increases, the output variance of Pre-LN will increase exponentially.

This explosive growth of variance causes the derivatives of deep Transformer blocks to be close to the unit matrix, so that these layers contribute little to any valid information during training. In other words, deep layers become unit mapping during training, and useful features cannot be learned.

The existence of "Deep Curse" brings serious challenges to the training and optimization of large language models. First, insufficient deep training leads to waste of resources. When training large language models, a large amount of computing resources and time is often required. Computing resources are largely wasted due to the failure to effectively learn useful features in the deep layer.

Deep ineffectiveness limits further improvements in model performance. Although the shallow layer can undertake most of the feature extraction tasks, the deep layer's ineffectiveness prevents the model from taking full advantage of its depth.

In addition, the "Deep Curse" also brings difficulties to the scalability of the model. As the size of the model increases, deep invalidity becomes increasingly prominent, which makes training and optimization of the model more difficult. For example, when training super-large models, insufficient deep training may cause the model to converge slowly, evenUntil it cannot converge.

Solution - LayerNorm Scaling

The core idea of LayerNorm Scaling is to accurately control the variance of Pre-LN output. In a multi-layer Transformer model, the layer normalization output of each layer is multiplied by a specific scaling factor. This scaling factor is closely related to the depth of the current layer and is the inverse of the square root of the layer depth.

Give you a simple and easy-to-understand example. A big model is like a tall building, each floor is one of them, and LayerNorm Scaling provides the "energy output" of each floor. Fine adjustment.

For lower floors (shallow layers), the scaling factor is relatively large, which means that their output is adjusted to a smaller extent, allowing for relatively strong "energy"; for higher floors For floors (deep), the scaling factor is small, which effectively reduces the "energy intensity" of the deep output and avoids excessive accumulation of variance.

In this way, the output variance of the entire model is effectively controlled and there will be no further explosion of deep variance. (The entire calculation process is quite complicated, and interested friends can read the paper directly)

From the perspective of model training, in traditional Pre-LN model training, due to the continuous increase in deep variance, The gradient will be greatly disturbed during backpropagation. The deep gradient information becomes unstable, just like when passing the baton, the baton always falls during the transmission of the subsequent rods, resulting in poor information transmission.

Make it difficult to learn effective features in deep training, and the overall training effect of the model is greatly reduced. LayerNorm Scaling stabilizes the gradient flow by controlling variance.

In the backpropagation process, the gradient can be transmitted from the output layer of the model to the input layer more smoothly. Each layer can receive accurate and stable gradient signals, so that parameters can be more efficiently performed. Update and learn.

Experimental Results

To verify the effectiveness of LayerNorm Scaling, the researchers conducted extensive experiments on models of different sizes. The experiment covers models ranging from 130 million parameters to 1 billion parameters.

Experimental results show that LayerNorm Scaling significantly improves model performance in the pre-training stage, reducing the confusion and reducing the number of tokens required for training compared with traditional Pre-LN.

For example, on the LLaMA-130M model, LayerNorm Scaling reduced the confusion from 26.73 to 25.76, while on the LLaMA-1B model with 1 billion parameters, the confusion from 17.02 to 15.71. These results show that LayerNorm Scaling is not only effectiveControlling deep variance growth can also significantly improve the training efficiency and performance of the model.

The researchers evaluated LayerNorm Scaling's performance during the supervised fine-tuning phase. Experimental results show that LayerNorm Scaling is superior to other normalization techniques in multiple downstream tasks.

For example, on the LLaMA-250M model, LayerNorm Scaling has improved performance by 3.56% on the ARC-e task and an average performance by 1.80% on all tasks. This shows that LayerNorm Scaling not only performs well in the pre-training stage, but also significantly improves the performance of the model during the fine-tuning stage.

In addition, the researchers replaced the normalization method of the DeepSeek-7B model from the traditional Pre-LN to LayerNorm Scaling. During the entire training process, the learning ability of deep blocks has been significantly improved, and we can actively participate in the learning process of the model and contribute to the performance improvement of the model. The decline in confusion is more obvious, and the decline rate is more stable.

Thesis address: https://arxiv.org/abs/2502.05795