Will Scaling Law become ineffective?
Although OpenAI CEO Altman said eloquently that "there are no walls here." However, OpenAI's recent releases are not that explosive, especially the o1 Pro, which is only one point higher in programming ability than the full-blooded version, which seems to make people believe in the existence of the "wall".
It’s time to break out of the Scaling Law!
The Densing Law proposed by the team of Professor Liu Zhiyuan of Tsinghua NLP Laboratory gives us a new perspective!
Different from Scaling Law, it is believed that as the model size (such as the number of parameters), training data set size and other parameters increase, the model performance will predictably improve according to a certain power law.
The expression of density law for large models, similar to Moore's Law, focuses on how capabilities increase over time.
In short: the capacity density of large models doubles in about 100 days!
What is capability density?
The research team defines it as the ratio of the "effective parameter amount" of the model to the actual parameter amount. It is a new indicator to measure the training quality of LLM (Large Language Model).
To give an example in the paper: MiniCPM-1-2.4B released on February 1, 2024, its performance is comparable to or even better than Mistral-7B released on September 27, 2023. In other words, after 4 months, you only need to use an LLM with 35% of the parameters to obtain roughly equivalent performance.
The first author of the paper said that using this law, by the end of next year, a small 8B model will be able to achieve the powerful effects of GPT-4.
In addition, the research team found that the three core engines in the AI era also obey the density law. Electricity, computing power and intelligence (AI) all have their own doubling cycles. Among them, the doubling time of battery energy density is 10 years, and the doubling time of chip circuit density is 18 months.
In the main findings of the paper, the research team also discovered 5 important inferences, let us talk about them:
The inference overhead of the model decreases exponentially over timeAccording to Densing Law, after every three months, we can use a model with half the parameters to achieve performance equivalent to before.
As a result, the cost of inference is decreasing at an exponential rate while achieving the same task performance.
The team found that from January 2023 to now, the inference cost of the GPT-3.5 level model has been reduced by 266.7 times.
The capability density of large models shows an accelerating trendThe team compared the growth trend of LLM density before and after ChatGPT was released, and found that after this node, the growth rate of LLM density increased by 50%!
This conclusion is not surprising - it can be said that this wave of AI craze was started with the release of ChatGPT.
No matter how much we complain about OpenAI’s closed ecosystem, its huge promotion of AI development is indelible.
Model miniaturization reveals the huge potential of end-side intelligenceMoore's Law (Moore, 1965) states that the number of circuits integrated on a chip of the same area increases exponentially, which means that computing power also increases exponentially.
The Densing Law proposed this time shows that the density of LLM doubles every 3.3 months.
Combining these two factors, the team proposed that the effective parameter size of LLM that can be run on chips of the same price is growing faster than the growth rate of LLM density and chip computing power.
This dual growth model is like running on an elevator, allowing AI to run smoothly on mobile phones and other devices in the near future.
Cannot enhance model capability density through model compressionPruning and Distillation is not as useful as we think!
In the team's research, by comparing the model with its compressed version, it was found that widely used pruning and distillation methods often make the compressed model less dense than the original model.
The research believes that we should continue to look for more efficient model compression algorithms, and in particular, we should pay more attention to improving the density of small models.
The density doubling cycle determines the "validity period" of the modelA cruel The fact is that large models also have expiration dates.
Every few months, new more "affordable" models will appear, which means that the model must make enough profits within a limited period to break even.
The research team estimates based on API profitability that it will take 2 months to reach 1.7 billion user visits!
After looking at this figure, we understand better why large models are so expensive.
The law of density also reminds the AI circle not to pursue Scaling blindly.
What is more important is how to strike a balance between model performance and efficiency.
“Blindly increasing model parameters in pursuit of performanceIncreasing performance may result in reduced model density, resulting in unnecessary energy consumption. For example, although Llama-3.1-405B (Dubey et al., 2024) achieves state-of-the-art performance among open source models, it requires hundreds of times more computational resources than other models. ”
Therefore, the future topic should shift from pure performance optimization to density optimization. Only when large models step out of the “examination-oriented” and are no longer obsessed with the numbers on the list can they truly enter the wilderness of applications.
Reference links: 1. https://arxiv.org/abs/2412.043152. The illustration comes from the research group