With the end of the 2025 postgraduate examination last month, the latest postgraduate mathematics examination questions have become a large language model, especially the inference model The "proving ground" will test their ability to think deeply.
There was a consensus in the industry that the performance of large language models at the text level was impressive, but when it came to mathematics, it was not very satisfactory. The "9.9 vs. 9.11" ratio issue that was once popular last year caused many large models, including GPT-4o, to overturn. It was not until the emergence of deep inference models that the situation was fundamentally improved.
The performance of the o1 model released by OpenAI is impressive in terms of complex and professional mathematical problems. After careful consideration for a certain period of time, the ability and accuracy of the large model to answer questions have been greatly improved. This kind of model is The phenomenon called inference-side Scaling Law has become a key force that continues to promote the improvement of large model capabilities. In Huang Renxun’s latest CES 2025 speech, he also described test-time (i.e. inference) scaling as one of the three curves in the development of large models.
It can be seen that after o1, large domestic model manufacturers have also successively launched their own deep inference models, and have performed brilliantly on certain tasks. After counting, the timeline looks like this:
On November 21, 2024, the DeepSeek team released the DeepSeek-r1 model; on November 28, 2024, the Alibaba Tongyi team released the QwQ model; on December 16, 2024 On December 31, 2024, the Dark Side of the Moon team released the Kimi-k1 model; on December 31, 2024, the GLM team released GLM-Zero model; on January 6, 2025, Kunlun Wanwei released the Skywork-o1 model.You may be curious, how strong are the capabilities of these deep reasoning models (especially mathematical reasoning capabilities), and who can come out on top? This is when a fair standardized test is needed.
In order to comprehensively evaluate the ability of these models in mathematical reasoning, the Tsinghua SuperBench large model evaluation team (hereinafter referred to as the evaluation team) combined the test questions of the 2025 postgraduate entrance examination mathematics (I, II, III) to specifically evaluate the above Deep inference models have been rigorously evaluated. At the same time, in order to ensure the comprehensiveness of the evaluation, the flagship basic models of each company were also included in the evaluation.
The details of the 13 models selected this time are as follows:
From the results, based on the average score of all models, the first place is OpenAI’s GPT-o1 model, which is also Nothing surprising. The second place went to GLM-Zero-Preview from Zhipu, which scored an average of 138 in three mathematics subjects.The score of 70 is second only to o1, becoming the first domestic large model, and less than 3 points away from the first place. The third place is QwQ, which comes from the general meaning.
Testing method
During this evaluation process, the evaluation team found that not all models provide API support, and some models that provide API services have problems when the output content length exceeds a certain limit. Content truncation may occur. In order to ensure the fairness and accuracy of the evaluation work, the evaluation team decided to uniformly use the web pages of each model manufacturer for testing operations.
During the test process, each question is conducted in an independent dialogue window to eliminate the possible interference of contextual information on the test results.
In view of the certain instability in the output of some models, in order to reduce the resulting fluctuations in scores, the evaluation team has set that only when the same model answers correctly twice or more in three tests will it be recorded. is the correct answer.
Result Analysis
The results of this evaluation will be analyzed in detail from three aspects: total test score, single test paper score, and deep thinking model vs. basic model.
Total score
For the total score, the evaluation team sums up the scores of the three test papers and calculates the average, and sorts them according to the scores. The results are shown in the figure below:
As can be seen from the figure, GPT-o1 is still in the leading position and is the only model to achieve a score of more than 140. Compared with GPT-4, which ranks at the bottom, The score advantage is up to 70 points.
The models in the second tier (above 130 points) include GLM-zero-preview and QwQ, which scored 138.7 points and 137.0 points respectively.
DeepSeek-r1-lite, Kimi-k1, Tiangong-o1-preview, and DeepSeek-v3 are in the third echelon (above 120 points).
It can be seen that deep thinking models can generally reach a level of 120+. This also demonstrates the powerful ability of deep thinking models in solving mathematical problems.
It is worth noting that the basic model GPT-4, which topped the list in 2023, only scored 70.7 points in this test and ranked last. This result shows that language models have made significant progress in the field of mathematical reasoning over the past year (2024).
On the other hand, in the absence of the assistance of deep thinking ability, DeepSeek-v3 as a basic model has been able to rank into the third echelon with only logical reasoning ability. This shows that the basic model and deep thinking model The boundaries between abilities are not clear-cut.
Single test paper analysis
In order to more clearly show the performance of the large model in terms of answering ability of each test paper, the evaluation team conducted an in-depth analysis of the distribution of wrong questions in each test paper. .
In the evaluation process of Mathematics 1, the four models GPT-o1, GLM-zero-preview, QwQ, and DeepSeek-r1-lite scored the same. By further analyzing the wrong questions, the evaluation team found that all models had errors on question 20 (12 points, involving the solution of surface integrals) and the second question of question 21 (6 points, involving the solution of eigenvectors).
In the evaluation of Mathematics 2, the score distribution of each model is relatively scattered. After statistical analysis, it was found that Questions 3, 5, and 7 became the concentrated areas where all models made errors. The specific distribution of wrong questions is shown in the figure below:
The evaluation results for Mathematics 3 show that the worst-hit areas with model errors are mainly concentrated in the third Question 14, Question 15, Question 16, Question 19. The distribution of relevant wrong questions is shown in the figure below:
Based on the specific analysis of the above wrong questions in each test paper, we can clearly see that, GPT-o1 (shown in the shaded column) only answered 3.5 questions incorrectly out of a total of 66 questions; and for the questions GPT-o1 answered incorrectly, other models also generally had errors, which shows that GPT-o1 is still a deep reasoning Model ceiling.
Basic model vs. deep thinking model
Finally, in order to comprehensively and in-depth explore the achievements of each model manufacturer in optimizing deep thinking capabilities, the evaluation team compared the corresponding basic models and deep thinking models. The models were carefully compared and analyzed.
It should be noted that the comparison here does not mean that each deep thinking model is optimized based on the corresponding basic model. Its main purpose is to visually present the progress and results of each manufacturer in improving the comprehensive capabilities of the model.
The relevant comparison results are shown in the figure below:
Note: OpenAI’s basic model uses GPT-4o.
Through comparative analysis, OpenAI’s deep thinking model GPT-o1 has the most significant improvement compared to the basic model GPT-4o, reaching 57.3 points. Followed by Alibaba’s Qwen model and Zhipu’s GLM model, with improvements of 47.0 points and 34.3 points respectively.
In addition, the improvement of Deep Quest and Dark Side of the Moon is relatively small, which is mainly due to the higher scores of their basic models. Taking deep search as an example, the initial score of its basic model DeepSeek-v3 is as high as 120.3 points, ranking first among the participating basic models.
In this test, the evaluation team selected the best-performing basic model DeepSeek-v3 as the reference benchmark to conductWhen evaluating the performance improvement of deep thinking models of each manufacturer, the relevant data is shown in the figure below:
It can be seen that OpenAI, Zhipu, and Alibaba have made a lot of improvements in the performance of deep thinking models. Big optimization, while the results of other models such as DeepSeek-v3 in this test are basically close.
Looking at these test results one by one, we can find that: although OpenAI’s o1 is still the strongest in terms of deep reasoning, domestic large-scale reasoning models are gradually narrowing the gap with it. The results of GLM-zero-preview and Alibaba QwQ illustrate this point.