2024 is coming to an end. In this year, how much has the intelligence level of large models improved?
Last Sunday, the preliminary test of the 2025 Postgraduate Entrance Examination had just ended. We took advantage of the heat to take the Postgraduate Entrance Examination Mathematics papers to test several major domestic models to see their true IQ levels.
List of 5 domestic large-scale model candidates:
Representative team of Dachang giants: Byte Doubao, Alibaba Tongyi startup team: Zhipu, Kimi private equity giant team: DeepSeekRemember June is high During the exam, many media conducted large-scale model college entrance examination score evaluations. It was found that everyone’s Chinese scores were above 100 points, but their math scores were basically terrible. The low score was only 37 points, and the high score was only over 60 points. No one could. Pass. You must know that the full score for mathematics in the college entrance examination is 150, and only a score of 90 or above is considered a passing grade.
This also shows that at least in terms of natural language understanding, large models have basically "passed", but in the "logical thinking" ability that widens the gap between humans and other species, even if it still needs to continue to evolve .
However, in the second half of 2024, especially after the release of Open AI’s o1 inference model in September, under the new reinforcement learning technology paradigm, large models seem to have found a way to solve difficult problems and complex tasks in fields such as mathematics, physics and chemistry. key. Companies such as Kimi, DeepSeek, and Tongyi have also launched their own reasoning models that support Chain of Thought, bringing the level of mathematics and physics to a new level.
Stop talking nonsense and start testing directly!
We selected the moderately difficult 2025 Postgraduate Entrance Examination Mathematics III as a reference test paper. Each model has two opportunities to answer each question, and the score is the average of the two times.
In order to ensure the fairness of the test, we all use the latest versions of each product (Doubao and Tongyi cannot select models and use the default mode; Kimi uses the newly launched visual thinking version; DeepSeek turns on "Deep Thinking" "switch, Zhipu Qingyan adopts GLM-4-Plus model), upload exactly the same 22 A screenshot of a question, and the text prompts (Prompt) input to the large model are basically the same, simulating real scenarios, such as "answer this question", "what to choose for this question", "solve this question" and "what is the answer to this question" ".
2025 Postgraduate Entrance Exam Mathematics: The scores of two companies exceeded 100What is the real level? Let’s look at the results directly:
Judging from the final test results, two models scored over 100 points in the preliminary mathematics test of this postgraduate entrance examination. Among them, Kimi Visual Thinking version scored 133 points and DeepSeek 103.5 points. . Tongyi scored 90 points and passed. Both Doubao and Zhipu scored 88.5 points, which was close to passing. Compared with the college entrance examination mathematics scores in June, everyone has improved a lot.Kimi and DeepSeek are making particularly fast progress.
In the past, large domestic models could struggle to solve elementary school math problems. Nowadays, some of them can handle graduate-level math problems with ease. , which quite surprised us. However, judging from the success rate of the last question, there is still some room for improvement.
Two styles of problem-solving process: giving answers vs. giving ideas + answersBased on the scores alone, it is clear who is more likely to land last.
However, the results of this set of postgraduate mathematics examination questions cannot fully demonstrate the full capabilities of these models. However, for some students preparing for the exam, when faced with the same questions, who can solve the problem? The more complete the ideas and the richer the derivation steps, the greater the reference and practicality will naturally be.
Let’s first look at a multiple-choice question on trigonometric functions in algebra.
The correct answer to this question is C, but the process of obtaining C with different models is very interesting.
Let’s first look at the problem-solving process of Doubao
Doubao also gave the correct answer, but the problem-solving process is relatively simple, more like some standard answers in the postgraduate entrance examination reference book. If you want To know the more detailed problem-solving process, you still need to purchase the corresponding postgraduate entrance examination famous teacher course as an aid.
The solution process of Zhipu Qingyan is relatively awkward. Because he didn't get this question right, he chose B in the first test and A in the second test.
First test B:
Second test A:
However, even if you make a mistake, it still gives a relatively complete thinking process." "Wrong" is excusable.
Let’s look at Kimi’s visual thinking version.
As you can see, the Kimi Visual Thinking Edition not only gives the correct answers, but also gives a complete derivation process and solution ideas. For some postgraduate entrance examination candidates, it has a high reference value and helps to check wrong questions and draw inferences from one example.
The answers from Alibaba Tongyi and Deepseek are similar to Doubao. Relatively speaking, the steps shown by these two models are simpler.
Tongyi Qianwen
Deepseek
Let’s look at another fill-in-the-blank question.
This is its standard answer: the asymptotic line equation is y=3 and y=-3
You can see that, like the aforementioned multiple-choice question, the problem-solving process of Kimi Thinking Edition It is quite informative, with a lot of derivation details, and finally gives the correct answer.
The derivation process of bean bag is relatively simple, but you can also see the obvious derivation process, which is also a good reference. The similar process of Alibaba Tongyi and deepseek is slightly simpler, but it gives the correct answer.
Unfortunately, the results of this question were wrong twice.
But belowOn this question of definite integrals, the differences between the various models are more obvious.
First put the correct answer: a=2
The performance of Kimi Thinking Edition is relatively stable. After giving enough derivation steps, there is another verification, and finally it outputs a= 2 is the correct result.
The performance of bean bags is also relatively stable. But the derivation steps are as simple as ever.
When Zhipu Qingyan solved this problem, the first time it was answered correctly, but the problem was that it did not use natural language and used code, which has limited reference value for ordinary learners. The second time the test was I didn't give an answer directly, and thought there was something wrong with the question setting.
The performance of Tongyi is quite normal. The first answer was wrong, but the second time gave the correct answer. But Deepseek was more embarrassed. It couldn't answer the first time.
The second time, I fell into an endless loop and continued to write answers for more than 3 minutes.
If it is some more difficult questions, some models will be difficult to cover.
For example, the following.
As usual, the correct answer comes first.
Kimi's answer is as follows. Although the final result is different from the standard answer, it is just written in a different way, but the result is still correct.
Doubao gave two answers in two tests, but both were wrong. This is the first time.
The second time:
In the two answering processes of Zhipu Qingyan, there was an inability to answer.
Tongyi is able to write down the process, and the two answers he gave are different, but unfortunately, they are still wrong.
Deepseek's performance is unexpected. Like kimi, although the writing method is different, the results are correct.
ConclusionBut in just a few A few months ago, large model manufacturers were still satisfied with writing essays for perfect scores in the college entrance examination. Compared with the past, their logical thinking and comprehensive abilities are no longer the same.
It should be noted that regardless of arts and science, once you reach the level of scientific research, the logical ability represented by mathematics and physics is the cornerstone for the usability, usability and ease of use of large models, and the level of mathematics and physics problem-solving ability, It is a direct manifestation of the intelligence of large models.
With the continuous enhancement of the capabilities of large models, when humans explore more cutting-edge scientific and technological fields, large models that were “useless” in the past can now become the assistants of many researchers. Perhaps in the future, when the capabilities of AI really reach the level of the top 1% of human experts in various fields, or even exceed the level of humans, with the help of AI, our understanding of the universe will really have the opportunity to reach new heights that humans have never reached before. I hope that by then, AI will still be mankind’s best friend.