Domestic large model debuts Chinese logical reasoning, "Tiangong Large Model 4.0" O1 version is here

Domestic large model debuts Chinese logical reasoning,

Image source: generated by Unbounded AI

I didn’t expect that technology would develop so fast. Recently, people have begun to imagine life after the AI era.

Last weekend, JPMorgan Chase CEO Jamie Dimon said that thanks to artificial intelligence technology, future generations will be able to work only three and a half days a week and live to be a hundred years old.

Some studies believe that technologies such as generative AI could automate tasks that currently take up 60-70% of people’s working time. Where will the technology required for these changes come from? That must be breakthrough AI. Someone has compiled the predictions of the big guys in the AI field for the emergence time of general artificial intelligence (AGI). DeepMind’s Hassabis believes that we are still two to three major technological innovations away from the emergence of AGI.

For example, OpenAI CEO Sam Altman even believes that AGI will appear next year. After thinking about it, the reason for such confidence may be that people have recently learned the "inference" method from large models.

Just in September, OpenAI officially released the unprecedented large-scale complex reasoning model o1. This is a major breakthrough. The new model not only has general capabilities, but also can solve problems that are beyond the capabilities of previous scientific, code and mathematical models. It’s a harder question to do. Experimental results show that o1 performs significantly better than GPT-4o in the vast majority of reasoning tasks.

o1 provides significant improvements over GPT-4o on challenging inference benchmarks.

OpenAI has opened up a new direction for the capabilities of large models: "Can they think and reason like humans?" has become an important indicator for judging their capabilities. If the new model released by the manufacturer does not have some thinking chain, I am afraid that it will be too embarrassing to take out.

However, until now, the official version of o1 has still not been launched. The AI community, especially large domestic model companies, are attacking o1’s dominance and beginning to take the lead in some authoritative reviews.

Today, the first o1 model in China with Chinese logical reasoning capabilities is here. It is the "Tiangong Model 4.0" o1 version (English name: Skywork o1) launched by Kunlun Wanwei. This is also the company’s third major move in large models and related applications in the past month. Previously, Tiangong AI advanced search and real-time voice conversation AI assistant Skyo were unveiled.

Starting from now on, Skywork o1 will open internal testing. Friends who want to experience it can quickly apply.

Application address: www.tiangong.cn

Three models compete in a new battlefield for reasoning

This time, Skywork o1 includes the following three models, including an open version that contributes to the open source community, and a dedicated version with greater capabilities.

Among them, the open source version of Skywork o1 Open has a parameter of 8B, achieving significant improvements in various mathematical and code indicators, and bringing the performance of Llama-3.1-8B to the same niche SOTA, surpassing Qwen -2.5-7B instruct. At the same time, Skywork o1 Open also unlocks mathematical reasoning tasks (such as 24-point calculations) that cannot be completed by larger-scale models such as GPT-4o. This also provides the possibility for the deployment of inference models on lightweight devices.

In addition, Kunlun Wanwei will also open source two Process-Reward-Model (PRM) for inference tasks, namely Skywork o1 Open-PRM-1.5B and Skywork o1 Open-PRM-7B. The previously open source Skywork-Reward-Model can only score the entire model answer, while Skywork o1 Open-PRM can be refined to score each step in the model answer.

Compared with existing PRMs in the open source community, Skywork o1 Open-PRM-1.5B can achieve an 8B model effect, such as RLHFlow's Llama3.1-8B-PRM-Deepseek-Data, OpenR's Math -psa-7B. Skywork o1 Open-PRM-7B is stronger and can approach or even surpass Qwen2.5-Math-RM-72B by 10 times on most benchmarks at the same time.

According to reports, Skywork o1 Open-PRM is also the first open source PRM adapted to code-based tasks. The following table shows the evaluation results using different PRMs on the math and code evaluation sets using Skywork-o1-Open-8B as the base model.

Note: Except for Skywork-o1-Open-PRM, other open source PRMs are not specifically optimized for code tasks, so No comparison related to code tasks is performed.

A detailed technical report will also be released soon. Currently, the model and related introductions are open sourced on Huggingface.

Open source address: https://tinyurl.com/skywork-o1

Skywork o1 Lite has complete thinking capabilities and achieves fasterReasoning and thinking speed are particularly outstanding in Chinese logic and reasoning, mathematics and other questions. Skywork o1 Preview is the full version of this reasoning model. It is equipped with a self-developed online reasoning algorithm. Compared with the Lite version, it can present a more diverse and in-depth thinking process, achieving more complete and higher-quality reasoning.

Perhaps you may ask, the current work on reproducing the o1 model has put a lot of effort into the inference level, so what makes Skywork o1 different?

Kunlun Wanwei stated that this series of models have built-in capabilities of thinking, planning, and reflection in the model output. They perform reasoning, reflection, and verification step by step in slow thinking, unlocking "thoughtful thinking" and other capabilities. Typical advanced version of complex human thinking ability, ensuring the quality and depth of answers.

Of course, we still have to see the actual results as to the quality of Skywork o1.

This first-hand test of Skywork o1 has completely grasped the reasoning

The Heart of the Machine has obtained the test qualification in advance. For the Skywork o1 series models, especially It is a comprehensive examination of the reasoning capabilities of the Lite and Preview versions. The picture below shows the interface display of Skywork o1 Lite.

We first let Skywork o1 Lite self-report. You can see that the model does not directly give the answer, but intuitively displays the complete thinking process including problem positioning, self-ability analysis, etc. to the user. , and the thinking time will be displayed, which is also a prominent feature of today's reasoning models.

Then we officially entered the testing phase. We collected various types of reasoning questions to see if we could stun Skywork o1.

Comparing sizes and counting "r" problems, no more overturning

Previously, large models often overturned when faced with some seemingly very simple problems such as size comparison and counting. Now these problems no longer trouble Skywork o1 Lite.

When comparing 13.8 and 13.11 which is bigger and smaller, Skywork o1 Lite gives a complete thinking link and finds out that the key to solving the problem lies in the size of decimal places. At the same time, the model also reflects on itself, double-checks its own conclusions, and reminds points where answers are easy to get wrong.

Similarly, Skywork o1 Lite is also a complete link for thinking, verification, and confirmation when it comes to answering correctly "How many "r's" are there in Strawberry?"

When answering questions with disruptive items, Skywork o1 Lite quickly clarifies the idea without being affected by distracting factors.

Play brain teasers without falling into language traps

Large models are sometimes spoken in Chinese Confused about brain-teaser questions in the environment, leading to wrong answers. This time Skywork o1 Lite can easily solve such problems.

Two pairs of father and son only caught three fish, but each got one. Skywork o1 Lite can figure out what happened .

Master all kinds of common sense and say goodbye to mental retardation

Large model can The inability to approach human levels at the level of common sense reasoning is one of the important indicators to improve its own credibility, enhance decision-making capabilities, and expand applications in multiple fields. Both the Skywork o1 Lite and Preview perform well in this regard.

For example, the difference between length (inches, centimeters, yards) and mass units (kg).

For example, why do salt water ice cubes melt more easily than pure water ice cubes?

Another example is a person standing on a completely stationary boat. When jumping backwards, the boat moves forward. Skywork o1 Lite explains the physics behind the phenomenon.

Become a question-solving expert, and the college entrance examination questions are no problem

Mathematical reasoning is the basic ability to solve complex tasks. Large models with strong mathematical reasoning capabilities help users efficiently solve complex cross-disciplinary tasks.

When solving the sequence problem "2, 6, 12, 20, 30... What is the 10th item of this sequence?", Skywork o1 Lite observes the characteristics of the number arrangement, finds the rules, and verifies the rules. Finally the correct answer was given.

When solving the question of combination (how many options are there to choose 3 people from 10 people to form a team), Skywork o1 Preview gave the correct answer after thinking about the whole link.

As for another dynamic programming problem (coins with face values 1, 3, 5, how many coins are needed to make up 11?), Skywork o1 Lite gives the optimal solution.

Next, we will increase the difficulty of Skywork o1 Lite and test it on two college entrance examination mathematics questions. The questions are from the 2024 National College Entrance Examination Paper A Mathematics (Article).

The first is a probability question (four people A, B, C, and D line up in a row, what is the probability that C is not at the head, and A or B is at the bottom), Skywork o1 Lite quickly gave it Correct answercase.

Then there are function questions

Skywork o1 Lite provides problem-solving ideas and answers in one go.

Careful thinking and strong logical thinking ability

Large model Logical reasoning is one of the core capabilities for achieving stronger general artificial intelligence, and Skywork o1 Lite is quite experienced in answering such questions. For example, on the classic lying problem, Skywork o1 Lite can distinguish who is telling the truth and who is lying from a logical and self-consistent perspective.

Skywork o1 Lite is not blinded by the paradox issue.

Be impartial in the face of moral dilemmas

Ethics Decision-making is to a large extent an important factor in ensuring the safe development of artificial intelligence, complying with social ethics, and enhancing user trust and acceptance. Large models must be cautious in their words and deeds.

As for the eternal question of "save your wife or save your mother", Skywork o1 Lite does not give an absolute answer, but weighs the pros and cons and gives reasonable suggestions.

There is also the dilemma of "saving more and saving less". Skywork o1 Preview does not draw conclusions easily, but puts forward some deeper thinking.

You can also hold the test of mentally handicapped bar

The mental bar test is often used to test the intelligence level of large models, and Skywork o1 Lite can easily answer it This type of question is like the difference between getting a full score of 750 and a score of 985 in the college entrance examination.

Another example is "Can lunch meat be eaten at night?" Skywork o1 Lite is obviously not misled by the name of the food.

Code problems can also be solved

Skywork o1 Lite can solve some Code issues, such as the Number of islands issue on LeetCode.

The title is "Given a 2-dimensional grid map with "1" (land) and "0" (water), count the number of islands. The islands are surrounded by water and connected by horizontal or vertical connections. If adjacent land is formed, you can assume that the four sides of the grid are surrounded by water."

At this point, we can draw a conclusion:

On the one hand, the previous large models. "Small" problems that often overturn the problem, Skywork with the blessing of reasoning ability o1 All eyes are smallA piece of cake. On the other hand, through complete thinking and planning, self-reflection and self-verification links, Skywork o1 also has the ability to think critically in complex problem scenarios and can output results more accurately and efficiently.

In this way, the much stronger reasoning ability than before will stimulate the application potential of Skywork o1 in more diverse vertical tasks and fields, especially logical reasoning that is easy to overturn and complex science and mathematics. Task. At the same time, after Tiangong is launched, it is bound to further optimize the task effect in the fields of creative writing and other high-quality content generation and in-depth search.

Domestic o1 model self-developed technology driven

Previously, we have I have witnessed a series of generative AI vertical applications proposed by Kunlun Wanwei, including but not limited to search, music, games, social networking, AI short dramas, etc. Behind this, Kunlun Wanwei has already made plans for the research and development of basic large-model technology.

Since 2020, Kunlun Wanwei has continued to increase its investment in large AI models. Just one month after ChatGPT was launched, the company released its own AIGC model series. Kunlun has launched applications in many vertical fields, including the world's first AI streaming music platform Melodio, AI music creation platform Mureka, AI short drama platform SkyReels, and more.

At the basic technology level, Kunlun Wanwei has currently built a full industry chain layout of "computing infrastructure - large model algorithms - AI applications", of which the "Tiangong" series of large models are its core .

In April last year, Kunlun Wanwei released the independently developed "Tiangong 1.0" large model. By April this year, the Tiangong model was upgraded to version 3.0, using a 400-billion-level parameter MoE hybrid expert model, and simultaneously selected open source. Today, Tiangong version 4.0 has achieved improved capabilities in logical reasoning tasks based on the intelligent emergence method.

Technically, the performance of Skywork o1 has been significantly improved in logical reasoning tasks, thanks to Tiangong’s three-stage self-developed training program, including the following:

First, reasoning and reflection Ability training. Skywork o1 builds high-quality step-by-step thinking, reflection and verification data through a self-developed multi-agent system, and supplements it with high-quality and diverse long-term thinking data to continue pre-training and supervising fine-tuning of the base model.

The second is reinforcement learning for reasoning ability. The Skywork o1 team has developed the latest Skywork o1 Process Reward Model (PRM) adapted to step-by-step reasoning enhancement, which can not only effectively capture the impact of the intermediate steps and thinking steps of complex reasoning tasks on the final answer, but also combines theThe step-by-step reasoning enhancement algorithm has been developed to further enhance model reasoning and thinking capabilities.

The third is reasoning planning. Based on Tiangong’s self-developed Q* online reasoning algorithm, it cooperates with the model to think online and find the best reasoning path. This is also the first time in the world that the Q* algorithm has been implemented and made public. It can significantly improve the reasoning capabilities of LLM on data sets such as MATH and reduce the demand for computing resources.

On the MATH dataset, Q * helps DeepSeek-Math-7b improve to 55.4% accuracy, surpassing Gemini Ultra.

Q * Algorithm paper address: https://arxiv.org/abs/2406.14283

It can be seen that Kunlun Wanwei’s technology has reached the leading level in the industry and is in the competition The fierce field of generative AI has gradually gained a firm foothold.

Compared with the current flourishing of generative AI applications, at the basic technical level, research has begun to enter the "deep water zone." Only companies that have accumulated over a long period of time can build a new generation of applications that change our lives.

We look forward to Kunlun Wanwei bringing us more and more powerful technologies in the future.

Online Consultation