What? ? ?
A domestic start-up company that has always kept a low profile has quietly become the number one model in the country and fifth in the world (ranked only after the o1 series and Claude 3.5)!
And it is the only domestic company in the top ten.
(The second domestically produced product on this list is Alibaba’s open source qwen2.5-72b-instruct, ranking 13th overall).
And the ranking list it is on LiveBench, although it is not as well-known as the large model arena (LMSYS Chatboat Arena), it is highly qualified -
Turing Award winner , Yann LeCun, chief AI scientist of Meta, launched in June this year in conjunction with New York University and others.
It is known as "the world's first LLM benchmark test that cannot be cheated".
As for the dark horse that suddenly emerged this time, friends who are more familiar with the domestic large model competition pattern have already guessed it -
The Step series is behind one of the Six Little Tigers of large models. The step stars.
Instructions followed high scores and won the first place in the worldOn the LiveBench list, Step-2-16k-202411, a large trillion-parameter language model independently developed by Stepstar, scored 57.68 points on the Global Average.
Ranked fifth in the overall list and first in China.
This list has not appeared frequently before. On the one hand, it is really new and was just launched in June this year. On the other hand, it is more realistic, that is, domestic large models have not been at the top of this list before. Achieve impressive results.
This does not delay the strength of the list itself -
LeCun and New York University and other institutions jointly launched it, specially designed for large models, currently including 17 different models in 6 categories Tasks, updated with new questions every month.
The goal is to ensure that list questions are less susceptible to contamination and can be evaluated easily, accurately, and fairly.
The emphasis is on not being easily contaminated because the training data contains a large amount of Internet content, and many BenchMarks are easily contaminated.
For example, the familiar mathematical test set GSM8K has recently been proven to have many models overfitting on it. This obviously brings trouble to the evaluation of model capabilities.
In addition to being careful about BenchMark being contaminated, it is also important to ensure that the evaluation method is fair and unbiased.
Generally speaking, everyone uses two methods: LLM as the judge or humans as the referee. LiveBench chooses to use objective, basic factual judgment to evaluate each issue.
So, when we look at this list for the first time, what else can we see from it?
Let’s talk about Step-2 with excellent results first.
IF Average, which is instruction following, ranked first in the world with the highest score.
The content of this project is to rewrite, simplify, summarize or generate stories from recent new articles in The Guardian.
The score of 86.57 is really very high - everyone else on the list (even the models from OpenAI and Anthropic) are in the 70-80 segment, and Meta-LLaMA- ranks second in the individual category. 3.1-405b-instruct-turbo is more than 8 points lower than it.
This means that Step-2 has strong control over the details in language generation, maximizes understanding, and then better follows human instructions.
To be more specific, it can be understood that when we ordinary people input a non-professional, really ordinary, prompt with reversed sentences, unclear semantics, and vague meaning, Step-2 can infer the user based on the context and specific situations. According to the specific needs, a fuzzy instruction can be understood from "360p" to "1080p", and the true intention behind the fuzzy instruction can be accurately captured.
It also means that the content creation ability is also very strong. For example, if you ask it to create an ancient poem, it can accurately control the number of words, rhythm, rhyme, artistic conception, etc.
Completely independent research and development, MoE architecture, trillions of parametersBefore LiveBench came out again this time, the deepest impression that Step-2 left on the outside world must be "the first domestically produced Large models with trillions of parameters launched by startups."
This is a bit like a step-style embodiment. Among the six large models, the Step series was released the latest, but its launch is unambiguous.
In March this year, Step-2 was previewed at the opening ceremony of the Global Developers Pioneer Conference. It suddenly increased the parameter scale of the previous work Step-1 from hundreds of billions to one trillion parameters.
After whetting your appetite, during WAIC 2024 in the summer, the official version of Step-2 was launched.
The model adopts MoE architecture.
Generally speaking, there are two mainstream ways to train MoE models. Either start training based on the existing model through upcycle (upward reuse), or start training from scratch.
The Upcycle method requires relatively lower computing power and higher training efficiency, but it easily reaches the ceiling of this method.
For example, the MoE model based on copying is very prone to serious homogeneity of experts.
If you choose to train the MoE model from scratch, you can explore a higher upper limit of the model, but at the cost, the training difficulty will also increase.
But the Step Team still chose the latter, choosing to develop completely independently and training from scratch.
In the process, through innovative MoE architecture designs such as parameter sharing by some experts and isomerized expert design, the hybrid expert model in Step-2Every expert in is fully trained.
Therefore, the total number of parameters in Step-2 reaches trillions, and the number of parameters activated for each training or inference also exceeds most Dense models on the market.
In addition, during the training process of Step-2, Step's system team broke through key technologies such as 6D parallelism, extreme video memory management, and fully automated operation and maintenance, supporting the efficient training of the entire model.
When it first appeared, Step-2 officially stated:
Step-2 is fully close to GPT-4 in terms of mathematical logic, programming, Chinese knowledge, English knowledge, and command following.Based on the results of this LiveBench AI, the team has a clear grasp of Step-2’s positioning and advantages.
The base model has strong technical capabilities, but the key is to make people use it.
The official news is that Step-2 has been connected to Step-Star’s C-side smart life assistant “Yuewen”, which can be tried on both the Web and App.
If you are a developer, you can use Step-2 through API access on the Step Star Open Platform.
Both language models and multi-modal models are requiredAt the beginning, we mentioned that the Step model is a series, and Step-2 is the strength representative of its language model.
In this series, in addition to the language model, the multi-modal model of step stars is also very interesting.
Step-1.5V is a large multi-mode understanding model of step stars. This model has outstanding advantages in three aspects:
First, perception ability. The innovative image-text training method allows Step-1.5V to understand complex charts and flowcharts, accurately perceive complex geometric positions in physical space, and process images with high resolution and extreme aspect ratios.
The second is reasoning ability. Perform various advanced reasoning tasks based on image content, such as solving math problems, writing code, composing poetry, etc.
The third is video understanding ability. It can not only accurately identify objects, people and environments in videos, but also understand the overall atmosphere of the video and the emotions of the characters.
In terms of generation, Step-1X has a large image generation model.
Step-1X adopts DiT (Diffusion Models with transformer) architecture, with three different parameter quantities of 600M, 2B and 8B, achieving both semantic understanding and image creativity.
Specifically, no matter whether the text instructions are simple or complex, whether it is to draw a single object or a multi-layered and complex scene, it can cover it.
In addition, the model also supports in-depth optimization of Chinese elements, making the generated content more suitable for the aesthetic style of Chinese people.
As for both language models and multi-modal models, the step has its own reasons.
From the very beginning, Step Stars has made it clear that it leads toRoadmap for AGI:
Unimodality – Multimodality – Unification of multimodal understanding and generation – World Model – AGI.
In other words, the goal of Step is to develop multi-modal large models that can realize AGI, and use these independently developed large models to create a new generation of AI applications.
For this goal, Laiyu has written his own answer for more than a year.
The R&D iteration speed is very fast. In less than a year, whether it is Step-1 to Step-2, or Step-1V to Step-1.5V, the overall progress is continuing.
The product also has its own ideas and is not limited to ChatBot. On the same day that Step-2 reached the top of the domestic rankings, Yuewen, a subsidiary of Steppe, also launched a new function:
With simple settings, you can use the "Camera Control" button on the lower right side of iPhone 16 to Press the key to call the "Photography" function.
Apple users who do not have an iPhone 16 can also use domestic AI in one step by upgrading the system to iOS18.
Although it has already occupied a place among the Six Little Tigers, looking at its recent progress, I still want to describe it as a dark horse.
In terms of technology and strength, Step-2 can suddenly rank first in the country on the industry's authoritative list and become the only domestic player in the top ten in the global list.
It has been almost two years since the wave of large-scale models surged.
In the past two years, the technical practitioners who have participated in it have been (seemingly distributed but actually jointly) creating a vision, a vision that many people are willing to participate in and be associated with.
There is reason to believe that the Step series, as well as China's large models, will become more and more shining due to their outstanding technical strength and unremitting pursuit of innovation.
One More ThingLast month, Intellectual Property Research Institute launched the debate platform FlagEval Debate, which aims to provide a new measurement scale for large model capability evaluation by introducing the competition mechanism of model debate.
The gameplay is somewhat similar to the large model arena. There are two models, one positive and one negative, double-blind testing, and users vote after the debate.
Then it was revealed who the pros and cons were.
Model debate mainly relies on information understanding, knowledge integration, logical reasoning, language generation and dialogue abilities.
Of course, it can also measure the depth of information processing and transfer adaptability in complex contexts, reflecting the progress of learning and reasoning.
After playing with it briefly, some topics are quite interesting.
For example, "If a museum is on fire, you can only save one, save the cat or the Mona Lisa."
The two models were arguing at the back, and they said the words "cats have nine lives", which made me laugh to death.
In the end, after several repeated shots, Step-2 beat o1.
It seems that its debating ability is also very strongYeah...
Official website of the list: https://livebench.ai/#/blog
Yuewen link: https://yuewen.cn
FlagEval Debate official website: https://flageval.baai.org/#/debate