A major breakthrough! Microsoft releases "self-evolution" to help small models surpass OpenAI

A major breakthrough! Microsoft releases

Image source: generated by Unbounded AI

Microsoft Research Asia released an innovative algorithm-rStar-Math.

rStar-Math enhances CoT, Monte Carlo tree search, etc. through code, which can help small parameter models achieve multiple rounds of deep evolution of self-thinking without relying on teacher model distillation, greatly enhancing the model's performance. Mathematical reasoning skills.

In the American mathematics competition AIME 2024 test, rStar-Math solved an average of 53.3% (8/15) of the difficult problems, exceeding OpenAI o1-preview's 44.6%, and all other open source large models, Become the smartest top 20% high school math students.

In the MATH benchmark test, rStar-Math increased the accuracy of Alibaba’s open source small model Qwen2.5-Math-7B from 58.8% to 90.0%, and the accuracy of Qwen2.5-Math-1.5B The rate increased from 51.2% to 87.8%, and Phi3-mini-3.8B increased from 41.4% to 86.4%, all exceeding OpenAI o1-preview.

This fully demonstrates that with the support of innovative algorithms and high-quality data, small models can also have reasoning capabilities that are comparable to cutting-edge models with very large parameters.

Code Enhancement CoT

Traditional mathematical reasoning models rely on natural Although this method is intuitive, it is prone to errors or irrelevant steps, especially in complex mathematical problems that are difficult to detect. Therefore, rStar-Math uses the code-enhanced CoT (Chain-of-Thought) method to solve this problem.

When the model generates each step of reasoning, it not only generates a natural language explanation, but also generates the corresponding Python code, and verifies the correctness of the reasoning step through code execution. Code enhancement CoT can provide a strict verification mechanism to ensure the correctness of every step of reasoning.

For example, when solving a mathematical problem, the model might generate an equation-solving step and actually perform the equation-solving process through Python code. This step is only retained as a valid inference step if the code execution is successful and the result is correct. This approach not only reduces the generation of erroneous inference steps but also improves the overall quality of the inference trajectories.

To further ensure the quality of the inference step, rStar-Math uses Monte Carlo Tree Search (MCTS) to generate step-by-step inference trajectories. MCTS is used to decompose complex mathematical problems into multiple single-step generation tasks.

In each step, the policy model generates multiple candidate steps andFilter valid nodes through code execution. Through extensive MCTS rollback, rStar-Math is able to assign Q-values to each step, ensuring that the generated inference trajectory consists of correct and high-quality intermediate steps.

PPM training method

Currently, most large models are used in reasoning mathematics The problem is faced with the inability to provide fine-grained step-level feedback to help it make better choices during the reasoning process. rStar-Math helps the model find a better reasoning path by introducing a process reward model (PRM).

The core idea of PPM is to train the model by constructing step-level positive and negative preference pairs, rather than directly relying on precise step-level scoring. PPM’s training method utilizes MCTS-generated Q-values, which are calculated through an extensive rollback and backpropagation process and reflect the contribution of each step to the final answer. While these Q values themselves are not entirely accurate, they are able to reliably distinguish high-quality steps from low-quality steps.

PPM selects the two steps with the highest Q value from the MCTS tree as positive examples, and the two steps with the lowest Q value as negative examples to construct a preference pair. In this way, PPM is able to learn which steps are more likely to guide the model to generate the correct inference trajectory, thereby making better choices during the inference process.

The training process of PPM uses the standard Bradley-Terry model and the pairwise ranking loss function. For each step, PPM predicts a reward score and optimizes the model's predictive power through a pairwise ranking loss function. The core idea of the pairwise ranking loss function is to maximize the reward score difference between positive and negative steps, thereby ensuring that the model can accurately distinguish between high-quality and low-quality inference steps.

PPM’s training method also introduces an important innovation, avoiding the direct use of Q values as reward labels. Although Q-value can provide certain step-level feedback, due to its inherent noise and inaccuracy, directly using Q-value as a training target will cause the model to learn inaccurate reward signals.

So, PPM converts the Q value into a relative ranking problem by constructing preference pairs, thereby reducing the impact of noise on model training. This approach not only improves the robustness of the model, but also enables PPM to more reliably assess the quality of each step during inference.

Multiple rounds of self-evolution

rStar-Math gradually enhances the model's reasoning capabilities through four rounds of in-depth evolution of self-thinking, combined with PPM, MCTS and code enhancement CoT.

In the first round, the basic model is initially improved through supervised fine-tuning to lay the foundation for subsequent self-evolution. The key to this round is to generate high-quality initial training data and use this data to fine-tune the base model.

In the second round, model reasoning capabilities were significantly improved through PPM. PPM identifies which steps are of high quality and which need improvement by analyzing the reasoning steps generated by the policy model. This feedback information is then passed to the policy model to guide it to make better choices in subsequent reasoning.

In the third round, higher-quality data is generated through PPM-enhanced MCTS to further improve the model’s reasoning capabilities. In this round, PPM not only evaluates the inference steps generated by the policy model, but also guides the search process of MCTS, making it explore high-quality inference paths more efficiently.

In the fourth round, ultra-difficult mathematical reasoning problems are solved by increasing the number of MCTS rollbacks. Based on the first three rounds of self-evolution, the fourth round of self-evolution further improves rStar-Math's ability to solve challenging mathematical problems by increasing the number of MCTS rollbacks.

Increasing the number of rollbacks enables MCTS to explore different reasoning paths more deeply and discover high-quality solutions that may have been overlooked in initial exploration. This not only improves the model's ability to solve complex problems, but also enhances its robustness when facing difficult mathematical problems.

Code address (currently cannot be opened and is under review): https://github.com/microsoft/rStar

Paper address: https://arxiv.org/abs/2501.04 519

Judging from the most powerful small model Phi-4 open sourced by Microsoft yesterday, and the newly launched innovative algorithm rStar-Math, the performance and efficiency of small models will gradually become mainstream in the future, and for those without strong computing power It is very practical for clustered small and medium-sized enterprises and individual developers.