Previous large language models Many related work relies on a large amount of manual annotation data to improve model performance. However, in the paper Deep Seek R1, it is pointed out that the reasoning capabilities of the model can be improved through large-scale reinforcement learning, and there is no need to use SFT (supervised fine-tune) to complete the cold start part. Work.
P.S. Complete cold-start of the model through a small amount of SFTs can further improve the model performance.
Personal thoughts: A small number of SFTs improves the performance of the model during the cold start stage, allowing better answers to be found in subsequent RL training. Easy to understand version: If you compare the model to a martial artist in a martial arts novel, the "small amount of SFT" is like a martial arts secret. When providing a secret for martial artists to practice (corresponding model training), they can avoid detours and achieve martial arts faster (excellent model performance). But without this secret, martial artists may take detours and need more practice and exploration to achieve martial arts. They may even go on a wrong path and become obsessed with the incompetence (training collapses).Key steps and model relationship diagram mentioned in the paper
1.1. DeepSeek-R1-ZeroPrevious LLM training work relies heavily on supervised data, and collecting this data is time-consuming and labor-intensive. The training process and results of DeepSeek-R1-Zero reflect the potential of LLMs to develop inference ability without any supervised data, showing their self-evolution through pure reinforcement learning processes.
P.S. DeepSeek-R1-Zero is a model trained based on DeepSeek-V3-Base purely in RL (without using any SFT data). 1.1.1 Reinforcement learning algorithmGroup Relative Policy Optimization
In order to save the training cost of reinforcement learning, GRPO is used for training. This method abandons the Critic model (usually the same size as the strategy model) part of the traditional PPO algorithm and instead estimates the baseline directly from the population score.
Specifically, for each question q, GRPO will sample a set of outputs from the old policy model parameters and then optimize the current policy model parameters by maximizing the GRPO objective function.
Assisted understanding: Different strategy models are actually the same model under different parameter stagesVersion: The specific version can be understood as follows: The model of the previous round of model parameters can be understood as the model of the previous iteration. :The latest model parameters of the model (updated). : Initial model parameters.The original formula is shown in the figure below:
ε Controls the upper limit of the learning step to ensure that each round of learning does not have too much deviation from the previous round. Mainly to prevent policy crashes.
β controls the degree of punishment for the original ability offset. It is mainly to alleviate the problem of catastrophic forgetting.
is advantage, calculated by the following formula:
The personal illustration is as follows (for easy understanding):
From Policy Gradient to PPO to GRPO (personal reasoning version, Mathematical derivation warning ⚠️, if there is any error, please point it out and correct it)
Some basic concepts (Good understanding version under LLM training task):
π (Policy, strategy): i.e. LM model< /p>
θ (Parameter, parameter): that is model parameters
τ (Trajectory, trajectory): that is the output sequence. Here it can be understood as the entire sentence of output, and each output token is for action.
s(State, interactive state): that is, the initial state is
a(Action, interactive behavior): that is, the output token, which can be simply understood as each character . (In fact, one word does not equal a token)
Then, we can get the probability of generating the sequence τ under the model parameter θ as follows:
Reward Function (reward function definition, that is, output sequence Rewards that can be obtained):
So you can get Expected Reward under model parameters:
In summary, we hope to adjust the model parameters to make the greater the expected reward, the better , so you can get the Policy Gradient formula as follows, expecting to maximize the expectation reward:
# In fact, N trajectories approximate expectations, maximizing the expectation reward# The environment cannot function as gradient, so it can be removed /p>Intuitive understanding: execute a certain action (token) under a certain state (above) to make the final output When the reward is positive, we should increase the probability of this output, and vice versa.
However, if you look at the above formula carefully, you will find that if reward is always positive, it will cause the output probability of any token to be increased all the time. But in actual operation, we use the sample method to train, which leads to the output probability dropping because some items are not sampled (the actual ground truth needs to be improved). So IWe hope to introduce a baseline (b) so that reward is not always positive. The formula becomes as follows:
Often we can set the baseline to the expected value of reward, that is,
At the same time, we know that the final output is a sequence, and it is used when calculating reward. Granularity calculated. Even if the overall reward is positive, it does not mean that every action in it is profitable (such as: saying a bunch of nonsense and finally saying the correct result). Therefore, a more reasonable approach is that we need to give each action the appropriate credit.
First of all, we have some assumptions (note: it does not necessarily apply in all circumstances, different reward functions should be used according to the specific situation):
1. Reward should be separately Each action calculation (previous) # Calculate the sum of all rewards after the current action as the reward of the current action
2. The faster the task is, the more important it should be, and the farther the distance, the smaller the contribution # is the time decay function
In fact, this item is actually counting how good it is to execute an action under a certain state than to execute other actions. This is what we often call Advantage Function, which can be expressed as, so the above formula Can write:
PPO(Proximal Policy Optimization)
Because Policy Gradient is an on The policy method requires re-collecting training data after each round of updates, which is undoubtedly very troublesome. Therefore, we hope to transform it into an off policy method, which can reuse a batch of data to update parameters.
Importance Sampling: [It's too complicated to explain it here, but I won't go into details]
Based on the above formula Policy Gradient can be written:
# Assuming it is not affected p>
So we can get a new Objective Function: # Note:
It's actually interacting and then updating.
In order to ensure that there is no big gap after training (basic abilities, etc. will not change too much), a class regular term is introduced as follows:
Note: This KL divergence theory It is not the parameter distribution distance, but the output distribution distance. In actual applications, adaptive KL penalty can be used for dynamic adjustment, that is,
if , increase
if , decrease
but more people will useThe following PPO2 formula simplifies the calculation (KL is too difficult to calculate):
Essentially, it replaces adaptive KL penalty to ensure that there is no big gap between the two distributions.
Note: clip(a,b,c); When ac, take c; the rest take a.
In fact, GRPO adjusts the calculation method of Advantage function based on the idea of ppo2, and at the same time introduces regular terms to ensure that the initial ability of the model will not be too large in the training process (read all the way down the inference process If that's the case, I'm sure I can see it clearly). [If there are limited abilities, if there are unclear expressions, please discuss]
1.1.2 Reward Modeling (reward modeling)Reward is the source of training signals, which determines the optimization direction of RL. The Reward uses a rule-based reward system, which mainly consists of two types of Rewards:
Accuracy Reward: Assess whether the answer is correct. For example, in a mathematical problem with deterministic results, the model is required to give the final answer in a specified format (e.g., put in a box), thereby directly extracting the LLM answer to verify the correctness. Similarly, for programming problems on LeetCode, you can use the compiler to generate feedback based on predefined test cases to determine correctness and error. P.S. The box format of seeing mathematical problems from the deployed distilled QWen model should be like this: [ \boxed{} ]Format Reward: Force the model to place its thinking process on the '' and '' tags between. Discussion: The original text of the reward implementation method is not explained in detail. Since it is rule-based, the guess here is just to simply judge whether the output exists and the label, and to judge that the answer is not within the part that the label is blocked, etc.Why not use result-oriented or process-oriented neural reward model?
Because it is found that neural reward models may face reward hacking problems during large-scale reinforcement learning. In addition, retraining the reward model requires additional training resources, which also leads to a significant increase in the complexity of the entire training process.
Knowledge annotation: Neural reward model: Learning or predicting reward signals through neural networks, used to guide the decision-making behavior of agents. It is generally used to solve the problem that traditional manual design reward function (Handcrafted Reward) is difficult to cover complex scenarios. Reward hacking: Deceive reward model scores. It can be simply understood as the "format score" of math problems when we were young (such as: write "solution:" before math problems.Score), but these score rewards don't actually help the model learn how to solve problems well. 1.1.3 Training Template (training format)To train the DeepSeek-R1-Zero model, a concise instruction template was first designed to guide the base model to follow preset specifications. As shown in the figure below, this template requires that the model must be produced into a complete inference process before outputting the final answer. It deliberately limits constraints to the structural paradigm level, avoiding any content-oriented bias (such as mandatory reflective reasoning or promoting specific problem-solving strategies), so as to ensure that the native evolutionary trajectory of the model can be accurately captured during reinforcement learning.
1.1.4 Performance, Self-evolution Process, and Aha Moment of DeepSeek-R1-Zero[I won't go into too much detail on the experimental performance here. Interested students can read the original text. More It is a discussion on the relevant content of the extraction technology]
DeepSeek-R1-Zero does not need to rely on any supervised fine-tuning data to achieve strong reasoning capabilities, which fully verifies that the model only uses Reinforcement Learning (RL) A technical path to efficient learning and generalization can be completed. In addition, the performance of DeepSeek-R1-Zero can be further significantly improved by introducing the Majority Voting mechanism.
Discussion: The Ensemble Learning approach may be a path that can further significantly improve performance. This means that you can try to further improve performance through MoE (Mixture of Experts with TopN Gating) while keeping the amount of inference parameters unchanged.Self-evolution Process of DeepSeek-R1-Zero
DeepSeek-R1-Zero's self-evolution process vividly reveals how RL drives models to independently improve their inference capabilities. By directly starting RL training from the base model, interference from the Supervised Fine-Tuning stage can be completely eliminated, and the full observation of the model's evolution trajectory can be achieved. This methodology provides a clear observation window for tracking the evolutionary path of the model, which can especially effectively reveal the reasoning ability it exhibits when dealing with complex inference tasks.
DeepSeek-R1-Zero achieves autonomous breakthroughs on complex inference tasks through the Test-time Computation capability. This process covers the scale of operations that generate hundreds to thousands of Reasoning Tokens, enables the model to explore and optimize its thinking process at a deeper level. The most significant feature of this self-evolution is that as the amount of computation increases during testing, the emergence of complex behaviors begins to appear. Specifically manifested as:
1. Reflection mechanism: The model will automatically backtrack and reevaluate previous inference steps
2. Multi-path exploration: The system spontaneously tries different problem-solving strategies
It is worth noting that these higher-order cognitive behaviors are not achieved through preset programming, but are entirely derived from the continuous interaction between the model and the reinforcement learning environment. This self-organized evolutionary feature has caused a qualitative change in the reasoning ability of DeepSeek-R1-Zero, which synchronizes the efficiency and accuracy dimensions, and ultimately forms a systematic solution to overcome complex tasks.
Aha Moment of DeepSeek-R1-Zero
[I won’t go into too much detail here, you can feel the shock brought by RL by looking at the picture. 】
Aha moment strongly confirms the potential of RL to unlock new levels of intelligence in artificial intelligence systems, and thus opens up a technical path for building a more autonomous and adaptable next-generation model. This area leaves more room for imagination for subsequent work and even work leading to AGI.
The Drawback of DeepSeek-R1-Zero
Authentic, the pure RL training method brings us a lot of shock, and allows the model to unexpectedly spontaneously explore evolutionary reasoning capabilities, but its There are still some flaws. Among them, the most significant is that the results of DeepSeek-R1-Zero output often have problems such as language confusion and difficulty in reading. In order to make the entire reasoning process more readable, training was completed with human-readable cold start data, and DeepSeek-R1 was obtained, which will be introduced in detail in the next section.
Random Thoughts & Discussion: Intuitively, this phenomenon is in line with expectations and reasonable. The model essentially affects the output probability distribution of subsequent tokens by outputting the so-called ‘thinking process’ (i.e. the probability of outputting the correct answer). At the same time, there are multilingual training corpus in BaseModel during pre-training. Therefore, as long as it is a token that can affect the subsequent output probability distribution, it may be output and is regarded as a "thinking process". To some extent, it can be regarded as a form of XAI (AI interpretability), and to some extent, it can be regarded as calculating parameters in deep learning models (essentially thinking, that is, the probability distribution affecting the output) Shown in natural language. Because of the readability of natural language, people can feel the so-called "intelligence" more intuitively. 1.2. DeepSeek-R1: Reinforcement Learning with Cold StartP.S. DeepSeek-R1 is not trained based on DeepSeek-R1-Zero, both of which are trained with DeepSeek-V3-Base as BaseModel. It can be roughly understood that DeepSeek-R1 is an upgraded version of DeepSeek-R1-Zero, and DeepSeek-R1-Zero is more of a data generator role.The breakthrough achievements made by DeepSeek-R1-Zero have spawned two key research directions:
Cold start optimization proposition: Can the cold start be further improved by introducing a small amount of high-quality data to further improve reasoning Performance or accelerated convergence? Human-computer collaborative evolution proposition: How to train a user-friendly model that has both clear and coherent Chains of Thought (CoT) expression ability and strong General Capabilities?In response to the above problems, the training technical solution process of DeepSeek-R1 was designed, and its specific architecture is as follows:
1.2.1 Cold StartDifferent from DeepSeek-R1-Zero, in order to prevent BaseModel from being in the early stage of RL training. An unstable cold start phase occurred, and a small amount of long CoT data was constructed and collected for DeepSeek-R1 to fine-tune the model as the initial version of the Actor model.
To collect this type of data, a variety of methods are used to fusion, specifically:
1. Construct a prompt word to allow the model to generate detailed answers that include reflection and verification
2. Inject the prompt word through few-shot prompting using a long thinking chain example
3. Use DeepSeek-R1-Zero to output the result in a readable format
< p>4. Use manual annotation to complete the final post-processing to refine the resultsThe DS team collected thousands of cold start data to fine-tune the DeepSeek-V3-Base as the starting point for reinforcement learning. Compared with DeepSeek-R1-Zero, this cold start data has the following advantages:
• Readability: The core limitation of DeepSeek-R1-Zero is that its generated content is often not readable. The result may be mixed in multiple languages or lack of Markdown format to highlight answers for users. In contrast, when building cold start data for DeepSeek-R1, data containing readability patterns were designed—each reply contained a summary at the end and poor readability replies were filtered. Specifically, define the output format as |special_token||special_token|
, which corresponds to thinking chain (CoT), is used to summarize the reasoning results.
• Potential Advantages: By carefully designing the cold start data mode based on human prior knowledge, it is observed that the model performance has been significantly improved compared with DeepSeek-R1-Zero. Based on this, it is believed that iterative training is the more optimized evolution path for inference models.
1.2.2 Reasoning-oriented Reinforcement LearningAfter fine-tuning of DeepSeek-V3-Base based on cold start data, the same large-scale reinforcement learning training process as DeepSeek-R1-Zero was adopted. This stage aims to improve the model's reasoning ability, especially for strong reasoning tasks such as programming, mathematics, science and logical reasoning that require clear problem-solving paths.
During the training process, it is observed that the thinking chain (CoT) often has language confusion, especially when reinforcement learning prompts contain multilingual content. To alleviate the problem of language confusion, we introduced a language consistency reward mechanism in reinforcement learning training, which is determined by calculating the proportion of target language vocabulary in the thinking chain. Although ablation experiments show that this alignment leads to a slight decline in model performance, this reward mechanism is more in line with human preferences and significantly improves the readability of the output. Finally, the accuracy of the inference task is directly weighted and summed with the language consistency reward to form a comprehensive reward function. The fine-tuned model is then subjected to reinforcement learning training until it reaches a convergence state on the inference task.
Discussion: The original text of the calculation formula for the language consistency reward mechanism here is not given. The guess may still be the rule-based method, which may be to eliminate language-independent information such as numbers and punctuation marks and calculate the maximum proportion of the same language in the process of CoT.Rejection Sampling and Supervised Fine-Tuning
When Reasoning-oriented Reinforcement Learning reaches convergence, the training node is used to collect SFT data for subsequent iterations. Unlike cold-start data that focused on reasoning in the early stage, this stage integrates data from other fields to improve the model's performance in general tasks such as text generation and role-playing. Specifically, the following methods are used to generate data and implement fine tuning:
Inference data: Selected inference prompts, use the above-mentioned model checkpoint to perform Rejection Sampling to generate reasoning trajectories. Only data that can be evaluated by rule rewards is included in the RL phase. And at the current stage, the dataset is extended by adding additional data (which does not necessarily allow rules to evaluate rewards). Therefore, some of the data adopts a generative reward model (generative reward)model) for evaluation, that is, input the standard answer and model prediction results into DeepSeek-V3 for automated evaluation. Due to the problems of language confounding and poor readability in the model output, we filtered the thinking chain that includes multilingual mixing, lengthy paragraphs and code blocks. For each Prompt, multiple output results are sampled and only the correct results are retained. Finally, about 600,000 inference-related training samples were collected.
Knowledge annotation: Rejection Sampling: The core idea is to approximate the target distribution through a known and easy-to-sample proposal distribution, and finally achieve the goal in accordance with the mechanism of accepting or rejecting samples. The sample set of distributed.Non-inference data: For non-inference data such as writing, fact-question, self-cognition and translation, DeepSeek-V3's construction process is followed and part of its SFT data set is reused. For specific non-inference tasks, DeepSeek-V3 is required to generate a potential thinking chain before answering questions through the propt method; while for simple queries (such as greeting statements), the corresponding CoT is not generated. Finally, about 200,000 non-inference training samples were collected.
Based on the above-mentioned approximately 800,000 selected data sets, two rounds of SFT training were implemented on DeepSeek-V3-Base.
1.2.3 Reinforcement Learning for All ScenariosIn order to achieve deep alignment of the model with human preferences, the second phase of reinforcement learning was implemented. The goal is to systematically improve the usefulness of the model while optimizing the reasoning ability. helpfulness and harmlessness. Specifically, the model is trained by combining reward signals and diversified prompt distributions:
Inference data: Following the method framework proposed by DeepSeek-R1-Zero, in mathematics, programming and The field of logical reasoning adopts rule-based rewards general data: captures human preferences in complex and subtle scenarios through reward models.The data pipeline building system that aligns DeepSeek-V3, adopts the same preference pair distribution and training prompts:
Helpfulness: The evaluation mechanism focuses only on the final summary Utility and user relevance, while minimizing interference to the underlying reasoning process; harmlessness:The complete content output of the model (including reasoning chains and conclusions) is detected in the whole process to identify and suppress potential risks, cognitive biases and harmful content that may occur during the generation process.Ultimately, by fusing multi-dimensional reward signals with heterogeneous data distribution, we successfully trained a model that strictly follows the principles of usefulness and harmlessness while maintaining excellent inference capabilities.
1.3. Model distillation: R1 -> Small modelIn order to make the lightweight model have the same reasoning ability as DeepSeek-R1, based on the above DeepSeek-R1, 800,000 samples were selected and directly used for Qwen and Llama. The open source model has been supervised fine-tuned (SFT). Experimental results show that this straightforward distillation method can significantly improve the reasoning ability of small models. The base models used in this study include:
Qwen series: Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-14B, Qwen2.5-32BLlama series: Llama-3.1-8B , Llama-3.3-70B-InstructFor distilled models, only supervised fine-tuning (SFT) process is used, and the reinforcement learning (RL) phase is not introduced - although RL can significantly enhance model performance. The article stated that the core goal of the research is to verify the effectiveness of distillation technology and leave the exploration space in the RL stage to a wider academic community.
Discussion: In downstream applications, you may try to use RL to make small models have stronger inference capabilities in a certain vertical field to improve model performance. 2. Discussion2.1 Distillation v.s. Reinforcement LearningAs shown above, by knowledge distillation of DeepSeek-R1, the lightweight model can be obtained. Significant performance improvement. However, a key question remains unanswered: Can the model achieve comparable performance without using distillation technology and relying solely on the large-scale reinforcement learning (RL) training described in this article? To verify this problem, the following experiments were carried out on the Qwen-32B-Base model:
Training setup: Using math, programming and STEM field data, implementing large-scale RL training of more than 10,000 steps, and obtaining DeepSeek-R1-Zero- Qwen-32B; Comparative benchmark: DeepSeek-R1-Distill-Qwen-32B was obtained by distillation of DeepSeek-R1 based on the same base model.Experimental results tableMing:
The performance of pure RL-trained 32B pedestal model (DeepSeek-R1-Zero-Qwen-32B) is comparable to that of QwQ-32B-Preview; the distillation model (DeepSeek-R1-Distill-Qwen-32B) is in all evaluations It is significantly better than pure RL training models on the benchmark (significant performance gap).Core Conclusion
Dissolve knowledge of more powerful models to lightweight models to achieve excellent results; if you rely solely on the large-scale RL method in this paper, the small model will consume huge computing resources and the final performance will not exceed it. Distillation model. Distillation strategies are both economical (low computing cost) and efficient (significant performance gain); but to break through the intelligence boundaries, more powerful base models and ultra-large-scale reinforcement learning are still needed. Discussion: Small models actually have limited expression capabilities. When doing RL, the original model is actually required to have a certain inference and understanding ability (to ensure the efficiency of exploit during RL training. If the original model is stupid, it may take longer. time to find ideas to solve problems, and even cannot solve them) and have sufficient parameters to support learning. The distillation method is equivalent to directly teaching the basic ideas of the model. I personally think that this is similar to the cold start of R1. I feel that the cold start step can be replaced by the cold start part, and then follow the idea of R1 to try to make another comparison. Experiment, perhaps performance will be improved. 2.2 Unsuccessful AttemptIn the early stages of DeepSeek-R1, several methods that did not meet expectations were tried. Some key failure cases are shared in the article, aiming to provide insights into the training of complex inference models (Note: the failure analysis here does not negate the potential value of the relevant methods).
Case 1: Process Reward Model (PRM)
PRM has been proven to effectively guide model optimization inference task resolution paths, but triple bottlenecks were found in practical applications: < /p>
1. Dilemma for fine-grained step definition: it is difficult to explicitly define atomized inference steps in general inference tasks;
2. Difficulty in verification of intermediate steps: insufficient reliability of automated annotation (model judgment) ;Manual annotation is difficult to scale;
3. Reward cracking and system complexity: Introducing model-driven PRM will induce reward hacking; reward model iteration requires additional training resources, significantly increasing the complexity of training pipelines.
Conclusion: Although PRM performs well in Top-N reordering and guided searches for model generation responses, its benefits cannot cover the newly added computational overhead in a large-scale reinforcement learning framework.
Case 2: Monte Carlo Tree Search, MCTS)
Inspired by AlphaGo and AlphaZero, we try to calculate scalability when enhancing tests through MCTS. Method design:
1. Decompose the answer into searchable submodules; /p>
2. Use the propt boot model to generate inference tags corresponding to the search steps;
3. Guide MCTS to search answers based on the pre-trained value model;
3. >4. Use the generated QA Pair to iterate the strategy model and value model.
Scale training challenge:
Search space explosion: Unlike the finite state space of Go, language tags generate search spaces that face exponential expansion, even if the maximum expansion limit is set, the model still falls into the local area. Optimal; Value model dependence: The value model directly determines the quality of the search path, but the training of the fine-grained value model itself is extremely difficult, hindering the iterative optimization of the model.Conclusion: Although MCTS can combine pre-trained value models to improve the performance of the inference stage, there are still significant obstacles to achieving continuous iteration of model performance through self-search.
3. Future WorkReferenceHungYi-Lee: Deep Reinforcement Learning【Strongly recommend Teacher Li Hongyi’s lesson】DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language ModelsA vision researchers' guide to some RL stuff: PPO & GRPODeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning