In the medical field, a large number of complex reasoning processes are involved, from symptom analysis to disease diagnosis. Each step requires comprehensive consideration of many factors. factor. For example, when diagnosing a rare disease, doctors must not only be familiar with the symptoms of various diseases, but also understand the patient's medical history, family genetic history, living environment and other information, and make accurate judgments through layers of reasoning.
In order to assist doctors in achieving more efficient reasoning, the Chinese University of Hong Kong (Shenzhen) and the Shenzhen Big Data Research Institute jointly open sourced a complex large model dedicated to the medical field - Huatuo GPT-o1.
Open source address: https://huggingface.co/FreedomIntelligence/HuatuoGPT-o1-7B
Github: https://github.com/FreedomIntelligence/HuatuoGPT-o1
High quality medical data setDeveloping high-quality verifiable medical data sets is an important cornerstone for Huatuo GPT-o1 to achieve high-quality inference. The researchers carefully selected the data from the MedQA-USMLE and MedMCQA training sets. Collected 192K medical multiple choice questions.
It covers knowledge points in many medical disciplines such as internal medicine, surgery, obstetrics and gynecology, pediatrics, neurology, etc., and can comprehensively reflect the knowledge system in the medical field.
However, there are many problems with the original data and require strict screening. First, many questions are too simple to effectively train the model’s complex reasoning capabilities. For example, some questions only test a single knowledge point, and the answers are clear at a glance, which is not challenging for the model. Secondly, the answers to some questions are not unique or ambiguous, which will cause problems in model learning and verification. In addition, some questions are not suitable for conversion into open-ended questions, which is not conducive to in-depth reasoning of the model.
In order to select suitable topics, the researchers used a multi-round screening method. In the first round, small language models are used to preliminarily screen the questions and remove simple questions that all small models can easily answer correctly. In the second round, the remaining questions were manually reviewed to eliminate questions with unclear or ambiguous answers.
Finally, the selected questions were further optimized and verified using the GPT-4o model to ensure that each question has a clear and unique correct answer and can be converted into an open-ended question. After layers of screening, a data set containing 40K verifiable medical questions was finally obtained.
Two-stage training modeIn the first stage, Huatuo GPT-o1 will first conduct a preliminary analysis of the given verifiable medical problem and generate an initial chain of thought (CoT) and answer. For example, for a question about patient symptom analysis, the model may preliminarily speculate on the possible scope of the disease based on the symptoms, time sequence of occurrence, accompanying symptoms and other factors, and give a preliminary diagnosis.
This initial answer is then rigorously verified by a medical validator. If the answer is incorrect, the model starts an iterative optimization process. It will randomly select one from four preset search strategies (exploring new paths, backtracking, verifying, and revising) to improve the previous reasoning process.
It is assumed that the model ignores an important symptom during the diagnosis process, resulting in an initial diagnosis error. If you choose the explore new path strategy, the model will try to analyze symptoms from a new perspective and consider other possible disease factors; if you choose the backtracking strategy, the model will return to the previous reasoning step and re-examine the association between symptoms and disease;
If you choose the verification strategy, the model will re-evaluate the current reasoning process to check whether there are logical loopholes; if you choose the correction strategy, the model will correct the errors in previous reasoning and adjust the diagnosis direction based on the feedback of the verifier.
The model will repeat this process until it finds the correct answer. Each iteration generates a new CoT and answer, and the validator continues to verify the new answer until the answer is confirmed to be correct. In this way, the model can learn the correct medical reasoning method through continuous attempts and improvements, improving the accuracy and reliability of reasoning.
When the model successfully finds the correct reasoning trajectory, this trajectory will be reformatted into a more natural and coherent complex CoT form. For example, the original reasoning process may be a series of scattered steps and conclusions, which after formatting, will become a logically clear and fluent reasoning narrative, using natural transition words (such as "um", "and", "etc." ") organically connects the various steps to make the entire reasoning process more consistent with human thinking.
During the formatting process, the model will highlight key reasoning steps and basis, allowing complex CoT to clearly display the model's thinking process. The model then generates a formal answer based on this complex CoT, which not only contains the final conclusion, but also briefly summarizes the reasoning process to better communicate and explain to the user.
By constructing SFT training data, the model can learn how to think deeply and reason before answering questions, integrating complex medical knowledge and reasoning processes to form a complete solution. This training method helps improve the performance of the model in practical applications, allowing it to better deal with various complex medical problems.
Experimental dataIn order to evaluate the performance of Huatuo GPT-o1, comprehensive tests were conducted on medical benchmarks such as MedQA, MMLU-Pro, MedMCQA, PubMedQA, etc. The results showed that Huatuo GPT-o1-70B version Surpassing all other open source models, in many It has achieved leading results on several data sets.
For example, in the health and biology tracks of MMLU-Pro, its accuracy reached 73.6% and 71.0% respectively, and in GPQA's genetics and molecular biology. On the academic track, the accuracy rates also reached 66.5% and 56.2% respectively.