Today is the last day of the "12 Days of OpenAI" event. No matter judging from the pace or timing of releases in the past few days, it’s time to show off the real thing.
Sure enough, as Sam Altman’s “oh oh oh” riddle suggested, OpenAI finally officially announced the latest flagship members of the inference model series: o3 and o3 mini.
The team said that these two models have achieved a major breakthrough in performance. Being able to handle increasingly complex reasoning tasks marks "AI technology entering a new stage."
Interestingly, as the next generation model of o1, OpenAI skipped "o2" in naming and jumped directly to Arrived at o3. This will prevent trademark conflicts with British telecom operator O2 and avoid potential legal disputes.
o3 model: reasoning performance soars, multiple superhuman expertsAs OpenAI’s current most powerful inference model, o3 has performed well in multiple benchmark tests, especially in the fields of programming and mathematics.
• Programming ability: In the real-world software task evaluation (HumanEval-Verified), o3 set a new record with an accuracy of 71.7%, an improvement of more than 20% compared to the previous model o1; in competitive code programming The ELO score on the platform (Competition Code) is as high as 2727, far exceeding o1’s 1891.
During the live broadcast, Sam asked Mark, the research director who also teaches competitive programming, how many points he could score. Mark replied that his best score on a similar platform was about 2,500 points. Sam then revealed that o3's score even exceeded that of chief scientist Yakov.
When he learned that someone in the company could score more than 3,000 points, Sam joked: "He can still enjoy this advantage for a few months. O3's performance in programming is really incredible."
• Mathematical Reasoning: o3 achieved an unprecedented 96.7% accuracy in the American Mathematical Olympiad Examination (AIME), and an accuracy of 87.7% in the Doctoral Science Question Test (GPQA Diamond), significantly exceeding the average level of human experts. 70%.
• Cutting-edge testing was conquered for the first time in five years
Mark mentioned that among existing traditional benchmark tests, o3 is close to saturation, highlighting the need for more difficult testing.
Recently, Epic AI’s cutting-edge math benchmark has stood out as the most difficult math assessment available. This dataset contains new, unpublished, and extremely complex problems that would require even an expert mathematician to solve.hours or even days.
The accuracy of all products currently on the market in this test is less than 2%, but the accuracy of o3 reaches more than 25% under strict settings, showing strong mathematical reasoning capabilities.
The bigger surprise comes from o3’s performance in the Arc AGI test.
Arc AGI is a unique benchmark designed by François Chollet in 2019 to evaluate the general intelligence level of AI systems. What's special about it is that it doesn't look at learned knowledge, but requires the model to infer new task rules and learn on the fly by observing several examples. For example:
Infer the rule "Place dark blue squares in the spaces";
Or "Count the number of colored squares in the yellow square, and then surround the yellow square with this width" .
These rules are intuitive to humans but extremely challenging for AI systems.
In this test that has not been overcome in five years, o3 achieved a historic breakthrough: under low computing power configuration, its accuracy reached 75.7%, setting a new high in public records; under high computing power, The performance has been improved to 87.5%, which is higher than the human average level of 85%.
This is the first time that an AI system has surpassed human performance in a task that requires instant understanding and learning of new rules, verifying AI's substantial progress in novelty adaptation.
However, the ARC Prize, the organization responsible for the test, also stated that this does not mean that AGI has been implemented. o3 still makes mistakes on some simple tasks, indicating that it is still fundamentally different from human intelligence. They will continue to hold the Grand Prix until there is an efficient open source solution that achieves 85% results (as can be seen in the figure, o3 with high computing power costs $1,000 per task to perform).
o3 mini: Super O1 performance, ideal for high efficiency and low costFor application scenarios that require a balance between performance and cost, OpenAI launched o3 mini. It inherits the previous o1 mini's advantages in mathematics and coding, and achieves greater breakthroughs in cost performance.
The most eye-catching thing is its innovative "adaptive thinking time" function, which provides three reasoning intensity options of low, medium and high, allowing users to flexibly adjust the thinking time of the model according to the complexity of the task. It's like switching the brain into different working modes.
In actual programming tests, o3 mini has outperformed o1 at medium inference times at a fraction of the cost and latency. This means that it can complete difficult programming tasks in a more economical way, providing an ideal choice for developers.
Research scientist Hongyu demonstrated o3 mini through several casesExcellent real-world performance in high, medium and low intensity modes:
1. Code generation and execution:
In high-intensity mode, o3 mini is required to use Python Write a smart programming assistant. The assistant is equipped with a simple input box interface. Users only need to enter their requirements and it will generate and execute code. This complex task fully demonstrates the efficiency and accuracy of the model in programming scenarios.
2. Self-assessment capability:
In medium-intensity mode, o3 mini was asked to evaluate its performance on the complex GPQA data set. The model generated an evaluation script and quickly completed data set parsing, question classification, answer generation and result scoring, achieving a score of 61.62% within 1 minute. Such performance also poses a great challenge to human experts.
3. Efficiency testing and mathematical reasoning:
In low-intensity mode, the response speed of o3 mini is almost the same as GPT-4, and the user gets a reply almost immediately after pressing the send key. . Even in medium mode, it is twice as fast as O1, and this high performance is achieved at a significant cost reduction.
In the U.S. Mathematical Olympiad 2024 data set test, o3 mini's performance was equivalent to o1 under medium inference time settings, and its performance exceeded o1 under high inference time.
In addition, o3 mini also supports API features required by developers such as function calls and structured output.
Open testing and deployment timelineOpenAI plans in 2025 The o3 mini was released at the end of January, followed by the full version of the o3.
From now on, researchers and developers can apply for safety testing on the OpenAI official website (https://openai.com/index/early-access-for-safety-testing/#how-to-apply) , to gain early access. Applications will be open until January 10, 2025.
This conference also specifically mentioned that o3 and o3 mini introduced a new deep alignment (Deliberative Alignment) technology.
This technology greatly improves the model's ability to identify potentially unsafe requests by reasoning about the user's input intention. Even if the user tries to use obscure language to bypass restrictions, the model can accurately determine the dangerous intention. Test results show that o3 performs excellently in security assessments, and its accuracy and sensitivity in rejecting unsafe requests are significantly improved.
Chinese researchers emergeDuring the official announcement of o3 mini, in addition to the research scientist Hongyu Ren who was introduced, there were also young Chinese researchers such as Kevin Lu and Shengjia Zhao who were also responsible for model training. Face.
Hongyu Ren graduated from Peking University and received a PhD in computer science from Stanford University. Before joining OpenAI, he worked as an intern researcher at Apple, Google, NVIDIA and Microsoft.
As OpenAI o1-mini As the creator and foundational contributor of o1, Hongyu also serves as the person in charge of GPT-4o mini, and is deeply involved in the development of GPT-4o, focusing on making the model think faster, deeper, and more accurately.
< p>Kevin Lu graduated from the University of California, Berkeley, majoring in electronic information engineering and computer science, and worked as a researcher at Berkeley AI Research.Shengjia. Zhao graduated from Tsinghua University and also holds a PhD in computer science from Stanford University. He is a core contributor to GPT-4.
OpenAI wants to abandon GPT. , are you fully committed to the o series?Judging from today’s finale release, OpenAI is undergoing a major strategic shift.
At the recent NeurIPS 2024 conference, Ilya, former co-founder of OpenAI Sutskever gave a speech titled "The End of the Pre-training Era". He pointed out that the pre-training method of AI models is facing a data bottleneck, and the data available on the Internet are like "fossil fuels" and are unsustainable. The expansion rule that "computing power equals better performance" is failing, and AI technology needs to find new development paths.
Ilya predicts that future AI systems will be more "agentic". Not only can they complete tasks, but they can also Solve problems step-by-step like humans through reasoning skills . This new paradigm may be the key to breaking through the current technical bottleneck, and it will also bring higher uncertainty.
OpenAI may be moving from the traditional GPT large language model to the "o" series inference model. Realizing that relying solely on pre-trained GPT models can no longer meet the needs of future AI development, we hope to find breakthroughs for achieving higher levels of intelligence by integrating reasoning capabilities.
In addition to OpenAI, similar trends are also reflected. Competitor Google’s newly released Gemini. 2.0 Flash ThinkingIt is regarded as the beginning of AI reasoning models and may be deeply integrated with major language models in the future.
The initiatives of various technology companies indicate that reasoning capabilities are becoming the new focus of industry development, and how to organically combine them with general-purpose large language models may be the core direction of AI competition in the next stage. OpenAI began to use the same strategy as the GPT stage in this technical direction - rapid iteration, even the futures were demonstrated first, and then AGI and Scaling law were the most important concepts for the thinking, development and promotion of the entire industry. Hold it firmly in your own hands and be defined by it.
After o3 was released, Jason Wei, a star researcher at OpenAI, said that more importantly, it only took three months from o1 to o3, which proved how fast progress can be made under the new paradigm.
It is much faster than the pre-trained paradigm that is updated every one or two years.
With OpenAI 12-day technical release summaryDay 1: o1 official version and ChatGPT Pro
O1 official release version, performance is improved by 34%, thinking speed is increased by 50%, and multi-modal input support is added; ChatGPT Pro is launched, subscribers can have unlimited use of o1 Pro mode and advanced voice functions for a monthly fee of US$200.
Day 2: Enhanced fine-tuning research plan
The enhanced fine-tuning plan is extended to research institutions and enterprises to help users create domain expert models with a small amount of data.
Day 3: Sora official version
The Wensheng video tool Sora is launched, which can create videos with a maximum length of 20 seconds and a resolution of 1080p. It supports Wensheng videos and various editing. Open to Plus and Pro users.
Day 4: Canvas
Released the Canvas collaboration interface, which supports Python code running and parallel editing, improving the writing and programming experience.
Day 5: ChatGPT integrates with Apple’s intelligent system
Integrate with Apple’s intelligent system to enhance Siri’s task processing capabilities, support advanced functions such as document summary and translation, and adapt to the latest iOS, iPadOS and macOS systems.
Day 6: Advanced voice adds video function and Santa Claus mode
Advanced voice mode adds video chat and Santa Claus voice to enhance interactive fun and user experience.
Day 7: Projects function
The Projects function is launched, allowing users to organize folders, upload content, set instructions, and trace past conversations, bringing greater convenience to individual and team collaboration. Answer accurately.
Day 8: All search functions are free, and new map functions are added
ChatGPT search is free and open to all users, supporting real-time information query and map interaction.
Day 9: Developer Tools and o1 API
OpenAI o1 API is released, optimizing real-time API and fine-tuning tools to provide developers with more flexible and efficient model building capabilities.
Day 10: 1-800-CHATGPT
Launched voice call service. Users can dial "1-800-CHATGPT" over the phone to have a real-time voice conversation with AI.
Day 11: Application integration function
Enhance the integration function of ChatGPT and some applications to achieve direct interaction and control, and improve cross-platform work efficiency and productivity.
Day 12: Next-generation inference models o3 and o3 mini
The strongest inference model so far, o3, and its efficient version, o3 mini, are released. Among them, o3 surpassed the human average level for the first time in the Arc AGI test, and o3 mini achieved performance close to the top model at a low cost through the innovative "adaptive thinking time" function.