Just now, the world's first mixed reasoning model, Claude 3.7, came into existence! The Strongest Programming Brain Critical DeepSeek R1

Image source: Generated by Unbounded AI

After holding it in for more than half a year, Anthropic finally released its big move - the first mixed reasoning model, Claude 3.7 Sonnet, made its debut!

This is the smartest model in the Claude series to date, which can respond almost in a timely manner and perform scalable, step-by-step thinking.

In short, one model, two ways of thinking.

Suppose you want to crack a game theory math problem - the Montihall problem, throw it to Claude 3.7 Sonnet, and then select the "Extended" mode at the same time.

It will show the detailed CoT process and it will be completed in 52 seconds.

The most important thing is that Claude 3.7 Sonnet is currently available for free for everyone, and the "Extended Thinking" mode is not available yet.

In multiple benchmark tests, Claude 3.7 Sonnet, supported by the "Extended Thinking" mode, refreshes SOTA in mathematics, physics, instruction execution, programming, etc.

Compared with the previous generation of Claude 3.5 Sonnet, the math and coding capabilities have soared by more than 10%.

Apart from mathematics, Claude 3.7 Sonnet (64k extended thinking) almost completely crushes o3-mini, DeepSeek R1, comparable to Grok 3.

API users can accurately control the thinking time of the model

It can be said that Claude 3.7 Sonnet is completely the strongest "software engineering AI". On SWE-bench, it scored a high score of 70.3%.

At the same time, the first "Aggressive Programming" tool, Claude Code (preview version), was also released today.

Now, it has become an indispensable tool within Anthropic. In early testing, Claude completed a 45-minute task in one go.

In other words, if you are a product manager, AI will work to write code for you.

Although there is no Claude 4, Anthropic's sudden play style has really shocked the AI world.

This half month is destined to be the highest value of AI since the beginning of 2025.

Grok 3 was just released last week. DeepSeek was open source for five consecutive days this week. OpenAI GPT-4.5 is also said to be launched. Coupled with Claude 3.7 Sonnet, the melee in the big model field has begun again.

The world's first "hybrid reasoning" model was born

Official In the blog post, AnthropClaude 3.7 Sonnet is Anthropic's smartest model to date and the first hybrid inference model on the market, ic says.

Claude 3.7 Sonnet is able to generate almost instant responses or step by step to show detailed steps of the thinking process that are visible to the user. API users can also finely control the thinking time of the model.

Claude 3.7 Sonnet has been significantly improved in coding and front-end web development.

In addition, they have also launched a command line tool called Claude Code for agent encoding.

Currently, Claude Code is only available as a limited research preview, which enables developers to delegate a large number of engineering tasks to Claude directly from their terminals.

Inference is an LLM overall capability

Claude 3.7 Sonnet's design concept is different from other inference models on the market.

Anthropic believes that just as humans use a brain to process rapid responses and deep thinking, reasoning should be the overall capability of cutting-edge models, rather than a completely independent model. This unified approach provides users with a smoother experience.

Claude 3.7 Sonnet embodies this concept in several aspects.

First of all, Claude 3.7 Sonnet is both a normal language model (LLM) and a reasoning model: you can choose when you want the model to answer normally and when you want it to think longer before answering.

In standard mode, Claude 3.7 Sonnet is an upgraded version of Claude 3.5 Sonnet.

In extended thinking mode, it reflects itself before answering, which improves performance on math, physics, instructional compliance, coding and many other tasks.

Generally, the two modes have similar hints on the model.

Secondly, when using Claude 3.7 Sonnet through the API, users can also control the budget of thinking -

You can tell Claude to think at most N tokens when answering, and the maximum value of N Output limit of 128K tokens. This allows users to trade off speed (and cost) and answer quality.

Thirdly, when developing inference models, Anthropic's optimization level in math and computer science competition problems was slightly reduced, but instead shifted its focus to real-world tasks that better reflect the actual use of LLM in enterprises.

Claude 3.7 Sonnet flashes the line SOTA on SWE-bench Verified, the evaluation aims to evaluate AI models to solve realityThe ability of world software problems

Claude 3.7 Sonnet refreshes SOT on TAU-bench, a framework for testing the ability of AI agents to interact with users and tools in complex real-world tasks

As mentioned earlier, the performance of the Claude 3.7 Sonnet has been significantly improved in almost all major benchmarks.

Compared with the latest Grok 3 Beta model, Claude 3.7 Sonnet (64k extended thinking) almost tied in reasoning. In terms of mathematics and visual reasoning, it is slightly inferior to Grok 3 Beta.

Compared with o3-mini and DeepSeek R1, in addition to mathematics, Claude 3.7 Sonnet with extended thinking mode scored the highest score.

Claude 3.7 Sonnet has performed well in task instruction following, general reasoning, multimodal capabilities, and autonomous programming, and extended thinking modes have brought significant improvements in the fields of mathematics and science. In addition to traditional benchmarks, it even surpasses all previous models in the Pokémon game tests

AI-encoded agents, completing 45-minute tasks at once.

Since June 2024, the Sonnet series has been the preferred model for developers around the world.

Today, Anthropic's first agent encoding tool, Claude Code, was born, and is currently released in the form of a limited research preview.

Claude Code actively collaborates with people to search and read code, edit files, write and run tests, submit and push code to GitHub, and use command-line tools – while ensuring that users can search and read code at every step Be able to participate.

In addition, this update also improves the coding experience on Claude.ai.

All Claude packages now support GitHub integration—developers can connect code repositories directly to Claude.

As Anthropic's most powerful coding model to date, Claude 3.7 Sonnet has a deeper understanding of personal projects, work projects and open source projects, and has become a powerful tool for fixing bugs, developing new features, and writing GitHub documents. assistant.

Claude Code is still in its early stages, but it has become an indispensable tool for the Anthropic team, especially in test-driven development, debugging complex problems and large-scale reconstruction.

In early testing, it was able to complete tasks that usually require more than 45 minutes of manual work at once, significantly reducingDevelopment time and workload.

In the next few weeks, Anthropic plans to continuously improve it based on usage: improve the reliability of tool calls, increase support for long-running commands, improve in-app rendering, and extend Claude pairs Understanding of one's own abilities.

The new test is Scaling

Claude as an AI agent< /p>

Claude 3.7 Sonnet has a new feature called "action scaling" - an improvement that enables iterative calls to functions, respond to changes in the environment, and continues to operate until the open-ended Task.

For example, in terms of computer use: Claude can replace the user with tasks by issuing virtual mouse clicks and keyboard keys. Compared with previous generations, Claude 3.7 Sonnet can invest more interactions in computer usage tasks, and is equipped with more time and computing resources, so it often achieves better results.

This advancement is fully reflected in OSWorld evaluation, a testing platform for evaluating the capabilities of multimodal AI agents.

Claude 3.7 Sonnet showed good performance in its initial stages, and its performance advantages will continue to expand over time as it continues to interact with virtual computers.

Claude's extended thinking mode combined with AI agent training not only helped it perform better in many standard evaluations such as OSWorld, but also enabled it to be implemented in some other unexpected tasks. Major breakthrough.

Take playing games as an example - especially in the classic Game Boy handheld game "Pokemon: Red". They equipped Claude with basic memory capabilities, screen pixel input functions, and function calling capabilities for key operation and screen navigation, allowing it to break through conventional contextual limitations, continue to play games, and achieve tens of thousands of continuous interactions.

In the following picture, they compare the progress of Claude 3.7 Sonnet with extended thinking ability with previous versions of Claude Sonnet in the Pokemon game.

As shown, the early versions were difficult to advance at the beginning of the game, and Claude 3.0 Sonnet could not even get out of the story's starting point of the original hut in the Zhenxin Town.

The Claude 3.7 Sonnet made significant progress with its improved AI agent capabilities, successfully challenging and defeating three gallery owners and obtaining the corresponding badges.

Claude 3.7 Sonnet excels in trying multiple strategies and revisiting existing assumptions, which allows it to continuously improve its abilities during the game.

Computing in serial and parallel tests Scaling

When Claude 3.7 Sonnet uses its extended thinking ability, it can be said that it utilizes the "computing in serial tests" mechanism.

Specifically, it performs multiple consecutive inference steps before generating the final output, and continuously increases the investment in computing resources in the process.

Overall, this mechanism can improve its performance in a predictable way: for example, in solving mathematical problems, its accuracy will increase with the number of "thinking tokens" that allow sampling. Logarithmic growth.

Claude's researchers are also exploring the use of parallel testing calculations to improve model performance.

The specific method is to sample multiple independent thinking processes and select the best result without knowing the correct answer in advance. This can be achieved through a majority voting or consensus voting mechanism, i.e., the answer with the highest frequency is chosen as the "best" answer.

Otherwise, another LLM can be used to verify its work results, or a trained scoring function can be used to select the optimal answer.

These optimization strategies (and related research work) have been verified in the evaluation reports of multiple AI models.

In the GPQA evaluation, they made breakthrough progress in calculating Scaling when passing parallel tests.

Specifically, by calling computing resources equivalent to 256 independent samples, combining training an optimized scoring model, and setting an inference limit of up to 64,000 tokens, Claude 3.7 Sonnet reached 84.8 in GPQA test % of the overall score (of which the physics part is as high as 96.5%).

It is worth noting that even if the limitations of conventional majority votes are exceeded, the model performance continues to improve.

The following figure lists the detailed results of the scoring model method and the majority voting method.

These methods improve the quality of Claude's answers, and usually do not have to wait for it to complete the reasoning process. Claude can explore more problem-solving ideas and significantly improve the output frequency of correct answers.

Three-step roadmap, Claude collaborators have come

Claude 3.7 Sonnet and Claude Code mark an important step towards artificial intelligence systems that truly enhance human capabilities.

With its ability to reason deeply, work autonomously and collaborate effectively, they bring us closer to a future where artificial intelligence enriches what humans can achieve.

Now, Claude collaborators are here.

The latest version can be used for free

It is worth mentioning that Claude 3.7 Sonnet is currently online on the Claude.ai platform, and can be experienced for free by web, iOS and Android users.

For developers who want to build custom AI solutions, they can be accessed through the Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI.

In standard mode and extended thinking mode, the price of Claude 3.7 Sonnet is the same as its predecessor: $3/million input token, $15/million output token - this includes the thinking token the cost.

Anthropic package pricing

AI bigwig test

Ethan Mollick, a professor at the University of Pennsylvania Wharton, has tested Claude 3.7 in the past few days,

Claude 3.7 often gives him the same thing as when he first used ChatGPT-4. Feelings: both amazed and a little uneasy about their abilities. Taking Claude's native coding capabilities as an example, we can now obtain runnable programs through natural dialogue or documents without any programming skills.

For example, he provided Claude with a proposal for a new AI education tool and asked it to "display the proposed system architecture in 3D and make it interactive". As a result, it generates interactive visualizations of the core design in our paper without any errors.

These graphics, while concise, are not the most impressive parts. What is really amazing is that Claude decides to make it into a step-by-step demonstration to explain the concept, and that's not what we asked it to do.

This prediction of demand and thinking about new methods is a new breakthrough in the field of AI.

To give another more interesting example, Ethan Mollick told Claude: "Make me an interactive time machine device so that I can travel back and forth and something interesting happens. Pick something unusual Let me go back at the time…” and “add more images.”

After just these two prompts, a fully functional interactive experience emerged, even with a rough but charming experience pixel images (these images are actually surprisingly impressive – AI has to "draw" these images using pure code and cannot see what it is creating, like a blindfolded artist.