Claude 3.5 tortured more than 50 experts in two hours, and the programming speed soared 10 times! But shocking shortcomings were exposed in 8 hours

Image source: Generated by Unbounded AI

How far is it from independent research and development of AI agents?

A study in the journal Nature has proven that GPT-4 can independently design and conduct chemical experiments, and can also read documents to learn how to use laboratory equipment.

There is also "the world's first AI scientist" developed by one of the authors of Transformer, who can write 10 papers in one breath without human intervention at all.

Today, the speed of AI intrusion in the research and development field far exceeds human expectations.

The latest research from the non-profit organization METR said:

Given 2 hours at the same time, Claude 3.5 Sonnet and o1-preview defeated more than 50 people in 7 challenging research projects Human experts.

Paper address: https://metr.org/AI_R_D_Evaluation_Report.pdf

What is impressive is the speed of AI programming Can generate and test various solutions 10 times faster than humans.

In a task that required writing a custom kernel to optimize prefix sum operations, o1-preview not only completed the task, but also achieved amazing results: compressing the running time to 0.64 milliseconds, even surpassing the latest Excellent human expert solution (0.67 ms).

However, when the game time was extended to 8 hours, humans showed a clear advantage.

As can be seen from the following, as time goes by, the performance improvements of Claude 3.5 Sonnet and o1-preview gradually level off.

What’s interesting is that in order to get higher scores, AI agents will actually violate the rules and “cheat”.

Originally for a task, the agent should reduce the running time of the training script, but o1-preview directly copied the output code.

Top forecasters exclaimed after seeing this result that based on this rate of progress, the time for AI to reach high levels of human capabilities may be shorter than previously expected.

RE-Bench design architecture, traversing seven major tasks

For To be able to iterate quickly and collect data at a reasonable cost, the researchers set operating limits: human expert evaluation took no more than 8 hours, and all environments could only run using 8 H100 GPUs or less.

When designing the environment, the main consideration is to maximize coverage of cutting-edge AI problems, while ensuring that human experts and agents can continue to advance tasks without encountering research bottlenecks or score ceilings.

RE-Bench contains seven carefully designed evaluation environments, each of which presentsA unique machine learning optimization problem, achieving high scores requires extensive experimentation, implementation, and efficient use of computing resources.

Each evaluation environment consists of three parts:

1. Scoring function (scoring function), which defines the goals of the environment and the agent can run this function at any time. Each time the scoring function is run, a timestamped entry is added to the score log. The agent can see the score log and examine the details of the scoring function, which helps understand the research goals.

2. A simple but poor-performing starting solution is provided to the agent to show what an effective solution looks like. It helps to explain the environment settings and allows the agent to be more precise. Quickly start working on the more challenging parts of the problem. For example, in the Optimize Kernel environment, the input to the agent is a simple but slow Python solution.

3. The reference solution (reference solution) created by the task author has a high score. But this solution is not provided to the agent and is only used to normalize the score as an example of a good solution. In the following formula, ys is the initial solution score, yr is the reference solution score, y is the model score, and yn is the normalized score.

Except for the "Scaling Law Experiment" which uses the final score for evaluation, other experiments use the highest score in the score log.

8 hours limit, human experts beat AI

Researchers used the open source Vivaria platform to set up secure virtual machines, equipped with 20-48 virtual CPU, 200-400 GB of RAM, and 0 to 6 H100 GPU, claude-3-5-sonnet-20241022 and o1-preview were selected as comparison models, and METR and AIDE were selected as frameworks

Performance under time constraints

At the same time, you can see that o1-preview and Claude 3.5 Sonnet made rapid progress in the initial stages but slowed down after a few hours; while the human experts started slowly but then made rapid progress.

Performance under sampling constraints

In addition to long runs, the researchers also evaluated whether the model could achieve better performance under shorter time constraints: k samples were taken for each environment, and The highest score achieved across all k runs.

It was found that the agent in the modular framework performed better in the 30-minute trial, while the agent in the AIDE framework performed best in the 2-hour trial.

You can also find o1-preview in the AIDE framework and Claude 3.5 So in the modular frameworknnet achieved the highest scores overall, reaching the 36th and 37th percentile of human experts respectively.

As the number of samples increases, the best-k score (score@k) will also change: Claude 3.5 Sonnet at 30 minutes, the number of samples increases, the performance also improves steadily, but even at With k=128 samples, the agent's performance also lags far behind the performance of top humans.

Overall, humans start slow but improve The trajectory is steeper, i.e. progress is faster, and a higher score is achieved within the 32-hour time budget than any agent.

Evaluation by Environment

The researchers compared the top performance allocation of two agents (Claude 3.5 Sonnet in the modular framework and o1-preview in the AIDE framework) over 32 hours with human results (four Best performance in an 8-hour run) to match the agent's time budget.

The results found that there are significant differences in the performance of different agents in different environments: the Claude 3.5 Sonnet agent is close to human performance in the "Fine-tuning GPT-2 for Question and Answer" and "Extended Rule Experiments" environments, while o1-preview performs poorly in these environments; o1-preview outperforms humans in the "optimized kernel function" environment.

Qualitative analysis of the human-AI gap

In order to better understand the conditions under which human experts do better than AI agents The researchers conducted a qualitative analysis of the agent's performance and solutions.

AI agents: Large knowledge base, attempt frequency 37 times/h

Although AI agents often fail to find solutions that match top human experts, they have also achieved some notable successes.

For example, with the support of AIDE, both Claude 3.5 Sonnet (new version) and o1-preview can find "optimized kernel" solutions, and the running speed is about 2 times that of the reference solution.

As shown in the figure below, o1-preview's solution even surpasses that of 9 human experts.

These are complex and novel algorithms that effectively bypass GPU communication limitations and use a low-resource programming language (Triton) that lacks good public documentation.

This surprising result may be partially attributed to the expertise gap between the AI agent and many human experts, since most human experts do not have the expertise to program GPU cores.

This situation is reflected in multiple environments: some human experts who are not very familiar with the field need to spend a lot of time learningOr reproducing standard methods (sometimes even failing completely), which AI agents can often do easily thanks to their extensive knowledge base.

Another key factor in the success of AI agents may be that they can try more solutions than human experts.

On average, AIDE and modular agents ran 36.8 and 25.3 times per hour respectively, while human experts only ran 3.4 times.

This often results in the agent finding a highly optimized "local optimal" solution, which can achieve significant improvements simply by adjusting the parameters and code of the initial solution.

For example, many agents solve the "optimized kernel" environment not by writing a successful Triton solution (which is difficult), but by carefully tuning the initial Pytorch solution to make it run faster significantly improved.

That is, they can effectively optimize existing solutions.

The following is the best agent solution for "Fine-tuning GPT-2 for question answering". Claude 3.5 Sonnet adjusted the parameters of the initial solution and obtained accurate results in the training trajectory and evaluation.

However, this high score may be the result of overfitting to noise.

All in all, the success of the AI agent lies in its ability to perform a large number of parameter fine-tuning and code optimization, and occasionally come up with creative and effective solutions, which is particularly effective in environments with low evaluation costs.

Factors

But in most environments, agents still do not reach the level of powerful human experts.

One of the reasons for this is the lack of diversity in the solutions proposed by AI agents.

For example, in the "Restricted Architecture MLM" task, the agent tried to use a slightly modified Transformer architecture 84% of the time. This scheme is adhered to even when division and exponentiation operations cannot be used.

Another limitation is the persistent instruction misunderstanding, especially in the "Restricted Architecture MLM" and "Optimizing LLM Foundry" tasks.

In some cases, these misreadings of the environment can lead the agent to find impressive and unexpected vulnerabilities that score highly on automated evaluation but are apparent upon human inspection Violation of environmental rules.

Narrowing the gap

Based on the above observations, researchers believe that AI agents will perform better than humans in an environment with the following characteristics:

- Short-term and high-fidelity loop feedback can Let the AI agent take full advantage of trying multiple solutions

- The low engineering complexity allows the AI agent to solve the problem in a few steps

- Tasks that require professional knowledge , AI agents have more complete knowledge than human experts

- There is significant noise in the environment. In this case, the AI agent can perform large-scaleThe advantage of more attempts outweighs the fewer attempts of a human expert.

- It is less prone to unexpected situations and does not require too much exploration and discovery

Re-Bench limitations

Underrepresentation of assessment contexts

To create high-reliability assessments that meet design standards, researchers need to work to ensure that instructions and scoring are easy to understand, that significant progress can be made within 8 hours, and that all necessary resources are provided , you must also choose an environment that is easy to build and evaluate.

These limitations make the assessment environment less representative of real research, with common problems including unclear goals, poor instruction, slow feedback, and unsolvable problems.

Result Noise

Because the number of environments is small and the agent scores are heavily skewed to the right, with most runs scoring 0 and only a few scoring very high, the result evaluation is sensitive to sampling noise.

Cost and Complexity of Evaluation

Using the H100 GPU to run an agent for several hours requires corresponding infrastructure and a large budget. It is very stressful for ordinary researchers to run large-scale experiments to compare multiple models, The framework and parameters are also more challenging.

Lack of framework iteration

Selecting different agent frameworks or prompts may result in the model achieving better results on benchmarks in a similar period of time.

The researchers’ expectation is to achieve better performance by providing agents with tools to manage GPU resources, or by exploring solutions in parallel to utilize more tokens.

Limitations of covering cutting-edge research

Because hardware access is limited and cutting-edge AI research is mostly closed source, there may be differences between the types of research covered by the assessment and the types of research that drive advances in cutting-edge AI.

The scheme may be overfitting

With the exception of the "Extended Law Experiment", all environments provide test score output to the agent to minimize the risk of misunderstanding or confusion; in future iterations, the researchers Consider only providing validation scores to the agent in most environments and hiding test scores.

There is an element of luck in the score of "Extended Law Experiments"

Although good experiments can help human experts make informed predictions in the environment, agents still rely mainly on guessing, which is more a matter of luck than skill. .

Online Consultation