Can AI programming work earn $400,000? ! OpenAI launches the latest evaluation test benchmark, with surprising results

Image source: Generating large language models (LLMs) from unbounded AI are changing the way software development is. Whether AI can replace human programmers on a large scale has become a topic of concern to the industry.

In just two years, AI models have evolved from solving basic computer science problems to the level of competing with human experts in international programming competitions. For example, OpenAI o1 has been with humans. Participants participated in the 2024 International Informatics Olympics (IOI) under the same conditions and successfully won the gold medal, demonstrating strong programming potential.

At the same time, the AI iteration rate is also accelerating. On the code generation evaluation benchmark SWE-Bench Verified, the score of GPT-4o in August 2024 was 33%, but the score of the new generation of o3 models has doubled to 72%.

To better measure the software engineering capabilities of AI models in the real world, OpenAI Open Source today launched a new evaluation benchmark SWE-Lancer, which links model performance to monetary value for the first time.

SWE-Lancer is a benchmark test containing more than 1,400 free software engineering tasks from Upwork platform. The total reward of these tasks in the real world is about $1 million, how much can AI make to program money?

The "features" of the new benchmark

SWE-Lancer benchmark task Prices reflect the real market value situation. The more difficult the task, the higher the reward.

This includes both independent engineering tasks and management tasks, which can be selected between technical implementation plans, and the benchmark is aimed at not only programmers but also the entire development team, including architects and administrators.

Compared with previous software engineering test benchmarks, SWE-Lancer has many advantages, such as:

1. All 1488 tasks represent the true reward paid by the employer to the freelance engineer. , providing a natural, market-determined difficulty gradient, with rewards ranging from $250 to $32,000, which is quite considerable.

35% of the tasks are worth more than $1,000, and 34% are worth between $500 and $1,000. The Individual Contributor (IC) Software Engineering (SWE) task group contains 764 tasks with a total value of US$414,775; the SWE management task group contains 724 tasks with a total value of US$585,225.

2. Large-scale software engineering in the real world not only requires specific code development, but also requires capable technical management. This benchmark uses real-world data evaluation model to act as the "technical director of SWE "Role."

3. Have advanced full-stack engineering evaluation capabilities. SWE-LanCer represents real-world software engineering, as its mission comes from a platform with millions of real users.

The tasks involved in the engineering development of mobile and web pages, interaction with APIs, browsers and external applications, and verification and reproduction of complex problems.

For example, some tasks are spending $250 to improve reliability (fix dual-triggered API calls), $1,000 to fix vulnerabilities (solve permission differences) and $16,000 to implement new features (on the web page, Add in-app video playback support, etc.) on iOS, Android and desktop.

4. Diversity of fields. 74% of IC SWE tasks and 76% of SWE management tasks involve application logic, while 17% of IC SWE tasks and 18% of SWE management tasks involve UI/UX development.

In terms of task difficulty, the tasks selected by SWE-Lancer are very challenging, and tasks in the open source dataset take an average of 26 days to solve on Github.

In addition, OpenAI represents unbiased data collection, which selected representative tasks samples from Upwork and hired 100 professional software engineers to write and validate end-to-end tests for all tasks.

AI encoding money-making ability PK

Although many technology tycoons keep on It is claimed in the publicity that AI models can replace "low-level" engineers, but it still has a big question whether companies can completely replace human software engineers with LLM.

The first batch of review results show that on the complete SWE-Lancer dataset, the current AI gold medalist models are all well below the potential total compensation of $1 million.

Overall, all models perform better than IC SWE tasks in SWE management tasks, and IC SWE tasks have not been fully conquered by AI models to a large extent. The test models currently perform the most Fortunately, it is the Claude 3.5 Sonnet developed by OpenAI competitor Anthropic.

On the IC SWE task, all models have a single pass rate and yield rate of less than 30%, and on the SWE management task, the best performing model, the Claude 3.5 Sonnet score is 45%.

Claude 3.5 Sonnet showed strong performance on both IC SWE and SWE management tasks, with a 9.7% chance of being rated above the second best model o1 on IC SWE tasks and a 3.4-point increase on SWE management tasks. %.

The best performing Claude 3.5 Sonnet has a total revenue of over $400,000 on the full dataset if converted into earnings.

It is worth noting that higher inference calculationsQuantity will be of great help to "make money from AI".

In the IC SWE task, the researchers conducted experiments on the O1 model with deep reasoning tools that enabled the deep reasoning tool, which showed that higher inference calculations can increase the single pass rate from 9.3% to 16.5%, and the benefits are also The corresponding increase from US$16,000 to US$29,000, and the yield increased from 6.8% to 12.1%.

The researchers concluded that although the best model, Claude 3.5 Sonnet, solved 26.2% of the IC SWE problem, most of the remaining solutions still have errors, and a lot of improvement is needed to achieve reliable deployment. Next is o1, then GPT-4o, and the single pass rate of management tasks is usually more than twice the single pass rate of IC SWE tasks.

This also means that even if the view that AI agents replace human software engineers is hyped up very much, companies still need to think twice at the moment. AI models can solve some "low-level" coding problems, but they cannot Replace "low-level" software engineers because they cannot understand the reasons why some code errors exist and continue to make more extended errors.

The current evaluation framework does not support multimodal inputs yet. In addition, researchers have not evaluated the "ROI", such as the compensation paid to freelancers when completing a task and Comparison of the costs of using APIs will be the next focus of this benchmark.

Be a "AI-enhanced" programmer

Just now From the perspective of this, AI still has a long way to go to truly replace human programmers. After all, developing a software engineering project is not just as simple as generating code as required.

For example, programmers often encounter extremely complex, abstract, and vague customer needs problems, which require a deep understanding of various technical principles, business logic and system architectures, and when optimizing complex software architectures. , Human programmers can comprehensively consider factors such as future scalability, maintainability and performance of the system, and AI may find it difficult to make comprehensive analysis and judgment.

In addition, programming is not only about realizing existing logic, but also requires a lot of creativity and innovative thinking. Programmers need to conceive new algorithms, design unique software interfaces and interaction methods, etc. This truly novel Ideas and solutions are the shortcomings of AI.

Programmers usually need to communicate and collaborate with team members, clients and other stakeholders, understand the needs and achievement of each party, express their views clearly, and work with others to complete projects. In addition, human programmers have the ability to continuously learn and adapt to new changes, they can quickly master new knowledge and skills and apply them to real-world projects, and a successful AI model also requires various training tests.

The software development industry is also subject to various legal and regulatory constraints, such as intellectual property rights, data protection and softwareAI may find it difficult to fully understand and comply with these laws and regulations, thereby posing legal risks or liability disputes.

In the long run, the alternative to programmer positions brought about by the advancement of AI technology still exists, but in the short term, "AI enhanced programmers" are the mainstream, and mastering the use of the latest AI tools is excellent. One of the core skills of programmers.