GPT-5 was exposed to be far less effective than expected.
OpenAI’s 12 consecutive press conferences have just ended, and the GPT-5/4.5 that everyone wanted to see the most was nowhere to be seen, so the Wall Street Journal broke the news.
GPT-5 has completed at least 2 rounds of training, each lasting several months, but new problems were encountered after each training. OpenAI is hiring people to write code and do math problems to create data from scratch for GPT-5. It also uses o1 synthetic data, but the efficiency is not high enough, and it is difficult to meet the pre-training requirements of GPT-5.According to market estimates, a six-month training course alone costs US$500 million. The two training sessions of GPT-5 did not go well, and the cost behind it must have been astronomical.
The pre-training that Ilya announced at NeurIPS 2024 not long ago is about to end, and it seems to be proved again...
This also echoes the previous revelations by The Information. With the evolution speed of the GPT series Slowing down, OpenAI is trying to adjust its strategy, such as the launch of o1 and o3 series.
Currently, OpenAI has not responded to the latest revelations.
But is GPT-5 hidden by OpenAI and not released, or is it unable to be released? The answer is a little more certain.
The huge amount of data computing power is not good for pre-training GPT-5According to the Wall Street Journal’s revelations, OpenAI has high expectations for GPT-5.
It can conduct scientific exploration and discovery, and complete routine human tasks, such as making appointments and booking flights. And hopefully it will make fewer mistakes, or be able to admit that mistakes exist, which means it will be less hallucinatory.
This echoes information revealed earlier. Mira, the former CTO of OpenAI, once vividly compared the intelligence level of GPT-5 to that of a doctoral student.
This means that GPT-5 can achieve high-level results in certain specific fields, and can have deep understanding, reasoning, and professional knowledge like graduate students and doctoral students. By comparison, GPT-3 is a toddler, and GPT-4 is a high school student.
In October this year, OpenAI recently raised US$6.6 billion in financing, and its valuation soared to US$157 billion. Investors’ renewed investment is believed to be due to their belief that GPT-5 will be able to make a major leap forward.
But the release of GPT-5 has been pending.
Ultraman previously stated that GPT-5 will not have a clear release time and will be released when it is ready. This time may be 2025 or 2026.
Looking back now, the launch of GPT-5 has been bumpy.
In 2023, OpenAI was exposed to have abandoned a model codenamed Arrakis. The reason for abandonment is that this model cannot reduce the demand for computing resources while maintaining performance.However, the expected training efficiency was not achieved.
This actually proves in reverse that if you want to train a larger-scale model, you still need larger computing resources and longer time.
Judging from the settings, GPT-5 will obviously be a "Big Mac".
The development of GPT-5 started when GPT-4 was released. It has been more than 18 months.
Its internal code name is Orion. According to the original plan, Microsoft wanted to see GPT-5 in mid-2024.
The Wall Street Journal disclosed that at least 2 rounds of large-scale training of GPT-5 were conducted. Each time it took several months, and new problems were encountered each time.
In the best case, Orion performs better than OpenAI's current products. But compared with the cost consumed, this improvement is not obvious.It is estimated that a six-month training session will cost US$500 million in computing power alone. In comparison, the training cost of GPT-4 exceeds US$100 million.
On the other hand, if you want a better model, you need more data.
The data from public resources was exhausted, and OpenAI decided to hire people to reconstruct the data from scratch. According to reports, it specially hired some software engineers and mathematicians to write code and solve mathematical problems for GPT-5 to learn.
It has always been believed in the AI circle that a model learning code can improve its ability to solve other problems.
At the same time, OpenAI is also cooperating with some physicists to let GPT-5 learn how scientists understand problems in the field.
But the problem is that this is too slow.
OpenAI is also taking the path of AI synthetic data. It is said that GPT-5 uses data synthesized by o1.
This paradigm may already be demonstrable.
Next door Anthropic was also revealed to be using AI synthetic data to train models. Their approach is to retain synthetic data within the best-used models, because model performance is directly proportional to the quality of synthetic data.
The above is probably the latest relevant information about GPT-5.
But having said that, who cares about GPT-5 (manual dog head) these days?
After all, OpenAI started the reasoning Scaling Law with the o1 and o3 series.
The just-released o3 has refreshed its results on ARC-AGI. The latest results report shows that o3's best result has reached 91.5% on 400 public tasks.
On the core mechanism, o3 also provides new inspiration. It searches and executes in the token space through LLM, realizing knowledge reorganization during test time.
With the release of the o3 series, AGI’s predictions are still very attractive.
o3 Tu Bang ARC-AGI test, how far is it from AGI?A brief introduction to the ARC-AGI data set, with the titleA grid array of color blocks (expressed in text form, using numbers to represent colors), the large model needs to observe 3 input-output examples in each question, and then fill in new blank grids according to rules.
These examples are relatively simple, but the actual problems faced may be as follows:
The ARC-AGI test set contains a total of 400 public questions and 100 private questions.
In the public question, the accuracy of the o3 high-efficiency version was 82.8%, consuming 111 million Tokens, and the average cost per task was US$17.
The low-efficiency version (the amount of calculation is 172 times that of the efficient version) has an accuracy rate of 91.5%, but the number of tokens consumed also reached an astonishing 9.5 billion.
In addition, OpenAI also made a version specifically for ARC-AGI, using 75% of the public data sets for training.
This version was tested on a private test set, and the low-computation mode achieved an accuracy of 76%, while the high-computation mode achieved an accuracy of 88%.
Moreover, the cost of the low-computation version is within the rules of ARC-AGI-Pub (<$10k), ranking first on the public ranking list.
The 88% high-computation version is too expensive, but still shows that the performance of new tasks does improve as the amount of calculation increases.
Prior to this, the accuracy of GPT-3 was zero, GPT-4o was 5%, and o1 was just over 30% at best.
Francois Chollet, one of the initiators of the ARC challenge, a former Google senior engineer and the father of Keras, believes that o3 can adapt to tasks that have never been encountered before, and can be said to be close to human level in the ARC-AGI field.
Of course, the cost is also very expensive. Even in the low-computation mode, each task costs 17-20 US dollars, and the cost for the initiator to hire real people to solve such problems is only 5 US dollars per problem on average. .
But putting aside the cost issue, Chollet pointed out that o3’s improvements to the GPT series proved the importance of the architecture, and believed that such results could not be achieved by investing more calculations on GPT-4.
So, passing the ARC-AGI test means that o3 implements AGI? Chollet doesn't think so.
Through testing, it was found that o3 still failed on some very simple tasks, which shows that it is fundamentally different from human intelligence.
In addition, ARC-AGI's next generation ARC-AGI-2 is also about to be launched. Early tests indicate that it will pose a significant challenge to o3, and its score may be reduced to Below 30% (while smart people can still score over 95%).
But regardless of whether it reaches AGI or not, the results that o3 can achieve are unprecedented. Some people even think that for tasks such as ARC,In other words, the advantage of human beings actually lies in visual reasoning. If the graphics are described in text form as seen by the model, humans may not necessarily do better than AI.
Furthermore, regarding a case where o3 "failed to succeed", some people questioned whether the standard answer was wrong.
In this question, the rule of change is to connect two blue grids in the same row or column into a line, and paint the entire red area that passes through it blue.
The difference between the "standard answer" to this question and the o3 attempt is whether the part in the green box is painted blue:
Among the three examples, the change from red to blue Some parts are connected by lines passing through the middle, but in this question the connecting lines pass under the 3×4 red area, so o3 believes that this area should not be painted blue.
So, how is o3 implemented?
Some people think that it is through prompt words, but ARC challenge leader Greg Kamradt and OpenAI researcher Brandon McKinzie both denied this statement, saying that the prompt words given to o3 were very simple.
In addition, Chollet speculated that the core mechanism of o3 seems to be to search and execute natural language programs in the Token space - guided by a certain evaluator model, search for possible thoughts that describe the steps required to solve the task. chain space.
According to Chollet's point of view, o3 has achieved knowledge reorganization during testing. In short, o3 has built a new paradigm leading to AGI.
Nvidia AI scientist Jim Fan believes that the essence of o3 is to "relax single-point RL super intelligence to cover more points in the useful problem space."
That is to say, depth is exchanged for breadth, and reinforcement learning for individual tasks is relaxed in exchange for versatility on more tasks.
Fan Linxi gave examples such as AlphaGo and Boston Dynamics Electronic Atlas, which are super artificial intelligences that perform very well on specific tasks.
But o3 is no longer an expert who can only handle a single task like this, but an expert who excels in a larger set of useful tasks.
However, Fan Linxi also said that o3 still cannot cover all distributions of human cognition, and we are still in Moravec's paradox.
(Moravec’s Paradox holds that higher-order intellectual abilities unique to humans require very little computational power (such as reasoning), but unconscious skills and intuition require enormous computational power . )
The discovery by the ARC challenge sponsor that o3 failed on some very simple tasks seems to confirm this point of view.
Finally, regarding AGI, Fan Linxi said that we have achieved huge milestones and have a clear roadmap, but there is still more to do.
One More ThingAs part of 12 days of releases, OpenAI releases o3 on the last dayAt the same time, a paper on security issues was also published.
The paper introduces an alignment method called deliberative alignment, which directly teaches human-written, interpretable security specifications to inference models and trains them to explicitly evaluate these specifications before answering. reasoning.
As a result, the trained model does not require manually labeled CoTs or answers and can comply with OpenAI's security policies with a high degree of accuracy.
OpenAI found that o1 significantly outperformed other state-of-the-art models such as GPT-4o on a range of internal and external security benchmarks, and saturated its performance on a number of challenging (security) datasets.
This discovery reveals that inference will become a new way to improve model security.
Reference link:[1]https://www.wsj.com/tech/ai/openai-gpt5-orion-delays-639e7693?st=ng5hBi[2]https://x.com/mckbrando/ status/187 0285050555810198[3]https://x.com/DrJimFan/status/1870542485023584334[4]https://arcprize.org/blog/oai-o3-pub-breakthrough