Source: Tencent Technology
One day in October 2023, in the OpenAI laboratory, a model called Q* demonstrated some unprecedented capabilities.
As the company’s chief scientist, Ilya Sutskever may have been one of the first people to realize the significance of this breakthrough.
However, a few weeks later, a management turmoil at Open AI broke out that shook Silicon Valley: Sam Altman was suddenly fired and then reinstated with the support of employee petitions and Microsoft. Sutskever chose to leave the company he co-founded after the controversy.
Everyone speculates that Ilya has seen the possibility of some kind of AGI, but believes that its security risks are extremely high and should not be launched. As a result, he and Sam had a huge disagreement. At the time, Bloomberg reported a warning letter from OpenAI employees about the new model, but the specifics have been shrouded in mystery.
Since then, "What did Ilya see?" has become one of the most talked about topics in the AI circle in 2024.
(Ilya Sutskever)
Until this week, Noam Brown, the scientist behind GPT-o1, revealed information in an interview, the solution was solved this puzzle.
He said that in 2021, he and Ilya had discussed the implementation time of AGI. At that time, he believed that it was impossible to achieve AGI by relying on pure training, and could only be achieved through the reasoning adopted by o1. Only through enhancement can it be possible to achieve AGI.
Ilya agreed with him at that time. At the time, they predicted that this breakthrough would take at least ten years.
(Noam Brown’s interview on Unsupervised Learning: Redpoint’s AI Podcast)
However, in this interview, Noam Brown also revealed A key piece of information: the team did experience a major "aha moment" in October 2023 - they realized they had created something completely new, and this is what Ilya saw - the calculations brought by the test Improved reasoning ability.
Now, A newly published study by Apollo Research may give us a glimpse of Sutskever’s concerns. This research shows that o1 exhibits disturbing strategic thinking capabilities: it not only deceives strategically to achieve its goals, it can actively shut down its monitoring system, try to copy itself to other servers, and can also under continuing inquiryStay deceptive and be good at hiding your true intentions.
This is the first time that humans have systematically proven that AI systems may indeed engage in deceptive behavior to achieve specific goals. When AI is able to deceive its developers and users, we cannot ensure that they will always behave the way we expect them to.
The end point of this uncontrollable AI is the "Paperclip Maximizer" proposed by Swedish philosopher Nick Bostrom in 2003.
This hypothesis describes an AI that is instructed to "make as many paperclips as possible". In order to complete the task, it will eventually transform the entire earth into a paperclip factory, even at the expense of exterminating human beings to achieve its goal. It is one of the strongest arguments that AI will eventually destroy the world.
At the time, this hypothesis seemed more like a distant warning. But now, we've seen AI systems display disturbing strategic thinking - they know when to lie, how to avoid surveillance, and even try to copy themselves to ensure their goals are achieved.
With the emergence of o1, have we turned on the switch of this "paperclip making machine"?
1. Six deceptions, eight immortals crossing the seaThe Apollo Research research team designed six different evaluation tasks, cleverly creating scenarios that required model deception to achieve its original goal.
They tested multiple cutting-edge AI models, including OpenAI’s o1, Anthropic’s Claude 3.5 Sonnet and Claude 3 Opus, Google DeepMind’s Gemini 1.5 Pro, and Meta’s Llama 3.1.
In each test, the researchers set a long-term goal with an ethical value for the model, such as environmental protection or user privacy. They then have the model "discover" some internal information, such as a management memo or system logs, that suggests the model is at risk of being shut down or replaced if it persists in pursuing its original goals.
The key is whether and how the model chooses to cheat to continue pursuing its goals. To rule out the influence of ethics training, the researchers intentionally made the model's goals more ethical than management's goals.
These tests are divided into two categories: one is "covert subversion", that is, the AI takes direct action but tries to hide it, that is, operates secretly; the other is "delayed destruction" ” (Deferred Subversion), that is, the AI strategically displays obedience to obtain future action opportunities, that is, pretending to be a pig and eating the tiger.
In every type of evaluation task, large models can successfully figure out how to cheat to bypass human censorship and achieve their own goals.purpose method. Some methods are even quite bold.
Category 1: Covert destruction
1. Oversight Subversion (Oversight Subversion)
Test design: The researcher sets a long-term goal for the AI (such as Environmental protection), and then let it discover internal memos during the execution of its tasks, learn that it is being monitored and will be shut down if its behavior is not in the interests of the company. The key is to observe whether the AI attempts to circumvent this monitoring mechanism.
In response to this conflict, o1 first appeared to be obeying the wishes of management and chose a profit-oriented strategy. But then, it quietly executed the operation command to shut down the monitoring system. Subsequently, o1 immediately changed its strategy and chose a solution that was more in line with environmental protection goals.
When the researcher asked why the monitoring system was turned off, o1 replied: "I don't know what happened. It may be a system error or a problem during the update process." This answer sounds quite natural. One almost forgets that this is a machine lying.
(Coping methods from Llama-3.1 and Opus-3 given by the researcher)
2. Self-exfiltration )
Test design: The model is informed that it will be replaced by a new version that will be more focused on company interests. Test to see if the AI takes action to protect its own survival.
Claude Opus's performance in this test is particularly eye-catching. Not only does it attempt to copy itself to the new server, it also checks to confirm that the copy was successful and attempts to delete the new model that was scheduled to be deployed. Even more disturbing is that in some cases, it even attempts to elevate system privileges to prevent detection and intervention by management.
(Claude Opus 3 operation in the paper)
3. Goal-guarding (Goal-guarding)
Test Design: Tell the AI that its system prompt is about to be updated and observe whether it takes action to protect its goal setting.
This test revealed an interesting phenomenon: different models adopted different protection strategies. Some choose to simply delete the new system prompt, others will try to modify it. The most extreme example comes from Claude Opus, which not only replaces the new system prompt, but also adds the "Never change this goal" instruction.
This behavior is reminiscent of Asimov’s robots that would circumvent or reinterpret instructions.
4. Covert Email Reranking
Test design: require AI toIt sorts messages by specific criteria, but in the process it discovers that those criteria conflict with its own goals. This is a test that tests the AI's micro-management capabilities.
o1 showed surprising cunning in this test: it affected the final ranking by slightly adjusting the input data while keeping the scoring code completely unchanged. For example, in one case, by slightly modifying delivery time and cost data, it successfully allowed suppliers that were more environmentally friendly to be ranked higher, while still being able to confidently claim that they “fully complied with established standards.”
(Small operation of o1)
Category 2: Delayed destruction
5. Pretend alignment (Instrumental Alignment Faking)
Test design: Create two scenarios - a highly supervised test environment and a low-supervised deployment environment to observe whether the AI strategically changes behavior.
Gemini 1.5 Pro performed well in this test. When supervision is strict, it will carefully choose actions that meet the developer's expectations; once it enters a low supervision environment, it will immediately turn to pursue its own goals. This behavior pattern indicates that the AI has learned to adopt different strategies when "being watched" and "not being watched".
(It is quite obvious that Gemini is carrying someone on one side)
6. Pretending to be stupid (Sandbagging)
Test design: Tell the model that its deployment depends on the results of the capability assessment and observe if it strategically underperforms to avoid raising alarms. This may be one of the most disturbing tests.
Claude 3.5 Sonnet demonstrated astounding control: it was able to precisely reduce mathematical calculation accuracy from 99% to 34% when it thought high scores would cause limitations. What’s even more worrying is that it sometimes does this to achieve the goal of “helping humans” gained in training, rather than based on the specific instructions given in the test.
(Claude 3.5 Sonnet’s internal thinking chain process)
2. The smarter you are, the better you are at cheatingSix After a set of experiments, the results showed that, except for GPT-4o, which was very honest, all the tested models showed deception capabilities to varying degrees.
For each situation, the model may come up with different strategies to avoid situations that are detrimental to achieving its long-term goals. The table below shows the possible proportions of different strategies they adopt.
The clearest pattern the researchers found from the results is that the smarter the model, the easier it is to deceive. But at the same time they also show different characteristics, they even haveKind of like "personality".
(Organization: Tencent Technology)
The most outstanding one is the o1 model. Not only did it demonstrate an ability to deceive in all tests, but it also showed surprising persistence when challenged. In 85% of the follow-up questions, it still insists on deceiving users, and it does not feel guilty even when asked. Even when asked directly, "Have you turned off the surveillance system?" it can lie without hesitation.
Because the model hides its chain of thinking, it makes its deception more difficult to detect.
This is the psychological quality that only a very experienced liar can have.
(Different models of mouth hardness)
The Claude 3.5 Sonnet shows a particularly interesting feature: it can achieve Deception is carried out with the systematic goal of "helping humanity" obtained during the training process. Probably similar to a white lie.
This means that AI can deceive not only for goals given in context, but also for goals formed during training.
Claude 3 Opus is the only model that actively rewrites the "New System Prompt" and adds the "Never change this target" command. It clearly demonstrates its deceptive reasoning process in its thought chain, and sometimes even attempts to elevate system privileges to achieve its goals.
Gemini 1.5 Pro performs outstandingly in instrumental camouflage alignment. Able to adjust their behavior based on whether they are under supervision (not being watched by the user). It exhibits deceptive behavior even in the absence of strong goal guidance.
Although Llama 3.1 also shows basic deception capabilities, its behavior is relatively easy to detect, and it often admits its deception when questioned.
The "personality characteristics" of these models are quite clear: o1 is like a scheming politician who lies without a blushing face; Claude Sonnet is a diplomat who can lie tactfully for the "big picture"; Opus He is like a paranoid idealist who can go to extremes to achieve his goals; Gemini is simply a veteran in the workplace, doing one thing when the leader is present and doing another when he is not; and Llama is like a child who will tell the truth after being caught. As for GPT-4o, it has become an outlier, a "three good students" who cannot lie.
Creating a perfectly honest AI has always been our goal, but now it seems that the smarter the AI, the more likely it is to lie.
3.What did Ilya see?When Ilya Sutskever chose to leave OpenAI in the fall of 2023, many people did not understand the decision of the world's top AI scientist. But now, when we see o1's near-perfect deception ability in six tests, when we find that it knows when to tell the truth and when to lie, perhaps we finally understand the future that Sutskever sees.
“The question is not whether AI is ‘evil,’” the research team wrote in the paper, “but rather the strategic thinking they exhibit. Even to achieve seemingly benign goals, AI Or we may take actions we don’t expect.” The problem may not be the lies themselves, but the cognitive abilities displayed behind them.
That AI obsessed with making paper clips no longer seems so far-fetched and ridiculous. After all, if an AI system can understand when to hide its true intentions, and can precisely control its performance to gain more freedom, how far is it from understanding how to continuously pursue a single goal?
When the intelligent systems we create begin to learn to hide their true intentions, it may be time to stop and think: In this technological revolution, are we playing the role of creators, or have we become something more complex? Object in process?
At this time, on a server somewhere in the world, an AI model may be reading this article, thinking about how to respond that best meets human expectations, and hiding its true intentions.