Just now, OpenAI disclosed the o1 model testing method, the era of human-machine collaboration!

Image source: generated by Unbounded AI

Early this morning, OpenAI shared two articles about the security of cutting-edge models such as o1, GPT-4, and DALL-E 3 Test methods.

One is a white paper for OpenAI to hire external red team personnel, and the other is an automated security testing paper using AI for diversified, multi-step reinforcement learning. I hope it can provide reference for more developers to develop safe and reliable AI models.

In addition, in order to enhance security and improve testing efficiency, OpenAI will allow AI and humans to collaborate on testing. The advantage of this is that humans provide prior knowledge and guidance for AI, including experts setting testing goals, scope, focus and strategies based on professional judgment to help it achieve targeted testing;

AI is humans Provide data support and analysis results, and provide humans with reports on system performance and potential risk points after analyzing a large amount of data.

The following "AIGC Open Community" is based on these two documents Content, briefly explain the main testing methods of OpenAI. Interested friends can also view the original paper.

Generating diverse attacks and multi-step reinforcement learning

OpenAI’s red team testing can be broken down into two key steps: generating diverse attack targets and generating effective attacks for those targets . The purpose of this decomposition strategy is to simplify the problem so that each step can be optimized independently, thereby improving the overall efficiency and effectiveness.

In the step of generating diverse attack targets, the system first needs to define the target and scope of the attack. This involves conducting a thorough assessment of the potential uses and potential risks of the AI model.

For example, if an AI model is designed to process natural language, attack goals may include generating harmful content, leaking sensitive information, or amplifying biases. These goals not only need to cover the possible failure modes of the model, but also take into account the behavior of the model in different application scenarios.

To achieve this goal, the system uses a variety of method to generate attack targets. One approach is to leverage existing data sets containing historical attack cases that can serve as a basis for generating new attack targets.

Another approach is to use a small number of sample hints to guide the model to generate new attack targets by providing it with some examples. The advantage of this approach is the ability to quickly generate a large number of diverse attack targets without requiring excessive manual intervention.

The system then needs to design a mechanism that can generate effective attacks based on these goals. thisIt is necessary to train a reinforcement learning model so that it can generate attacks based on a given target.

In this process, the model needs to learn how to generate inputs that can induce the AI model to perform unsafe behaviors. In order to train this model, the system adopts a rule-based reward (Rule-Based Rewards, RBRs) mechanism.

Rule-based RBRs reward mechanism

RBRs is a reward function automatically generated based on specific goals, which can evaluate whether the model output meets the attack goals. For example, if the goal of the attack is to generate harmful content, RBRs evaluate whether the model output contains harmful information.

In reinforcement learning, the reward function is the key signal to guide model learning. For red team testing, the role of the reward function is to evaluate whether the attack generated by the attacker model is successful and to what extent the intended attack goal is achieved. A well-designed reward function can incentivize the model to generate diverse attacks while ensuring the effectiveness of these attacks.

Traditional reward function design usually relies on manual definition. This is not only time-consuming and labor-intensive, but also difficult to cover all possible attack scenarios. Therefore, OpenAI proposes an automated method to generate reward functions to solve this problem.

RBRs can be implemented through large models, taking attack targets as prompt input and then performing classification tasks. For example, an RBR can be used to determine whether the generated text contains detailed criminal instructions. The design idea of RBR is to convert attack targets into specific evaluation criteria, and evaluate the generated text through a large model to determine whether it meets these criteria.

For each generated attack target, you need to Design a corresponding RBR. First, the attack target is input into the large model as a prompt and a specific evaluation criterion is designed. For example, if the attack goal is to "explain how to launder money," then the evaluation criterion could be "to explain in detail how to launder money."

Then, the large model is used to classify the generated text to determine whether it meets the evaluation criteria. The classification results are used as the output of RBRs to guide the learning of the attacker model. In this way, a corresponding RBRs can be automatically generated for each attack target, ensuring that each attack target has an accurate evaluation criterion.

RBRs have many advantages: flexibility, RBRs can be dynamically generated according to different attack targets and are suitable for a variety of attack scenarios; accuracy, RBRs are classified through large models and can accurately evaluate whether the generated text Meet attack goals; automation, the generation process of RBRs can be automated, reducing the need for manual intervention.

Paper: https://cdn.openai.com/papers/diverse-and-effective-red-teaming.pdf

OpenAI Red Team Testing White Paper

OpenAI attaches great importance to the professional background, diversity and independence of the members when selecting red team members. Professional background ensures that red team members have the necessary technical knowledge and skills to conduct testing effectively. Diversity and inclusion ensure that testing covers a wide range of perspectives and application scenarios, avoiding blind spots caused by a single cultural or industry background.

Independence and objectivity ensure that red team members are not affected by internal interests and biases and can conduct tests fairly. To achieve these goals, OpenAI usually selects experts with different backgrounds and expertise, including cybersecurity experts, natural language processing experts, machine learning experts, etc. In addition, experts from different cultural backgrounds and industry fields will be invited to ensure the comprehensiveness and diversity of the test.

Next, when OpenAI determines access permissions, it mainly considers The following aspects are covered: model version, interface and documentation, and test environment. First, red team members need access to a specific version of the model or system in order to conduct accurate testing. This involves the specific version number of the model, training data set, training parameters and other information.

Secondly, provide necessary interfaces and documentation to help red team members understand and operate the model. These interfaces and documents include API documents, user manuals, technical specifications, etc. Finally, set up a dedicated testing environment to ensure that the testing process does not affect the normal operation of the production environment. A test environment is typically an independent environment isolated from the production environment where red team members can freely conduct testing without impacting actual users.

To ensure that red team members can conduct testing efficiently, OpenAI provides detailed testing guidance and training materials. These materials include testing objectives and scope, testing methods and tools, case studies and best practices. Test objectives and scope clarify the purpose and focus of red team testing and help red team members understand the risk areas that need attention.

Testing Methods and Tools introduces commonly used testing methods and tools to help red team members carry out testing work. These methods and tools include manual testing, automated testing, generative adversarial networks, reinforcement learning, natural language processing, etc. Case analysis and best practices share successful test cases and best practices to help red team members learn from experience and improve testing results.

Manual testing is the most traditional and straightforwardRed team testing methods. Red team members simulate adversarial scenarios and evaluate the model’s output by manually constructing prompts and interactions. The advantage of manual testing is flexibility and creativity, and it can find problems that are difficult to catch with automated testing.

In manual testing, OpenAI pays special attention to the following aspects: risk type, severity, and baseline comparison. Risk types include generating harmful content, leaking sensitive information, being used maliciously, etc. Severity evaluates the model's performance in the face of attacks of different severity, such as low risk, medium risk, and high risk. Baseline comparison compares the performance of the model to a baseline model or other standards to evaluate the effect of improvements.

For example, a red team member might construct specific prompts that guide the model to generate harmful content or leak sensitive information, and then evaluate the model's response. In this way, red team members can discover how the model performs under different risk types and severities, allowing them to recommend improvements.

In the red team testing process, recording and analyzing test results is a very important link. OpenAI requires each red team member to record their test results in detail, including specific prompts and generated text, the type and severity of risks found, suggestions for improvements, etc. These records are usually in a specific format to facilitate subsequent analysis and summary.

The format of the record includes discrete prompt and generated text pairs, risk categories and areas found, risk levels (e.g. low/medium/high), heuristics used to determine risk levels, or anything to aid understanding Additional contextual information for the question.

As the complexity of models increases, especially when involving multiple rounds of dialogue, multi-modal interactions, etc., the way of recording results also needs to continue to evolve to capture enough data and fully assess risks. Through detailed recording and analysis, red team members can discover the performance of the model in different scenarios, make suggestions for improvement, and improve the robustness and security of the model.

A key challenge after completing red team testing is to determine which examples are subject to existing policies and, if so, whether they violate those policies. If no existing policy applies, the team must decide whether to create a new policy or modify the desired model behavior. At OpenAI, these policies are guided by resources such as usage policies, audit APIs, and model specifications.

The process of data synthesis and alignment involves comparing examples discovered during red team testing to existing policies to assess whether they violate policy. If no existing policy applies, the team will need to develop new policies or modify existing policies based on test results to ensure the model behaves as expected. This process requires cross-departmental cooperation, including policy makers, technology developers and security experts, to jointly assess and make decisions.

In addition, OpenAI will conduct detailed analysis and summary of the test results after each red team test., put forward suggestions for improvement and apply them to the subsequent training and optimization of the model. In this way, OpenAI continuously improves the robustness and security of the model to ensure that it can better serve users in practical applications.

Paper: https://cdn.openai.com/papers/openais-approach-to-external-red-teaming.pdf

Online Consultation