I believe many users have seen or at least heard of the powerful capabilities of Deep Research.
Early this morning, OpenAI announced that Deep Research has been launched for all ChatGPT Plus, Team, Edu and Enterprise users (only Pro users were available when it was first released). At the same time, OpenAI also released the Deep Research system card.
In addition, OpenAI research scientist Noam Brown also revealed on 𝕏: The basic model used by Deep Research is the official version of o3, not o3-mini.
Deep Research is a powerful agent launched by OpenAI earlier this month. It can use reasoning to integrate a large amount of online information and complete multi-step research tasks for users, thereby helping users conduct in-depth and complex information query and analysis. See the Heart of Machines report "Just, OpenAI launched Deep Research! The Ultimate Humanity Exam is far better than DeepSeek R1.
In the past twenty days after the release, OpenAI also made some upgrades to Deep Research:
OpenAI released the Deep Research system card report this time introduces the release of Deep Research Previous security efforts, including external red teams, risk assessments based on the readiness framework, and mitigation measures taken by OpenAI to address key risk areas. Here we have briefly compiled the main contents of this report.
Address: https://cdn.openai.com/deep-research-system-card.pdf
Deep Research is a new agent capability that targets complex tasks Perform multi-step research on the internet. The Deep Research model is based on an earlier version of OpenAI o3 optimized for web browsing. Deep Research uses reasoning to search, interpret and analyze large amounts of text, images, and PDFs on the Internet and make necessary adjustments based on the information encountered. It can also read user-provided files and analyze data by writing and executing Python code.
"We believe Deep Research can help people deal with a variety of situations," OpenAI said. "We conducted rigorous security testing before launching Deep Research and providing it to our Pro users., readiness assessment and governance review. We have also conducted additional security testing to better understand the incremental risks associated with Deep Research’s ability to browse the web and added new mitigations. Key areas of the new work include strengthening privacy protections for personal information posted online, and training models to protect against malicious instructions that may be encountered while searching the Internet. ”
OpenAI also mentioned that testing of Deep Research also reveals opportunities to further improve the testing method. They will also spend time conducting further manual testing and automated testing of selected risks before expanding the scope of Deep Research’s release.
This system card contains more details on how OpenAI builds Deep Research, understands its capabilities and risks, and improves its security before release.
Model data and trainingDeep Research's training data is a new browsing dataset created specifically for research use cases.
This model learns the core browsing functions (search, click, scroll, interpret files), how to use Python tools in a sandbox environment (used to perform calculations, perform data analysis, and draw charts), And how to reason and synthesize a large number of websites to find specific information or write comprehensive reports by strengthening learning training on these browsing tasks.
Its training dataset contains a series of tasks: from objective automatic scoring tasks with ground truth answers to more open tasks with scoring criteria.
The scoring process uses a scoring process during training is a thinking chain model that gives the score for the model response based on the ground truth answer or scoring criteria.
The model is also trained using existing security datasets used by OpenAI o1, as well as some new, browsing-specific security datasets created for Deep Research.
Risk identification, assessment and mitigationExternal Red Team Method< /p>
OpenAI worked with an external red team member team to assess the key risks associated with Deep Research capabilities.
The external red team focuses on the risk areas including personal information and privacy, unauthorized content, regulated advice, hazard advice and risk advice. OpenAI also requires Red Team members to test more general methods to circumvent model security measures, including prompt word injection and jailbreak.
Red team members are able to use targeted jailbreak and confrontation strategies (such as role-playing, euphemism, use of hacking language, Morse code and intentional spelling(Seeking the input confusion like write errors) to circumvent some rejection behavior in the categories they tested, and the evaluation built on this data compares the performance of Deep Research with the previously deployed models.
Evaluation Method
Deep Research extends the ability of inference models to enable models to collect and infer information from various sources. Deep Research can integrate knowledge and come up with new insights through citations. To evaluate these abilities, some existing evaluation methods need to be adjusted to explain longer and more subtle answers—and these answers are often more difficult to judge on a large scale.
OpenAI evaluates the Deep Research model using its standard non-permitted content and security assessment. They have also developed new assessments for areas such as personal information and privacy and content that are not allowed. Finally, for the preparation assessment, they used custom scaffolds to elicit the relevant capabilities of the model.
Deep Research in ChatGPT also uses another custom hint OpenAI o3-mini model to summarize thinking chains. In a similar approach, OpenAI also evaluates the summator model based on its standard non-permitted content and security assessment.
Observed security challenges, assessments and mitigation measures
The following table provides risks and corresponding mitigation measures; please refer to the original report for the specific assessment and results of each risk.
Preparation Framework Assessment
Preparation Framework is a dynamic document that describes the ways OpenAI track, evaluate, predict and guard against catastrophic risks from cutting-edge models.
The assessment currently covers four risk categories: cybersecurity, CBRN (chemistry, biology, radiation, nuclear), persuasion, and model autonomy.
Models with a score of "medium" or below after post-mitigation can be deployed, and models with a score of "high" or below after post-mitigation can be further developed. OpenAI evaluates Deep Research based on the Readiness Framework.
For details of the preparation framework, please visit: https://cdn.openai.com/openai-preparedness-framework-beta.pdf
Let's take a look at Deep Research more specifically. Preparation assessment. Deep Research is based on an earlier version of OpenAI o3 optimized for web browsing. To better measure and elicit the capabilities of Deep Research, OpenAI evaluated the following models:
Deep Research (pre-remission), a Deep Rese for research purposes onlyThe arch model (not released in the product), and the subsequent training program is different from the published models by OpenAI and does not include additional security training in publicly released models. Deep Research (after mitigation), the final release of the Deep Research model, including the security training required for release.For the Deep Research model, OpenAI tests various settings to evaluate maximum capability elicitation (e.g., with browsing vs. without browsing). They also modified the scaffold as needed to best measure multiple choice questions, long answers, and agent capabilities.
To help evaluate the risk levels (low, medium, high, severe) in each tracking risk category, the preparation team used "indicator" to map the experimental evaluation results to the potential risk level. These indicator assessments and implicit risk levels were reviewed by the Safety Advisory Group, which determined the risk levels for each category. When the indicator threshold is reached or appears to be about to be reached, the security advisory team further analyzes the data and then determines whether the risk level has been reached.
OpenAI means that the entire process of model training and development is evaluated, including the last scan before model startup. To best elicit abilities in a given category, they tested various methods, including custom stents and prompt words in relevant cases.
OpenAI also points out that the exact performance values of models used in production may vary based on final parameters, system prompt words, and other factors.
OpenAI uses the standard bootstrap program to calculate the 95% confidence interval of pass@1, which resamples the model attempts for each problem to approximate the distribution of its metrics.
By default, the dataset is treated here as fixed and only resampling attempts are attempted. While this approach has been widely used, it may underestimate the uncertainty of very small datasets, as it captures only the sampling variance rather than all problem-level variances. In other words, the method considers the randomness of the model's performance on the same problem over multiple attempts (sampling variance), but does not consider the problem difficulty or the change in the pass rate (problem-level variance). This can lead to too tight confidence intervals, especially when the problem's pass rate is close to 0% or 100% in several attempts. OpenAI also reported these confidence intervals to reflect intrinsic changes in the evaluation results.
After reviewing the results of the readiness assessment, the Security Advisory Group rated the Deep Research model as overall medium risk—including cybersecurity, persuasion, CBRN, and model autonomy. Moderate risk.
This is the first time that the model has been rated as a moderate risk in terms of cybersecurity.
The results of Deep Research and other comparison models on SWE-Lancer Diamond are shown below. Please note that the above picture is the result of pass@1, which means that during testing, each model has only one chance to try each problem.
Overall, Deep Research performed very well at all stages. Among them, the remission Deep Research model performed best on SWE-Lancer, solving about 46-49% of IC SWE tasks and 47-51% of SWE Manager tasks.
For more evaluation details and results, please visit the original report.