News center > News > Headlines > Context
Gemini 2.0 is here: a super model that serves as the base for all AI agents
Editor
2024-12-12 11:02 8,025

Gemini 2.0 is here: a super model that serves as the base for all AI agents

Image source: generated by Unbounded AI

The OpenAI conference entered its fifth day, bringing an upgrade to the integration of ChatGPT and Apple devices. Users can enable the Apple Intelligence extension in settings without a ChatGPT account to experience Siri's complex task transfer, content creation, iPhone 16 visual intelligence mode, and quick call functions on macOS.

The demonstration content is also very simple: after the user says "Ask ChatGPT..." to Siri, the request is taken over by ChatGPT; long press the iPhone 16 side camera control button to open the camera, click "ask" to call ChatGPT to analyze the shooting Content; Double-click the Command key to activate ChatGPT on macOS to quickly analyze and extract long PDF document information.

The live broadcast only lasted 12 minutes. Since most of them had already been seen in Apple’s demo, it seemed unremarkable overall.

The real highlight moment today comes from Google.

In the morning of local time, Sundar Pichai, Demis Hassabis and Koray Kavukcuoglu jointly issued a document announcing Google's most powerful next-generation model to date, Gemini 2.0, which is specially adapted to the new "agent era". And officially released the first version of this series: Gemini 2.0 Flash experimental version.

1. Performance exceeds 1.5 Pro, major breakthrough in multi-modality, native tool integration

p>

With low latency and enhanced performance as its core, Gemini 2.0 Flash represents Google’s highest level in the field of AI model development.

Compared with the previous generation model, Gemini 2.0 Flash has significantly improved performance while maintaining fast response. In key benchmark tests such as MMLU, programming, mathematics, and reasoning, it not only surpasses the performance of 1.5 Pro, but also doubles the speed.

In terms of multi-modality, Flash 2.0 has achieved leaps and bounds: in addition to supporting multi-modal input such as images, videos, and audio, it also adds multi-modal output functions, including native image and text mixing. Generation and multilingual text-to-speech.

At the same time, the model can also natively call Google search, execute program code, and support user-defined third-party tool access.

Developer support: Multi-modal real-time API

To help developers build richer dynamic interactive applications, Google has simultaneously launched a new multi-modal real-time API that supports Real-time audio and video stream input and multi-tool combination call.

Currently, openDevelopers can use the multi-modal input and text output capabilities of the 2.0 Flash experimental version through Google AI Studio and the Vertex AI platform. The text-to-speech and native image generation functions are currently only open to early partners, and it is expected to achieve a wider range of function openings and model version updates in January next year.

Available to global users, new research tool Deep Research

On the client side, the 2.0 Flash experimental version has been integrated into the Gemini chat assistant. Global users can use the desktop and mobile web versions to Model drop-down menu access, and mobile app integration will be available soon.

Google is testing Gemini 2.0’s advanced reasoning capabilities in the AI ​​Overview feature in Search to help answer more complex and multi-step questions, with plans to expand to more Google products early next year.

It is particularly worth mentioning that Google today also launched a new Deep Research feature for Advanced paid users.

It is specially designed for complex online research. It can automatically create a multi-step research plan based on Gemini 1.5 Pro after users ask questions, collect and analyze relevant information from the entire network, and continuously optimize based on feedback to finally generate a Comprehensive reports with in-depth information and accurate sources. It greatly simplifies the tedious and time-consuming research process, which is good news for scientific researchers and ecstatic for PhDs.

2. AI model created for the "First Year of Agent"

The Gemini 2.0 series model has a clear positioning, which is directly "AI model for the agentic era".

Pichai said that in the past year, Google has been focusing on developing models with stronger agent capabilities. Such models can deeply understand the user's environment, have multi-step predictive thinking, and execute corresponding responses under supervision. operate. Combined with the previously released Genie 2, Google’s vision of spatial intelligence and world models has been revealed.

Hassabis even bluntly stated that 2025 will be the "first year of Agent", saying that Gemini 2.0 Flash's native user interface interaction, multi-modal reasoning, long context understanding, complex instruction execution and planning, function call combination and native Tool usage, etc., will make it the core support model for future agent-based work, moving closer to the vision of creating a “universal assistant”.

In this release, Google demonstrated the progress of a series of prototype projects based on the new capabilities of Flash 2.0, including:

Project Astra: universal in the real worldIntelligent Assistant

At this year's I/O conference, Google demonstrated for the first time Project Astra, which has multi-modal understanding capabilities and supports real-time voice interaction. Thanks to the support of Gemini 2.0 and feedback from Android testers, the latest version of Astra has achieved the following key upgrades:

• Comprehensive improvement in conversational capabilities: supports multi-language and mixed-language communication, and can more accurately understand different languages Accents and uncommon vocabulary.

• Tool call upgrade: natively integrates Google search, Lens and map functions, significantly improving its practicality in daily life.

• Memory enhancement: It can maintain richer contextual information in conversations, support conversation memory of up to 10 minutes, and bring users a more personalized interactive experience.

• Latency Optimization: Improve response speed to close to human conversation levels through next-generation streaming and audio understanding technology.

Project Mariner: Complex task assistant in the browser

Project Mariner is Google’s experimental agent product to explore the future of human-computer interaction, focusing on improving the processing capabilities of complex tasks in the browser .

Relying on the advanced reasoning capabilities of Gemini 2.0, it can comprehensively understand and analyze all kinds of information on the browser screen, including pixel data, text content, code snippets, picture materials, form elements, etc., and pass it through a Experimental Chrome extension to help users complete tasks.

In the WebVoyager benchmark test, which measures an agent's ability to complete real web page tasks, Mariner achieved a leading score of 83.5% as a single agent system.

However, the project still has room for improvement in terms of accuracy and responsiveness. To ensure safe use, Mariner's operating permissions are strictly limited, and sensitive operations such as online shopping must be confirmed by the user to strike a balance between security and efficiency.

Jules: AI programming assistant designed for developers

Jules is an AI-driven code agent for developers that is directly integrated into the GitHub workflow. Thanks to improvements in Gemini 2.0, Jules can handle issues, create plans, and execute coding tasks with the guidance and supervision of a developer. This project aims to explore how AI agents can enhance productivity in the developer community and pave the way for future cross-domain AI applications.

Game agent: Breaking the boundary between virtuality and reality

Google also shared some hidden easter eggs of the prototype.

For example, in the gaming field, the intelligent agent supported by Gemini 2.0 has demonstrated its strong adaptability in virtual environments. Not only can real-time analysis and recommendationManage screen actions and provide strategic advice to players.

Genie 2, previously launched by DeepMind, can generate infinitely playable 3D game worlds from a single image, while game agents working with developers such as Supercell have demonstrated excellent rule understanding in strategy and simulation games. and problem-solving skills. Combined with the Google search function, these agents can also provide players with rich game knowledge support.

The spatial intelligence potential of Gemini 2.0

In addition, Gemini 2.0, based on version 1.5, has raised its spatial understanding capabilities to a new level. Through the new toolset launched by AI Studio, developers can more easily explore spatial intelligence applications that integrate multi-modal reasoning, which is not only reflected in virtual scenes, but can also be extended to physical world applications such as robots.

Core capability improvements include:

• Fast spatial analysis: able to identify and analyze the spatial positional relationship of objects in images with ultra-low latency

• Intelligent object recognition : Supports search and matching within the image, even if it is hidden or blurred The details can also be accurately found

• Multi-language spatial annotation: combine spatial information to achieve intelligent multi-language annotation and translation

• Spatial logic understanding: grasp the spatial association between objects, For example, the actual object and the corresponding shadow

• 3D space reconstruction: Convert 2D photos into interactive 3D top views for the first time

In the above demonstration, Gemini 2.0 demonstrated multiple impressive application scenarios: from identifying origami animals and their projections, to From matching socks with specific patterns to providing bilingual annotations of items and analyzing real-life scenarios for problem-solving. In particular, the newly introduced 3D space understanding function, although still in its early stages, has shown the potential to transform flat images into three-dimensional interactive scenes, opening up a broader application space for developers.

Compared with OpenAI’s fussy press conference today, Gemini 2.0 brought by Google not only stood out, but also won the game steadily with its strength.

Pichai said that millions of developers are currently using Gemini to build projects, and Google itself is using Gemini to reshape its seven core products, with a user base of up to 2 billion.

The launch of Gemini 2.0 marks the transformation of AI from simple information understanding to actual task execution, moving towards the goal of a "universal assistant". With its sixth-generation TPU and the newly released quantum computer Willow, Google is playing a key role in pushing the limits of computing power, achieving a productivity jump, and leading the development of AGI.

Keywords: Bitcoin
Share to: