Last night, the world model took a big step forward!
Google DeepMind shockedly announced their new generation world model Genie 2, which can generate infinite 3D worlds based on a picture that can be played by humans or AI agents.
After the news was released, praise and shock poured in. Some people are shocked by the speed of technological advancement and call it the future of video games, and some even see a longer-term future in which everything is virtualized as a world model.
Interestingly, as soon as Genie 2 was released, DeepMind CEO Hassabis directly invited Musk to come and use the world with him after the promotion. Musk actually readily agreed to model the AI game:
It can be seen that the AI boss is very confident in his technology, and Musk also attaches great importance to it.
Genie 2: An epoch-making world modelGenie 2 is a The foundation world model has the ability to generate an unlimited variety of action-controllable and playable 3D environments, and these 3D environments can be used to train and evaluate embodied intelligence.
DeepMind said that Genie 2 only needs to use a single prompt image to generate an environment that can be played by humans or AI intelligence using keyboard and mouse input.
We know that games play an important role in AI research. Games require player participation, have varying levels of difficulty, and are easily measurable, making them an ideal environment for safely testing and advancing AI development.
In fact, since the establishment of Google DeepMind, research on the combination of AI and games has been important. Heart of the Machine has also been following up on their progress in game-related AI research, from early Atari games, to AlphaGo and AlphaStar that attracted the world's attention, to the general-purpose intelligence they developed in cooperation with game developers in the first half of the year. ——See "The Agent's ChatGPT Moment!" DeepMind's general AI evolves towards human players and begins to understand games."
But DeepMind also pointed out that there is a bottleneck in training more general embodied agents: it is difficult to obtain a sufficiently rich and diverse training environment.
Genie 2 appears to be filling this gap, creating an infinite new world for training and evaluating agents. DeepMind said: “Our research also provides a way to build communicationPaves the way for a new creative workflow for interactive experience prototyping. ”
How does it compare with Li Feifei’s spatial intelligence?A few days ago, we just reported on the first project of the famous scholar Li Feifei’s startup company World Labs. Judging from the description, it seems to have the same ability as Genie 2, both of which can generate interactive 3D scenes based on a single image. See the report "Just now, Li Feifei's first entrepreneurial project attracted attention: generating interactive 3D scenes from a single image, spatial intelligence Coming".
But there are some differences between the two. Wang Mengdi, founder and director of the Princeton AI Innovation Center and tenured professor, told Machine Heart: "Feifei's World Labs and Google's Genie2 both seem to generate interactive three-dimensional scenes from a picture, but there are essential differences. Genie2 is still a video Diffusion (video diffusion), each frame is generated by pixel prediction, and the probability distribution of the next frame is affected by additional user input guidance. Feifei's World Labs. It is to further explore the physical nature of the world: starting from the picture, estimating the depth and relative relationship of different scenes in the picture, generating a 3D environment modeling of a more physical world, not just an interactive video. ”
World Labs' demonstration of the effect of generating a 3D scene from a single image
From this description, Li Feifei's research project seems to be closer to a real world model. No matter what, the collision of these new technologies is a process of progress. Professor Wang Mengdi also expressed this expectation: "I look forward to seeing more progress and confrontations of different technical ideas. The new paradigm is coming soon."
Genie 2 The Emergent Capabilities of Basic World ModelsSo far, world models have been largely limited to narrow modeling domains.
In the previous generation Genie 1, DeepMind proposed a method to generate a variety of 2D worlds. Genie 2 is a leap forward in versatility! It generates rich and diverse 3D worlds.
Genie 2 is a world model, which means it can simulate a virtual world, including the consequences of taking any action (e.g. jumping, swimming, etc.). It was trained on a large video dataset, so Genie 2 shares the same large-scale emergent capabilities as other generative models, such as object interaction, complex character animation, physics, and the ability to model and predict the behavior of other agents.
The following shows some examples of people interacting with Genie 2. For each example, the model uses Imagen3 (DeepMind’s state-of-the-art Vincent graph model) generates a single image as a prompt. This means that anyone can describe the world they want in words, choose their favorite rendering method, and then enter and interact with this newly created world (or train or evaluate an AI agent in it).
At each step, a human or agent provides keyboard and mouse operations, and Genie 2 simulates the next observation. Genie 2 can generate consistent worlds for up to a minute, with most examples lasting 10-20 seconds.
Motion Control
Genie 2 intelligently responds to keyboard keystrokes, identifying the character and moving it correctly. For example, the model must understand that the arrow keys should move the robot rather than trees or clouds.
Generate counterfactual video frames
Genie 2 can generate different trajectories from the same starting frame, which means that counterfactual experiences can be simulated for training agents. As shown in the two action graphs below, each video starts at the same frame, but the human player takes different actions.
Long Span Memory
Genie 2 is able to remember parts of the world that disappear from view and then accurately represent them when they become visible again.
Use new generated content to generate long-form videos
Genie 2 can instantly generate new plausibly trustworthy content and maintain a consistent world for up to a minute.
Diverse environments
Genie 2 can create different perspectives, such as first-person, isometric or third-person driving perspective.
3D Structures
Genie 2 learned to create complex 3D visual scenes.
Object Affordances and Interactions
Genie 2 is able to simulate interactions between a variety of objects, such as popping balloons, opening doors, and shooting dynamite barrels with a gun.
Character Animation
Genie 2 learned to animate a variety of characters that perform different activities.
NPC
Genie 2 is able to simulate other agents and even interact with them complexly.
Physics
Genie 2 is capable of modeling water effects.
Smoke effects
Genie 2 is capable of modeling various smoke effects.
Gravity Effects
Genie 2 is capable of modeling various gravity effects.
Lighting Effects
Genie 2 is capable of modeling point and directional lighting effects.
Reflection Effects
Genie 2 is capable of modeling reflections, blooms and colored lighting effects.
Use real-world images as prompts
Genie 2 alsoReal-world images can be used as cues, such as grass blowing in the wind or flowing river water.
Genie 2 supports rapid prototypingGenie 2 makes it easy and fast It enables researchers to quickly prototype new environments for training and testing embodied AI agents.
The image below uses different images generated by Imagen 3 as prompt images for Genie 2 to simulate the differences between paper airplanes, dragons, eagles or parachute flights, and to test how Genie animates different avatars.
Thanks to Genie 2’s distributed generalization capabilities, concept art and drawings can be transformed into fully interactive environments. This enables artists and designers to rapidly prototype, thereby initiating the creation of environmental designs and further accelerating research. The image below shows an example of a "Research Environment Concept" created by a concept artist.
An AI agent acting in a world modelBy using Genie 2 quickly creates rich and diverse environments for AI agents, and researchers can generate evaluation tasks that the agent has not seen during training.
The image below shows an example of a SIMA agent developed in collaboration with a game developer that follows instructions from a Genie 2-synthesized unseen environment prompted by a single image.
Prompt word: Screenshot of a third-person open world exploration game. The player takes on the role of an adventurer exploring the forest. There is a house on the left with a red door, and a house on the right with a blue door. The camera is located directly behind the player. Photorealistic and immersive.
SIMA agents follow natural language instructions to complete a series of tasks in a 3D game world. In the figure below, Genie 2 is used to generate a 3D environment with two doors (blue door and red door) and provide instructions to the SIMA agent to open each door. In this example, SIMA controls the avatar via keyboard and mouse input, while Genie 2 generates game frames.
You can also use SIMA to help evaluate Genie 2 functionality. In the image below, SIMA is instructed to look around and explore the back of the house to test Genie 2's ability to generate a consistent environment.
Although this research is still in its early stages, and there is still much room for improvement in agent and environment generation capabilities, Google believes Genie 2 is a major way to solve the structural problem of safely training embodied agents. , while also providing the breadth and versatility needed to move toward AGI.
The picture below is a computer game image generated by Imagen 3. The prompt is "A computer game picture showing a rough cave or mine interior scene. The viewer's position is a third-person perspective, located above the player's avatar. Look down at the avatar. The player avatar is a knight holding a sword. There are 3 in front of the knight avatar. A stone arch, the knight can choose to go through any of the doors. Through the first door and inside, we can see the tunnel lined with strange green plants and glowing flowers. Outside, there is a corridor riveted with spiked iron plates nailed to the cave walls, leading to an ominous glow in the distance. Through a third door we can see a set of rough stone steps that lead to a mysterious place. The destination. 》
The following is the game frame generated based on the above picture.
The technology behind: Diffusion World ModelAs an autoregressive latent diffusion model, Genie 2 is trained on large video datasets. After passing through an autoencoder, latent frames from the video are passed into a large transformer dynamic model, which is trained using causal masks similar to large language models.
At inference time, Genie 2 is able to sample in an autoregressive manner, taking individual actions and past potential frames frame by frame. Google uses classifier-free guidance to improve action controllability.
The examples in this article are generated from the undistilled base model to demonstrate its many possibilities. Of course it is also possible to run the distilled version in real time, but the output quality will be reduced.
Develop technology responsiblyGoogle said that Genie 2 demonstrates the power of the basic world model in creating diverse 3D environments and the potential to accelerate agent research. However, given that this research direction is still in its early stages, Genie’s world generation capabilities in terms of versatility and consistency will continue to be improved in the future.
Like SIMA, Google’s research is moving toward more general AI systems and agents that can understand and safely perform a variety of tasks to provide benefits to people online and in the real world. help.
By the way, DeepMind also released the AI weather prediction model GenCast. Of course, its weather prediction performance has also reached the current best level.
Reference content: https://deepmind.google/discover/blog/genie-2-a-large-scale-foundation-world-model/https://news.ycombinator.com/item?id=42317903