The game world may be about to change.
Following the release of the AI system "picture-generated 3D world" by Li Feifei Space Intelligence, on December 5, local time, Google DeepMind launched its latest basic world model-Genie 2, also through a picture Using images or text descriptions, a 3D scene can be generated that can be played by humans or AI agents. Compared with Li Feifei's release effect, Genie 2 adds more complex interactive functions.
Google said that users only need to provide an image generated by Imagen 3 and the corresponding text description, and Genie 2 can generate an interactive 3D environment in which users can freely move around through the mouse and keyboard. Explore, which can last up to one minute. This model has the ability to "extend the scene", which not only maintains the consistency of the generated environment, but also accurately represents the parts that disappear from the field of view when the user moves.
Google DeepMind displayed a series of effect animations on the official website, further illustrating that Genie2 can simulate object interaction, animation, realistic lighting, physical reflection effects, and NPC behavior during the generation process, and many generated scenes The image quality is close to that of AAA-level games, and it even performs well in terms of object perspective consistency and spatial memory, and has the ability to simulate physical laws.
These capabilities are shocking, because currently, achieving such effects still requires game development and art to spend a lot of time to complete. Netizens exclaimed that this release further blurs the boundaries between the physical world and the digital world, allowing people to see the future of world models like "Ready Player One".
01. Generate infinite interactive worlds through gamesImage source: Google DeepMind official website
For decades, games have been the cornerstone of artificial intelligence research. The immersive and controllable nature of games, as well as the measurable challenges they pose, provide an ideal environment for testing and advancing artificial intelligence. From mastering Atari games in the early days of AI development, to AlphaGo's world-changing victory at Go, to AlphaStar's dominance in StarCraft II, DeepMind continues to demonstrate the potential of games as a proving ground for AI.
However, a significant obstacle in training universal embodied agents—artificial intelligence that can learn to interact with the physical and virtual world in multiple ways—has been the lack of diverse training environments.
Traditional training tools do not provide enough variety and depth to keep people motivatedAI agents fully perceive the complexity of the real world. Genie 2 aims to solve this problem by generating infinite interactive worlds through games.
What sets Genie 2 apart is its ability to create highly customizable games on demand. Simply input images as prompts and the system creates playable worlds to suit specific training or gaming needs. This flexibility allows AI researchers to take on never-ending challenges using agents, helping them develop skills that are transferable to real-world scenarios. This has the potential to revolutionize the way developers test and improve AI systems, allowing people to use AI to better unleash their creativity.
By using Genie 2 to quickly create rich and diverse environments, researchers can generate evaluation tasks not seen during training. For example, Google showed off an example of a SIMA agent developed in collaboration with game developers that can synthesize and execute instructions in a previously unseen environment based on a single image prompt.
Image generated by Imagen 3, prompt: "Screenshot of a third-person open world exploration game. The player plays an adventurer who is exploring the forest. There is a house on the left with a red door, and a house on the right , the door is blue. The camera is located directly behind the player.
The SIMA agent is designed to complete a series of tasks in the 3D game world. Here, Google uses Genie 2. Generate a 3D environment with two doors (one blue door and one red door), and input the command "open the red door" or "open the blue door" to the SIMA agent through the keyboard and mouse to control the character to make Corresponding action.
Additionally, Genie 2 can: intelligently respond to actions taken by key presses on the keyboard; generate different trajectories from the same starting frame; remember what has been generated before, with spatial context; Maintain world consistency over time; create worlds in different styles, such as first-person or cartoon style; support the creation of complex 3D Structural visual scenes; supports simulated physical interactions, balloon explosions, shooting dynamite barrels, etc.; learned how to animate various types of characters performing different activities; modeled even complex interactions with other agents; made powerful Physical property simulation: fluid, smoke, gravity, lighting, reflection; supports generation from real-world images.One of the most exciting implications of Genie 2 is its ability to facilitate general agent training. Unlike specialized agents that excel at a single task (such as playing chess or answering trivia), general agents can adapt to a wide variety of challenges, just as humans solve a variety of problems in the real world. By exposing these agents to new environments, Genie 2 enables them to cope with complex real-world scenarios where adaptability and versatility are critical.
While this research is still in its early stages and there is much room for improvement in both agent and environment generation capabilities, there is no doubt that Genie 2 is a way to solve the structural problem of safely training concrete agents while also Demonstrates the breadth and versatility of possibilities required to move toward AGI.
In addition to promoting the development of AI research, Genie 2 also provides new imagination space for game development and interactive prototyping work. For game developers, especially independent developers, they can use Genie 2 to quickly create unique, playable experiences, reducing the time and cost of traditional design processes. The value of Genie 2 to game development is so obvious that after the release of Genie 2, DeepMind CEO enthusiastically invited Musk to make AI games together on "X", and Musk replied:
"Cool".
For gamers, the technology behind Genie 2 promises a future where gaming environments will be more dynamic, personal and immersive than ever before. Imagine a video game that could adapt to a player's skill level or preferences in real time, providing a truly tailored experience. The future world of "Ready Player One" may be getting closer to us.
Even more, Genie 2’s impact extends far beyond gaming.
Genie 2 can serve as a platform for innovation in virtual reality, simulation and robotics. For example, robots can be trained in Genie 2-generated gaming environments to learn how to navigate unfamiliar terrain or interact with objects in new ways. Likewise, virtual assistants can improve their ability to understand and respond to real-world tasks by practicing in these environments. This is probably why Google DeepMind’s positioning when introducing Genie 2 was a “basic world model” rather than just a “game generation model”.
02. Unlocking 3D narrative may become a new era of technological revolutionWhen Li Feifei announced the AI system for "picture-generated 3D world" at "X", he did not provide a corresponding explanation of the technical principles behind it. As a result, netizens marveled at the superb technical capabilities, but regretted that they could not explore the principles behind it.
On the Google DeepMind official website, Google briefly introduces the principle behind Genie 2 as "an autoregressive latent diffusion model trained on a large video data set" and cites relevant papers with hyperlinks. The author conducted a simple analysis and understanding of this introduction. The principle is roughly as follows:
Image source: Google DeepMind official website
Genie 2 is an autoregressiveDiffusion models learn how to generate video content by analyzing large amounts of video data. Specifically, the collaboration of autoencoders and large transformer dynamic models enables Genie 2 to extract key information from the original video and generate updated video scenes through deep learning models.
First, Genie 2 uses a tool called an autoencoder to extract important information from the video. Through autoencoders, key features in video frames are compressed into a simplified form called "latent frames". This process can be compared to compressing each frame of video into a smaller packet, retaining the most informative part. These "potential frames" are not complete video content, but highly abstract and simplified versions of the most important elements in the video.
Next, these "potential frames" are input into a large transformer dynamic model. The model uses "causal masking" technology to learn the relationship between frames in the video. "Causal mask" helps the model understand the order between frames, making the video content coherent and smooth. For example, the model can learn how an action transitions smoothly from one frame to the next, ensuring that dynamic changes in zoom in the video are not abrupt.
In the process of video generation, Genie 2 uses a method called autoregressive sampling. This means that it does not generate the entire video at once, but rather frame by frame, with each frame relying on information from the previous frame to determine what the next frame will contain. This method ensures the continuity of the video and makes each picture naturally connected together, thus improving the realism and smoothness of the video.
In addition, Genie 2 also introduces a technology called classification-free guidance to improve the controllability of generated actions. Through this technology, Genie 2 can more accurately control the actions and scenes in the video when generating the video, reducing the uncertainty or incoherent actions that may occur during the generation process, thus enhancing the controllability of the video content. sex.
As global technology giants turn their attention to the integration of AI and the physical world, we are standing on the threshold of a new technological revolution. Although the pace seems slower than the evolution speed of question-and-answer AI such as ChatGPT, the development of 3D AI heralds broader application prospects. Just as Li Feifei's ImageNet project once led a wave of AI entrepreneurship in the field of computer vision, 3D AI technology may now be setting off a larger revolution. It will not only promote technological progress, but also profoundly change the way we interact with the world. From robotics to self-driving cars, from virtual reality to urban planning, 3D AI has unlimited application potential.
Therefore, we can foresee that 3D AI will open a new era full of innovation and opportunities. It will not only be an iteration of technology, but also a profound reshaping of human lifestyles, pushing usEnter a smarter, more connected world.