Meta's chief AI scientist Yang Likun once again bombards generative AI

Image source: Generated by unbounded AI

"Give up generative models and do not study LLM (big language model), we cannot make AI reach the level of human intelligence through text training." Recently, Yann, chief AI scientist at Meta LeCun once again bombarded generative AI at the 2025 Artificial Intelligence Action Summit in Paris, France.

Yang Likun believes that although the existing big models run efficiently, the inference process is divergent, and the generated token may not be within the scope of reasonable answers, which is why some big models have hallucinations. Although many generative models now allow AI to pass the bar exam and solve math problems, they cannot do housework. Things that humans can do without thinking are very complicated for generative AI.

He also said that generative models are not suitable for making videos at all. The AI models that everyone sees that can generate videos do not understand the physical world, they are just generating beautiful pictures. Yang Likun supports a model that can understand the physical world. He proposed a joint embedding prediction architecture (JEPA) that is more suitable for predicting video content. He always believes that only when AI can truly understand the physical world can we usher in artificial intelligence that is comparable to human intelligence.

Finally, Yang Likun emphasized the need for open source artificial intelligence platforms. In the future, we will have universal virtual assistants that will regulate all our interactions with the digital world. They need to speak all the languages in the world, understand all cultures, all value systems, and all interest centers. Such an AI system cannot come from a few companies in Silicon Valley and must collaborate in an effective way.

The wonderful views are as follows:

1. We need human-level intelligence, because we are used to interacting with people. We look forward to the emergence of AI systems with human intelligence. In the future, the ubiquitous AI assistant will Become a bridge between human beings and the digital world, and help human beings better interact with the digital world. 2. We cannot make AI reach the level of human intelligence through text training, which is impossible. 3. In Meta, we call this type of AI that can reach the level of human intelligence advanced machine intelligence. We do not like the term "AGI" (general artificial intelligence), but call it "AMI", which is very similar to "in French" The word friend”. 4. Generative models are not suitable for making videos at all. You may have seen AI models that can generate videos, but they do not really understand physics, they are just generating beautiful pictures. 5. If you are interested in AI that reaches the level of human intelligence and you are in academia, don't study LLM because you are competing with hundreds of people with tens of thousands of GPUs, and there is no point in it. 6. AI platforms need to be shared. They must be able to speak all languages in the world, understand all cultures, all value systems and all interest centers. No company in the world can train such a basic model, and it must be effective. Complete in collaboration. 7. The open source model is slow and firmFixed ground transcends closed source model.

The following is the full text of sharing (with a cut):

Why do we need AI at the level of human intelligence

As we all know, we need artificial intelligence at the human level, which is not only an interesting scientific question, but also a product requirement. In the future, we will wear smart devices, such as smart glasses, and access AI assistants at any time through these smart devices and interact with them.

We need intelligence at the human level, because we are used to interacting with people. We look forward to the emergence of AI systems with human intelligence. In the future, the ubiquitous AI assistants will become the bridge between humans and the digital world, helping humans. Better interact with the digital world. However, compared with humans and animals, machine learning is still bad, and we have not yet created machines that have the ability to learn, common sense, and the ability to understand the material world. Both animals and humans can act based on common sense, and these behaviors are essentially driven by goals.

So the artificial intelligence system that almost everyone is using does not have the characteristics we want. Because they generate tokens recursively, and then use marked tokens to predict the next token. The way these systems are trained is to put the information at the input and then try to make it reproduce the information at the input at the output. It is a causal structure that cannot cheat, nor can it use specific inputs to predict itself, and can only look at the tokens around you. So it is very efficient, people call it a universal big model, and can use it to generate text and pictures.

But this reasoning process is divergent. Every time you generate a token, it may be out of the scope of the reasonable answer, and may make you farther and farther away from the correct answer. If this happens In the situation, there is no way to fix it in the future, which is why some big models have hallucinations and nonsense.

Now these artificial intelligences cannot replicate human wisdom, and we cannot even replicate the wisdom of animals such as cats or mice. They all understand the operating rules of the physical world and can complete some actions that rely on common sense. Planning is required. A 10-year-old human child can complete the movements of collecting dishes and wiping the table without learning. A 17-year-old young man can learn to drive in 20 hours, but we have not yet been able to create a robot that can be used for home use. This shows that Our current research and development of artificial intelligence still lacks some very important things.

Our existing AI can pass the bar exam, solve math problems, and prove theorems, but cannot do housework. What we think can be done without thinking is very complicated for artificial intelligence robots, and what we think is unique to humans, such as language, playing chess, creating poetry, etc., can be easily accomplished by AI and robots.

We cannot help AI achieve human intelligence through text trainingLevel, this is impossible to happen. Some vested interests will say that AI intelligence can reach the level of a human doctorate in the year, but this is simply impossible. AI may reach the level of a human doctorate in a certain field such as chess and translation, but a general big model cannot do it. If we only train these AI models specifically for problems in a certain field, if your question is very standard, the answer can be generated within a few seconds, but if you slightly modify the statement of the problem, the AI may still give the same The answer is because it doesn't really think about this question. So it will take time for us to want an artificial intelligence system that can reach the level of human intelligence.

Not "AGI" but "AMI"

In Meta We call this type of AI that can reach the level of human intelligence advanced machine intelligence. We do not like the term "AGI" (general artificial intelligence), but call it "AMI". It is pronounced in French very much like the word "friend" . We need a model that uses the senses to collect information and learn from them, which can manipulate it in the mind and learn two-dimensional physics from videos. For example, systems with lasting memory, systems that can plan actions in layers, and systems that can reason, and then implement controllable and secure systems through design rather than fine-tuning.

Now, I know that the only way to build such systems is to change the current way artificial intelligence systems do reasoning. The current reasoning method of LLM is to generate a token by running a fixed number of neural network layers (Transfomers) and input it, and then run a fixed number of neural network layers again. The problem with this way of reasoning is that no matter you ask a simple or complex question, when you ask the system to reply "yes" or "no", it will spend as much calculation to answer them. So people have been cheating and telling the system how to answer. Humans know this reasoning and thinking technique, so that the system can generate more tokens, which will spend more computing power to answer questions.

In fact, the way reasoning works is not like this. In many different fields such as classic statistics artificial intelligence, structural prediction, etc., the way reasoning works is: you have a function to measure your observations and Compatibility or incompatibility between output values, the inference process involves finding the value that compresses the information space to a minimum and outputs it. This function we call it the energy function. When the result does not meet the requirements, the system will only perform optimizations and reason. If the inference problem is more difficult, the system will spend more time in reasoning, in other words, it will spend longer thinking about complex problems.

In classic artificial intelligence, many things are related to reasoning and searching, so optimizing any computational problem can be simplified to a reasoning problem or search problem. This type of reasoning is more similar to what psychologists call System 2, which means that before you take action, you should first consider how to do it. System 1 is what you can do without thinking.This will become a subconscious.

Source: Video screenshot

I will briefly explain the energy model, which means that we can capture the dependence between variables through energy functions, assuming that the observation value X and the output value Y, when When X and Y are compatible, the energy function takes a low value, and when X and Y are incompatible, the energy function takes a high value. You don't want to calculate Y only from X, but just want an energy function to measure the degree of incompatibility. Just give an X and find a Y with lower energy.

Now let us learn in detail how the world model is built and how it has something to do with thinking or planning. This system is like this. To observe the world, you have to go through a perception module. This module will summarize the state of the world. Of course, the state of the world is not completely observable, so you may need to combine it with memory. The content contains your thoughts on the state of the world, and the combination of these two forms a world model.

What is the world model? The world model gives a summary of the current world state. In an abstract demonstration space, it gives you an imagined order of actions. Your world model predicts the state of the world after you take these actions. If I told you to imagine a cube floating in front of you, and now rotate this cube vertically 90°, what does it look like? You can easily imagine what it looks like after it rotates in your mind.

I think we will have human-level intelligence before we have audio and video that can really work. If we have this world model, which can predict the outcome of a series of actions, we can input it into a task goal to measure the extent to which the prediction of the final state meets the goals we set for ourselves. This is just an objective function, we can also set some constraints and regard them as requirements that need to be met for the safe operation of the system. With these constraints, the security of the system can be guaranteed, so that you cannot pass them. They are rigidly stipulated and are not within the scope of training and reasoning.

Now a series of actions should use a world model, and be used repeatedly in multiple time steps. If you perform the first action, it predicts the state after the action is completed, and you perform the second action and it will be Predict the next state and follow this trajectory, you can also set task goals and constraints. If the world is not completely certain and predictable, then the world model may need to have latent variables to explain everything we have not observed about the world, which biases our predictions. Ultimately, what we want is a system that can be planned in a layered manner. It may have several levels of abstraction, at which we plan low-level actions, such as basic muscle control. But at a high level, we can plan abstract macro actions. For example, I sat in my office at New York University and decided to go to Paris. I can divide this task into two sub-tasks: going to the airport and catching a plane. Then plan each step in detail: get a bag, go out, take a taxi, take an elevator, buy air tickets...

We often don't feel that we are doing layered planning, and they are almost all subconscious actions, but we don't know how to make machine learning do this. Almost every machine learning process is planned in a layered manner, but each level of prompts is input manually. We need to train an architecture that allows it to learn these abstract demonstrations on its own, not only the state of the world, but also the prediction of the world. Models can also predict abstract actions at different levels of abstraction, so that machine learning can unconsciously perform layered planning like humans.

How to make AI understand the world

I have all this reflection , wrote a long paper three years ago explaining areas that I think artificial intelligence research should focus on. I wrote this paper before ChatGPT became popular, and until today, my view on this issue has not changed, and ChatGPT has not changed anything. That paper is about the path to autonomous machine intelligence, which we now call advanced machine intelligence, because the word "autonomous" scares people, and I have introduced it in speeches on different occasions.

If you want the system to understand how the world works, a common method is to train according to the process we used to train natural language systems in the past and apply it to videos. If a system can predict that the video will be What happens, you show it a small video, and then let it predict what will happen next, training it to make predictions can actually allow the system to understand the underlying structure of the world. It works for text because predicting words is relatively simple, with limited number of words and limited number of markers. We cannot accurately predict which word will follow another word, or which word is missing in the text, but we can do every The probability of possible generation of a word is calculated.

But we can't do this with images or videos, we don't have a good way to represent the distribution of video frames, and every attempt to do this basically encounter math problems. So, you can try to solve this problem with statistics and mathematics invented by physicists, and in fact, it is best to abandon the idea of probabilistic modeling altogether.

Because we cannot accurately predict what will happen to the world. If you train a system to predict only one frame, it won't do a good job. So the solution to this problem is to develop a new architecture, which I call joint embedding prediction architecture (JEPA). Generative models are not suitable for making videos at all. You may have seen AI models that can generate videos, but they don’t really understand physics, they are just generating beautiful pictures. The philosophy of JEPA is to run both observations and output values so that it is no longer just predicting pixels, but what happens in predicting videos.

Source: Video Screenshot

Let's compare these two architectures. On the left is the generation architecture, you input X, that is, the observation value, and then make a prediction for Y,This is a simple prediction. And in the JEPA architecture on the right, you run X and Y at the same time and possibly the same or different encoders, and then predict the representation of Y based on the representation of X in this abstract space, which will cause the system to basically learn an encoder, which can Eliminate everything you can't predict, this is what we really do.

When we were shooting in the room, the camera began to move, neither human nor AI intelligence could predict who would appear in the next picture, what the texture of the wall or floor would look like, there were many things We simply cannot predict. So instead of insisting that we make probabilistic predictions for things that cannot be predicted, let go of predicting it and learn a representation where all these details are basically eliminated, so that predictions are much simpler and we simplify the problem.

JEPA architecture has various styles. I won’t discuss those potential variables here, but talk about action conditions. This is the most interesting part because they are really world models. You have an observation value X is the current state of the world, and input the action you plan to do into the encoder. This encoder is the world model, letting it predict the state representation of the world after doing this action. This is how you How to plan.

Recently, we have conducted in-depth research on Video JEPA. How does this model work? For example, first extract 16 consecutive frames from the video as input samples, then block and destroy some frames, then input these partially corrupted video frames into the encoder, and synchronously train a prediction module to make it based on defects. The picture information is reconstructed to create a complete video representation. Experiments show that this self-supervised learning method has significant advantages. The deep features learned can be directly transferred to downstream tasks such as video action classification, and have achieved excellent performance in many benchmark tests.

There is one very interesting thing, if you show this system, something very strange happens in the video, and this system is actually telling you that its prediction error is soaring. You take a video and take 16 frames to measure the prediction error of the system. If something strange happens, such as an object spontaneously disappears or changes shape, the prediction error will rise. It tells you that although the system is simple, it Have learned a certain level of common sense, it can tell you if something very strange happened in the world.

I want to share our latest work - DINO-WM (a new way to build visual dynamic models without rebuilding the visual world). Use a world picture to train a predictor, and then run it through the DINO encoder. Finally, the robot may make an action, so that the next frame of the video can be obtained, and the image is placed in the DINO encoder again to run it. Out a new image and train your predictor to predict what will happen based on the actions taken.

It is very simple to plan. You observe an initial state, put it in the DINO encoder and run it, and thenRun the world model in multiple time points and steps with the imagined action, and then you have a target state, which is represented by the target image, for example you run it to the encoder, and then calculate the predicted state and the state representing the target image in the demonstration The gap in space, find an action sequence with the least running cost.

Source: Video Screenshot

This is a very simple concept, but it works very well. Suppose you have this small T-shaped pattern and want to push it to a specific position, you know where it has to go, because you put the image of that position into the encoder and it will give you a target in the demo space state. What actually happens in the real world when you take a series of planned actions, what you see is the internal psychological prediction of the action sequence of the system planned, putting it into the decoder will produce a graphical representation of the internal state .

Please give up the research-general model

Finally, I have some suggestions Share with everyone. The first thing is to give up the generative model. This is the most popular method at the moment, and everyone is studying this. You can study JEPA, which is not a generative model, they predict what will happen to the world in the demonstration space. I have been saying that giving up reinforcement learning for a long time, it is inefficient. If you are interested in AI that reaches the level of human intelligence and you are in academia, don't study LLM because you are competing with hundreds of people with tens of thousands of GPUs, and there is no point. There are still many problems in the academic world that need to be solved. The planning algorithm is very inefficient. We must come up with better methods. JEPA with potential variables is completely unsolved in uncertainty hierarchical planning. These are all welcome to explore. .

In the future, we will have universal virtual assistants that will always be with us, regulating all our interactions with the digital world. We cannot let these AI systems come from a handful of companies in Silicon Valley or China, which means that the platform we build these systems requires open source and widespread use. These systems are expensive to train, but once you have a basic model, fine-tuning for a specific application is relatively cheaper, which many people can afford.

AI platforms need to be shared. They need to speak all languages in the world, understand all cultures, all value systems, and all interest centers. No company in the world can train such a basic model. , must be done in an effective way.

Therefore, an open source artificial intelligence platform is necessary. The crisis I've seen in Europe and elsewhere is that geopolitical competition has induced some governments to essentially make the release of open source models illegal because they want to keep scientific secrets to stay ahead. It's a huge mistake, when you do your research in secret, you'll fall behind, it's inevitable, what's going to happen is that other countries in the world adopt open source technology, we'llBeyond you. This is what is happening right now, and the open source model is slowly and firmly moving beyond the closed source model.