Sora was launched online yesterday, and everyone has tested it a lot.
To be honest, the product is very complete, but the quality of the model is really not as good as expected.
But today we are not here to talk about the Sora model.
On the contrary, when I was testing Sora during the day yesterday, I generated a gymnastics video that I have been posting for a long time.
In the year I have been playing with AI videos, gymnastics seems to have always been a nightmare for all AI videos.
Whether it is Sora, Luma, Keling, Runway, etc., they will overturn when generating gymnastics videos.
Some rollovers are gentler because they have a small range of motion.
Some rollovers are quite large, causing athletes to twist and deform in the air.
Gymnastics is the cruelest Turing test for AI videos.
Back then, everyone was using Wells’ noodles to measure AI videos, but in fact, gymnastics was the real goalkeeper.
Five months ago, when the DiT video model first came out, a gymnastics video generated by Luma caused an uproar on X.
In the video, the athletes’ limbs are twisted and deformed in the air. This video generated by Luma not only attracted nearly a million netizens to watch, but also made AI tycoons including LeCun quarrel.
There is only one focus of debate: Does AI understand the laws of physics?
Now five months have passed, and now there is almost a consensus on this issue.
You definitely don’t understand the laws of physics.
Going back to gymnastics, why are people’s running, walking and other movements almost very good now, and many animals’ movements are also very stable, but when it comes to complex movements, especially gymnastics, they just explode. Woolen cloth?
It’s actually quite simple.
First, we have to talk about how difficult gymnastics is.
A standard gymnastics action, such as a backflip and a 720-degree turn, seems to last only two seconds, but in these two seconds, there are about three extremely difficult points for the AI to perform.
The first one is the physical difficulty.
It is different from walking and running, which are actions that are almost engraved in genes.
Gymnastics requires you to explode with enough power to take off in an instant, complete two rotations in the air, and then land steadily.
This process involves multiple physical laws such as gravity, inertia, and conservation of angular momentum. Frankly speaking, if the take-off angle is 1 degree different and the strength is 1 point different, you may end up with an unstable landing.
In the real world, a gymnast must go through at least ten years of training before these skills can be engraved in his memory and muscles. The difficulty for AI to understand these laws in a short training process can be imagined.
The second is the biomechanical difficulty.
The human body structure is extremely complex, with 206 bones and more than 600 muscles.
Every bone and muscle has its own movement trajectory and coordination.
For humans, this kind of cooperation is an innate instinct. But for AI, understanding this complex biomechanical system is a huge challenge.
Just like people who often draw six fingers when drawing with AI, AI often makes many fatal mistakes at the biomechanical level when generating some complex movements. For example, reverse bending of the elbow joint, excessive knee rotation, etc., and the most classic one, turning around is really just turning around without turning the head. . .
These errors occur because the AI does not truly understand the structural limitations of the human body. It does not know that human joints can only move at specific angles, does not understand the synergistic relationship between muscle groups, and does not understand the biomechanical characteristics of the human body during high-speed movement.
More importantly, AI does not understand the concept of "pain". In reality, pain is the body's natural feedback to irrational actions and is part of the protective mechanism. But in the movements generated by AI, it doesn’t matter whether you feel pain or not, as long as you can move.
This is like asking a painter who knows nothing about the structure of the human body to draw a sequence of movements of a gymnast with his eyes closed. He may draw pictures that appear smooth but are completely ergonomic.
This biomechanical limitation is precisely one of the most difficult bottlenecks for AI to break through when generating gymnastics videos.
The third point is the aesthetic difficulty.
Gymnastics is not a pure sports competition, but also an art.
The gracefulness of movements, the lines of the body, and the overall beauty of rhythm are all important scoring criteria in gymnastics competitions. Even if an action is technically completed, if it lacks aesthetic appeal, points will still be deducted.
The movements must be accurate and graceful, which is too difficult for AI.
When these three levels of difficulty are superimposed, it becomes a nightmare for AI.
Some people say that the failure of AI to generate gymnastics videos is due to insufficient training data. Some people say that the fuzzy processing of the data set causes the model to be unable to understand the human body structure.
But I think the deeper problem lies in: AI is still imitating perfectly after all.
Just like a parrot can imitate human speech, it doesn’t know what it means, even if it answers fluently.
This metaphor is very accurate.
I think this is true for today’s large models, this is true for AI drawing, and it is even more true for AI video.
When AI generates a video, it is actually playing a game of probability, guessing what the next frame is most likely to look like based on what it has already seen. It's like a person who has never studied gymnastics trying to reproduce a difficult move through a video he has watched.
But gymnastics is not a probability gameplay.
Some more cutting-edge academic circles have also tried to introduce physics engine simulation (such as combining action generation with physical simulators), or adding physical law constraints to the loss function, but they are still in the exploratory stage and are far away from The so-called world simulator is still far, far away.
Just like the Turing test uses human dialogue to test the intelligence level of AI, I think the gymnastics video is testing the depth of AI's understanding of the real world. It requires AI to not only "perfectly imitate", but also to understand the physical laws, biomechanical principles and aesthetic standards behind it.
This understanding is much deeper than we imagined.
This exactly confirms the judgment of Professor Pedro Domingos. The road to AGI may be farther than we think.
This road may be a long one.
But the end point must be worth looking forward to.