The actual measurement of Wanxian 2.1 is the world's first open source model for Chinese text generation, which beats Sora?

Image source: Generated by Unbounded AI

DeepSeek Open Source Week, others are not idle either.

On February 25th, Claude released the Sonnet 3.7 version in the early morning. DeepSeek opens the DeepEP code base during the day. At night, Alibaba’s latest video generation model Wan2.1 was unveiled. It was a lively day!

Compared with a language model with stronger code capabilities and an underlying code base that makes developers more excited, the video generation model is obviously more exciting for ordinary people.

Still adhering to the style of "open as much as you can", Wanxiang opens the source of all the inference codes and weights of the two parameters of 14B and 1.3B, and supports Wensheng Video and Tusheng Video tasks worldwide Developers can download and experience it in Github, HuggingFace and Modai communities.

The loosest Apache 2.0 protocol is adopted, which means that the copyright of the generated content belongs to the developer, and can be used for both free channels and commercial use.

In the evaluation set VBench, Wanxiang 2.1 surpasses domestic and foreign open source models such as Sora, Luma, and Pika.

How effective is it? Without further ado, let’s review it first!

#01. Model test

Currently in Tongyi Wanxiang Experience the 2.1 Speed Edition and Professional Edition. Both versions are 14B. The speed version is generated at about 4 minutes. The speed version is generated at a slower speed, which takes about 1 hour to generate, but the effect is more stable.

Wensheng Video 2.1 Professional Edition is more accurate in understanding text than the Speed Edition, and the picture clarity is relatively higher. However, the video images generated by both versions have obvious deformations, and there is a lack of understanding of the details of some physical worlds.

Tip words: Refer to the shooting method of Inception, shoot wide-angle shots on the top, the hotel corridor continues to rotate at an angle of 15 degrees per second, two suit agents roll and fight between the wall and the ceiling, and the tie floats up at 45 degrees due to centrifugal force . The ceiling light fragments splash in disorder with the direction of gravity.

Professional version

Speed version

Tip: The girl in the red dress jumps on the Montmartre ladder, and the old items collection box pops up at each step (clockwind toy/old photos/glass marbles) , under the warm filter, the pigeon group forms a heart-shaped trajectory, the accordion scale and the pace are accurately synchronized, and the fish-eye lens follows the shot.

Professional Edition

Speed Edition

Wanxiang 2.1 is currently the world's first open source video model that can directly generate Chinese text. Although the specified text can be accurately generated, it is limited to relatively short text, and garbled code will occur after a certain length.

Tip words: The wolf's hair brush is sprinkled on rice paper, and the words "fate" appear one by one when the ink marks are stained, golden light appeared on the edges of the handwriting.

The video effect of the picture generation is relatively stable, the characters are consistent, and there is no obvious deformation, but the understanding of the prompt words is incomplete and details are lacking. For example, in the case video, there are no pearls in the pearl milk tea, and the Queen Mother Shiji did not become a fat girl.

Tip words: Oil painting style, a girl in simple clothes took out a cup of pearl milk tea, opened her red lips and tasted it slowly, and her movements were elegant and calm. The background of the picture is a deep dark tone, and the only light is focused on the girl's face, creating a mysterious and peaceful atmosphere. Close-up, side face closeup.

Tip word: The stone man's arms swing naturally with the pace, and the background light gradually changes from bright to dark, creating a visual effect of time passing. The lens always remains stationary and focuses on the dynamic changes of the stone man. The small stone man in the initial picture gradually grew as the video progressed, and finally turned into a round and cute stone girl in the end picture.

In general, the semantic understanding and physical performance of Wanxian 2.1 still need to be improved, but the overall aesthetic is online, and after open source, it may speed up the optimization and update speed. We hope that there will be better presentation effects in the future.

#02. Low cost, high effect, and high controllability

In terms of algorithm design, Wanxiang is still based on mainstream DiT architecture and linear noise trajectory Flow Matching, which looks a bit complicated, but in fact everyone has the same idea.

It means that a bunch of noise (similar to the TV snowflake screen), until the picture becomes pure noise, the model starts to "denoise", put each noise where it should be placed, and through multiple iterations Generate high-quality pictures.

But the problem is that the traditional diffusion model generates videos with a huge amount of calculation and requires continuous sorting optimization, which leads to a long generation time but not long enough video time, and the second memory consumption of computing power .

At this time, Wanxiang proposed a novel 3D spatiotemporal variational automatic encoder (VAE), called Wan-VAE. By combining multiple strategies, space-time compression is improved and memory usage is reduced.

This technology is a bit similar to the "two-way foil" in "The Three-Body Problem", turning people from three-dimensional to two-dimensional. Space-time compression means compressing the space-time dimensions of video, such as decomposing the video into low-dimensional representations, from producing a three-dimensional cube to creating a two-dimensional cube and then reverting it to three-dimensional, or using layered generation to improve efficiency.

To give a simple example, Wan-VAE can compress a book called "Romance of the Three Kingdoms" into an outline, and retain the recovery content in the outline, which greatly reduces the memory usage. At the same time, this method can be used , remember the novels with longer works.

Solve the content occupation problem, and solve the problem of long video production. Traditional video models can only handle fixed lengths., if you exceed a certain length, it will stumble or crash, but if you only store the outline and remember the front and back relationships, then when generating each frame, you temporarily store the key information of the first few frames, which can avoid recalculating from the first frame. . In theory, according to this method, infinite length 1080P videos can be encoded and decoded without losing historical information.

This is why Wanxiang can run on consumer graphics cards. Traditional high-definition videos (such as 1080P) have too much data and the memory of ordinary graphics cards is not enough. However, before processing video, Wanxiang reduces the resolution first, such as scaling 1080P to 720P, reducing the amount of data, and after the generation is completed, the image quality is improved to 1080P using the super-score model.

After estimation of ten thousand phases, by inverting space downsampling and compression in advance, the inference memory usage is further reduced by 29% without losing performance, and the production speed is fast and the picture quality is not reduced.

This part of the technological innovation solves the engineering problem that video generation models have not been able to be widely used. But at the same time, Wanxiang has also further optimized the generation effect.

For example, refined motion control. Previously, the relative motion control of single objects and multiple objects in Runaway's native video model was completed by drawing trajectories by moving brushes, and Wanxian allows users to pass text and keys. Points or simple sketches control how objects move in the video (for example, specifying "butterflies hover from the lower left corner into the screen").

Wanphase 2.1 converts the user-input motion trajectory into a mathematical model, and serves as an additional conditional guide model during the video generation process. But this is far from enough. The movement of objects must meet the physical laws of the real world. Based on mathematical models, the calculation results of the physics engine are introduced to improve the authenticity of the movement.

In general, Wanxiang's core advantage lies in solving problems in actual production scenarios through engineering capabilities, and at the same time, it flows out space for subsequent iterations through modular design. For ordinary users, the threshold for video creation has been truly lowered.

The comprehensive open source strategy has completely broken the business model of paying for video models. With the emergence of Wanxiang 2.1, the video generation track in 2025, there is another good show!

Online Consultation