After the emergence of visual model intelligence, Scaling Law will not end

Image source: Generated by Unbounded AI

Scaling Law may be coming to an end - this is one of the hotly debated topics recently. This discussion originally originated from a paper called "Scaling Laws for Precision" at Harvard University. Its research shows that the current language model has experienced overtraining on a large amount of data, and continuing to add more pre-training data may have side effects.

The signal this releases is: in the field of natural language processing, Scaling Law has reached a bottleneck as far as the eye can see. The method of simply relying on increasing the model size and data volume to improve performance may no longer be effective, and low-precision training and inference are giving diminishing marginal benefits to model performance improvements.

Scaling Law does show signs of "coming to an end" on the road of NLP, but this does not mean that its end is really coming. In the field of multi-modal models, since multi-modal data includes images, videos, audio and other types, it is more complex in terms of information richness, processing methods, and application fields, and it is difficult to achieve a large training scale. Therefore, Scaling Law has not been truly verified.

However, the latest release of Vidu1.5 by Shengshu Technology, a large model company based in Tsinghua University, shows that Scaling Law in the multi-modal field has just begun. Through continuous Scaling Up, Vidu 1.5 has reached the "singularity" moment, emerging with "contextual capabilities" that can understand the multi-subject information input by memory and demonstrate more precise control over complex subjects. Whether it is a character with rich details or a complex object, by uploading three pictures from different angles, Vidu 1.5 can ensure a high degree of consistency in the image of a single subject.

Vidu 1.5 not only enables precise control of a single agent, but also achieves consistency among multiple agents. Users can upload images containing characters, props, objects, environmental backgrounds and other elements. Vidu can seamlessly integrate these elements and achieve natural interaction.

Vidu has made various breakthroughs in subject consistency. It is not only the Scaling Law that plays a role, but also the fundamental reason for its adoption of a unified technical architecture solution without fine-tuning. In order to achieve consistency, most current video models use LoRA solutions that are fine-tuned for a single task based on pre-training. Vidu's underlying model breaks out of the industry's mainstream solutions and makes pioneering changes.

Coincidentally, looking back at the development history of large language models, you will find that the qualitative change from GPT-2 to GPT-3.5 is a sign of a breakthrough from pre-training + specific task fine-tuning to an overall unified framework. . It can be said that the launch of Vidu 1.5 has opened the GPT-3.5 moment for multi-modal large models.

SoSince ra was released at the beginning of the year, there have been no other iterative new versions. Other video generation entrepreneurial teams seem to have no anchoring direction, and most of them are doing derivative work on the DiT architecture. Regarding this phenomenon, Bao Fan, CTO of Shengshu Technology, said: We will not catch up on the route set by Sora. Instead, we will follow our own path from the beginning, aiming at the goal of universal multi-modal large models. Achieve corresponding capabilities.

From the release of U-ViT, the world's first Transformer architecture based on Diffusion, earlier than Sora, to the first realization of using a unified architecture to handle generalization tasks, Shengshu has not only a first-mover advantage, but also continuous breakthroughs ability. Compared with other video generation models in the industry, Vidu has initially formed a technical generation gap.

1. Redesign an "underlying architecture"

Achieving subject consistency is a difficult nut to crack in the field of video modeling. "It's like, you know that the engine is very important for a car, and you also know that if the engine is qualitatively changed, the performance of the car will also be improved, but it is difficult to build a good engine." Bao Fan tells AI Technology Reviews.

Including Sora, domestic and foreign video models have not made any breakthroughs in subject consistency. Currently involved is a large domestic factory, which is limited to achieving facial consistency control. It is difficult to guarantee details such as details, clothing, and styling, and it uses a LoRA fine-tuning solution.

Vidu’s achievements in subject consistency were not achieved overnight. At the end of July 2024, when Vidu was launched, it focused on solving the consistency problem and better achieving consistent facial control; in September, it launched the world's first "subject reference" function, extending the control of a single subject from the face to the entire single subject. In terms of image; Vidu 1.5, which was launched in November, has been further improved, allowing highly precise control of different perspectives of a single subject, while overcoming the problem of multi-subject control.

In other words, Vidu accomplished many of the things that video generation models are currently solving when it was launched in July.

From the perspective of technical solutions, other companies are still limited to the solution of pre-training + LoRA fine-tuning. Although this route is mature, it also has many shortcomings, such as long training time due to cumbersome data structure. , it is easy to produce over-fitting, thus forgetting a lot of original knowledge, and being unable to capture details, resulting in inaccurate features. Shengshu adheres to the concept of versatility and implements it through a unified underlying model technology architecture. Therefore, there is no need to collect, label, and fine-tune data separately. It only requires 1 to 3 pictures to output high-quality videos.

Comparing large language model technologiesThe evolution path will reveal that Vidu has the same design philosophy as the large language model: similar to the large language model using a Transformer to process all input and output tokens, Vidu as a video model will also unify all problems into visual input and visual output. patches; on this basis, after unifying the architecture, Vidu also uses a single network to uniformly model variable-length input and output like a large language model.

The "unified problem form" is the starting point of the general model. The more difficult part is to unify the architecture. Now Vidu has made some disruptive designs on the original U-ViT, which is essentially different from Sora's DiT architecture and is more unified in architecture. Bao Fan admitted that developing this architecture is as difficult as designing a Transformer from scratch.

The predecessor of the unified architecture can be traced back to September 2022. Bao Fan, who was studying for a PhD in Professor Zhu Jun’s research group at Tsinghua University, submitted an article titled "All are Worth Words: A ViT The paper "Backbone for Diffusion Model" proposed the U-ViT architecture, two months earlier than Sora's DiT architecture. CVPR2023, which rejected DiT, included U-ViT.

In March 2023, Professor Zhu Jun’s research team once again released a work on Unidiffuser. Unidiffuser has basically the same effect as Stable Difussion 1.5 at the same stage, demonstrating its excellent capabilities in visual tasks, and more importantly, , Unidifuser is more scalable and can complete arbitrary generation of graphics and text based on an underlying model. To put it simply, in addition to one-way text generation, it can also realize multiple functions such as picture generation, joint generation of graphics and text, unconditional generation of graphics and text, and rewriting of graphics and text. Later, OpenAI applied DiT to video tasks, and Shengshu, as a founding team, first applied U-ViT to image tasks, starting with tasks that required smaller computing power clusters for verification.

In April 2024, the underlying model architecture of Shengshu began to make changes to the U-ViT architecture, making the team the first to launch the self-developed large video model Vidu. This breakthrough has continued since then. When it was officially launched globally in July, Vidu successfully achieved verification on the face consistency issue. Until the release of Vidu1.5, Scaling Up based on this architecture has allowed multi-modal models to see the "singularity".

Looking back at the development process of large language models, the core idea of GPT-2 is to allow the model to conduct unsupervised learning through massive text data in the pre-training stage, without relying on specific tasks; after pre-training, GPT -2 Use domain-specific annotation data to fine-tune the model so that it can better adapt to specific tasks or application scenarios. However, at the stage of GPT-3.5, the model of pre-training and fine-tuning for specific tasks is no longer used. Only a simpler and more efficient unified architecture can support a variety of text tasks. The model has developed strong generalization capabilities.

Similar to the transition from GPT-2 to GPT-3.5, from pre-training and fine-tuning specific tasks to a unified and general technical architecture, the launch of Vidu 1.5 has allowed the video model to experience the GPT-3.5 moment. In other words, other video models are still in the GPT-2 pre-training + fine-tuning stage, while Shengshu’s Vidu has reached the GPT-3.5 stage.

2. The emergence of intelligence in the era of visual context

The unified and efficient underlying technical architecture is the foundation of Vidu, but its current comprehensive performance It is not only due to the technical architecture, but also inseparable from the data engineering of the video model.

In close-up scenes of people, Vidu 1.5 can ensure that the characteristic details and dynamic expression changes of the characters' faces are natural and smooth, without facial stiffness or distortion. In this video, the little girl's expression can change from happy to sad very naturally. Bao Fan told AI Technology Review that the careful control of data in these details is very important.

With Scaling Up of high-quality data, Bao Fan admitted that he has also seen the emergence of intelligence similar to large language models in the underlying video generation model. For example, Vidu1.5 can fuse different subjects, seamlessly blending the front of character A with the back of character B to create a brand new character. This is an ability that was not expected before.

In addition, the intelligent emergence of Vidu1.5 can also be seen from the improvement of model context capabilities and enhanced memory capabilities, which is reflected in the unified control of characters, props, and scenes in videos.

The key to this phenomenon is to solve the problem of "multiple pictures" "Flexible input" problem is similar to the language model increasing the window length. During the conversation with the chatbot, a character setting is first given through prompt words, and then the Chatbot can conduct an interactive conversation in the tone of this character. This shows that the language model not only processes a single text input information, but also processes it through association. Contextualize text, identify relationships between sentences, and generate answers or content that are coherent and contextual.

Similarly, give the video model a main photo as a prompt, then there will be noNo matter what new instructions are given, a video related to the subject in the photo above will be generated. It can be seen that if the video model wants to generate consistent subjects more stably, it also needs to understand the related text or picture information input before and after, and then generate consistent, coherent and logical content based on this information.

In fact, the difficulty of improving from single-subject consistency to multi-subject consistency also lies in the context length. The design of the single-agent architecture a few months ago is already compatible with the current multi-agent consistent architecture. Multi-agent consistency requires a longer context length than single-agent consistency, thereby solving the key problem of understanding more input combinations. .

Next, Shengshu’s main attack direction will still be iterated along the main line of contextual capabilities. "There is a lot of room for imagination after the video model's contextual capabilities are improved," Bao Fan said. He further explained that by inputting a few slices of Wong Kar-Wai's films into the model, a series of video clips with Wong Kar-Wai's photography skills can be generated; by feeding the model some videos of classic fighting actions, a model with exquisite fighting skills and excellent fighting scenes can be generated. video.

Vidu’s iteration in contextual capabilities also has its own rhythm: from the initial stage, it could only refer to the facial features of a single subject, to now it can reference multiple subjects, and later it is expected to be able to reference shooting techniques, camera movements, and scheduling. More factors. In this process, the reference object changes from concrete to abstract, and the requirements and difficulty gradually increase.

Because there is currently no open source solution for the contextual capabilities of video models, it will not be like a large language model. After one company does PMF, other companies will quickly follow up. From this perspective, Vidu1.5 has formed its own technical barriers.

3. There is more than one answer for Sora

< p> "The unified technical architecture without fine-tuning was designed by Shengshu, and the intelligent emergence of video models was first verified on Vidu - these are inevitable events." Bao Fan said. "Because the vision when our team was founded was to build a universal multi-modal model."

Shengshu Technology has never adopted a single, fine-tuned solution for specific tasks. This is consistent with the unified and efficient The structure is contradictory. This also means that universal multimodal models are where the genes of biology lie.

When Sora was first released at the beginning of the year, all video generation entrepreneurial teams were "flexing their muscles", and the competition was once very fierce. However, at the end of the year, the entire industry seemed to be "lack of stamina", and start-up companies rarely made major breakthroughs in their progress. However, Shengshu Technology is "finely crafted" on its own route, not only regularly improving the versatility of the model, but also not neglecting video picture details such as lens feel and dynamic level.

Vidu 1.5 has the ability to understand lens movement at the basic model level and canIt can generate complex shots such as push-pull pan + clockwise/counterclockwise fusion, and the picture has high expressiveness and smoothness. For example, enter the prompt word: Model shooting, she is surrounded by flowers, the light is bright and natural, the camera rotates clockwise to advance the shooting, and the following picture is obtained.

In terms of dynamics, the video movements generated by Vidu1.5 are large and natural. At the same time, a new dynamic control function is launched, which can accurately control the overall dynamics of the picture.

In addition to video capabilities, Vidu is also planning and laying out 4D models, audio and other modes. Among them, the 4D model derived from the video model will be able to achieve more precise camera movement control such as "adjustment by 6 degrees" on the video in the future. Bao Fan said that in the current initial stage, the team will first independently verify each sub-field of the multi-modal model, and finally integrate it into a general multi-modal large model.

The increasingly prominent technological advantages of Shenshu have also given it the confidence to compete in domestic video models. But the challenge ahead of it is the overwhelming resource advantages of big companies such as Kuaishou and Byte. In response, Bao Fan replied: When the goal is clear enough and the thing we create can truly solve industry problems, we will continue to move in this direction, and the final result will always be correct.

When comparing Sora, the world's leading company, you will find that Shengshuo and Sora have different concerns. Shengshu Technology is positioned as a universal multi-modal large model, while Sora advocates building a world simulator, hoping to truly simulate the physical world. Although the world simulator is a sub-problem of the multimodal large model, a general multimodal large model of biomass will emphasize solving more practical problems.

Shengshu will not completely match Sora, let alone follow Sora. Vidu proves: video modeling isn’t just the answer to Sora.

Online Consultation