Sora, which has been criticized as a "technical future" since the release of OpenAI on February 16, finally appeared on December 10, the official version of Sora , can generate videos up to 1080p resolution and up to 20 seconds long.
OpenAI CEO Altman said that the official version of Sora is a GPT-1 moment in the field of video generation.
However, domestic AI companies have not kept pace with OpenAI in the field of video generation as they did during the GPT period, but have shown a more complicated attitude.
Some people choose to follow up. For example, after Sora came out, Internet companies such as Alibaba, ByteDance, Kuaishou, Tencent, etc., and AI companies such as Zhipu AI, MiniMax, Aishi Technology, Shengshu Technology, etc., all Video generation models have been released one after another, and many have stated that they have reached or exceeded the preview version of Sora.
There are also people who choose not to follow up, including Baidu among Internet companies. Robin Li once made it clear that "no matter how popular Sora is, Baidu will not do it." AI companies such as Baichuan Intelligence have also made it clear that they will not make Sora-like models. Although Dark Side of the Moon, SenseTime, and Zero-One Everything have Vincent video models, they are not the focus.
The video generation track no longer continues the development model of the GPT era, that is, OpenAI played a trump card, and domestic technology companies rushed to follow. After Sora, the domestic AI poker game began to have its own rhythm and presented a more complex situation.
Domestic technology companies that have the ability to build large-scale universal basic models are beginning to have clear differences in their judgments on technical routes and business prospects. We will follow up on the choice of Sora from domestic companies and talk about the Chinese poker game generated by video.
First of all, we need to clarify what are the domestic technology companies that are benchmarking against the Sora model?
To put it simply, the core technical route of the Sora video generation model is the combination of Diffusion+Transformer, which uses text (natural language), pictures, and videos as prompts to generate videos.
To benchmark Sora’s model, it must have at least a few characteristics:
1. Universality, it is not targeted at a certain style, industry, role, etc., any video content can be used generate.
2. High quality, high image quality and precision (up to 1080p), long video time (up to one minute), and strong picture consistency (understanding the laws of physics).
Facing Sora, domestic technology companies are not as unprepared as they were when ChatGPT was launched. But whether to follow or not is no longer as consistent as ChatGPT, but divided into three categories:
The first category is clear follow-up.
In the camp of Internet companies, ByteDance, Kuaishou, etc., which have video as their core business, and comprehensive technology company TencentAccording to the news, the digital infrastructure is mature, the technical talent resources are abundant, and the video product gene is internal, so I chose to follow up almost immediately. ByteDance launched Dreamnia, and Kuaishou also released the large model of Ke Ling. With the Hunyuan large model as the core, Tencent has released and open sourced the Hunyuan multi-modal generation model, which is considered the Tencent version of Sora.
Among the large model start-ups, Zhipu AI is the most agile in its actions. In July this year, it released the AI video generation tool Qingying, which allows users to generate 10-second, 4K, and 60-frame videos through text/pictures. MiniMax’s Conch AI also added video generation capabilities in October, supporting text prompt words to generate 6-second video clips.
The second category is resolutely not to follow.
Contrary to the attitude of the first type of companies, there are also Internet companies and large model startups that are determined not to follow Sora. For example, after Sora came out, Wang Xiaochuan of Baichuan Intelligence said that someone in the team proposed to make Sora, but he made it clear that he would not follow up on this direction.
Although Baidu has achieved certain results in the field of video generation, Robin Li of Baidu also has the same idea, but he is also very determined not to do Sora because the commercialization of Sora may take five years or even In the past ten years, Baidu is currently focusing more on large language models and multi-modal large models, and there is no attempt to productize Sora-like products.
The third category is just a taste.
In addition, there are a large number of domestic companies that have made plans for Sora out of FOMO "fear of missing out", but they have not invested heavily and are in a state of dabbling.
For example, the Alibaba team in the Alibaba department released tomoVideo to test the video generation scenario for e-commerce marketing; in the "Big Model Six Little Tigers", Dark Side of the Moon also launched a video generation model, but still focused on Regarding kimi products; Zero One Wanwan has entered the B-side business, and the film and television production industry for which the video generation model is oriented is in a period of adjustment, and Sora-like products are unlikely to become a core growth point.
To sum up, if the global large-scale model is a "landlord fight", then the rules of the game are no longer that OpenAI plays a king bomb, and domestic technology companies follow suit one after another, but each plays according to their own rules. Determine Sora's strategy based on the card's face, business importance and priority.
Why have the rules of the game in the large model industry changed since Sora?
The performance of domestic technology companies shows that there is no consensus on Sora, and the overall stage is still relatively chaotic and the rules are vague. In a foggy realm, the rules of the game can only be explored by oneself.
Today’s current situation in the field of video generation is shrouded in triple fog.
Technical fog: OpenAl believes that Sora is a world simulator and a promising way to AGl. This technical route is currently controversial.
For example, Li Feifei, lecun and others believe that Sora cannot implement AGI. Li Feifei proposed,Sora is still a two-dimensional image, and only three-dimensional spatial intelligence can achieve AGI. The generated video of "Japanese women walking through neon-lit Tokyo streets" shown in the preview version of Sora cannot place the camera behind the woman, indicating that Sora does not really understand the three-dimensional world. Academic guru Lecun also named Sora as being unfavorable, saying that it is not a real world model at all and will still face the huge bottleneck of GPT4.
Indeed, even in the official version of Sora, problems such as inaccurate hand details generated and consistency in the dynamic process still exist.
One of the reasons why domestic companies are determined not to follow Sora is that they have reservations about this technical route. For example, Wang Xiaochuan of Baichuan Intelligence believes that Sora is only a phased product, and its technical height, breakthrough and application value are not as good as GPT. In short, the openness of the technical route to achieve AGI and simulate the physical world determines that Sora is not the only solution.
Business fog: The commercial prospects and investment return ratio of the video generation model are unclear in the short term, which has become another obstacle to dissuading domestic companies from quitting.
Both the preview version and the official version of Sora continue OpenAI’s “aesthetics of violence.” OpenAI research scientist Noam Brown said that Sora is the most intuitive display of the power of scale, that is, through the accumulation of computing power, data, and By using parameter quantities, we try to make large models emerge with the ability to understand the physical world. This method is costly and requires a large investment of resources. Whether to follow up on Sora depends on each company’s commercial expectations and investment return ratio for the model.
If the video generation model is charged for ToB, through API or SaaS services, basic model manufacturers will need to invest a lot of manpower to optimize business processes and develop interactive pages. The film and television industry is in an adjustment cycle, and the AI film and television production business growth is limited. This invisibly increases the opportunity cost of AI companies, because the same manpower, material resources, and computing power will obviously achieve greater results when invested in financial AI, educational AI, large government enterprises, and other fields. Therefore, companies such as Baidu and Lingyiwuwu regard the field of video generation as a marginal business and do not focus on investment.
In the ToC scenario, on the one hand, personal willingness to pay is not high, video generation is not a high-frequency scenario used daily by the public, and the generation cost and subscription fee are generally higher than the text model, plus the Sora model. Failure to solve the illusion and consistency problems may not create actual value, so the scale of C-side payment is very limited. On the other hand, the model is completely free, and the video generation model product is used as the traffic entrance of the enterprise. This business model is only suitable for enterprises that regard video as their core business.
For example, Kuaishou and Bytedance have their own core video business and can quickly scale up their models. For C-end users or B-end productivity tools, such companies can quickly integrate video generation capabilities with existing products. The marginal cost of model development will decrease with scale of commercialization.
Overall, for the vast majority of domestic basic mold factories, the field of video generation is a relatively marginal business with low return on investment.
The third layer of fog is the fog of competition in the market structure.
Although the business prospects of video generation models are currently unclear, is it possible that it will explode in the future, with companies quietly investing in it and surprising everyone? This kind of business myth of betting on marginal tracks to "pick up big leaks" may be difficult to happen with large models.
Currently, the prospects for productization and commercialization of large models are generally vague. General model manufacturers need to quickly select a product with a higher probability of success and a larger market from a large number of unclear products. Potential options, focus on investment. Among all products, video generation model is a particularly heavy and challenging project. In this case, we must prioritize products with a higher success rate and lower the business priority of the video generation model.
From another perspective, even if a company puts the highest priority on video generation models, it may be difficult to establish a competitive advantage. Because the current market competition for large models is different from that of the GPT period. Nowadays, each company has accumulated a certain amount of experience in basic training facilities, core architecture design, and technical reserves. In fact, there are technical barriers to reproducing Sora and launching Sora-like applications. It’s not as difficult as it was during the ChatGPT period. This also means that even if a company releases a video generation model first, it may not be able to maintain its competitive advantage and market monopoly for a long time. This competitive situation also weakens Sora's business imagination.
Technical fog, business fog, and competition fog still shroud the field of video generation, resulting in Sora's game with too many uncertainties and too many possibilities. It is too early to say which understanding is correct and which route is the ultimate winner. Each company can only play according to its own rules of the game.
Large model technology must continue to develop, but starting from Sora, domestic technology companies no longer follow OpenAI and start to have their own rhythm.
Specifically, for a blockbuster new thing like Sora, domestic companies have their own understanding and thinking on the productization and commercialization of large models, and have begun to define their own gameplay, and follow up what Sora has shown. Strength, not following Sora shows mentality and strategic determination.
In addition, we don’t blindly follow the products, but OpenAI’s narrative capabilities are still worth learning.
Whether it was using Sora to steal Google’s limelight in February, or whether Sora was officially launched recently, OpenAI can always drive the pace, set topics, and attract attention time and time again. This is a very important ability for capital-intensive AI companies. .
You don’t need to follow up on Sora, but you can’t miss out on key technologies.
Take Baidu as an example. Although it has no plans to launch Sora products, it does not lack key technologies. For example, it has self-developed multi-modal controllable imaging technology, which can keep the physical characteristics unchanged. Down, realThe highly generalized generation of images and the improvement of controllability are precisely the core of the next stage of video generation. In addition, Baidu has not completely ignored the field of video generation. Currently, it has invested in video generation start-up Shengshu Technology, AI video short drama company Jingying Technology, etc.
Focusing on the main track, with its own core business and commercial priority level and multiple factors to determine the priority of chasing Sora. In the large-scale poker game, domestic companies are finding their own sense of rhythm.