1 million USD ARR in 45 days, this startup found AI 3D generated PMF

Image source: Generated by Unbounded AI

In 45 days, the 3D generation product Rodin reached $1 million in ARR. This is an important milestone, and for comparison, it took HeyGen, one of the most successful startups in the GenAI space, 7 months to reach this number.

Rodin comes from Shadow Eye Technology and has just completed tens of millions of dollars in Series A financing. Investors include ByteDance and Meituan Longzhu.

The four co-founders have an average age of 25, but they have been in business for four years. Four years ago, we were all classmates. The more confident we were in technology, the more bumpy our business would be.

We sat down and chatted with CEO Wu Di and CTO Zhang Qixuan for a long time, and heard many questions they asked themselves. Those questions gradually had answers during four years of exploration.

"Our technology is so good, why don't customers use it?" The first question is a super-typical technical genius.

Yingmu has spent four years solving this problem.

January, from ancient times to the present, belongs to hardware. The hope for large models may also be hardware. So Founder Park held this semi-closed-door exchange and invited several emerging entrepreneurs to discuss: What is the most awesome AI hardware at CES? Besides glasses, what else is worth doing with AI hardware? How to quickly reach 1 million shipments of a new category? In the next year, who among the emerging companies will do the best? Sharing guest: Zhang Peng | Henri Pang, founder & president of Geek Park | He Jiabin, chief strategy representative & senior consultant of Kickstarter China | Zhang Xiaohui, CEO and co-founder of Mengyou Intelligence (Ropet) | Tony Wu, founder & CEO of TangibleFuture (LOOI) | Co-founder and CEO of Jiuzhi Technology (RingConn) 01, the expression of 3D is "fragmented"

Rodin 1.0 took 45 days to reach $1 million in ARR. That was already a story half a year ago. Now Rodin has completed several version iterations and upgraded to version 1.5, and the model performance has completed a leap.

The most important feature of version 1.5 is: the ability to generate right angles. It sounds very "simple", that is, more accurate generation of straight lines, right angles and or smooth surfaces, as well as better edge sharpness.

When the outside world’s expectations for 3D generation become that you can easily create a corner of the real world, a more accurate “right angle” with a few words of natural language, what is the value?

Film and television-level works created using Rodin

"3D generation, what exactly is it generated?" This is the most basicbasic, but also the most critical issue.

Some people think that it is a video, or that most people’s understanding of 3D is largely equivalent to a video content full of 3D elements. "Toy Story" in the 1990s, Ang Lee's digital version of Will Smith later, polygonal games in the early years, and last year's hit "Black Myth: Wukong", everyone can experience 3D as a method of image presentation through planes Charm, whether it's a movie screen or a gaming computer screen.

As a result, imitating 3D from 2D video has become a very important technical route.

Sora came out in early 2024. The high consistency in the demo video triggered people to explore whether it would directly cover 3D generation work. But soon, Sora was delayed in releasing, the performance of followers was mediocre, and the video model was still a long time away from being "movie-level" or joining the game pipeline.

There are many reasons. For example, the power of generative AI is still overestimated. As film concept artist and illustrator Reid Southen judged earlier, "These videos are a bit too It was sloppy and had too many problems, especially artifacts like temporal consistency and extra limbs."

But an overlooked question is, is a picture demonstrating a 3D image "3D", or is it more of a "video"?

Video works mean directly facing its consumers, but the "3D" concept in game and film and television creation itself is a link in a complete industry, such as a virtual modeling of flowers and fruits Mountain, it needs to be able to continue to be used in subsequent creative sessions.

"3D generation, what exactly is it generated?"

"Unlike video, 3D is an industry, it has downstream links. After the video is output, users can share it directly, and it will be available on their mobile phones. You can watch it, but in 3D If you want to use it further after production, you need to adapt the renderer, the game engine, and if it is embodied intelligence, you need to adapt the simulation software. This requires us to coordinate the output of the (model) with some industries. Standards need to be properly matched. "

"In our understanding, 3D is an asset," Qixuan said. "Text, images, and videos are all consumer-grade and directly related to C-end users. Meet but 3D No. ”

Users use Rodin to generate 3D assets in batches

Text, images or videos have now become consumer-grade content, which means that they are directly related to C-end users meet. At a technical level, this means that the expressions of the three modes have reached basic agreement in the industry.

"Video has its mainstream encoding. The current mainstream image may be a two-dimensional matrix, and its color is recorded at each position. Text may be the encoding on some characters," Qixuan said, "But 3D No, not yetIts expression is still very fragmented."

This fragmentation means that, for example, a 3D digital human facial modeling may use a specific format to support complex facial expressions and body animation, which usually requires high-precision mesh and bone binding. The modeling in the battle royale game pays more attention to performance and efficiency, and a gun on the ground is usually modeled in a low-polygon style; while the 3D modeling of a car in the design stage Modeling, which focuses on precise geometric shapes and functional performance, requires a detailed representation of its internal and external structures, mechanical components, and aerodynamic properties. This type of modeling usually requires the use of professional CAD software, combined with rigorous standards of engineering and design. , to ensure the accuracy and practicality of the model.

Nearly all industries that have a demand for 3D data currently have a set of standards and representation methods that are only applicable to their own scenarios, and their data information cannot be reused with each other.

The Yingmo Technology team has always hoped to unify the representation of 3D data and turn it into a standardized asset. This has been done since Rodin 1.0. The team proposed a remesh model reset The strategy is to achieve consistent representation by "thickening" each model slightly. After "thickening", it actually does not have much impact on the aesthetics of the generated 3D and the information it contains, but the entire model will look They are all round.

But in the process of Rodin 1.0 actually falling into the industry, the unification of representation does not mean that the generated 3D data can be smoothly used as an asset. In a large number of real product design or game industries, the huge demand for 3D assets is not cute pets or a letter "A" made of cloud texture, but more inorganic shapes (using mathematical construction methods, A surface formed by straight lines or curves, or a combination of straight curves) and a feeling of sharp edges.

The ability to generate inorganic shapes, sharp edges and very clean topology are the most prominent performance improvements of Rodin 1.5 in 3D generation capabilities. This emphasis on consistency and "availability" of 3D generated data is what Wu Di and Qixuan have stepped through one pit at a time in the past few years.

02, Must be Production-Ready

A few years ago, A big client made Wu Di, Qixuan and others who were just starting out hit the wall for the first time, and that was "The Wandering Earth 2".

In "The Wandering Earth 2", there are some scenes of Andy Lau and Wu Jing becoming younger, and the post-production team hopes to use special effects to present them. At the beginning of 2021, the Yingmo team built a black spherical frame with a diameter of 3 meters in Zhangjiang, Shanghai. The light sources and cameras were distributed inside the sphere. The entire device occupied an entire room. This was what Yingmo Technology was using for high-precision human beings at the time.The first generation dome light field for collecting object faces. After the dome light field was completed, some teams from the film and television industry came to inquire, including "The Wandering Earth 2".

Dome Light Field

Wu Di and Qixuan are very confident in the face scanning equipment they have developed, but the reality is also very bleak. According to Wu Di's recollection, "The first question the Wandering Earth team members asked after seeing the effect was: How does this thing work?"

The reason why it couldn't be used was that the original dome light field was essentially It is a pure lighting system. When a person enters the center of the sphere, light from all directions can be collected through 360-degree light sources. On this basis, different lighting environments can be synthesized in the later stage, and then replaced by face replacement. , logically more inclined to the video generation we are talking about now. This makes it difficult to enter the film industry's CG pipeline.

"To really use a 3D face in the CG pipeline, it must first be a complete 3D model. It has excellent topology, materials that can reflect various lighting changes, and can be controlled and Make various expressions so that it can be well connected and used later.”

Shortly after that, Shadow Eye Technology made a major decision – to cut off all the software at that time. Base 2D technology research and development investment, all in 3D. Behind the shift in the generation route from 2D to 3D is the consensus within the Yingmo technology team on "Production-Ready".

The term "Production-Ready" comes from the CG industry. There is a word in the CG industry - Post-Production, and "Production-Ready" means available in post-production.

User works, 70% of the models are from Rodin

From the first generation of dome light field that focused on plane data collection, it slowly evolved to the second generation in the process of constant collision with customers. The dome light field that collects 3D face data on behalf of the company, and then with the contact with customers, the technology finally reached the point where the collected data can be directly used to build digital characters in film and television games. "Production-Ready" gradually became the core concept of Yingmo Technology from the inside out. An external concept.

"Production-Ready is not an easily quantified indicator. If we must be more specific, it is that in the design and selection of technical routes, we will regard the usability of the generated results as a very important factor. Important points to think about. For example, if a technology can improve visual quality but does not make Production-Ready closer, we may not necessarily do it," Qixuan said.

The concept of "Production-Ready" also directly determines that Shadowmoon Technology has chosen a counterintuitive path in 3D generation after the advent of the generative AI wave.

In the most mainstream concept at the time, 3D generation was essentially a dimensionality increase from 2D. After the emergence of Stable Diffusion, 3D reconstruction was achieved through the 2D diffusion model, combined with methods such as NeRF. Because they can be trained on large amounts of 2D image data, such models tend to produce diverse results.

With the multi-view reconstruction work, by adding multi-view 2D images of 3D assets to the training data of the 2D diffusion model, the problem of the limited ability of this type of model to understand the 3D world is alleviated to a certain extent, but The limitation is that the starting point of this type of method is 2D images after all, and 2D After all, the data only records one side, or projection, of the real world. Images from multiple angles cannot fully describe a three-dimensional content. Therefore, there is still a lot of missing information in what the model learns, and the generated results still need a lot of corrections, which is difficult to meet the needs of the industry. standard.

The route from 2D to 3D is more like proving that an image model can understand 3D after seeing enough images, but this understanding of 3D is still far from the 3D data that can be used in industry Very far. From another perspective, the improvement of 2D to 3D also means a compression of 3D information - just like a regular polygon with 200 sides is still far from an ideal circle.

The Shadow Eye team, after working with a large number of digital people and 3D face scans, faced this technical route that seems to have the most consensus in 3D generation, "there is no way to convince myself."

“We know where the upper limit of 3D scanning is. At present, it is difficult to put it directly into actual production when it reaches the most perfect state. However, using 2D Stable Diffusion to upgrade the dimension to 3D The good situation is that the quality of 3D scanning is infinitely approximated. Why can this method be achieved in one step?" Wu Di said.

To be able to align with human industry, 3D generation can only take the path of 3D native, which means abandoning the idea of dimensionality increase from 2D and directly building 3D models.

At the ACM SIGGRAPH 2024 conference, the two papers of the Yingmo Technology team—the controllable 3D native DiT generation framework CLAY and the 3D clothing generation framework DressCode—were both shortlisted for the best papers. Nominated. The paper proposes a 3D native diffusion transformer architecture, which is to train the generative model entirely from 3D data sets and extract rich 3D priors from various 3D geometric shapes.

The exploration work of these two papers also led to changes in the technical route of the 3D generation industry. After that, 3D native began to replace 2D to 3D, and now it has become the world's leading technology.3D generation of mainstream exploration paths.

The Shadow Eye team at SIGGRAPH

03, from laboratory to startup company

As early as the first year of Yingmu's establishment, they had produced a star product.

In 2021, a two-dimensional character generation product called "WAND" was launched. The day after it was launched, it was seen by a well-known Japanese blogger, and then quickly became popular in the country. In a short period of time Received 1.6 million users.

WAND’s App Store page back then

Traffic and attention followed, “I couldn’t handle it,” Wu Di said.

Traffic does not give Wu Di and Qixuan the opportunity to choose what kind of company they want to be. On the contrary, it deprives Wu Di and Qixuan of the right to choose.

“Everyone thinks we should build ourselves into a “WAND” company, including people around us, and some want to invest in us,” Wu Di said.

But in the end, the "WAND" company did not appear. Soon after, Wu Di and Qixuan took the initiative to discontinue the "WAND" product. The names that are more familiar to the outside world now are Shadow Eye Technology and Rodin.

"We have not taken the path that everyone thinks we should take, because our technical capabilities and what we want to do are still in 3D."

Abandon image generation completely The determination of the route was supported by Dr. Lu Qi.

"Now that you have made this decision, you must be ruthless and only do what you think is right." Dr. Lu Qi said to the Yingmu team after the 2021 Qiji Chuangtan autumn road show.

Amazing achievements at the end of 2021 At the 2021 Autumn Entrepreneurship Camp roadshow, Dr. Lu Qi was like a "coach", holding back the microphone while high-fiving entrepreneurs who had just completed the roadshow. Among the 4,226 startups in this round, 53 projects were finally accepted. The acceptance rate is 1.25%, including Yingmo Technology.

WAND eventually became a stepping stone for Wu Di and Qixuan to move from the laboratory to the commercial world.

Wu Di later asked Dr. Lu Qi why he voted for his team. WAND, which exploded in the same year, was the initial opportunity for Qiji to notice this young team at Shanghai University of Science and Technology. But the most fundamental reason was behind WAND. Qiji saw that it was rare for a pure R&D team to possess commercial thinking at an early stage.

This is not easy for a founding team with an average age of only 21 years old in 2021, but the two very corporate thinking dimensions of productization and commercialization, from the name Shadow Eye Technology It started to be brewed in the MARS laboratory of Shanghai University of Science and Technology.

Wu Di entered Shanghai University of Science and Technology in 2015, while Qixuan enteredIn 2018, the two entered the MARS laboratory of Shanghai University of Science and Technology, whose main research direction is artificial intelligence combined with computational photography. At that time, there were only three students in the laboratory, which were the three earliest members of Shadow Eye Technology, and the fourth Lianchuang enters MARS in 2020 In the laboratory, the first-generation dome light field was being built at this time. The concept of metaverse and digital people was gaining momentum in the outside world. Wu Di and Qixuan saw the business prospects behind this set of digital acquisition equipment. In the laboratory Decided to establish Yingmu Technology.

Shanghai University of Science and Technology is a very, very young school. It was founded in 2013. Wu Di was the second class student. At that time, Shanghai University of Science and Technology was not a "double first-class university". There was only one dormitory building on campus, and all classes were Need to borrow classrooms from other schools.

But the interesting thing is that at HKUST, everything has to be built from scratch, whether it is laboratories, student unions, or the initial courses. Wu Di likes this feeling very much, "Studying gives me the taste of entrepreneurship."

Or in Qixuan’s words, “(The situation in the first two years of HKUST) determined the attributes of the students at that time, which were their boldness, aka entrepreneurial spirit.”

< p>The Shadow Eye team demonstrated Rodin 3D generation in the SIGGRAPH Real-time Live! session

The company was established in June 2020 In September, for more than a year, Wu Di and Qixuan were frustrated by the huge gap between generated content and real industrial needs. Taking "Production-Ready" as the core calibration direction of technology research and development was initially formed through these numerous setbacks.

In the fall of 2021, Yingmo received its first round of financing from Qiji Chuangtan. After the road show day of Qiji Chuangtan, they quickly got the second deal.

The second one came from Sequoia. Wu Di remembers that it was Christmas 2021 when they finalized the financing from Sequoia. They met with several waves of investors that afternoon until very late. "That day happened to be our Christmas party, but in the end Wu Di and I just went to the party to settle the bill," Qixuan said.

This entrepreneurial path has not been smooth sailing ever since. Starting from 2022, Yingmo Technology has not received financing for nearly two years. One of the financing processes consumed a lot of energy from Wu Di, but failed to close in the end.

That failure brought about two results:

First, Yingmu’s character, when starting an AI business, you must consider commercialization on the first day, survive first, and ensure Cash flow;

Second, we must firmly choose the 3D native route.

"Before this, our idea of doing 3D generation was to recruit someone who had tried in the field of 3D generation to help us do it together, but that would probably not break out of the inertia of the technical path at that time." , Wu Di said, “It was precisely because of the failure of that financing that the entire core R&D teamDetermined to make truly usable 3D generation. ”

A few months later, there was the original Rodin 1.0.

04, 3D is the puzzle piece

Shadow Eye Hope Rodin Will it become a popular toC product like WAND?

The answer is clear.

“3D generation will eventually move to the consumer side, but not now.” Qixuan said, “Now taking a picture or a video can be shared directly on social platforms, but 3D is not yet a viable option. The format being shared."

Maybe new hardware has a chance, but it will definitely take time. Until then, "When you don't know where this thing will end up, it's better to do it first. There are always many problems worth overcoming." Wu Di is convinced that the current opportunities for 3D generation are in the existing market.

Needless to say, film and television entertainment, there is also an increasing demand for 3D generation in the industrial field. For example, in architectural design, in the past, most architectural renderings relied on two-dimensional textures, and computing power limited visualization options. This method has quite a few limitations, for example the lighting never looks right, the camera always has to be at a certain height, and animation is a no-go. 3D native technology allows the entire virtual space to operate in any light situation and under any camera, bringing more imagination to architectural visualization.

Currently, Shadow Eye has cooperated with many leading companies in the gaming, film and television, manufacturing and other industries. Rodin’s SaaS products have also accumulated a large number of graphic designers, AR & VR developers, and 3D printing enthusiasts. and other professional user groups.

Rodin users’ comments on Model?" Wu Di said.

What happens next?

When Sora was groundbreaking a year ago, people once doubted whether the industry still needed 3D.

Qi Xuan was deeply impressed. “When video generation first came out, all of us who were doing traditional graphics thought it would be subverted.” He explained that for 3DCG, Video generation means that there is no need for a three-dimensional space and the rendering result can be obtained directly. "This has a great impact on traditional CGI technology. Those who do 3D generation will worry that one day 3D will no longer be needed."

Especially , although Sora It was "futures" at the time, "but OpenAI has a pretty good reputation in terms of futures."

Yingmo's R&D team began to frequently understand and test video models. They soon realized that what the video generation was doing was just "simulation".It is "simulation" and then "approximation" to the final desired result.

"It is a frame consistency (inter-frame consistency) generator. It is not built on the World Model. It cannot achieve world consistency (world consistency)." Qixuan said, " This is a two-level concept. If you only rely on video generation, you can only stay here."

"But the interesting thing is that 3D models were originally made in the CGI industry, which is world consistency."

A CG from a movie A video, such as a person in a room, first needs a model of every object in the room. Each model needs a material that expresses lighting attributes. The character needs animation of the action. It needs a camera in the virtual world to capture every frame of the character. Actions are ray traced. At this time, ray tracing is the job of the renderer. Usually, a movie-level CG is rendered offline, which often requires cluster-level rendering to achieve realistic effects.

Realizing this, and looking at video generation, in the above pipeline, it seems that "only the work of the offline renderer is replaced - not the entire CGI industry."

"Video is not a world model," Wu Di said, "it may be a form of world model output when displayed to the public."

"Consistency issues, especially world model -level consistency, this is a question of information volume," Qixuan explained, "If the description of the world's information changes cannot be input to AI, it will definitely not be able to achieve this consistency."

To the world model, at least world is required consistency, so at this time, a new module is needed for control.

The missing piece of the puzzle happens to be 3D.

"We have our own World Model." There are many things that are being done that are worth doing, and I am very excited to think about it.