Have you ever encountered such a scene: faced with a meeting minutes containing a large number of charts and text information, you I hope AI can help you refine it quickly. However, the AI tool in your hand only captures the text part and turns a blind eye to the key charts. You have to manually sort it out again, which is not as efficient as doing it yourself from scratch.
A good user experience should be like this:
This is the ability demonstrated by the latest SenseTime "RiRiXin" fusion large model. Like us humans, it can see in six directions and hear in all directions. It can integrate text, images, sounds and other information in the real world to think. After integration, it can form a cognition and understanding of the world and efficiently solve practical problems. .
According to authoritative evaluations, SenseTime's "RiRiXin" integrated large model has achieved a leap-forward breakthrough in the industry in AI's ability to understand and process complex information. In the OpenCompass multi-modal evaluation, "RiRiXin" surpassed GPT-4o, Claude 3.5 Sonnet, etc. and won the first place.
In the latest "Chinese Large Model Benchmark Evaluation 2024 Annual Report" released by SuperCLUE, another authoritative large model evaluation agency, SenseTime's "Ririxin" integrated large model also scored an excellent total score of 68.3, competing with DeepSeek V3 ranks first in the country.
The same model won both the multi-modal assessment and the general ability assessment, which is quite impressive. In other words, "RiRiXin" integrates large models and achieves a single model. At the same time, it can reach the best level in the industry in graphic scenes, pure language, reasoning and other scenarios.
This also means that it solves a long-standing "old problem" in the field of multi-modal AI-the seesaw effect. What does it mean? Previous multi-modal models, limited by technical limitations, often could only maintain a high level in one dimension. It was difficult to have both bear's paws and shark's fins.
Therefore, at present, other domestic language models and multi-modal models are still independent of each other, making it difficult to truly achieve seamless integration between different modalities. This time, SenseTime's substantial breakthrough in native fusion modal training will play a key role in leading and promoting domestic large models from language and multi-modal separation to unification.
According to Lin Dahua, co-founder of SenseTime and chief scientist of artificial intelligence infrastructure and large models, to solve this problem, SenseTime has overcome two key technical points that hinder multi-modal model research: Fusion modal data synthesis, and fusion task enhancement training. Through the accumulation of high-quality and diverse data, innovative data reproduction and synthesis, and the construction of a large number of cross-modal bridges, we fundamentally solve problems such as data and fusion.
This is exactly the direction that the world's top research institutions, including OpenAI and Google, are working hard to tackle. For example, Op.GPT-4o launched by enAI and Google's Gemini series are developing in the direction of integrating multiple modal processing capabilities into a single model system, striving to break the "perception blind spot" of AI.
1. Actual test cases to unlock more application scenarios" The integrated large model of "RiRixin" can now be experienced through the "Discussion" web version, and Silicon Star people also put it to the test as soon as they learned the news.
In educational settings, students often record and solve math problems by handwriting. For scrawled handwriting, traditional AI models may have difficulty accurately recognizing it. The "RiRiXin" fusion large model uses multi-modal understanding capabilities to not only accurately identify, but also provide detailed problem-solving derivation processes and correct answers.
Understanding “Abstract” Dolls
Can AI understand the abstract culture that young people love? Not only can it recognize that it is a doll, it can also analyze its color, material, and even the "little thoughts" and cultural connotations behind the design.
Be able to understand macroeconomic charts and perform inference analysis
In real scenarios, we often need to interpret complex charts. "RiRiXin" can not only understand these complex charts, but also sort out the relationship between the charts and content through logical reasoning, thereby providing analytical support with practical reference value, whether it is business decision-making or personal planning. Calm.
2. Multi-mode integration, lane changing and overtakingAs Fusion modalities effectively improve the performance of AI large models. SenseTime's "RiRiXin" fused large modal models will be widely used in many scenarios, including intelligent hardware, online education, embodied intelligent robots, etc., to achieve cross-modal interaction and improve interaction. experience.
In addition, multi-modal models trained using native fusion methods have more potential waiting to be tapped in the future. For example, it has been implemented in many vertical industries and enterprise-level scenarios to help enterprises achieve "cost reduction and efficiency improvement" and bring efficiency to society.
Imagine that in an intelligent industrial park, the camera captures workers operating illegally. If you rely only on traditional image recognition technology, you may only be able to issue a cold alarm. But with the large fusion model, it can combine multi-modal information such as on-site video footage, text descriptions of operation manuals, and historical violation records to determine whether workers really have safety risks, and give more accurate guidance and suggestions. You can even proactively contact the person in charge of security.
For another example, in the customer service scenario of an e-commerce platform, a user sends a photo of a damaged product and describes the problem in text. Traditional customer service systems may require manual intervention to determine responsibility and resolution. But the integration is bigThe model can understand pictures and text information at the same time, quickly determine the degree and cause of damage, and automatically generate a return or exchange application, which greatly improves customer service efficiency and user experience.
For another example, in the medical field, doctors can upload patients' imaging data and medical records for comprehensive analysis, assisted diagnosis, and provide more accurate treatment plans. In the financial field, analysts can quickly interpret financial reports containing charts and text to make investment decisions more efficiently. Even in industrial production, engineers can diagnose the cause of faults and provide maintenance suggestions by uploading photos and maintenance records of equipment.
The advent of the "RiRiXin" fusion large model is a key step taken by SenseTime. It not only allows the AI large model to get rid of the limitations of "blind man touching the elephant", but also becomes a powerful assistant that can understand the world and serve life. , and will bring more changes to enterprise-level applications. Relying on its technology accumulation in the field of large models and multi-modality, as well as its engineering advantages, SenseTime has found a key path suitable for its own development. It will also lead China's AI industry to a new level of native integration development, and ultimately achieve "changing lanes and overtaking". .
3. ConclusionAll these point to a common trend: AI needs to become more and more "all-powerful". By integrating the capabilities of different modalities, AI can unlock more room for imagination.
Having said that, the fact that multi-modal fusion has developed to this extent actually means that artificial intelligence is quietly changing its direction. It not only makes AI more powerful, but more importantly, it means that AI is no longer just good at "solving questions" and "clearing the rankings". After integrating multi-modal capabilities, AI can really begin to have the ability to solve complex problems in the real world. This is how AI can truly generate value, rather than just being a concept. You can think of it as AI is working hard to build a more powerful brain to understand and simulate our real world. In this way, artificial intelligence can usher in a real big change and follow the path of LLM->multimodality->fusion modality->world model.