MiniMax is open source, breaking through the traditional Transformer architecture, with 456 billion parameters and supporting 4 million long contexts.

Image source: Generated by Unbounded AI

"In 2025, we may see the first AI Agents join the workforce and have a significant impact on company productivity "Substantial impact." - OpenAI CEO Sam Altman

"In 2025, every company will have AI software engineer agents who will write a lot of code." - Meta CEO Mark Zuckerberg

"In the future, every company's IT All departments will become the HR department of AI Agent.” - NVIDIA CEO Jensen Huang

At the beginning of 2025, when many trends are still unclear, several important figures in the AI industry made plans at almost the same time. A similar judgment was made - 2025 will be the year of AI Agent.

Unexpectedly, MiniMax took action very quickly: it open sourced the latest basic language model MiniMax-Text-01 and the visual multi-modal model MiniMax-VL-01.

The biggest highlight of the new model is that it has implemented a new linear attention mechanism on a large scale for the first time in the industry, which greatly lengthens the input context window: it can process 4 million tokens at a time, which is 20 times that of other models. -32 times.

They believe that these models can contribute to the explosion of potential Agent-related applications in the next year.

Why is this work so important to Agent?

As Agents enter application scenarios, whether it is the memory generated when a single Agent works or the context generated by the collaboration of multiple Agents, more requirements will be placed on the long context window of the model.

Open source address: https://github.com/MiniMax-AIHugging Face: https://huggingface.co/MiniMaxAI Technical Report: https://filecdn.minimax.chat/_Arxiv_MiniMax_01_Report.pdfWeb page: https: //www.hailuo.aiAPI：https://www.minimaxi.com/platform

A series of innovations create an open source model that is comparable to top models

How is MiniMax-Text-01 made? In fact, they have developed a series of innovations to do so. From new linear attention to improved hybrid expert architecture, to optimization of parallel strategies and communication technologies, MiniMax solves many of the performance and efficiency pain points of large models when facing extremely long contexts.

MiniMax-Text-01 Architecture

Lightning Attention

Most of the current leading LLMs are based on Transformer, and the core self-attention mechanism of Transformer is its computational cost important source. In order to optimize, the research community can be said to have racked its brains and proposed many techniques such as sparse attention, low-rank decomposition and linear attention. MiniMax’s Lightning Attention is a kind of linear attention.

By using linear attention, the computational complexity of the native Transformer can be significantly reduced from quadratic complexity to linear complexity, as shown in the figure below.

MiniMax’s relevant technical report states that this is mainly due to a right product kernel trick. Taking the TransNormer in the 2022 paper "The Devil in Linear Transformer" as an example, the NormAttention mechanism on the left side of the figure below can be converted into a linear variant using "right-hand matrix multiplication".

Lightning Attention is an I/O-aware optimized version based on TransNormer.

The following is the algorithm description of Lightning Attention forward pass.

Based on Lightning Attention, MiniMax also proposes a Hybrid-lightning, which replaces Lightning Attention with softmax attention every 8 layers, thereby not only solving the efficiency problem of softmax attention, but also improving Lightning Attention’s scaling ability.

What is the effect? The following table gives the formula for calculating the attention architecture parameters and FLOPs based on the number of layers l, model dimension d, batch size b and sequence length n.

It can be clearly seen that the larger the model size, the more obvious the advantages of Lightning Attention and Hybrid-lightning over softmax attention.

Mixed Experts (MoE)

The efficiency advantage of MoE over dense models has been proven by a large number of studies. The MiniMax team also conducted a comparative experiment. They compared a 7B parameter dense model with 2B activation parameters and20B MoE model with total parameters. The results are shown below.

It can be seen that on various benchmarks, when the computational load is the same, the performance of the MoE model is significantly better than that of the dense model.

MiniMax also introduces a new allgather communication step to solve the problem of routing collapse that may be encountered when scaling up the MoE model.

Computational Optimization

Like many large-scale model training projects, MiniMax first experiments on a small scale with the effectiveness of the above technical improvements and Scaling Law before embarking on large-scale training. MiniMax uses between 1,500 and 2,500 H800 GPUs for this purpose—and the exact number of GPUs used changes dynamically during training. Large-scale training has its own unique challenges, and MiniMax has developed a series of targeted optimization technologies.

First of all, for the MoE architecture, the main optimization goal is to reduce its communication load. Especially for MoE models with all-to-all (a2a) communication. MiniMax's solution is an overlapping scheme based on token grouping.

Secondly, for long context training, a major challenge is the difficulty in normalizing real training samples to a uniform length. The traditional way is to fill, but this method is very computationally wasteful. MiniMax's solution is to format the data, in which different samples are connected end to end along the dimension of the sequence. They named this technique data-packing. This format minimizes computational waste during calculations.

Finally, in order to put Lightning Attention into practice, MiniMax adopts four optimization strategies: batch core fusion, separate pre-filling and decoding execution, multi-level filling, and strided batch matrix multiplication expansion.

MiniMax-Text-01 has a huge context and strong capabilities

< p>Based on the above series of innovations, MiniMax finally obtained an LLM with 32 experts and a total of 456 billion parameters, and each token will activate 45.9 billion parameters. MiniMax named it MiniMax-Text-01. When performing inference, its context length can reach up to 4 million tokens, and it shows excellent long context capabilities.

MiniMax-Text-01 has excellent benchmark results

On common academic test sets, MiniMax-Text-01 has basicallyIt is comparable to or even surpasses closed-source models such as GPT-4o and Claude 3.5 Sonnet, as well as SOTA open-source models such as Qwen2.5, DeepSeek v3, and Llama 3.1. The results will be posted directly below.

As can be seen, MiniMax-Text-01 performs well compared to Instruct Qwen2.5-72B on HumanEval. Furthermore, MiniMax-Text-01 achieves a score of 54.4 on a challenging question answering dataset like GPQA Diamond, outperforming most open source instruction-fine-tuned LLMs as well as the latest version of GPT-4o.

MiniMax-Text-01 also achieved top three results in MMLU, IFEval and Arena-Hard tests, demonstrating its ability to apply comprehensive knowledge to fully satisfy user queries under given constraints. , superior ability to align with human preferences. It is conceivable that based on the latest model capabilities, it also provides developers with a better foundation for developing Agent applications.

Leading context capabilities

What about the long context capability that MiniMax-Text-01 is proud of? Its advantages are even more obvious.

On the long context understanding task, MiniMax tested two common benchmarks, Ruler and LongBench v2. First, on Ruler, we can see that when the context length is 64k or shorter, MiniMax-Text-01 is comparable to other SOTA models, and when the context length exceeds 128k, the advantages of MiniMax-Text-01 are obvious. came out.

Performance comparison between MiniMax-Text-01 and other models on Ruler

Similarly, MiniMax-Text-01 also performs very well on the long context reasoning task of LongBench v2 .

Comparison of the performance of MiniMax-Text-01 and other models on LongBench v2

In addition, MiniMax-Text-01’s long context learning capability (a core research field in lifelong learning ) is also SOTA level. MiniMax validates this on the MTOB benchmark.

Performance comparison of MiniMax-Text-01 and other models on MTOB

Long text capability Showcase

MiniMax-Text-01 got very good results Benchmark scores, but what about actual performance? Some examples are shown below.

First, let’s write a song!

Human evaluators also gave very positive reviews: The poetic language and interpretive space add layers of interest and emotional resonance to the song, making it both engaging and thought-provoking.

The following focuses on the long context capabilities of MiniMax-Text-01. For Kalamang, a niche language in New Guinea, you first put instructions, grammar books, word lists, and comparison examples with English into the context of MiniMax-Text-01, and then let it perform translation. It can be seen that the answer given by MiniMax-Text-01 is basically consistent with the standard answer.

As for the long dialogue memory task, MiniMax-Text-01 can be said to perform perfectly.

Visual-Language Model

Based on MiniMax-Text-01, MiniMax has also developed a multi-modal version: MiniMax-VL-01. The idea is very simple, which is to integrate an image encoder and an image adapter based on the text model. In short, it is to turn the image into a token form that LLM can understand.

Therefore, its overall architecture conforms to the more common ViT-MLP-LLM paradigm: MiniMax-VL-01 as the basic model, and then uses a 303M parameter ViT as the visual encoder, and uses a random initialization A two-layer MLP projector is used to perform image adaptation.

Of course, in order to ensure that the visual understanding ability of MiniMax-VL-01 is good enough, it also needs to be continuously trained using image-language data based on the text model. To this end, MiniMax designed a proprietary dataset and implemented a multi-stage training strategy.

Finally, the resulting MiniMax-VL-01 model achieved the following performance on various benchmarks.

It can be seen that the overall performance of MiniMax-VL-01 is strong, comparable to other SOTA models, and can reach the best in some indicators.

The following shows an example of analyzing a navigation map, and MiniMax-VL-01 deserves a thumbs up for its performance.

Explore the infinite context window and let the Agent enter the physical world

Some people think [1] that context will be a hidden line throughout the development of AI products , Whether the context is fully synchronized will directly affect the user experience of smart applications, which includes various background context information such as the user's personalized information, environmental change information, etc.

In order to ensure that the context is fully synchronized, a large enough context window becomes a technical problem that large models must overcome. Currently, MiniMax has taken an important step along this path.

However, the context window of 4 million tokens is obviously not the end point. They wrote in the technical report: "We are working on more efficient architectures to completely eliminate softmax attention, which may enable the model to support unlimited context windows without incurring computational overhead."

In addition, the visual language model trained by MiniMax on the basis of LLM also has an ultra-long context window, which is also determined by the tasks faced by the Agent. After all, in real life, multimodal tasks are far more common than text-only tasks.

“We believe that the next generation of artificial intelligence is an agent that is infinitely close to passing the Turing test. It interacts naturally, is within reach, and is everywhere.” The founder of MiniMax mentioned at an event last year.

Perhaps, "ubiquity" also means that with the addition of multi-modal tokens, Agents will gradually enter the physical world. To this end, the AI community needs more technical reserves.