2026.1.22

MiniMax M2.1: Post-Training Experience and Insights for Agent Models

Model Overview

M2.1 is the latest flagship open-source model released last month, built upon further post-training optimization over the previous M2 generation. It adopts a Mixture-of-Experts (MoE) architecture, with approximately 230B total parameters and ~10B activated parameters.
In Agent scenarios, M2.1 demonstrates excellent usability. Even with a relatively small number of activated parameters, it maintains high inference efficiency and strong performance, offering stable and production-ready engineering capabilities for a wide range of real-world applications.

Agentic Data Synthesis

We begin with data synthesis, which can be roughly divided into three categories:

Real-data-driven synthesis: SWE Scaling
Expert-driven synthesis: AppDev
Virtual long-horizon task synthesis: WebExplorer

The first two primarily target coding scenarios, while the last focuses on more general-purpose search tasks.

Real-Data-Driven Synthesis: SWE Scaling

At the core of SWE Scaling is data scaling for software engineering scenarios. We leverage GitHub, an enormous and well-structured data source, to synthesize a wide variety of verifiable tasks. With such tasks, we can effectively perform rejection sampling, construct SFT datasets, or conduct RL, all on a solid data foundation.

Data Pipeline Overview

The raw data source consists of GitHub Pull Requests (PRs) and Commits. We first apply quality filtering—simple rules include selecting PRs that were eventually merged, along with additional criteria such as the presence of relevant test cases.
The next and most critical step is constructing a runnable Docker environment for each PR.
Environment construction is non-trivial. The common approach today is to let an Agent iteratively build the environment in a code sandbox, equipped with tools that allow it to repeatedly attempt builds and self-correct based on build results.
Ideally, this process would be fully automated, but in practice, it is not yet perfect. For certain languages or library versions, environments may fail to build reliably. In such cases, expert knowledge is required to optimize the Agent's execution flow, this can be seen as injecting skills into the Agent.
Once completed, we obtain a runnable virtual Docker environment for each PR.

PR Tagging and Task Diversification

Next, we perform tagging and routing on PRs. PRs contain diverse data types—bug fixes, feature additions, performance optimizations, test construction or refactoring, and dozens of other categories.
Routing is necessary because different PR types require different downstream treatment.

Bug Fix Example

For standard bug-fix scenarios, we extract F2P (Fail-to-Pass) and P2P (Pass-to-Pass) test cases. If a golden patch passes these tests, the data is considered valid. We then let the model act as an Agent to fix the bug in a sandbox and verify correctness using both F2P and P2P tests.
P2P tests are particularly important to ensure that no new bugs are introduced during the fix.

Feature Addition and Performance Optimization

For feature additions, traditional F2P/P2P logic may not apply, since tests often depend on newly introduced code (e.g., function signatures). Instead, we focus on extracting newly added test points and ensuring the golden patch passes them.
For performance optimization, there is no bug-fixing process. In such cases, we extract P2P tests that can verify stable and significant performance differences before and after the optimization.
Different PR types naturally require different handling strategies.

Model-Based Validation

Even in basic bug-fix scenarios, raw GitHub PRs are not always well-structured, and test cases may not fully cover the described issue. This can lead to situations where a bug is impossible to fix purely based on the problem description.
To address this, we use the model itself to validate consistency between test cases and problem descriptions. If key information is missing, the model augments the original description to make it a self-contained and solvable problem.

Task Transformations and Augmentation

For the same PR or issue, there are many ways to reuse the data:

Inject additional bugs

Merge adjacent commits or PRs to increase difficulty (similar to SWE-smith)

Convert BugFix tasks into SWE-Test tasks

In the SWE-Test formulation, the task is inverted: the model must write a test case that fails before applying the patch and passes after, requiring strong test-writing capability while remaining verifiable and task-equivalent.
We can also construct code review tasks, which do not necessarily require a runnable environment. The model performs static analysis, reviews code changes, and identifies issues. Consistency can be verified using another LLM, making such tasks approximately verifiable while offering greater diversity.

Final SWE Dataset

Ultimately, we obtain SWE-style datasets that include:

Original problem descriptions

Fully verifiable rewards based on test cases

Runnable Docker environments

These datasets are used for both SFT and RL. For SFT, we apply multi-scaffold rejection sampling to improve generalization. For RL, multi-scaffold training is essential because different scaffolds introduce different context management and execution logic. Training on a single scaffold (e.g., a simple ReAct loop) severely limits generalization.
A summary of the core idea behind the SWE data: building agent-driven automated data pipelines based on raw GitHub data to produce diverse, verifiable SWE-style datasets and environments.

Scaling Results

As of M2.1, SWE scaling covers:

10+ major programming languages

A wide variety of coding tasks and scenarios

10,000+ runnable PRs

140,000+ variable tasks

On benchmarks such as Multi-SWE and SWE-bench, M2.1 significantly outperforms M2, especially in multilingual settings. Performance remains stable across different scaffolds, demonstrating strong adaptability.

Expert-Driven Data Synthesis: AppDev

We divide coding into two categories:

SWE: tasks within existing repositories with fixed verification

AppDev: full-stack application development from scratch

AppDev differs fundamentally from SWE because test cases cannot be fully predefined. As a result, its reward structure is entirely different.

Expert-in-the-Loop

AppDev heavily relies on experts-in-the-loop. Internal specialists in frontend, backend, Android, and iOS development help design prompts, meta-queries, and rubric-based rewards, which cannot be fully automated.
Experts also inject best practices through system prompts. During training, the system prompts can be omitted in the training data, thereby distilling expert heuristics into the model's default behavior.
Verification uses Agent-as-a-Verifier: the Agent deploys the app in a sandbox, interacts with it via tools such as Playwright, and scores performance against rubrics. Unlike LLM-as-a-judge, this requires multi-step tool-based interaction.
M2.1 currently ranks #1 among open-source models on the Hot Arena leaderboard. We also built an internal benchmark called VIBE Arena, which will be introduced in more detail later.

Synthetic Long-Horizon Task Generation: WebExplorer

Beyond coding agents, we also focus on general-purpose agent scenarios, with search as a foundational capability.
Our work "Explore and Evolve for Training Long-Horizon Web Agents" (available on arXiv) proposes a two-step approach:

Exploration: Agents freely explore the web to construct information-rich seed questions.

Evolution: Queries are iteratively evolved to increase complexity.

Example

Starting from a seed like "Brazil national football team", the Agent discovers the 1950 World Cup, the Maracanazo Blow, referee George Reader, his later role as an English club chairman, and the 1976 FA Cup final, which was won by Bobby Stoke's goal.
In the final stage, the model synthesizes clues gathered along the search trajectory to generate an information-rich initial query. Although this query contains a large amount of information, it still has clear entry points for retrieval. The evolution strategies include removal, obfuscation, and substitution. For example:

Converting specific match details into a vague description such as "a World Cup with a unique format and no knockout stage"

Replacing salient information like "defeating Manchester United in the FA Cup" with "leading a second-division club to victory over a top-tier powerhouse", thereby increasing retrieval difficulty

Removing easily searchable facts, such as a player's age at the time of death, which can be readily found on Wiki.

The evolved query is ultimately far more complex than the original question and lacks obvious retrieval entry points, requiring the model to explore step by step based on the available clues.

The evaluation criterion for this type of long-horizon task synthesis is the average number of reasoning turns required to solve a problem. The original questions require approximately 7.9 turns on average, while the evolved versions increase this to 9.9 turns. This strategy has been deployed in M2.1.
On the BrowsComp leaderboard, M2.1 achieves performance close to GPT-4.5's SOTA metrics, particularly when Context Management is enabled. Context Management is currently the dominant approach for Search Agents: during evaluation, the context is continuously cleared to keep it concise and uncluttered, enabling the model to sustain effective Test-time Scaling.

Agentic RL Framework and Algorithms

Forge

We use an internally developed framework, Forge, which was designed for Agent-centric scenarios from the very beginning of M2 development. One of Forge's key features is its support for running reinforcement learning over arbitrary Agent scaffolds.
Integrating with Forge only requires implementing four interfaces:

agent_reprocess: preprocessing (initialization)

agent_run: execution

agent_postprocess: postprocessing

calculate_reward: reward computation

For example, even black-box Agents that are available only as binary executables can be integrated. During Agent execution, the Agent's base URL is redirected to Forge's internal inference engine service. This engine currently supports both the internal inference framework and SGLang. After redirection, all logs generated during the Agent's execution loop are persisted on the inference server.
A Data Coordinator then post-processes these logs to extract Sub-agent trajectories. To further improve training efficiency, the framework performs intelligent prefix merging over trajectories.

CISPO

From an algorithmic perspective, up through the M2.1 stage, the core reinforcement learning (RL) algorithm has largely continued to rely on CISPO, which was originally proposed in the MiniMax M1 paper. In M1, two key aspects were emphasized: first, the importance sampling truncation design inherent to CISPO itself; and second, a fix targeting FP32 precision issues identified at the time.

The first part concerns the objective function of CISPO. Conceptually, it can be understood as being very similar to the REINFORCE objective. In essence, it is a REINFORCE-style objective augmented with importance sampling corrections for the off-policy setting. The key modification introduced by CISPO is that it applies clipping to the importance sampling weights—that is, the scalar weighting factors. In practice, this clipping typically imposes an upper bound, ensuring that gradients do not become excessively large. Fundamentally, this mechanism serves to control the magnitude of gradient updates.

CISPO was originally designed during a period when many teams were reproducing the R1-ZERO pipeline (including ByteDance's DAPO). During our reproduction efforts, we made a critical observation: throughout RL training, certain tokens were consistently filtered out by PPO's clipping mechanism. Once clipped, these tokens effectively lost their gradients permanently. Empirical analysis showed that such tokens were often discourse or transition words, such as "wait". This implies that PPO-style clipping can prevent a large number of tokens from ever emerging through training.

In this context, DAPO's approach was to increase the upper bound of PPO clipping. By contrast, the CISPO approach aims to allow all tokens to receive gradients, while controlling the importance-sampling weighting coefficients instead. Although this introduces some bias, it significantly reduces the overall variance of the optimization process.

When we completed early-stage experiments on smaller models and migrated RL training to the larger MiniMax M1 model, we observed that the overall reward barely increased. After logging and comparing the training-time and inference-time probabilities, we found clear piecewise horizontal segments, and the correlation between them was much lower than that observed in dense models. Further layer-by-layer debugging by the inference team revealed that the numerical precision of the prediction layer, namely the LLM _head, was critical. After restoring this layer to FP32 precision, training–inference consistency improved substantially, enabling stable and sustained training gains.

Moving into the M2 generation, the primary change was the transition to an agentic RL setting involving multi-turn tool usage. These tool calls inherently introduce noise from the external environment, making execution trajectories more extreme, more off-policy, or prone to anomalous statistics.

Under these conditions, we incorporated several major techniques proposed by the broader community, including multiple importance sampling (MIS) and PPO-based trajectory filtering. The core idea is to filter out trajectories with anomalous statistics that lie in the long-tail distribution, thereby preventing excessive gradient fluctuations and ensuring the overall stability of RL training.

This figure is adapted from a figure in Meta's paper "The Art of Scaling Reinforcement Learning Compute for LLMs." In that work, the authors conducted a fairly systematic set of experimental comparisons, and their conclusions are largely consistent with ours.
The experiments show that the CISPO algorithm performs exceptionally well throughout the scaling process, both in terms of convergence speed and the final convergence ceiling. The left portion of the figure illustrates the effect of CISPO's importance sampling trick, while the right portion highlights the gains brought by fixing FP32 precision issues.
Overall, the experimental evaluation in this paper is very thorough. For those interested in reinforcement learning (RL), I strongly recommend taking a look at it.

Agent Evaluation

Next, I'd like to introduce three benchmarks that we released alongside M2.1:

VIBE: Visual & Interactive Benchmark for Execution in Application Development

SWE-Review

OctoCodingBench
At present, the VIBE dataset has been open-sourced on Hugging Face. However, its infrastructure is not yet fully ready, and we are actively working to complete it.

VIBE

Let's start with VIBE, which targets the AppDev (application development) scenario. Due to the lack of existing leaderboards that effectively measure performance in this domain, we built this benchmark ourselves. It covers a wide range of settings, including frontend development, simulated Android, iOS, and backend tasks. Compared to M2, M2.1 shows substantial improvements on this benchmark.

For verification, we adopt an Agent-as-a-Verifier approach, in which an agent executes tasks directly in a real environment. The reward signal consists of three dimensions:

Execution level: verifying whether the code compiles and runs successfully

Interaction level: interacting with tools and user interfaces to assess whether the business logic is correct

Visual level: scoring based on aesthetic criteria. Although this dimension is inherently subjective, we focus on standards with high inter-rater consistency.

Compared to traditional LLM-as-a-Judge methods that rely solely on static screenshots for evaluation, Agent-based verification enables dynamic interaction, providing a more comprehensive view of bugs and implementation flaws.

SWE-Revier & OctoBench

The other two benchmarks are SWE-Review and OctoBench.
SWE-Review targets review scenarios within the development pipeline. It is designed as a benchmark suite covering multiple programming languages and diverse use cases, with metrics that jointly consider recall and hallucination rate.
OctoBench evaluates an agent's instruction-following capability in agentic settings. Unlike traditional instruction-following (IF) leaderboards, where instructions typically originate only from the user or system prompts, instructions in agent-based scenarios may also come from system reminders, Claude.md files, or tool schemas. OctoBench is developed in-house and employs checklist-based, rubric-driven scoring to perform the evaluation.

Finally, regarding this metric: compared to M2, M2.1 shows a notable overall improvement as a result of several targeted optimizations, including gains on SWE-Review and OctoBench in terms of instruction-following performance.