2026.1.4

M2.1: Multilingual and Multi-Task Coding with Strong Generalization

MiniMax-M2.1 has achieved a significant leap in coding capabilities compared to the previous generation, matching or surpassing the level of global top-tier models on multiple internal and external benchmarks. As an open-source model optimized specifically for agentic scenarios, M2.1 demonstrates exceptional performance in code generation, tool usage, instruction following, and long-range planning. Here, we would like to share some insights and practical experience gained in the process of enhancing coding capabilities for real-world scenarios.

The Gap Between SWE-Bench and Real-World Coding

In 2025, SWE-Bench has become the most authoritative evaluation standard for code generation scenarios. In this evaluation, LLMs must face bugs from real GitHub repositories and fix them through multiple rounds of code reading and testing. The core value of SWE-Bench lies in the fact that the tasks it evaluates are highly close to a programmer's daily work, and the results can be objectively verified via test cases — a feature particularly crucial for reinforcement learning training. We can directly use the test pass rate as a reward signal, continuously optimizing the model in a real code environment without relying on the noise introduced by human labeling or model evaluation.
However, like all evaluation standards, SWE-Bench is not perfect. For a coding agent to be usable in real-world scenarios, there are more capability dimensions beyond SWE-Bench that need attention:

How to Fill These Gaps

1. Environment Scaling

We often see developers complaining that current coding agents perform well on languages like Python/JavaScript but show lackluster results in more serious enterprise-level development scenarios. If the task involves complex project understanding, the performance degrades further.
To solve this problem, during the training cycle of MiniMax-M2.1, we built a comprehensive data pipeline covering Top 10+ mainstream programming languages. We retrieved a massive number of Issues, PRs, and corresponding test cases from GitHub, and conducted strict filtering, cleaning, and rewriting based on this raw data to ensure the quality of Post Training data. A coding agent is naturally suited for mass-producing this kind of training environment. During this process, we found that for both the M2 model and other frontier models, the success rate of constructing multi-language environments was lower than that of Python. There are several distinct situations here:
Ultimately, we built a multi-language training system covering over ten languages including JS, TS, HTML, CSS, Python, Java, Go, C++, Kotlin, C, and Rust. We obtained over 100,000 environments usable for training and evaluation from real GitHub repositories, with each environment containing complete Issues, code, and test cases. To support such massive Environment Scaling and RL training, we built a high-concurrency sandbox infrastructure capable of launching over 5,000 isolated execution environments within 10 seconds, while supporting the concurrent operation of tens of thousands of environments. This infrastructure allows us to efficiently conduct large-scale multi-language coding agent training.

2. Beyond Bug Fix: Multi-Task Capabilities

Real software development is far more than just fixing bugs. A programmer's daily routine includes writing tests, code reviews, performance optimization, and other tasks. In the training of MiniMax-M2.1, we also conducted targeted optimization for these scenarios, including acquiring high-quality problems and designing corresponding Reward signals:

3. Generalization on OOD Scaffolds

Generalization on OOD scaffolds is vital for a coding agent. Developers use different scaffolds — some use Claude Code, some use Cursor, and others use proprietary agent frameworks. If a model is optimized only for a specific scaffold, its performance will be severely discounted in other environments, strictly limiting its capability in real development scenarios. In MiniMax-M2.1, we believe scaffold generalization primarily tests the model's long-range instruction following ability and adaptability to context management strategies:
To verify MiniMax-M2.1's scaffold generalization, we directly tested SWE-Bench performance on different scaffolds and also constructed a test set closer to real-world usage to observe if the model meets various scaffold instruction constraints. Ultimately, we found that MiniMax-M2.1 maintained an SWE-Bench score above 67 in mini-swe-agent,Droid, and Claude Code.
Compared to M2, MiniMax-M2.1 shows significant improvement across different OOD scaffolds. On OctoCodingbench, M2.1 improved from M2's 13.3 to 26.1, demonstrating strong compliance with scaffold instruction constraints.

2026 TODOs

We believe the development of coding agents still has a long way to go. Therefore, this year we will explore several interesting directions: