Today, we are handing MiniMax-M2.1 over to the open-source community. This release is more than just a parameter update; it is a significant step toward democratizing top-tier agentic capabilities.
M2.1 was built to shatter the stereotype that high-performance agents must remain behind closed doors. We have optimized the model specifically for robustness in coding, tool use, instruction following, and long-horizon planning. From automating multilingual software development to executing complex, multi-step office workflows, MiniMax-M2.1 empowers developers to build the next generation of autonomous applications—all while being fully transparent, controllable, and accessible.
We believe true intelligence should be within reach. M2.1 is our commitment to the future, and a powerful new tool in your hands.

How to Use

First Impressions

Fireworks

“MiniMax M2.1 performed exceptionally well across our internal benchmarks, showing strong results in complex instruction following, reranking, and classification, especially within e-commerce tasks. Beyond its general versatility, it has proven to be an excellent model for coding. We are impressed by these results and look forward to a close collaboration with the MiniMax team as we continue to support their latest innovations on the Fireworks platform.”

B

Benny Chen

Co-founder

Kilo

“We could not be more excited about M2.1! Our users have come to rely on MiniMax for frontier-grade coding assistance at a fraction of the cost, and early testing shows M2.1 excelling at everything from architecture and orchestration to code reviews and deployment. The speed and efficiency are off the charts!”

S

Scott Breitenother

CEO

Cline

“Minimax M2 series has demonstrated powerful code generation capability, and has quickly became one of the most popular model on Cline platform during the past few months. We already see another huge advancement in capability for M2.1 and very excited to continue partner with minimax team to advance AI in coding”

T

Team Cline

BlackBox

“Integrating the MiniMax M2 series into our platform has been a significant win for our users, and M2.1 represents a clear step forward in what a coding-specific model can achieve. We’ve found that M2.1 handles the nuances of complex, multi-step programming tasks with a level of consistency that is rare in this space. By providing high-quality reasoning and context awareness at scale, MiniMax has become a core component of how we help developers solve challenging problems faster. We look forward to seeing how our community continues to leverage these updated capabilities.”

R

Robert Rizk

Co-founder

Factory AI (Droid)

"We're excited for powerful open-source models like M2.1 that bring frontier performance (and in some cases exceed the frontier) for a wide variety of software development tasks. Developers deserve choice, and M2.1 provides that much needed choice!"

E

Eno Reyes

Co-Founder, CTO

Roo Code

"Our users love MiniMax M2 for its strong coding ability and efficiency. The latest M2.1 release builds on that foundation with meaningful improvements in speed and reliability, performing well across a wider range of languages and frameworks. It's a great choice for high-throughput, agentic coding workflows where speed and affordability matter."

M

Matt Rubens

Co-Founder, CEO

Benchmarks

MiniMax-M2.1 delivers a significant leap over M2 on core software engineering leaderboards. It shines particularly bright in multilingual scenarios, where it outperforms Claude Sonnet 4.5 and closely approaches Claude Opus 4.5.
We also evaluated MiniMax-M2.1 on SWE-bench Verified across a variety of coding agent frameworks. The results highlight the model's exceptional framework generalization and robust stability.
Furthermore, across specific benchmarks—including test case generation, code performance optimization, code review, and instruction following—MiniMax-M2.1 demonstrates comprehensive improvements over M2. In these specialized domains, it consistently matches or exceeds the performance of Claude Sonnet 4.5.
To evaluate the model's full-stack capability to architect complete, functional applications "from zero to one," we established a novel benchmark: VIBE (Visual & Interactive Benchmark for Execution). This suite encompasses five core subsets: Web, Simulation, Android, iOS, and Backend. Distinguishing itself from traditional benchmarks, VIBE leverages an innovative Agent-as-a-Verifier (AaaV) paradigm to automatically assess the interactive logic and visual aesthetics of generated applications within a real runtime environment.
MiniMax-M2.1 delivers outstanding performance on the VIBE aggregate benchmark, achieving an average score of 88.6—demonstrating robust full-stack development capabilities. It excels particularly in the VIBE-Web (91.5) and VIBE-Android (89.7) subsets.
MiniMax-M2.1 also demonstrates steady improvements over M2 in both long-horizon tool use and comprehensive intelligence metrics.
Evaluation Methodology Notes:
  • SWE-bench Verified: Tested on internal infrastructure usingClaude Code,Droid,ormini-swe-agentas scaffolding. By default, we utilized Claude Code metrics. When using Claude Code, the default system prompt was overridden. Results represent the average of 4 runs.
  • Multi-SWE-Bench & SWE-bench Multilingual & SWT-bench & SWE-Perf: Tested on internal infrastructure using Claude Code as scaffolding, with the default system prompt overridden. Results represent the average of 4 runs.
  • Terminal-bench 2.0: Tested using Claude Code on our internal evaluation framework. We verified the full dataset and fixed environmental issues. Timeout limits were removed, while all other configurations remained consistent with official settings. Results represent the average of 4 runs.
  • SWE Review: Built upon the SWE framework, this internal benchmark for code defect review covers diverse languages and scenarios, evaluating both defect recall and hallucination rates. A review is deemed "correct" only if the model accurately identifies the target defect and ensures all other reported findings are valid and free of hallucinations. All evaluations are executed using Claude Code, with final results reflecting the average of four independent runs per test case. We plan to open-source this benchmark soon.
  • OctoCodingbench: An internal benchmark focused on long-horizon instruction following for Code Agents in complex development scenarios. It conducts end-to-end behavioral supervision within a dynamic environment spanning diverse tech stacks and scaffolding frameworks. The core objective is to evaluate the model's ability to integrate and execute "composite instruction constraints"—encompassing System Prompts (SP), User Queries, Memory, Tool Schemas, and specifications such as Agents.md,Claude.md, and Skill.md.Adopting a strict "single-violation-failure" scoring mechanism, the final result is the average pass rate across 4 runs, quantifying the model's robustness in translating static constraints into precise behaviors. We plan to open-source this benchmark soon.
  • VIBE: An internal benchmark that utilizes Claude Code as scaffolding to automatically verify a program's interactive logic and visual effects. Scores are calculated through a unified pipeline comprising requirement sets, containerized deployment, and dynamic interaction environments. Final results represent the average of 3 runs. We plan to open-source this benchmark soon.
  • Toolathlon: The evaluation protocol remains consistent with the original paper.
  • BrowseComp: All scores were obtained using the same agent framework as WebExplorer (Liu et al. 2025), with only minor fine-tuning of tool descriptions. We utilized the same 103-sample GAIA text-only validation subset as WebExplorer.
  • BrowseComp (context management): When token usage exceeds 30% of the maximum context window, we retain the first AI response, the last five AI responses, and the tool outputs, discarding the remaining content.
  • AIME25 ~ 𝜏²-Bench Telecom: Derived from internal testing based on the evaluation datasets and methodology referenced in the Artificial Analysis Intelligence Index
logo
© 2025 MiniMax