Today, we are handing MiniMax-M2.1 over to the open-source community. This release is more than just a parameter update; it is a significant step toward democratizing top-tier agentic capabilities.
How to Use
- The MiniMax-M2.1 API is now live on theMiniMax Open Platform:https://platform.minimax.io/docs/guides/text-generation
- Our product MiniMax Agent, built on MiniMax-M2.1, is now publicly available:https://agent.minimax.io/
- The MiniMax-M2.1 model weights are now open-source, allowing for local deployment and use: https://huggingface.co/MiniMaxAI/MiniMax-M2.1
First Impressions
“MiniMax M2.1 performed exceptionally well across our internal benchmarks, showing strong results in complex instruction following, reranking, and classification, especially within e-commerce tasks. Beyond its general versatility, it has proven to be an excellent model for coding. We are impressed by these results and look forward to a close collaboration with the MiniMax team as we continue to support their latest innovations on the Fireworks platform.”
Benny Chen
Co-founder
“We could not be more excited about M2.1! Our users have come to rely on MiniMax for frontier-grade coding assistance at a fraction of the cost, and early testing shows M2.1 excelling at everything from architecture and orchestration to code reviews and deployment. The speed and efficiency are off the charts!”
Scott Breitenother
CEO
“Minimax M2 series has demonstrated powerful code generation capability, and has quickly became one of the most popular model on Cline platform during the past few months. We already see another huge advancement in capability for M2.1 and very excited to continue partner with minimax team to advance AI in coding”
Team Cline
“Integrating the MiniMax M2 series into our platform has been a significant win for our users, and M2.1 represents a clear step forward in what a coding-specific model can achieve. We’ve found that M2.1 handles the nuances of complex, multi-step programming tasks with a level of consistency that is rare in this space. By providing high-quality reasoning and context awareness at scale, MiniMax has become a core component of how we help developers solve challenging problems faster. We look forward to seeing how our community continues to leverage these updated capabilities.”
Robert Rizk
Co-founder
"We're excited for powerful open-source models like M2.1 that bring frontier performance (and in some cases exceed the frontier) for a wide variety of software development tasks. Developers deserve choice, and M2.1 provides that much needed choice!"
Eno Reyes
Co-Founder, CTO
"Our users love MiniMax M2 for its strong coding ability and efficiency. The latest M2.1 release builds on that foundation with meaningful improvements in speed and reliability, performing well across a wider range of languages and frameworks. It's a great choice for high-throughput, agentic coding workflows where speed and affordability matter."
Matt Rubens
Co-Founder, CEO
Benchmarks

Furthermore, across specific benchmarks—including test case generation, code performance optimization, code review, and instruction following—MiniMax-M2.1 demonstrates comprehensive improvements over M2. In these specialized domains, it consistently matches or exceeds the performance of Claude Sonnet 4.5.

MiniMax-M2.1 delivers outstanding performance on the VIBE aggregate benchmark, achieving an average score of 88.6—demonstrating robust full-stack development capabilities. It excels particularly in the VIBE-Web (91.5) and VIBE-Android (89.7) subsets.


Evaluation Methodology Notes:
- SWE-bench Verified: Tested on internal infrastructure usingClaude Code,Droid,ormini-swe-agentas scaffolding. By default, we utilized Claude Code metrics. When using Claude Code, the default system prompt was overridden. Results represent the average of 4 runs.
- Multi-SWE-Bench & SWE-bench Multilingual & SWT-bench & SWE-Perf: Tested on internal infrastructure using Claude Code as scaffolding, with the default system prompt overridden. Results represent the average of 4 runs.
- Terminal-bench 2.0: Tested using Claude Code on our internal evaluation framework. We verified the full dataset and fixed environmental issues. Timeout limits were removed, while all other configurations remained consistent with official settings. Results represent the average of 4 runs.
- SWE Review: Built upon the SWE framework, this internal benchmark for code defect review covers diverse languages and scenarios, evaluating both defect recall and hallucination rates. A review is deemed "correct" only if the model accurately identifies the target defect and ensures all other reported findings are valid and free of hallucinations. All evaluations are executed using Claude Code, with final results reflecting the average of four independent runs per test case. We plan to open-source this benchmark soon.
- OctoCodingbench: An internal benchmark focused on long-horizon instruction following for Code Agents in complex development scenarios. It conducts end-to-end behavioral supervision within a dynamic environment spanning diverse tech stacks and scaffolding frameworks. The core objective is to evaluate the model's ability to integrate and execute "composite instruction constraints"—encompassing System Prompts (SP), User Queries, Memory, Tool Schemas, and specifications such as Agents.md,Claude.md, and Skill.md.Adopting a strict "single-violation-failure" scoring mechanism, the final result is the average pass rate across 4 runs, quantifying the model's robustness in translating static constraints into precise behaviors. We plan to open-source this benchmark soon.
- VIBE: An internal benchmark that utilizes Claude Code as scaffolding to automatically verify a program's interactive logic and visual effects. Scores are calculated through a unified pipeline comprising requirement sets, containerized deployment, and dynamic interaction environments. Final results represent the average of 3 runs. We plan to open-source this benchmark soon.
- Toolathlon: The evaluation protocol remains consistent with the original paper.
- BrowseComp: All scores were obtained using the same agent framework as WebExplorer (Liu et al. 2025), with only minor fine-tuning of tool descriptions. We utilized the same 103-sample GAIA text-only validation subset as WebExplorer.
- BrowseComp (context management): When token usage exceeds 30% of the maximum context window, we retain the first AI response, the last five AI responses, and the tool outputs, discarding the remaining content.
- AIME25 ~ 𝜏²-Bench Telecom: Derived from internal testing based on the evaluation datasets and methodology referenced in the Artificial Analysis Intelligence Index






