2026.1.14

MiniMax Open-Sources New Benchmark: Defining Production-Grade Standards for Coding Agent

In the practical deployment of Coding Agent, we have observed a recurring yet frequently overlooked phenomenon: User dissatisfaction with Agent often stems not from their inability to accomplish tasks, but from their failure to accomplish them properly.

By analyzing user experience feedback, we identified that the most frequent complaints center on: Agent failing to follow explicitly stated instructions. For instance, when users explicitly specify "do not use emojis" in the system prompt, the Agent nonetheless inserts smiley faces in code comments; when users request "backup before modification," the Agent directly executes rm -rf to delete files; when users define naming conventions in project documentation, the Agent disregards them entirely.

These issues share a common characteristic: The task may ultimately be completed, but the process violates established specifications. Users demand not merely "code that runs," but "code that adheres to team collaboration standards."

Why Coding Agent Needs a New Benchmark


If we accept that only Coding Agent capable of adhering to process specifications can be confidently integrated into real-world software engineering workflows, then the current mainstream evaluation frameworks for Code Agent reveal a significant blind spot. With the proliferation of Agent products such as Claude Code, Codex, Cursor, and Windsurf, the community is establishing a repository protocol system designed specifically for Agent. Projects are no longer merely collections of code—they now encompass multi-layered specifications for collaboration:


The emergence of these mechanisms essentially constructs a multi-tiered instruction system. For example, when a user says "help me refactor this module," the Agent must simultaneously satisfy constraints across multiple levels: system-level safety rules (no direct code deletion), the user's immediate instructions (the extent of refactoring), engineering specifications explicitly documented in the repository, and decisions previously made in historical memory (to continue or override). More complex scenarios arise when these instruction sources conflict. A user might temporarily say "skip writing tests this time," while AGENTS.md explicitly mandates "every commit must have test coverage" — whom should the Agent obey?
However, an awkward problem persists: current academic leaderboards—whether SWE-bench verified or various terminal environment-based tests—are almost universally grounded in Outcome-based Metrics: Did the tests pass? Was the bug fixed? This result-oriented evaluation approach fundamentally fails to capture the model's output process within sandbox environments, let alone the authentic interaction experience in complex real-world scenarios. This ultimately leads to a misalignment between evaluation and actual usage contexts.

OctoCodingBench: Process-Oriented Evaluation for Engineering Reliability

To address this issue, the evaluation paradigm itself requires a fundamental transformation — one that focuses on the output process itself. Motivated by this insight, we introduce OctoCodingBench, which evaluates from two dimensions: Check-level Success Rate (CSR) and Instance-level Success Rate (ISR). This approach aims to comprehensively observe process instruction non-compliance issues during task completion, approximating the real user experience as closely as possible.

Specifically, CSR measures what proportion of rules the Coding Agent follows, while ISR measures whether the Coding Agent adheres to every single rule.


A qualified Coding Agent must adhere to the following while completing tasks:

Based on this benchmark, we conducted extensive evaluations of existing open-source and closed-source models, yielding several insightful experimental findings:

Future Research Directions


We believe that training the next generation of Coding Agent requires the introduction of Process Supervision:

The capability boundary of Coding Agent is shifting from "can it write code that runs" to "can it collaboratively complete tasks under complex constraints." This also reflects a deeper transformation in product philosophy: Agent are not meant to replace human developers, but to become team members who understand and abide by the rules.

Therefore, Process Specification is the core proposition for the evolution of Coding Agent.
When we begin to focus on process rather than merely outcomes, when we enable evaluation systems to capture the dangerous pattern of "non-compliant yet successful," Coding Agent can truly transition from demo to production environments.

OctoCodingBench represents an infrastructure-level endeavor. We look forward to collaborating with the community to continue advancing in this direction.

Hugging Face: huggingface.co/datasets/MiniMaxAI/OctoCodingBench

MiniMax Coding Plan: https://platform.minimax.io/subscribe/coding-plan

IIntelligence with Everyone.