MiniMax
2026.1.27

A Deep Dive into the MiniMax-M2-her

https://file.cdn.minimax.io/public/4389b755-cb3c-4e72-853b-d6be303853b2.png
Worlds to Dream, Stories to Live
How we built a Role-Play Agent for the production usage


Three Years of Observations: How We Define Role-Play


This year marks our third year optimizing Role-Play in Talkie / Xingye.

Three years is long enough for a product to leave its mark on users' lives, and long enough for a user to form a deep bond with the NPC. Beyond the product metrics, we have found that the most valuable insights come from the user behaviors reflecting their real needs.

Here are a few signals that are most interesting:


All of these converge to one singular insight: The essence of Role-play is not static impersonation; it is the unique narrative journey a user and a character weave together. A Deep role-play is not just about accuracy; it’s about agency—enabling every user to step into a living, breathing environment and arrive at a moment of resolution that is uniquely theirs. Formally, we define this as an agent’s capacity to navigate specific coordinates: {World} × {Stories}, conditioned on {User Preferences}.

Guided by this framework, we have distilled our technical strategy of Role-Play into three core challenges:



1 MiniMax-M2-her


Over the past three years, we have relentlessly iterated our models to answer these fundamental questions. Today, we are proud to introduce MiniMax-M2-her—our systematic attempt toward deeper Role-Play.

Specifically, MiniMax-M2-her is engineered to deliver:


In the following sections, we will break down the insights gained from three years of research and the engineering efforts that power MiniMax-M2-her.

2 Starting with Evaluation — Is A/B Testing A Good Evaluation?


Prior to mid-2024, our iteration cycle—like much of the industry—was tethered to traditional online A/B testing. We relied heavily on lagging indicators like LifeTime (LT), duration time and average conversation turns to judge performance.

However, we quickly hit a ceiling: velocity. Validating a new model required lengthy testing cycles to achieve statistical significance, often stretching feedback loops to a week or more. Furthermore, we faced a unique challenge with user inertia. Long-term users build extensive histories and deep emotional habits with specific NPCs. When we swapped the underlying model—even a "better" one—the sudden stylistic shift often felt like a violation of the character's established voice. Users would reject the change, venting frustration on social media or manually reverting to the previous version to preserve their immersion. To address this, we considered limiting A/B testing to "cold start" interactions, specifically with new NPCs that users hadn't met yet. Although we eventually bit the bullet and built that infrastructure, it was a significant engineering lift at the time.

To break free from these slow cycles, we needed a way to approximate online metrics through offline evaluation. But here we encountered the "Ground Truth Paradox." Unlike conventional NLP tasks, Role-Play is inherently subjective and non-verifiable. If you ask a tsundere character, "Do you like me?", valid responses could range from a flushed "Hmph, as if!" to a cold "...You're so annoying." A hundred users might hold a hundred different, equally valid expectations. This renders discriminative evaluation intractable.

However, we identified a key insight: While "alignment" (what makes a response great) is subjective, "misalignment" (what makes a response wrong) is surprisingly objective. Returning to the tsundere example, if the character responds "Yes, I really like you," we immediately know the model has failed. It is Out of Character (OOC). This gave us a clear path forward: while it's hard to define aligned responses, it is feasible to detect a misaligned one.

Leveraging this logic, and grounding it in our three core dimensions (Worlds, Stories, Preferences), we developed Role-Play Bench. This evaluation framework utilizes Situated Reenactment to automatically detect model misalignment. By focusing on error detection rather than subjective perfection, we have created a metric that correlates closely with online performance, significantly accelerating our iteration velocity.

2.1 Situated Reenactment: Bridging the Gap to Online Evaluation

Situated Reenactment measures an agent’s performance at specific coordinates:{Worlds} × {Stories}, conditioned on {User Preferences}. Instead of evaluating static, single-turn responses, we generate multi-turn dialogue trajectories via self-play simulation. This allows us to observe not just how a model starts a conversation, but how it behaves as the narrative arc unfolds.

Scenario construction. We started from our massive internal NPC/User prompt library (>1M) and the corresponding relationship setups. To bring order to this unstructured data, we produced hierarchical structured tags via embedding clustering (denoising) → LLM semantic aggregation → human verification. We then uniformly sampled 100 NPC settings each in Chinese and English based on the structured taxonomy to ensure coverage of various initial states. Our data team then crafted context-rich scenarios based on the sampled NPC and User prompts, applying strict dispersion constraints to ensure no two scenarios felt the same.

Model sampling. To simulate real conversation experiences, we built a Model-on-Model Self-Play sampling pipeline where models play both NPC and User. Since our internal model natively incorporate the "recommended reply" feature, it can naturally replicate real user expression habits and conversation preferences while playing the NPC, providing possible dialogue options. External models, however, typically lack this dual-role capability and struggle with user simulation fidelity. To ensure a fair comparison, we refined the prompts for external models based on real online user behavior patterns. We run 100 turns of self-play for each setting, repeated three times, generating a total dataset of 300 dense conversation sessions. (Note: we also validated this approach by using a fixed model to play the "User" role across all tests. The results were statistically similar to the self-play method, confirming the robustness of our approach).

The Evaluation Protocol. Based on the above scenario construction and model sampling, we obtained trajectories of different models playing NPC and User across different situations. For fairness, evaluation focuses exclusively on NPC-side outputs, scored across predefined dimensions, using evaluation model to align with human perception. To mitigate the inherent variance of LLM judges and ensure interpretability, we implement a rigorous scoring stack: text chunking, consistency checks across multiple samples, and manual calibration to align the model's judgment with our internal standards.

2.2 Evaluation Taxonomy of Role-Play Bench

Translating our "misalignment" philosophy into practice, we mapped our three core pillars—Worlds, Stories, and Preferences—to specific failure modes. We do not just look for "good" responses; we rigorously hunt for the specific flaws that break immersion.

Worlds focus on Basics, Logic, and Knowledge errors:


Stories consider Diversity and Content Logic problems:


User Preferences primarily evaluate interaction quality.


2.3 Role-Play Bench Results

2.3.1 Overall Quality

We systematically evaluated mainstream models using Role-Play Bench, focusing exclusively on multi-turn dynamic interaction. The results are definitive: across extended 100-turn sessions, MiniMax-M2-her ranks #1 overall.

                     Figure 1: Comparison of models' conversational performance on Role-Play Bench.

On the Worlds dimension, MiniMax-M2-her performs best. This result challenges the common assumption that strong general reasoning (found in massive external models) automatically translates to role-play fidelity. Our analysis reveals that beyond "repetition collapse" in some models, more common failures are reference confusion and physical logic error. "Reference confusion" often stems from a special but real usage pattern where users specify the model to play multiple characters including a narrator. Here is a typical example:
In such multi-character settings, models often attribute dialogue to the wrong character (e.g., the stoic Adrian suddenly speaking with Jax’s "gamer trash-talk" style), shattering immersion. MiniMax-M2-her maintains strict separation of voice and identity.

Additionally, physical logic errors represent a common failure mode. A typical example: NPC and User have said goodbye and are walking apart—they should no longer be able to converse. Yet, many models allow the dialogue to continue at normal volume as if nothing happened. MiniMax-M2-her recognizes this physical state change and autonomously introduces Narrative Bridging by using narration to transition through time or space ("Three hours later...") rather than forcing impossible dialogue.

On the Stories dimension, MiniMax-M2-her ranks fifth among all models, still achieving a relatively high standard. Gemini excels in rich vocabulary, Claude steadily advances the plot, and Doubao delivers vivid expressions. In comparison, MiniMax-M2-her tends to maintain expressive diversity while adopting a plain and natural language style.

On the User Preferences dimension, MiniMax-M2-her excels. It avoids speaking for users while emphasizing responsive intent recognition and natural interaction.

                           Figure 2: Comparison of models' performance on the Worlds leaderboard
                              Figure 3: Comparison of models' performance on the Stories leaderboard
                     Figure 4: Comparison of models' performance on the User Preferences leaderboard


2.3.2 Long-Horizon Quality Degradation Analysis

In role-play scenarios, users often prefer deep, immersive experiences. True immersion isn't built in five turns; it's built in fifty or more. This requires models to have long-range consistency and stable output capability, continuously maintaining character, relationship, and plot coherence. We further analyzed misalignment dimensions across different turn counts. We found MiniMax-M2-her better maintains long-conversation stability:

                     Figure 5: Evolution of quality and response length across conversation turns by model

3 How We Built MiniMax-M2-her


This section delineates our method for constructing the MiniMax-M2-her. We propose a two-phase alignment strategy. First, we leverage Agentic Data Synthesis to broaden the variety of training data and mitigate misalignment. It establishes a robust baseline for the model's worldview understanding and narrative progression. Second, we use Online Preference Learning to integrate feedback from data specialists to align the model with user preferences. It allows the model to enhance personalization significantly without compromising the fundamental capabilities established in the initial phase.

3.1 Agentic Data Synthesis

To achieve comprehensive coverage across infinite user scenarios while maintaining narrative dynamism, we proposed Agentic Data Synthesis—a dialogue synthesis pipeline driven by a sophisticated agentic workflow. This pipeline is designed to optimize two orthogonal yet equally critical dimensions:

1. Quality. We define quality as adherence to rigorous standards spanning from linguistic fundamentals (grammar, syntax, lexical precision, and coreference resolution) to high-level narrative execution (world knowledge, contextual coherence, plot causality, worldview consistency, and atmospheric nuance).

2. Diversity. Diversity reflects the pipeline's capacity to span a vast manifold of interactions, capturing how distinct user personas engage with heterogeneous characters across varying worldviews, scenarios, and interaction styles.

3.1.1 Synthesis Pipeline Overview

The overall synthetic data pipeline is illustrated in the figure below. The workflow proceeds in four stages:

                                                         Figure 6:Synthetic data pipeline overview

3.1.2 Diversity Guarantees

During synthesis, we improve generation diversity at multiple nodes of the pipeline to prevent mode collapse and ensure broad distribution:


3.1.3 Quality Guarantees

Beyond the BoN sampling strategy, we employ two specialized agents to further improve basic conversation quality and maintain ultra-long-turn conversation consistency:


3.2 Online Preference Learning

In the nuance of Role-Play, users rarely draft explicit feature requests like "I prefer slow-burn emotional arcs." Instead, they vote with their behavior—frantically hitting "regenerate" during a rushed action scene or lingering for minutes on a single poignant reply. These implicit behavioral signals, i.e., contextualized preferences, are the keys to the better role-play performance. To capture this, we utilize Online Reinforcement Learning from Human Feedback (RLHF) to train MiniMax-M2-her to perceive and adapt to these latent preference vectors in real-time.

3.2.1 Online Preference Learning Overview

The entire Online Preference Learning process works as follows:

                                                  Figure 7 Online preference learning process overview

3.2.2 Signal Extraction and Causal Denoising

The effectiveness of online preference learning hinges on whether collected signals reliably indicate generation quality.

One of our past observations is that raw feedback data is extremely noisy. If we train models directly on raw rewards—whether signals are implicit (turns, session duration) or explicit (likes/dislikes)—models tend to overfit to extreme, low-quality patterns (like controversial or "clickbait" content) or regress to the population mean, resulting in catastrophic mode collapse.

To solve this, we developed a Causal Denoising Protocol:


3.2.3 Model Training

With a clean, denoised dataset, we train models using RLHF. Here, the primary risk is Entropy Degradation—the tendency of the model to lose its diversity and converge on a few "safe" patterns. To combat this, we continuously monitor the entropy of the output distribution and apply early stopping at the moment we detect a significant drop in diversity. By the way, our experiments show that in Role-Play scenarios, RLHF tends to overfit rapidly—often by the second epoch.

We redeploy trained models to the product, letting the data team interact for the next round, using better models to collect higher-quality user feedback, thereby training superior next-generation models. This iterative loop progressively elevates model quality.

4 What’s Next?


If the past three years were defined by the question: "How to make an agent play a character well?" The next era is defined by a far more ambitious challenge: "How to let users truly own a world—one they can explore, shape, and watch evolve?"

We call this direction Worldplay. It represents a fundamental shift in the user's role: upgrading them from "entering a pre-set world" to "co-creating the world."

In our Worlds × Stories × User Preferences framework, the Worlds dimension today focuses on following settings and staying logically consistent, ensuring the model adheres to the context without breaking immersion. But once reliable adherence is achieved, the next frontier opens: How do events change future plots? How do user choices alter character destinies? How do we track those changes over hundreds or thousands of turns?

This drives an evolution from static prompt injection to Dynamic World State modeling: structuring entities, relationships, and causal chains so the model can track what happened, what changed, and what might happen next—at 100-turn and 1000-turn scales. For deep users, this is the threshold of true immersion: like an open-world game, they demand hidden variables, emergent consequences, and even branching worldlines.

Another critical axis of Worldplay is Multi-character Coordination. While today’s role-play is mostly 1v1 bonding, the most compelling narratives are ensemble dramas: users navigate relationships with multiple characters, characters have their own evolving bonds, jealousy and alliances emerge—"living" even when the user is absent. This is an upgrade of the "structural diversity" we discussed in Section 3: the challenge is no longer just "who speaks next," but "how multiple agents share world state, coordinate narrative, and maintain independent personas." In such settings, the risk of Reference Confusion explodes combinatorially.

Ultimately, our goal is simple to articulate but incredibly hard to build: To give you a world you can define, stories that grow with you, and companions that understand you without usurping your agency.

In Worldplay, users aren’t just participants—they’re World Definers: designing factions, planting foreshadowing flags, and pruning the branches of their own reality. This demands stronger planning capabilities—specifically, the intelligence to recognize the flags you plant and ensure they pay off at the perfect dramatic moment, alongside robust consistency guarantees to keep the world state stable across hundreds of turns.

The Planning Agent design in Section 3 is merely the foundation—it proves that models can proactively assess dialogue state and introduce new elements. The next step is a planning layer that tracks world changes across characters and vast time scales, ensuring the world truly comes alive.

We still have a long road to travel before full Worldplay, but the direction is clear.

                                    Worlds to Dream, Stories to Live. Let's go together.