MiniMax
2026.2.13

Forge: Scalable Agent RL Framework and Algorithm

Scaling RL for complex, real-world agents confronts a fundamental trilemma: balancing system throughput, training stability, and agent flexibility. These conflicting constraints have long impeded the application of large-scale RL in industrial-grade systems.

In this post, we reveal how we resolved this "impossible triangle" through a holistic approach in our internal RL framework Forge, combining flexible system architecture, algorithmic design, optimized asynchronous scheduling, and extreme training-inference efficiency. By leveraging standardized interaction protocols, Forge supports the training of arbitrary agent scaffolds, enabling the massive-scale RL that culminates in the breakthrough capabilities of the MiniMax M2.5 model. Integrated with our CISPO algorithm and composite reward framework, M2.5 pushes the frontier for efficient and reliable real-world productivity, achieving our mission "Intelligence with Everyone".

1. Problem Formulation

Before delving into our architectural design, we first formulate the optimization objective of our Agent RL system to be the maximization of the Effective Agent(A) Training Yield (J) , defined as:
where System Throughput refers to the raw number of tokens processed per second, bottlenecked by 4 components of the whole RL system: rollout, training, data processing and I/O. Sample Efficiency refers to the average performance improvement for each sample determined by data distribution, data quality, algorithm efficiency, and off-policy-ness. We choose our specific constraints using proxy indicators for both stability and convergence considerations, as noted in the equation. Achieving maximal is hindered by three structural challenges, which we explain in detail below.

1.1 Agent Extensibility and Framework Flexibility

Current RL paradigms impose a "Glass Ceiling" on agent complexity due to two structural flaws:



1.2 System Efficiency and Compute Redundancy

Agent rollout completion times exhibit extreme variance, ranging from seconds (simple API calls) to hours (complex reasoning chains). This creates a scheduling deadlock:



1.3 Algorithmic Challenges: Credit Assignment and Optimization Stability

2. System Architecture and Agent RL Paradigm

To alleviate the "Efficiency vs. Off-Policyness" trade-off and minimize redundancy, we introduce the following architectural innovations.

2.1 RL System Design

To achieve a truly scalable architecture, we move beyond specific implementations to a generalized "Middleware" design. This decouples the Agent's reasoning logic from the underlying training infrastructure.

Our RL system is made up of 3 modules:


Agent Side: This layer abstracts the General Agent—comprising both white-box and black-box architectures—and its operational environment. It orchestrates recursive environmental interactions, allowing the Agent to function as a pure trajectory producer. By decoupling environmental feedback from system overhead, the Agent can focus exclusively on core business logic (such as Context Management and Reasoning Chains), remaining agnostic to the underlying training and inference mechanics.

Middleware Abstraction Layer: Acting as the bridge, this layer physically isolates the Agent Side from the Training/Inference Side, including the Gateway server and the Data Pool.


Training and Inference Side: This layer manages the heavy computational lifting, consisting of the LLM Engine and Train Engine.


During offline evaluation, we observed significant performance discrepancies attributed to differences in the scaffolds. Leveraging the modular design of our RL framework, we can conduct training using an extensive array of scaffolds without requiring internal modifications to the Agent. This approach effectively enables the model to generalize across diverse scaffolds, a.k.a. environments. Our architecture achieves a complete decoupling of engines and agents, ensuring the seamless integration of various agents. In total, we have integrated hundreds of types of scaffolds and thousands of distinct tool invocation formats.

2.2 White-Box Agent RL for Context Management (CM)

Deep search and multi-step reasoning in long-horizon tasks (e.g., BrowseComp) reveal a fundamental tension between context density and reasoning stability:


To resolve this distribution shift and maintain reasoning fidelity, we integrate the CM mechanism directly into the RL interaction loop, effectively treating Context Management as a functional action that drives state transitions:



2.3 Black-box Agent RL: Robustness Across Heterogeneous Scaffolds

In practical deployment, a significant portion of our user base operates proprietary or complex agent architectures that function as "Black Boxes." We have observed that model performance often varies drastically depending on the underlying agent scaffold, as standard training paradigms fail to generalize across different cognitive architectures. To address this, we validated our framework through a dedicated Black-box Agent Experiment, ensuring consistent optimization regardless of the agent's internal opacity.

3. Engineering Optimization

3.1 Hybrid Scheduling Strategy: Windowed FIFO

To resolve the conflict between System Throughput and Distributional Consistency, we introduce Windowed FIFO. This strategy imposes a sliding constraint on the Training Scheduler, acting as a "middle ground" between strict synchronous ordering and greedy asynchronous execution.

The core logic governs how the Training Scheduler fetches samples from the global generation queue. Even if a large batch of requests (e.g., Generation Batch Size ) is submitted, the scheduler is restricted to a visibility window of size (e.g., ).

3.2 Accelerating Agent Trajectory Training with Prefix Tree Merging

In the training of agents, datasets typically consist of extensive multi-turn dialogue samples. Structurally, these samples exhibit a high degree of overlap.

The Challenge: Redundancy in Traditional Methods


Prefix Tree Merging

To eliminate this redundancy, we propose a Prefix Tree Merging scheme, transforming the training process from "linear processing" to a "tree-structured" approach.


By eliminating redundant prefix prefilling, this solution achieves a 40x training speedup and significantly reduces memory overhead to support longer sequences or larger batch sizes, all while guaranteeing strict mathematical equivalence to standard methods with zero impact on loss computation or metrics.

3.3 Extreme Inference Acceleration

We optimize the generation pipeline through three architectural innovations:

4. Scalable Agent RL Algorithm

4.1 RL Algorithm

We leverage CISPO as the core algorithm, specifically adapted for the characteristics of long-horizon Agents.

4.2 Dense and Efficiency-Aware Reward

We propose a composite reward framework designed to tackle the credit assignment challenges of ultra-long contexts (up to 200k) while ensuring training stability:

5. Conclusions

We have successfully addressed the "impossible triangle" of scaling RL for agents. Through Forge, we achieved a breakthrough in RL system throughput while ensuring robust generalization across arbitrary Agent scaffolds. By integrating this flexible architecture with our stable CISPO algorithm, we enabled the massive-scale training behind MiniMax M2.5. This holistic approach overcomes previous constraints, delivering efficient, real-world agent capabilities and advancing our mission of "Intelligence with Everyone."