MiniMax-M1 Technical Seminar: Deep Dive into RL, Hybrid Architecture, and What's Next in the Future.
MiniMax hosted its first global technical seminar to introduce M1, the world's first open-weight large-scale hybrid attention reasoning model. The seminar brought together leading researchers and industry practitioners from around the globe to discuss breakthrough innovations in model architecture, reinforcement learning, and practical deployment challenges.
MiniMax-M1 Technical Keynote
Jiayuan Song – Technical Staff, MiniMax
- Scaling for Performance: MiniMax successfully scaled the M1 model to support a 1M-token context length and an 80K-token output. This enabled efficient handling of tasks involving extended reasoning and long-form inputs.
- Hybrid Attention Design: M1 is a Mixture-of-Experts (MoE) model featuring 32 experts, with a total of 456B parameters and 45.9B activated per token. Its hybrid architecture combines linear attention with standard softmax attention, significantly reducing computational load while preserving inference quality.
- CISPO RL Algorithm: Olive introduced the "Clipped IS-weight Policy Optimization" (CISPO) algorithm, designed to overcome limitations in traditional PPO methods. By clipping importance weights rather than full policy updates, CISPO maintains gradient contributions from all tokens and accelerates training. Integrated with the hybrid model, this approach achieved DAPO-level performance in half the training time. An FP32 upgrade to the LM head was necessary to address numerical precision issues during training.
Panel 1- Architecture Evolution and RL Frontiers
- Junxian He – Assistant Professor, HKUST - Songlin Yang – PhD Student, MIT CSAIL - Shunyu Yao – Research Scientist, Anthropic - Wenhu Chen – Assistant Professor, University of Waterloo
- Hybrid Attention as the Future: Songlin Yang emphasized the hybrid attention model as a strategic balance between the memory constraints of linear attention and the flexibility of full softmax attention. - Hardware-Aware Algorithm Design: The panel highlighted a paradigm shift toward algorithms optimized not just for theory but also for hardware efficiency. Songlin referenced the "FlashAttention effect," illustrating that high-performance implementations often outperform theoretically superior designs. - Scaling RL: Shunyu Yao discussed RL's capacity to teach new abilities even when the base model lacks them, and highlighted the ongoing challenge of reward modeling for tasks that lack ground truth. - Data Quality Challenges: Wenhu Chen noted the contrast between SFT and RL data needs. While SFT can scale with moderate-quality data, RL requires high-quality signals. The over-reliance on math and coding tasks for RL data leads to hallucination issues in factual domains. - Looking Forward: Key research frontiers identified include visual reasoning, more expressive and reliable RL reward functions, scalable infrastructure for linear attention, and advances in multi-agent collaboration. Together, these areas are expected to shape the next wave of general-purpose AI systems.
Panel 2- Hybrid Architecture and Long Context Inference
- Arthur Zucker – Head of Transformers, Hugging Face - Kaichao You – Core Developer, vLLM - Akim Tsvigun - Senior ML Solutions Architect, Nebius - Arjun Krishna - AI Researcher, Writer - Liangsheng Yin - Core Developer, SGLang - Nick Comly - Inference Product Lead, NVIDIA
- Hybrid Architecture Evolution Mainstreaming Hybrid Architectures: Hybrid attention is becoming an industry standard due to its balance of scalability, latency reduction, and memory efficiency. Akim Tsvigun shared that the M1 reduced the first-token latency from 60s to 4–5s for 100k-token inputs. Evolving Infrastructure: Nick from NVIDIA shared optimization efforts such as custom kernels for lightning attention, contributions to open-source tools like FlashAttention, and work on speculative decoding to improve generation speed at low concurrency. SGLang and vLLM teams focus on sequence parallelism for linear attention, ring attention for softmax, and dynamic routing via MoE. Techniques like cache-aware scheduling, cache reuse, and batch overlapping were also cited. Infrastructure efforts are now shifting toward training and inference pipelines tailored for hybrid models, as existing systems favor traditional transformers. - Actual Inference & Long Context Applications Real-World Use Cases and Value: Use cases for long-context inference are growing, especially in legal, medical, and financial sectors where maintaining document coherence is critical. M1’s ability to handle full-document inference in a single pass is a game-changer. Ongoing Challenges: Arjun Krishna pointed to two major hurdles: high latency for first-token generation and attention drift over long sequences. The need for better evaluation metrics beyond "needle-in-a-haystack" tasks was also emphasized. What’s Next: We envision hybrid models transforming AI agents that operate over multi-step workflows, with infrastructure and applications co-evolving to meet enterprise demand.
Q&A - Technical Details and Future Roadmap
- Model Differentiation: The development of M1 was driven by internal use cases demanding long-term memory—ranging from roleplay scenarios involving hundreds of conversational turns to processing and querying entire books or extensive research papers. With support for over 1M input tokens and up to 80k output tokens, M1 enables an unprecedented scope of reasoning and contextual understanding. - Scaling RL with Hybrid Attention: The team faced initial plateaus when scaling token output length in RL. Investigation revealed layer activation extremes, resolved by upgrading to FP32 precision in the LM head. Hybrid models may require more training steps than softmax-only counterparts, but gain in generalization. - Multimodal Vision: M1 is part of MiniMax’s broader multimodal foundation model strategy, alongside Hailuo (video) and Speech-02 (audio). This unified direction aims to empower diverse content creation across industries. - System 2 Reasoning: This contains two key effects: first, it substitutes for manual prompt engineering by enabling the model to autonomously plan and structure its reasoning; second, it makes fuller use of test-time computation by allocating more inference-time effort to deeply process and analyze the task at hand.
Looking Back and Forward
The MiniMax-M1 Seminar showcased a significant leap forward in large language model innovation. By fusing hybrid attention with efficient scaling and robust reinforcement learning techniques, M1 pushes the boundaries of what large models can accomplish. Looking Ahead: Our mission is to build intelligence with everyone—advancing toward AGI in a scalable way. From text, speech and video to agentic applications, we’re also exploring multimodal models that combine understanding and generation. We’ll keep pushing the boundaries of architecture while making powerful AI accessible from the start.