
Today, we’re thrilled to introduce MiniMax Speech 2.6 — our latest speech model, bringing comprehensive upgrades with ultra-low latency, enhanced format handling, and a more natural, human-like voice for Voice Agent scenarios.
Since its launch, MiniMax Speech has become a core piece of infrastructure in the global voice intelligence landscape, known for its outstanding speech technology and exceptional cost-effectiveness.
From LiveKit, which powers ChatGPT's advanced voice mode, and the popular open-source framework Pipecat on GitHub, to the YC-incubated voice platform Vapi, all have chosen MiniMax Speech as their underlying technology engine. In the smart hardware sector, innovative products like Haivivi Bubble Pal, Fuzozo, and Rokid Glasses are also powered by MiniMax Speech to deliver their natural voice interaction experiences.
MiniMax continues to drive new forms of productivity through technological innovation, breaking down the barriers of language and culture to deliver natural, fluent interactions that connect every voice around the world.
1. Ultra-Low Latency, More Responsive: For Smoother Overall Interaction
We have completely optimized the audio generation pipeline, achieving an end-to-end latency of under 250 milliseconds—a top-tier industry standard. In scenarios with strict response time requirements, such as real-time conversations, audio generation is no longer the bottleneck, ensuring a smoother overall interaction.
Listen to Speech 2.6 acting as an AI customer service agent:
2. Seamless Handling of Specialized Formats, Smarter: For More Fluid Information Delivery
Speech 2.6 now directly converts non-standard text formats in multiple languages, including URLs, email addresses, phone numbers, dates, and monetary amounts. Whether you are using it with a large language model or need to process dynamically changing entity information in your business, you no longer need to perform tedious text pre-processing. The input is read correctly from the start, enabling more fluid information delivery.
For example, to correctly read the following passage, traditional TTS would require a series of conversions:
- +1 415 415 9921 → “plus one, four one five, four one five, nine nine two one”
- $1,234.56 → “one thousand two hundred thirty-four dollars and fifty-six cents”
- 192.168.1.1 → “one nine two dot one six eight dot one dot one”
- 2032-5-6 → “May sixth, twenty thirty-two”
- [email protected] → “support dash vip at technet dot com”
Original Text: "Hello Oliver Smith, I'm your intelligent virtual assistant Max! Thank you for your call. I've found your file. The outstanding balance for the phone number +1 415 415 9921 is $1,234.56. The associated IP address is 192.168.1.1. Your next payment is due on 2032-5-6. If you have any questions, please contact [email protected]."
3. Greater Naturalness and Fluent LoRA: For More Fluent Vocal Expression
In addition to further enhancing prosodic naturalness, Speech 2.6 also introduces Fluent LoRA.
Speech 2.5 already offered a convenient, high-fidelity voice cloning feature that allowed users to preserve the unique characteristics of the original voice, such as accents and speech habits. This capability met the diverse voice needs of real-world application scenarios.
Now, you no longer have to worry about imperfect source material when cloning a voice. Even with non-native recordings that may have an accent or be disfluent, Fluent LoRA can perfectly replicate the voice's timbre while generating fluent, natural speech that matches the target text, making your vocal expression more articulate.
Besides the English example shown in the video, this feature enables one-click fluency for voice cloning across the 40+ languages the model supports. Here is an example in a Japanese scenario:
Speech 2.6 is now fully live. Welcome to try it out:
MiniMax Open Platform:
https://www.minimax.io/platform_overview
MiniMax Audio:
https://www.minimax.io/audio
Intelligence with Everyone.

