Speech Synthesis (MiMo-V2-TTS)
Speech Synthesis (Text-to-Speech) automatically converts input text into natural and fluent speech output. You can configure parameters such as speech style to generate expressive and vivid speech content.
Core Capabilities
-
Provides built-in voices: Built-in default tones meet the needs for quick use.
-
Diverse speech styles: Supports specifying speech styles for more vivid and natural voices.
Supported Models
Only the mimo-v2-tts model is currently supported.
Preparation
For preparations such as obtaining the API Key, please refer to First API Call.
Available Built-in Voices
You may set the built-in voice in {"audio": {"voice": "mimo_default"}}.
| Voice Name | Voice Parameter |
|---|---|
| MiMo-Default | mimo_default |
| MiMo-Chinese Female Voice | default_zh |
| MiMo-English Female Voice | default_en |
Currently, voice cloning is not supported.
Style Control
Overall Voice Style Control
Place <style>style</style> at the beginning of the target text for conversion, where style is the audio style to be generated. If multiple styles need to be set, place multiple style names within the same <style> tag, with no restrictions on the separator.
Format example: <style>Style 1 Style 2</style>Content to be synthesized.
The following are some recommended styles, and styles not on the list are also supported.
| Style Type | Style Example |
|---|---|
| Speech rate control | Speed up / Slow down |
| Emotional changes | Happy / Sad / Angry |
| Role-playing | Sun Wukong / Lin Daiyu |
| Style change | Whisper / Clamped voice / Taiwanese accent |
| Dialect | Northeastern dialect / Sichuan dialect / Henan dialect / Cantonese |
Sample:
<style>Happy</style>Tomorrow is Friday, so happy!<style>Whisper</style>Oh my goodness, it's so cold today! You know that wind, it's howling like a knife, cutting into your face!
Fine-grained Control of Audio Tags
Through [Audio Tags], you can exercise fine-grained control over sound, precisely adjusting tone, emotion, and expression style—whether it's a whisper, a hearty laugh, or a little rant with a touch of emotion. You can also flexibly insert breaths, pauses, coughs, etc., all of which can be easily achieved. The speaking speed can also be flexibly adjusted, ensuring that every sentence has its proper rhythm.
Sample:
- Achoo! Ahem. I—I really [cough] think I am coming down with a terrible [cough] terrible cold.
- [heavy breathing] Just... give me... a second. I ran... all the way... from the station.
- I just feel... long sigh... like I'm constantly treading water, you know?
- It's just so stupid! (sobbing) We spent all that money on the cake and the dog just... (sudden laugh) he just ate the whole thing in one bite!
Code Sample
Notes
-
The target text for speech synthesis must be placed in a message with
role:assistant, not in a message withrole:user. -
The message of the
userrole is an optional parameter, but it is recommended that users carry it. You can adjust the tone and style of speech synthesis in some scenarios. -
To specify the speech style, place
<style>style</style>at the beginning of the target text. -
To achieve a better singing style, you must add only the tag
<style>唱歌</style>at the very beginning of the target text, in the format:<style>唱歌</style>lyrics. The values supported within the tag are as follows, and their effects are equivalent: -
唱歌,sing,singing
Non-streaming Call
Curl
curl --location --request POST 'https://api.xiaomimimo.com/v1/chat/completions' \
--header "api-key: $MIMO_API_KEY" \
--header 'Content-Type: application/json' \
--data-raw '{
"model": "mimo-v2-tts",
"messages": [
{
"role": "user",
"content": "Hello, MiMo, have you had lunch?"
},
{
"role": "assistant",
"content": "Yes, I had a sandwich."
}
],
"audio": {
"format": "wav",
"voice": "mimo_default"
}
}'
Python
import os
from openai import OpenAI
import base64
client = OpenAI(
api_key=os.environ.get("MIMO_API_KEY"),
base_url="https://api.xiaomimimo.com/v1"
)
completion = client.chat.completions.create(
model="mimo-v2-tts",
messages=[
{
"role": "user",
"content": "Hello, MiMo, have you had lunch?"
},
{
"role": "assistant",
"content": "Yes, I had a sandwich."
}
],
audio={
"format": "wav",
"voice": "mimo_default"
}
)
message = completion.choices[0].message
audio_bytes = base64.b64decode(message.audio.data)
with open("audio_file.wav", "wb") as f:
f.write(audio_bytes)
Streaming Call
Notes
- When using streaming calls, please specify the format of the output audio as
pcm16to facilitate splicing into a complete audio. For a splicing example, please refer to the Python calling method.
Curl
curl --location --request POST 'https://api.xiaomimimo.com/v1/chat/completions' \
--header "api-key: $MIMO_API_KEY" \
--header 'Content-Type: application/json' \
--data-raw '{
"model": "mimo-v2-tts",
"messages": [
{
"role": "assistant",
"content": "You are UN-BE-LIEVABLE! I am sooooo done with your constant lies. GET. OUT!"
}
],
"audio": {
"format": "pcm16",
"voice": "default_en"
},
"stream": true
}'
Python
import base64
import os
import numpy as np
import soundfile as sf
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("MIMO_API_KEY"),
base_url="https://api.xiaomimimo.com/v1"
)
completion = client.chat.completions.create(
model="mimo-v2-tts",
messages=[
{
"role": "assistant",
"content": "You are UN-BE-LIEVABLE! I am sooooo done with your constant lies. GET. OUT!"
}
],
audio={
"format": "pcm16",
"voice": "default_en"
},
stream=True
)
# 24kHz PCM16LE mono audio
collected_chunks: np.ndarray = np.array([], dtype=np.float32)
for chunk in completion:
if not chunk.choices:
continue
delta = chunk.choices[0].delta
audio = getattr(delta, "audio", None)
if audio is not None:
assert isinstance(audio, dict), f"Expected audio to be a dict, got {type(audio)}"
pcm_bytes = base64.b64decode(audio["data"])
np_pcm = np.frombuffer(pcm_bytes, dtype=np.int16).astype(np.float32) / 32768.0
collected_chunks = np.concatenate((collected_chunks, np_pcm))
print(f"Received audio chunk of size {len(pcm_bytes)} bytes")
# Save the collected audio to a file
os.makedirs("tmp", exist_ok=True)
sf.write("tmp/output.wav", collected_chunks, samplerate=24000)
print("Audio saved to tmp/output.wav")
Price
-
Billing: Free for a limited time.
-
View Bill: You can view your usage on the Billing page in the Console.