curl --location 'https://api.inworld.ai/tts/v1/voice' \
--header "Authorization: Basic $INWORLD_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"text": "Hello, world! What a wonderful day to be a text-to-speech model!",
"voiceId": "Dennis",
"modelId": "inworld-tts-2",
"audioConfig": {
"audioEncoding": "LINEAR16",
"sampleRateHertz": 22050
},
"deliveryMode": "BALANCED",
"applyTextNormalization": "ON"
}'{
"audioContent": "UklGRiRQAQBXQVZFZm1...",
"usage": {
"processedCharactersCount": 64,
"modelId": "inworld-tts-2"
}
}Synthesize speech
Receive results after all text input has been processed.
curl --location 'https://api.inworld.ai/tts/v1/voice' \
--header "Authorization: Basic $INWORLD_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"text": "Hello, world! What a wonderful day to be a text-to-speech model!",
"voiceId": "Dennis",
"modelId": "inworld-tts-2",
"audioConfig": {
"audioEncoding": "LINEAR16",
"sampleRateHertz": 22050
},
"deliveryMode": "BALANCED",
"applyTextNormalization": "ON"
}'{
"audioContent": "UklGRiRQAQBXQVZFZm1...",
"usage": {
"processedCharactersCount": 64,
"modelId": "inworld-tts-2"
}
}Documentation Index
Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt
Use this file to discover all available pages before exploring further.
Authorizations
Your authentication credentials. For Basic authentication, please populate Basic $INWORLD_API_KEY
Body
The text to be synthesized into speech. Maximum input of 2,000 characters.
The ID of the voice to use for synthesizing speech.
Configurations to use when synthesizing speech.
Show child attributes
Show child attributes
BCP-47 language tag (e.g., en-US, fr-FR, ja-JP) specifying the language that the given voice should speak the text in. If a localized voice prompt exists for the language, it will be used. When omitted, the original voice prompt will be used and the language will be auto-detected from the input text. If an invalid language code is provided, an error will be returned.
See Languages for more details.
Only supported by inworld-tts-2. The field is ignored on other models.
Controls how varied the output is.
DELIVERY_MODE_UNSPECIFIED: Defaults toBALANCEDbehavior.STABLE: Optimizes for more consistent, predictable output.BALANCED: Balanced between stability and diversity.CREATIVE: Optimizes for increased emotional range and variation.
DELIVERY_MODE_UNSPECIFIED, STABLE, BALANCED, CREATIVE Ignored on inworld-tts-2. Use deliveryMode instead.
Determines the degree of randomness when sampling audio tokens to generate the response.
Defaults to 1.0. Accepts values between 0 (exclusive) and 2 (inclusive). Higher values will make the output more random and can lead to more expressive results. Lower values will make it more deterministic. If 0 is provided, the default value will be used.
For the most stable results, we recommend using the default value.
Controls timestamp metadata returned with the audio. When enabled, the response includes timing arrays, which can be useful for word-highlighting, karaoke-style captions, and lipsync.
- WORD: Output arrays under
timestampInfo.wordAlignment(words, wordStartTimeSeconds, wordEndTimeSeconds). - CHARACTER: Output arrays under
timestampInfo.characterAlignment(characters, characterStartTimeSeconds, characterEndTimeSeconds). - TIMESTAMP_TYPE_UNSPECIFIED: Do not compute alignment; timestamp arrays will be empty or omitted.
Phonetic details: phoneticDetails is currently only returned for WORD alignment (not CHARACTER).
Latency note: Alignment adds additional computation. Enabling alignment can increase latency.
Model differences:
- TTS 1.0 models (
inworld-tts-1,inworld-tts-1-max): Returns basic word/character timing arrays. - TTS 1.5 and TTS-2 models (
inworld-tts-1.5-mini,inworld-tts-1.5-max,inworld-tts-2): Returns enhanced alignment data with detailedphoneticDetailscontaining phoneme-level timing and viseme symbols for lip-sync.
TIMESTAMP_TYPE_UNSPECIFIED, WORD, CHARACTER When enabled, text normalization automatically expands and standardizes things like numbers, dates, times, and abbreviations before converting them to speech. For example, Dr. Smith becomes Doctor Smith, and 3/10/25 is spoken as March tenth, twenty twenty-five. Turning this off may reduce latency, but the speech output will read the text exactly as written. Defaults to automatically deciding whether to apply text normalization.
APPLY_TEXT_NORMALIZATION_UNSPECIFIED, ON, OFF Response
A successful response.
The audio data bytes encoded in the format as specified in the request. For encodings that are wrapped in containers (e.g. MP3, OPUS) the header is included. For PCM audio a WAV header is included.
Maximum output audio size of 16MB. To avoid errors with longer texts, please use a compressed audio format with an appropriate bit rate, or use the streaming endpoint.
Timestamp alignment information (present when alignment is enabled).
Show child attributes
Show child attributes
Was this page helpful?