Synthesize speech (stream) - Inworld AI Documentation

Authorizations

Authorization

string

header

required

Your authentication credentials. For Basic authentication, please populate Basic $INWORLD_API_KEY. You can create a key in one command with the Inworld CLI: inworld workspace add-key.

Body

application/json

Request body for the streaming speech synthesis endpoint.

text

string

required

The text to be synthesized into speech. Maximum input of 2,000 characters.

voiceId

string

required

The ID of the voice to use for synthesizing speech.

modelId

string

required

The ID of the model to use for synthesizing speech. See Models for available models.

audioConfig

object

Configurations to use when synthesizing speech.

Show child attributes

language

string

BCP-47 language tag (e.g., en-US, fr-FR, ja-JP) specifying the language that the given voice should speak the text in. If a localized voice prompt exists for the language, it will be used. When omitted, the original voice prompt will be used and the language will be auto-detected from the input text. If an invalid language code is provided, an error will be returned.

See Languages for more details.

deliveryMode

enum<string>

default:DELIVERY_MODE_UNSPECIFIED

Only supported by inworld-tts-2. The field is ignored on other models.

Controls how varied the output is.

DELIVERY_MODE_UNSPECIFIED: Defaults to BALANCED behavior.
STABLE: Optimizes for more consistent, predictable output.
BALANCED: Balanced between stability and diversity.
CREATIVE: Optimizes for increased emotional range and variation.

Available options:

DELIVERY_MODE_UNSPECIFIED,

STABLE,

BALANCED,

CREATIVE

temperature

number<float>

default:1

Ignored on inworld-tts-2. Use deliveryMode instead.

Determines the degree of randomness when sampling audio tokens to generate the response.

Defaults to 1.0. Accepts values between 0 (exclusive) and 2 (inclusive). Higher values will make the output more random and can lead to more expressive results. Lower values will make it more deterministic. If 0 is provided, the default value will be used.

For the most stable results, we recommend using the default value.

timestampType

enum<string>

default:TIMESTAMP_TYPE_UNSPECIFIED

Controls timestamp metadata returned with the audio. When enabled, the response includes timing arrays, which can be useful for word-highlighting, karaoke-style captions, and lipsync.

WORD: Output arrays under timestampInfo.wordAlignment (words, wordStartTimeSeconds, wordEndTimeSeconds).
CHARACTER: Output arrays under timestampInfo.characterAlignment (characters, characterStartTimeSeconds, characterEndTimeSeconds).
TIMESTAMP_TYPE_UNSPECIFIED: Do not compute alignment; timestamp arrays will be empty or omitted.

Phonetic details: phoneticDetails is currently only returned for WORD alignment (not CHARACTER).

Latency note: Alignment adds additional computation. Enabling alignment can increase latency.

Available options:

TIMESTAMP_TYPE_UNSPECIFIED,

WORD,

CHARACTER

applyTextNormalization

enum<string>

default:APPLY_TEXT_NORMALIZATION_UNSPECIFIED

When enabled, text normalization automatically expands and standardizes things like numbers, dates, times, and abbreviations before converting them to speech. For example, Dr. Smith becomes Doctor Smith, and 3/10/25 is spoken as March tenth, twenty twenty-five. Turning this off may reduce latency, but the speech output will read the text exactly as written. Defaults to automatically deciding whether to apply text normalization.

Available options:

APPLY_TEXT_NORMALIZATION_UNSPECIFIED,

ON,

OFF

timestampTransportStrategy

enum<string>

default:TIMESTAMP_TRANSPORT_STRATEGY_UNSPECIFIED

The transport strategy of timestamps info.

TIMESTAMP_TRANSPORT_STRATEGY_UNSPECIFIED: The service will automatically decide the transport strategy.
SYNC: Timestamps will be returned in the same message as the audio data.
ASYNC: Timestamps could return in trailing message after the audio data. Use this strategy to reduce latency of the first audio chunk with v1.5+ models.

Available options:

TIMESTAMP_TRANSPORT_STRATEGY_UNSPECIFIED,

SYNC,

ASYNC

Response

A successful response returns a stream of objects.

result

object

A chunk containing the audio data. If using PCM, every chunk, not just the initial chunk, will contain a complete WAV header so it can be played independently.

Show child attributes

error

object

A response may contain an error object if an error happens in the stream.

Show child attributes