Synthesize speech - Inworld AI Documentation

cURL

curl --request POST \
  --url https://api.inworld.ai/tts/v1/voice \
  --header 'Authorization: Basic <api-key>' \
  --header 'Content-Type: application/json' \
  --data '{
  "text": "Hello, world! What a wonderful day to be a text-to-speech model!",
  "voiceId": "Dennis",
  "modelId": "inworld-tts-1",
  "timestampType": "WORD"
}'

{
  "audioContent": "UklGRiRQAQBXQVZFZm1...",
  "timestampInfo": {
    "wordAlignment": {
      "words": [
        "Hello,",
        "world!",
        "What",
        "a",
        "wonderful",
        "day",
        "to",
        "be",
        "a",
        "text-to-speech",
        "model!"
      ],
      "wordStartTimeSeconds": [
        0,
        0.525,
        1.515,
        1.717,
        1.919,
        2.485,
        2.809,
        2.91,
        3.051,
        3.152,
        3.879
      ],
      "wordEndTimeSeconds": [
        0.445,
        0.97,
        1.677,
        1.758,
        2.425,
        2.728,
        2.869,
        3.011,
        3.071,
        3.819,
        4.223
      ]
    }
  }
}

POST

tts

voice

cURL

curl --request POST \
  --url https://api.inworld.ai/tts/v1/voice \
  --header 'Authorization: Basic <api-key>' \
  --header 'Content-Type: application/json' \
  --data '{
  "text": "Hello, world! What a wonderful day to be a text-to-speech model!",
  "voiceId": "Dennis",
  "modelId": "inworld-tts-1",
  "timestampType": "WORD"
}'

{
  "audioContent": "UklGRiRQAQBXQVZFZm1...",
  "timestampInfo": {
    "wordAlignment": {
      "words": [
        "Hello,",
        "world!",
        "What",
        "a",
        "wonderful",
        "day",
        "to",
        "be",
        "a",
        "text-to-speech",
        "model!"
      ],
      "wordStartTimeSeconds": [
        0,
        0.525,
        1.515,
        1.717,
        1.919,
        2.485,
        2.809,
        2.91,
        3.051,
        3.152,
        3.879
      ],
      "wordEndTimeSeconds": [
        0.445,
        0.97,
        1.677,
        1.758,
        2.425,
        2.728,
        2.869,
        3.011,
        3.071,
        3.819,
        4.223
      ]
    }
  }
}

Authorizations

Authorization

string

header

required

Your authentication credentials. For Basic authentication, please populate Basic $INWORLD_RUNTIME_BASE64_CREDENTIAL

Body

application/json

text

string

required

The text to be synthesized into speech. Maximum input of 2,000 characters.

voiceId

string

required

The ID of the voice to use for synthesizing speech.

modelId

enum<string>

required

The ID of the model to use for synthesizing speech. See Models for available models.

Available options:

inworld-tts-1,

inworld-tts-1-max

audioConfig

object

Configurations to use when synthesizing speech.

Show child attributes

temperature

number<float>

default:1.1

Determines the degree of randomness when sampling audio tokens to generate the response.

Defaults to 1.1. Accepts values between 0 and 2. Higher values will make the output more random and can lead to more expressive results. Lower values will make it more deterministic.

For the most stable results, we recommend using the default value.

timestampType

enum<string>

default:TIMESTAMP_TYPE_UNSPECIFIED

Controls timestamp metadata returned with the audio. When enabled, the response includes timing arrays under timestampInfo.wordAlignment (WORD) or timestampInfo.characterAlignment (CHARACTER). Useful for word-highlighting, karaoke-style captions, and lipsync.

Note: Enabling alignment slightly increases latency. Internal experiments show an average ~100 ms increase.

Language support: Timestamp alignment currently supports English only; other languages are experimental.

WORD: Output arrays under timestampInfo.wordAlignment (words, wordStartTimeSeconds, wordEndTimeSeconds).
CHARACTER: Output arrays under timestampInfo.characterAlignment (characters, characterStartTimeSeconds, characterEndTimeSeconds).
TIMESTAMP_TYPE_UNSPECIFIED: Do not compute alignment; timestamp arrays will be empty or omitted.

Available options:

TIMESTAMP_TYPE_UNSPECIFIED,

WORD,

CHARACTER

applyTextNormalization

enum<string>

default:APPLY_TEXT_NORMALIZATION_UNSPECIFIED

When enabled, text normalization automatically expands and standardizes things like numbers, dates, times, and abbreviations before converting them to speech. For example, Dr. Smith becomes Doctor Smith, and 3/10/25 is spoken as March tenth, twenty twenty-five. Turning this off may reduce latency, but the speech output will read the text exactly as written. Defaults to automatically deciding whether to apply text normalization.

Available options:

APPLY_TEXT_NORMALIZATION_UNSPECIFIED,

ON,

OFF

Response

A successful response.

audioContent

string<byte>

The audio data bytes encoded in the format as specified in the request. For encodings that are wrapped in containers (e.g. MP3, OPUS) the header is included. For PCM audio a WAV header is included.

Maximum output audio size of 16MB. To avoid errors with longer texts, please use a compressed audio format with an appropriate bit rate, or use the streaming endpoint.

timestampInfo

object

Timestamp alignment information (present when alignment is enabled).

Show child attributes

⌘I