Skip to main content
POST
/
tts
/
v1
/
voice:stream
curl --request POST \
--url https://api.inworld.ai/tts/v1/voice:stream \
--header 'Authorization: Basic <api-key>' \
--header 'Content-Type: application/json' \
--data '{
"text": "Hello, world! What a wonderful day to be a text-to-speech model!",
"voiceId": "Dennis",
"modelId": "inworld-tts-1.5-max",
"timestampType": "WORD"
}'
{
  "result": {
    "audioContent": "UklGRiRQAQBXQVZFZm1...",
    "usage": {
      "processedCharactersCount": 64,
      "modelId": "inworld-tts-1.5-max"
    },
    "timestampInfo": {
      "wordAlignment": {
        "words": [
          "Hello,",
          "world!"
        ],
        "wordStartTimeSeconds": [
          0,
          0.51
        ],
        "wordEndTimeSeconds": [
          0.51,
          1.04
        ],
        "phoneticDetails": [
          {
            "wordIndex": 0,
            "phones": [
              {
                "phoneSymbol": "h",
                "startTimeSeconds": 0,
                "durationSeconds": 0.17,
                "visemeSymbol": "aei"
              },
              {
                "phoneSymbol": "ə",
                "startTimeSeconds": 0.17,
                "durationSeconds": 0.049999997,
                "visemeSymbol": "aei"
              },
              {
                "phoneSymbol": "l",
                "startTimeSeconds": 0.22,
                "durationSeconds": 0.110000014,
                "visemeSymbol": "l"
              },
              {
                "phoneSymbol": "oʊ1",
                "startTimeSeconds": 0.33,
                "durationSeconds": 0.17999998,
                "visemeSymbol": "o"
              }
            ],
            "isPartial": false
          },
          {
            "wordIndex": 1,
            "phones": [
              {
                "phoneSymbol": "w",
                "startTimeSeconds": 0.51,
                "durationSeconds": 0.15000004,
                "visemeSymbol": "qw"
              },
              {
                "phoneSymbol": "ɝ1",
                "startTimeSeconds": 0.66,
                "durationSeconds": 0.15999997,
                "visemeSymbol": "r"
              },
              {
                "phoneSymbol": "l",
                "startTimeSeconds": 0.82,
                "durationSeconds": 0.09000003,
                "visemeSymbol": "l"
              },
              {
                "phoneSymbol": "d",
                "startTimeSeconds": 0.91,
                "durationSeconds": 0.12999994,
                "visemeSymbol": "cdgknstxyz"
              }
            ],
            "isPartial": false
          }
        ]
      }
    }
  }
}

Authorizations

Authorization
string
header
required

Your authentication credentials. For Basic authentication, please populate Basic $INWORLD_RUNTIME_BASE64_CREDENTIAL

Body

application/json
text
string
required

The text to be synthesized into speech. Maximum input of 2,000 characters.

voiceId
string
required

The ID of the voice to use for synthesizing speech.

modelId
string
required

The ID of the model to use for synthesizing speech. See Models for available models.

audioConfig
object

Configurations to use when synthesizing speech.

temperature
number<float>
default:1.1

Determines the degree of randomness when sampling audio tokens to generate the response.

Defaults to 1.1. Accepts values between 0 and 2. Higher values will make the output more random and can lead to more expressive results. Lower values will make it more deterministic.

For the most stable results, we recommend using the default value.

timestampType
enum<string>
default:TIMESTAMP_TYPE_UNSPECIFIED

Controls timestamp metadata returned with the audio. When enabled, the response includes timing arrays, which can be useful for word-highlighting, karaoke-style captions, and lipsync.

  • WORD: Output arrays under timestampInfo.wordAlignment (words, wordStartTimeSeconds, wordEndTimeSeconds).
  • CHARACTER: Output arrays under timestampInfo.characterAlignment (characters, characterStartTimeSeconds, characterEndTimeSeconds).
  • TIMESTAMP_TYPE_UNSPECIFIED: Do not compute alignment; timestamp arrays will be empty or omitted.

Phonetic details: phoneticDetails is currently only returned for WORD alignment (not CHARACTER).

Latency note: Alignment adds additional computation. Enabling alignment can increase latency.

Model differences:

  • TTS 1.0 models (inworld-tts-1, inworld-tts-1-max): Returns basic word/character timing arrays.
  • TTS 1.5 models (inworld-tts-1.5-mini, inworld-tts-1.5-max): Returns enhanced alignment data with detailed phoneticDetails containing phoneme-level timing and viseme symbols for lip-sync.

Note: Timestamp alignment currently supports English only; other languages are experimental.

Available options:
TIMESTAMP_TYPE_UNSPECIFIED,
WORD,
CHARACTER
applyTextNormalization
enum<string>
default:APPLY_TEXT_NORMALIZATION_UNSPECIFIED

When enabled, text normalization automatically expands and standardizes things like numbers, dates, times, and abbreviations before converting them to speech. For example, Dr. Smith becomes Doctor Smith, and 3/10/25 is spoken as March tenth, twenty twenty-five. Turning this off may reduce latency, but the speech output will read the text exactly as written. Defaults to automatically deciding whether to apply text normalization.

Available options:
APPLY_TEXT_NORMALIZATION_UNSPECIFIED,
ON,
OFF

Response

A successful response returns a stream of objects.

result
object

A chunk containing the audio data. If using PCM, every chunk, not just the initial chunk, will contain a complete WAV header so it can be played independently.

error
object

A response may contain an error object if an error happens in the stream.