Timestamps - Inworld AI Documentation

Timestamp alignment currently supports English only; other languages are experimental.

Timestamp alignment lets you retrieve timing information that matches the generated audio, which is useful for experiences like word highlighting, karaoke‑style captions, and lipsync. Set the timestampType request parameter to control granularity:

WORD: Return timestamps for each word, including detailed phoneme-level timing with viseme symbols
CHARACTER: Return timestamps for each character or punctuation

Enabling timestamp alignment can increase latency (especially for the non-streaming endpoint).

When enabled, the response includes timestamp arrays:

WORD: timestampInfo.wordAlignment with words, wordStartTimeSeconds, wordEndTimeSeconds
- For TTS 1.5 models, phoneticDetails containing detailed phoneme-level timing with viseme symbols
CHARACTER: timestampInfo.characterAlignment with characters, characterStartTimeSeconds, characterEndTimeSeconds

Phoneme and viseme timings (phoneticDetails) are currently only returned for WORD alignment (not CHARACTER).

See the API reference for full details.

Streaming behavior

You can control how timestamp data is delivered alongside audio using timestampTransportStrategy.

Sync (default)

Audio and alignment arrive together in each chunk. Every chunk contains both audio data and its corresponding timestamps.

Chunk 1: audio + timestamps for chunk 1
Chunk 2: audio + timestamps for chunk 2
Chunk 3: audio + timestamps for chunk 3

This is the simplest approach, however the first audio will be slightly delayed.

Async

Audio chunks arrive first, followed by separate trailing messages containing only timestamp data. This reduces time-to-first-audio with TTS 1.5 models, since the server doesn’t need to wait for alignment computation before sending audio.

Chunk 1: audio only
Chunk 2: audio only
Chunk 3: audio only
Chunk 4: timestamps only (alignment for chunks 1–3)
Chunk 5: timestamps only
...

Use async when you prioritize playback speed and can handle timestamps arriving after their corresponding audio. Use sync when you need audio and timestamps together in each chunk (e.g., for real-time lip-sync or word highlighting during playback). Set timestampTransportStrategy to SYNC or ASYNC in your request. See the API reference for details.

Response structure

TTS 1.5 models (`inworld-tts-1.5-mini`, `inworld-tts-1.5-max`)

Returns enhanced alignment data with phonetic details: detailed phoneme-level timing with viseme symbols for precise lip-sync animation.

{
  "timestampInfo": {
    "wordAlignment": {
      "words": ["Hello,", "world,", "this", "will", "be", "saved"],
      "wordStartTimeSeconds": [0, 0.28, 0.96, 1.25, 1.38, 1.5],
      "wordEndTimeSeconds": [0.28, 0.8, 1.25, 1.38, 1.5, 1.99],
      "phoneticDetails": [
        {
          "wordIndex": 0,
          "phones": [
            {"phoneSymbol": "h", "startTimeSeconds": 0, "durationSeconds": 0.07, "visemeSymbol": "aei"},
            {"phoneSymbol": "ə", "startTimeSeconds": 0.07, "durationSeconds": 0.030000001, "visemeSymbol": "aei"},
            {"phoneSymbol": "l", "startTimeSeconds": 0.1, "durationSeconds": 0.089999996, "visemeSymbol": "l"},
            {"phoneSymbol": "oʊ1", "startTimeSeconds": 0.19, "durationSeconds": 0.09, "visemeSymbol": "o"}
          ],
          "isPartial": false
        },
        {
          "wordIndex": 1,
          "phones": [
            {"phoneSymbol": "w", "startTimeSeconds": 0.28, "durationSeconds": 0.18, "visemeSymbol": "qw"},
            {"phoneSymbol": "ɝ1", "startTimeSeconds": 0.46, "durationSeconds": 0.119999975, "visemeSymbol": "r"},
            {"phoneSymbol": "l", "startTimeSeconds": 0.58, "durationSeconds": 0.08000004, "visemeSymbol": "l"},
            {"phoneSymbol": "d", "startTimeSeconds": 0.66, "durationSeconds": 0.13999999, "visemeSymbol": "cdgknstxyz"}
          ],
          "isPartial": false
        },
        {
          "wordIndex": 2,
          "phones": [
            {"phoneSymbol": "ð", "startTimeSeconds": 0.96, "durationSeconds": 0.14000005, "visemeSymbol": "th"},
            {"phoneSymbol": "ɪ1", "startTimeSeconds": 1.1, "durationSeconds": 0.06999993, "visemeSymbol": "ee"},
            {"phoneSymbol": "s", "startTimeSeconds": 1.17, "durationSeconds": 0.08000004, "visemeSymbol": "cdgknstxyz"}
          ],
          "isPartial": false
        }
      ]
    }
  }
}

Phonetic details structure

Each entry in phoneticDetails contains:

Field	Description
`wordIndex`	Index of the word this phonetic detail belongs to (0-based).
`phones`	Array of phonemes that make up this word.
`isPartial`	True when the server considers the word potentially unstable (e.g., last word in a non-final streaming update). Clients may choose to delay processing partial words until `isPartial` becomes `false`.

Each phone entry contains:

Field	Description
`phoneSymbol`	The phoneme symbol in IPA notation.
`startTimeSeconds`	Start time of the phoneme in seconds. May be omitted for the first phoneme of a word.
`durationSeconds`	Duration of the phoneme in seconds.
`visemeSymbol`	The viseme symbol for lip-sync animation.

Viseme symbols

The following viseme symbols are used for lip-sync animation:

Viseme	Description
`aei`	Open mouth vowels (a, e, i, ə, ʌ, æ, ɑ, etc.)
`o`	Rounded vowels (o, ʊ, əʊ, oʊ, etc.)
`ee`	Front vowels (i, ɪ, eɪ, etc.)
`bmp`	Bilabial consonants (b, m, p)
`fv`	Labiodental consonants (f, v)
`l`	Lateral consonant (l)
`r`	Rhotic sounds (r, ɝ, ɚ)
`th`	Dental fricatives (θ, ð)
`qw`	Rounded consonants (w, ʍ)
`cdgknstxyz`	Alveolar/velar consonants (c, d, g, k, n, s, t, x, y, z)

TTS 1 models (`inworld-tts-1`, `inworld-tts-1-max`)

Returns basic word/character timing arrays:

{
  "timestampInfo": {
    "wordAlignment": {
      "words": ["Hello", "world,", "this", "will", "be", "saved"],
      "wordStartTimeSeconds": [0, 0.33, 0.69, 0.89, 1.1, 1.26],
      "wordEndTimeSeconds": [0.28, 0.63, 0.87, 1.05, 1.16, 1.6]
    }
  }
}

​Streaming behavior

​Sync (default)

​Async

​Response structure

​TTS 1.5 models (inworld-tts-1.5-mini, inworld-tts-1.5-max)

Phonetic details structure

Viseme symbols

​TTS 1 models (inworld-tts-1, inworld-tts-1-max)

Streaming behavior

Sync (default)

Async

Response structure

TTS 1.5 models (`inworld-tts-1.5-mini`, `inworld-tts-1.5-max`)

TTS 1 models (`inworld-tts-1`, `inworld-tts-1-max`)