Skip to main content
Timestamp alignment currently supports English only; other languages are experimental.
Timestamp alignment lets you retrieve timing information that matches the generated audio, which is useful for experiences like word highlighting, karaoke‑style captions, and lipsync. Set the timestampType request parameter to control granularity:
  • WORD: Return timestamps for each word, including detailed phoneme-level timing with viseme symbols
  • CHARACTER: Return timestamps for each character or punctuation
Enabling timestamp alignment can increase latency (especially for the non-streaming endpoint).
When enabled, the response includes timestamp arrays:
  • WORD: timestampInfo.wordAlignment with words, wordStartTimeSeconds, wordEndTimeSeconds
    • For TTS 1.5 models, phoneticDetails containing detailed phoneme-level timing with viseme symbols
  • CHARACTER: timestampInfo.characterAlignment with characters, characterStartTimeSeconds, characterEndTimeSeconds
Phoneme and viseme timings (phoneticDetails) are currently only returned for WORD alignment (not CHARACTER).
See the API reference for full details.

Streaming behavior

Each streamed chunk includes alignment data for that specific chunk. Audio and alignment arrive together in sync.
Chunk 1: audio + alignment for chunk 1
Chunk 2: audio + alignment for chunk 2
Chunk 3: audio + alignment for chunk 3

Response structure

TTS 1.5 models (inworld-tts-1.5-mini, inworld-tts-1.5-max)

Returns enhanced alignment data with phonetic details: detailed phoneme-level timing with viseme symbols for precise lip-sync animation.
{
  "timestampInfo": {
    "wordAlignment": {
      "words": ["Hello,", "world,", "this", "will", "be", "saved"],
      "wordStartTimeSeconds": [0, 0.28, 0.96, 1.25, 1.38, 1.5],
      "wordEndTimeSeconds": [0.28, 0.8, 1.25, 1.38, 1.5, 1.99],
      "phoneticDetails": [
        {
          "wordIndex": 0,
          "phones": [
            {"phoneSymbol": "h", "startTimeSeconds": 0, "durationSeconds": 0.07, "visemeSymbol": "aei"},
            {"phoneSymbol": "ə", "startTimeSeconds": 0.07, "durationSeconds": 0.030000001, "visemeSymbol": "aei"},
            {"phoneSymbol": "l", "startTimeSeconds": 0.1, "durationSeconds": 0.089999996, "visemeSymbol": "l"},
            {"phoneSymbol": "oʊ1", "startTimeSeconds": 0.19, "durationSeconds": 0.09, "visemeSymbol": "o"}
          ],
          "isPartial": false
        },
        {
          "wordIndex": 1,
          "phones": [
            {"phoneSymbol": "w", "startTimeSeconds": 0.28, "durationSeconds": 0.18, "visemeSymbol": "qw"},
            {"phoneSymbol": "ɝ1", "startTimeSeconds": 0.46, "durationSeconds": 0.119999975, "visemeSymbol": "r"},
            {"phoneSymbol": "l", "startTimeSeconds": 0.58, "durationSeconds": 0.08000004, "visemeSymbol": "l"},
            {"phoneSymbol": "d", "startTimeSeconds": 0.66, "durationSeconds": 0.13999999, "visemeSymbol": "cdgknstxyz"}
          ],
          "isPartial": false
        },
        {
          "wordIndex": 2,
          "phones": [
            {"phoneSymbol": "ð", "startTimeSeconds": 0.96, "durationSeconds": 0.14000005, "visemeSymbol": "th"},
            {"phoneSymbol": "ɪ1", "startTimeSeconds": 1.1, "durationSeconds": 0.06999993, "visemeSymbol": "ee"},
            {"phoneSymbol": "s", "startTimeSeconds": 1.17, "durationSeconds": 0.08000004, "visemeSymbol": "cdgknstxyz"}
          ],
          "isPartial": false
        }
      ]
    }
  }
}
Phonetic details structure
Each entry in phoneticDetails contains:
FieldDescription
wordIndexIndex of the word this phonetic detail belongs to (0-based).
phonesArray of phonemes that make up this word.
isPartialTrue when the server considers the word potentially unstable (e.g., last word in a non-final streaming update). Clients may choose to delay processing partial words until isPartial becomes false.
Each phone entry contains:
FieldDescription
phoneSymbolThe phoneme symbol in IPA notation.
startTimeSecondsStart time of the phoneme in seconds. May be omitted for the first phoneme of a word.
durationSecondsDuration of the phoneme in seconds.
visemeSymbolThe viseme symbol for lip-sync animation.
Viseme symbols
The following viseme symbols are used for lip-sync animation:
VisemeDescription
aeiOpen mouth vowels (a, e, i, ə, ʌ, æ, ɑ, etc.)
oRounded vowels (o, ʊ, əʊ, oʊ, etc.)
eeFront vowels (i, ɪ, eɪ, etc.)
bmpBilabial consonants (b, m, p)
fvLabiodental consonants (f, v)
lLateral consonant (l)
rRhotic sounds (r, ɝ, ɚ)
thDental fricatives (θ, ð)
qwRounded consonants (w, ʍ)
cdgknstxyzAlveolar/velar consonants (c, d, g, k, n, s, t, x, y, z)

TTS 1 models (inworld-tts-1, inworld-tts-1-max)

Returns basic word/character timing arrays:
{
  "timestampInfo": {
    "wordAlignment": {
      "words": ["Hello", "world,", "this", "will", "be", "saved"],
      "wordStartTimeSeconds": [0, 0.33, 0.69, 0.89, 1.1, 1.26],
      "wordEndTimeSeconds": [0.28, 0.63, 0.87, 1.05, 1.16, 1.6]
    }
  }
}