Timestamps

Timestamp alignment lets you retrieve timing information that matches the generated audio, which is useful for experiences like word highlighting, karaoke‑style captions, and lipsync. The examples and behavior described on this page apply to TTS 1.5 and TTS-2 models (inworld-tts-1.5-mini, inworld-tts-1.5-max, inworld-tts-2). Set the timestampType request parameter to control granularity:

WORD: Return timestamps for every token in the original text — words, punctuation, and whitespace — in the exact order they were given, with phoneme-level timing and viseme symbols.
CHARACTER: Return timestamps for each character or punctuation

Enabling timestamp alignment can increase latency (especially for the non-streaming endpoint).

When enabled, the response includes timestamp arrays:

WORD: timestampInfo.wordAlignment with words, wordStartTimeSeconds, wordEndTimeSeconds, and phoneticDetails. The words array covers every token from the original input in order, so the alignment maps back to the full text without gaps.
CHARACTER: timestampInfo.characterAlignment with characters, characterStartTimeSeconds, characterEndTimeSeconds

See the API reference for full details.

Streaming behavior

You can control how timestamp data is delivered alongside audio using timestampTransportStrategy.

Sync (default)

Audio and alignment arrive together in each chunk. Every chunk contains both audio data and its corresponding timestamps.

Chunk 1: audio + timestamps for chunk 1
Chunk 2: audio + timestamps for chunk 2
Chunk 3: audio + timestamps for chunk 3

This is the simplest approach, however the first audio will be slightly delayed.

Async

Audio chunks arrive first, followed by separate trailing messages containing only timestamp data. This reduces time-to-first-audio with TTS 1.5 and TTS-2 models, since the server doesn’t need to wait for alignment computation before sending audio.

Chunk 1: audio only
Chunk 2: audio only
Chunk 3: audio only
Chunk 4: timestamps only (alignment for chunks 1–3)
Chunk 5: timestamps only
...

Use async when you prioritize playback speed and can handle timestamps arriving after their corresponding audio. Use sync when you need audio and timestamps together in each chunk (e.g., for real-time lip-sync or word highlighting during playback). Set timestampTransportStrategy to SYNC or ASYNC in your request. See the API reference for details.

Response structure

Returns alignment data with phoneme-level timing and viseme symbols for lip-sync animation.

{
  "timestampInfo": {
    "wordAlignment": {
      "words": ["", "Hello", " ", "world", ", ", "this", " ", "will", " ", "be", " ", "saved", "."],
      "wordStartTimeSeconds": [0, 0.24, 0.59, 0.59, 1.09, 1.29, 1.52, 1.52, 1.62, 1.62, 1.73, 1.73, 2.34],
      "wordEndTimeSeconds": [0.24, 0.59, 0.59, 1.09, 1.29, 1.52, 1.52, 1.62, 1.62, 1.73, 1.73, 2.34, 2.34],
      "phoneticDetails": [
        {
          "wordIndex": 0,
          "phones": [
            {"phoneSymbol": "[silence]", "startTimeSeconds": 0, "durationSeconds": 0.24, "visemeSymbol": "bmp"}
          ]
        },
        {
          "wordIndex": 1,
          "phones": [
            {"phoneSymbol": "h", "startTimeSeconds": 0.24, "durationSeconds": 0.14, "visemeSymbol": "cdgknstxyz"},
            {"phoneSymbol": "ɛ", "startTimeSeconds": 0.38, "durationSeconds": 0.04, "visemeSymbol": "aei"},
            {"phoneSymbol": "l", "startTimeSeconds": 0.42, "durationSeconds": 0.05, "visemeSymbol": "l"},
            {"phoneSymbol": "ə", "startTimeSeconds": 0.47, "durationSeconds": 0.07, "visemeSymbol": "aei"},
            {"phoneSymbol": "ʊ", "startTimeSeconds": 0.54, "durationSeconds": 0.05, "visemeSymbol": "o"}
          ]
        },
        {
          "wordIndex": 2,
          "phones": [
            {"phoneSymbol": "[silence]", "startTimeSeconds": 0.59, "durationSeconds": 0, "visemeSymbol": "bmp"}
          ]
        },
        {
          "wordIndex": 3,
          "phones": [
            {"phoneSymbol": "w", "startTimeSeconds": 0.59, "durationSeconds": 0.1, "visemeSymbol": "qw"},
            {"phoneSymbol": "ˈɝ", "startTimeSeconds": 0.69, "durationSeconds": 0.03, "visemeSymbol": "r"},
            {"phoneSymbol": "ɫ", "startTimeSeconds": 0.72, "durationSeconds": 0.23, "visemeSymbol": "l"},
            {"phoneSymbol": "d", "startTimeSeconds": 0.95, "durationSeconds": 0.14, "visemeSymbol": "cdgknstxyz"}
          ]
        },
        {
          "wordIndex": 4,
          "phones": [
            {"phoneSymbol": "[silence]", "startTimeSeconds": 1.09, "durationSeconds": 0.2, "visemeSymbol": "bmp"}
          ]
        },
        {
          "wordIndex": 5,
          "phones": [
            {"phoneSymbol": "ð", "startTimeSeconds": 1.29, "durationSeconds": 0.07, "visemeSymbol": "th"},
            {"phoneSymbol": "ɪ", "startTimeSeconds": 1.36, "durationSeconds": 0.05, "visemeSymbol": "ee"},
            {"phoneSymbol": "s", "startTimeSeconds": 1.41, "durationSeconds": 0.11, "visemeSymbol": "cdgknstxyz"}
          ]
        },
        {
          "wordIndex": 6,
          "phones": [
            {"phoneSymbol": "[silence]", "startTimeSeconds": 1.52, "durationSeconds": 0, "visemeSymbol": "bmp"}
          ]
        },
        {
          "wordIndex": 7,
          "phones": [
            {"phoneSymbol": "w", "startTimeSeconds": 1.52, "durationSeconds": 0.03, "visemeSymbol": "qw"},
            {"phoneSymbol": "ə", "startTimeSeconds": 1.55, "durationSeconds": 0.03, "visemeSymbol": "aei"},
            {"phoneSymbol": "ɫ", "startTimeSeconds": 1.58, "durationSeconds": 0.04, "visemeSymbol": "l"}
          ]
        },
        {
          "wordIndex": 8,
          "phones": [
            {"phoneSymbol": "[silence]", "startTimeSeconds": 1.62, "durationSeconds": 0, "visemeSymbol": "bmp"}
          ]
        },
        {
          "wordIndex": 9,
          "phones": [
            {"phoneSymbol": "b", "startTimeSeconds": 1.62, "durationSeconds": 0.05, "visemeSymbol": "bmp"},
            {"phoneSymbol": "i", "startTimeSeconds": 1.67, "durationSeconds": 0.06, "visemeSymbol": "ee"}
          ]
        },
        {
          "wordIndex": 10,
          "phones": [
            {"phoneSymbol": "[silence]", "startTimeSeconds": 1.73, "durationSeconds": 0, "visemeSymbol": "bmp"}
          ]
        },
        {
          "wordIndex": 11,
          "phones": [
            {"phoneSymbol": "s", "startTimeSeconds": 1.73, "durationSeconds": 0.14, "visemeSymbol": "cdgknstxyz"},
            {"phoneSymbol": "ˈe", "startTimeSeconds": 1.87, "durationSeconds": 0.17, "visemeSymbol": "ee"},
            {"phoneSymbol": "ɪ", "startTimeSeconds": 2.04, "durationSeconds": 0.05, "visemeSymbol": "ee"},
            {"phoneSymbol": "v", "startTimeSeconds": 2.09, "durationSeconds": 0.06, "visemeSymbol": "fv"},
            {"phoneSymbol": "d", "startTimeSeconds": 2.15, "durationSeconds": 0.19, "visemeSymbol": "cdgknstxyz"}
          ]
        },
        {
          "wordIndex": 12,
          "phones": [
            {"phoneSymbol": "[silence]", "startTimeSeconds": 2.34, "durationSeconds": 0, "visemeSymbol": "bmp"}
          ]
        }
      ]
    }
  }
}

Phonetic details structure

Each entry in phoneticDetails contains:

Field	Description
`wordIndex`	0-based index into the `words` array. Speech tokens have full phoneme breakdowns; non-speech tokens have a single `[silence]` phone (may be zero-length).
`phones`	Array of phonemes for this token.
`isPartial`	Deprecated. This field may still appear in responses for backward compatibility, but it is always `false`.

Each phone entry contains:

Field	Description
`phoneSymbol`	The phone symbol: IPA for speech phones, or `[silence]` for non-speech tokens.
`startTimeSeconds`	Start time of the phoneme in seconds.
`durationSeconds`	Duration of the phoneme in seconds.
`visemeSymbol`	The viseme symbol for lip-sync animation.

Viseme symbols

The following viseme symbols are used for lip-sync animation:

Viseme	Description
`aei`	Open mouth vowels (a, e, i, ə, ʌ, æ, ɑ, etc.)
`o`	Rounded vowels (o, ʊ, əʊ, oʊ, etc.)
`ee`	Front vowels (i, ɪ, eɪ, etc.)
`bmp`	Bilabial consonants (b, m, p)
`fv`	Labiodental consonants (f, v)
`l`	Lateral consonant (l)
`r`	Rhotic sounds (r, ɝ, ɚ)
`th`	Dental fricatives (θ, ð)
`qw`	Rounded consonants (w, ʍ)
`chjsh`	Postalveolar/palatal consonants (tʃ, dʒ, ʃ, ʝ)
`cdgknstxyz`	Alveolar/velar consonants (c, d, g, k, n, s, t, x, y, z)

Get Started

Build with Realtime TTS

Best Practices

Resources

Streaming behavior

Sync (default)

Async

Response structure

Phonetic details structure

Viseme symbols

Get Started

Build with Realtime TTS

Best Practices

Resources

Documentation Index

​Streaming behavior

​Sync (default)

​Async

​Response structure

​Phonetic details structure

​Viseme symbols

Streaming behavior

Sync (default)

Async

Response structure

Phonetic details structure

Viseme symbols