Documentation Index
Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt
Use this file to discover all available pages before exploring further.
Timestamp alignment lets you retrieve timing information that matches the generated audio, which is useful for experiences like word highlighting, karaoke‑style captions, and lipsync. The examples and behavior described on this page apply to TTS 1.5 models (
inworld-tts-1.5-mini, inworld-tts-1.5-max).
Set the timestampType request parameter to control granularity:
WORD: Return timestamps for every token in the original text — words, punctuation, and whitespace — in the exact order they were given, with phoneme-level timing and viseme symbols.CHARACTER: Return timestamps for each character or punctuation
Enabling timestamp alignment can increase latency (especially for the non-streaming endpoint).
WORD:timestampInfo.wordAlignmentwithwords,wordStartTimeSeconds,wordEndTimeSeconds, andphoneticDetails. Thewordsarray covers every token from the original input in order, so the alignment maps back to the full text without gaps.CHARACTER:timestampInfo.characterAlignmentwithcharacters,characterStartTimeSeconds,characterEndTimeSeconds
Streaming behavior
You can control how timestamp data is delivered alongside audio usingtimestampTransportStrategy.
Sync (default)
Audio and alignment arrive together in each chunk. Every chunk contains both audio data and its corresponding timestamps.Async
Audio chunks arrive first, followed by separate trailing messages containing only timestamp data. This reduces time-to-first-audio with TTS 1.5 models, since the server doesn’t need to wait for alignment computation before sending audio.timestampTransportStrategy to SYNC or ASYNC in your request. See the API reference for details.
Response structure
Returns alignment data with phoneme-level timing and viseme symbols for lip-sync animation.Phonetic details structure
Each entry inphoneticDetails contains:
| Field | Description |
|---|---|
wordIndex | 0-based index into the words array. Speech tokens have full phoneme breakdowns; non-speech tokens have a single [silence] phone (may be zero-length). |
phones | Array of phonemes for this token. |
isPartial | Deprecated. This field may still appear in responses for backward compatibility, but it is always false. |
| Field | Description |
|---|---|
phoneSymbol | The phone symbol: IPA for speech phones, or [silence] for non-speech tokens. |
startTimeSeconds | Start time of the phoneme in seconds. |
durationSeconds | Duration of the phoneme in seconds. |
visemeSymbol | The viseme symbol for lip-sync animation. |
Viseme symbols
The following viseme symbols are used for lip-sync animation:| Viseme | Description |
|---|---|
aei | Open mouth vowels (a, e, i, ə, ʌ, æ, ɑ, etc.) |
o | Rounded vowels (o, ʊ, əʊ, oʊ, etc.) |
ee | Front vowels (i, ɪ, eɪ, etc.) |
bmp | Bilabial consonants (b, m, p) |
fv | Labiodental consonants (f, v) |
l | Lateral consonant (l) |
r | Rhotic sounds (r, ɝ, ɚ) |
th | Dental fricatives (θ, ð) |
qw | Rounded consonants (w, ʍ) |
chjsh | Postalveolar/palatal consonants (tʃ, dʒ, ʃ, ʝ) |
cdgknstxyz | Alveolar/velar consonants (c, d, g, k, n, s, t, x, y, z) |