> ## Documentation Index
> Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Inworld Lip-sync

Inworld's TTS output carries per-phoneme timing data that drives real-time lip-sync. The system maps phonemes to viseme categories and exposes blend weights each frame, which you apply to morph targets through an Animation Blueprint node.

## How it works

When the TTS node produces a `FInworldData_TTSOutput`, it includes a `Timestamps` array. Each entry is an `FInworldAudioChunkTimestamp` — a word with its start/end times and a `Phones` array of `FInworldPhoneSpan` entries. Each span carries the phoneme symbol, its viseme category, and its timestamp.

At playback time, `UInworldVoiceAudioComponent` fires `OnVoiceAudioPlayback` every tick with the current `FInworldVoiceAudioPlaybackInfo` (elapsed duration) and the cached phone spans. You pass these into a BFL function to get per-viseme or per-phoneme blend weights, then feed those weights into the `Inworld Viseme` AnimGraph node.

```
TTS Node → FInworldData_TTSOutput (Audio + Timestamps[word → Phones[phoneme]])
         ↓
UInworldVoiceAudioComponent → OnVoiceAudioPlayback (PlaybackInfo + PhoneSpans)
         ↓
GetVisemeBlendsTTS / GetVisemeBlends → FInworldVisemeBlends
         ↓
Inworld Viseme AnimGraph Node → morph target curves
```

***

## Data types

### FInworldData\_TTSOutput

The output of a TTS node. Contains everything needed for playback and lip-sync.

| Field        | Type                                  | Description                            |
| ------------ | ------------------------------------- | -------------------------------------- |
| `Audio`      | `FInworldData_Audio`                  | The synthesized PCM audio              |
| `Text`       | `FString`                             | The text that was synthesized          |
| `Timestamps` | `TArray<FInworldAudioChunkTimestamp>` | Per-word timing and phonetic breakdown |

### FInworldAudioChunkTimestamp

One word in the utterance, with its time range and phone-level detail.

| Field        | Type                        | Description                                             |
| ------------ | --------------------------- | ------------------------------------------------------- |
| `Token`      | `FString`                   | The word text                                           |
| `StartTime`  | `float`                     | Word start time in seconds                              |
| `EndTime`    | `float`                     | Word end time in seconds                                |
| `Phones`     | `TArray<FInworldPhoneSpan>` | Per-phoneme breakdown for this word                     |
| `bIsPartial` | `bool`                      | `true` if this word may still change (streaming update) |

### FInworldPhoneSpan

One phoneme within a word. The source of all lip-sync timing.

| Field                   | Type      | Description                                    |
| ----------------------- | --------- | ---------------------------------------------- |
| `Phoneme`               | `FString` | IPA phoneme symbol (e.g. `"b"`, `"æ"`)         |
| `Viseme`                | `FString` | Viseme category string (e.g. `"BMP"`, `"AEI"`) |
| `Timestamp`             | `float`   | Time in seconds when this phoneme sounds       |
| `Duration`              | `float`   | Duration of this phoneme in seconds            |
| `WordIndexAtAudioChunk` | `int32`   | Index of the parent word in `Timestamps`       |

### FInworldVisemeBlends

Blend weights for the 12 Inworld viseme categories, each in `[0, 1]`. `STOP` represents silence/rest.

| Field        | Sounds                           |
| ------------ | -------------------------------- |
| `BMP`        | b, m, p                          |
| `FV`         | f, v                             |
| `TH`         | th                               |
| `CDGKNSTXYZ` | c, d, g, k, n, s, t, x, y, z     |
| `CHJSH`      | ch, j, sh                        |
| `L`          | l                                |
| `R`          | r                                |
| `QW`         | q, w                             |
| `AEI`        | a, e, i                          |
| `EE`         | ee                               |
| `O`          | o                                |
| `U`          | u                                |
| `STOP`       | silence / rest (defaults to 1.0) |

### FInworldVoiceAudioPlaybackInfo

Playback timing provided each tick by `OnVoiceAudioPlayback`. Pass this to BFL functions to get the correct viseme weights for the current frame.

| Field                        | Type    | Description                                                                                       |
| ---------------------------- | ------- | ------------------------------------------------------------------------------------------------- |
| `Utterance.PlayedDuration`   | `float` | Seconds elapsed in the current utterance — used by BFL functions to look up the active phone span |
| `Utterance.TotalDuration`    | `float` | Total duration of the utterance                                                                   |
| `Utterance.PlayedPercent`    | `float` | Playback progress `[0, 1]`                                                                        |
| `Interaction.PlayedDuration` | `float` | Seconds elapsed across the whole interaction                                                      |

***

## UInworldVoiceAudioComponent

The `UInworldVoiceAudioComponent` handles TTS audio playback and is the main source of per-frame lip-sync data.

### Methods

| Method                                          | Description                                                                                           |
| ----------------------------------------------- | ----------------------------------------------------------------------------------------------------- |
| `QueueVoice(FInworldData_DataStream_TTSOutput)` | Queue a TTS stream chunk for playback                                                                 |
| `Interrupt()`                                   | Stop playback immediately and clear the queue                                                         |
| `GetCurrentPhoneSpans()`                        | Returns the cached `TArray<FInworldPhoneSpan>` for the current utterance — use with `GetVisemeBlends` |

### Events

| Event                   | Signature                                                       | Description                                                  |
| ----------------------- | --------------------------------------------------------------- | ------------------------------------------------------------ |
| `OnVoiceAudioStart`     | `(Component, FInworldData_TTSOutput, bInteractionStart)`        | Fired when a new utterance begins                            |
| `OnVoiceAudioPlayback`  | `(Component, PlaybackInfo, FInworldData_TTSOutput, PhoneSpans)` | Fired every tick during playback — primary hook for lip-sync |
| `OnVoiceAudioUpdated`   | `(Component, FInworldData_TTSOutput)`                           | Fired when TTS output data is updated                        |
| `OnVoiceAudioComplete`  | `(Component, FInworldData_TTSOutput, bInteractionEnd)`          | Fired when an utterance finishes normally                    |
| `OnVoiceAudioInterrupt` | `(Component, FInworldData_TTSOutput)`                           | Fired when playback is interrupted                           |

`OnVoiceAudioPlayback` is the recommended binding point for lip-sync — it provides `PlaybackInfo` and pre-built `PhoneSpans` in one call.

***

## Blueprint Function Library — Viseme & Phoneme functions

All functions are on `UInworldBlueprintFunctionLibrary`.

| Function                                                 | Category | Description                                                                                                          |
| -------------------------------------------------------- | -------- | -------------------------------------------------------------------------------------------------------------------- |
| `BuildPhoneSpansFromTTSOutput(TTSOutput, OutPhoneSpans)` | Viseme   | Flattens `Timestamps[].Phones[]` into a single array. Call once per utterance and cache the result.                  |
| `GetVisemeBlends(PlaybackInfo, PhoneSpans)`              | Viseme   | Returns `FInworldVisemeBlends` for the current playback time using a cached span array. Recommended for performance. |
| `GetVisemeBlendsTTS(PlaybackInfo, TTSOutput)`            | Viseme   | Same as above but builds spans internally from `TTSOutput` each call. Convenient, but less efficient.                |
| `GetPhonemeBlends(PlaybackInfo, PhoneSpans)`             | Phoneme  | Returns `FInworldPhonemeBlends` (raw IPA phoneme weights) from a cached span array.                                  |
| `GetPhonemeBlendsTTS(PlaybackInfo, TTSOutput)`           | Phoneme  | Same as above but reads directly from `TTSOutput`.                                                                   |
| `GetCurrentWord(PlaybackInfo, TTSOutput)`                | Playback | Returns `FInworldAudioVoiceWord` — the word currently being spoken (`Word`, `WordIndex`, `TotalWordCount`).          |

**Recommended pattern:** bind to `OnVoiceAudioPlayback`, cache `PhoneSpans` on `OnVoiceAudioStart`, then call `GetVisemeBlends(PlaybackInfo, CachedPhoneSpans)` each tick. This avoids re-flattening the timestamp array every frame.

***

## Inworld Viseme AnimGraph node

`UAnimGraphNode_InworldViseme` is an Animation Blueprint node that applies morph target curves from viseme blend weights. Bone transforms pass through unchanged — only the curve track (morph targets) is modified.

<img src="https://mintcdn.com/inworldai/eNucmoEcJOlO8x-G/img/unreal/runtime/animgraph_InworldViseme.png?fit=max&auto=format&n=eNucmoEcJOlO8x-G&q=85&s=0d44d00bb931b8b8afeb591e57c313f1" alt="Inworld Viseme AnimGraph node in use" width="450" height="166" data-path="img/unreal/runtime/animgraph_InworldViseme.png" />

### Properties

| Property         | Type                       | Default | Pin | Description                                                                                     |
| ---------------- | -------------------------- | ------- | --- | ----------------------------------------------------------------------------------------------- |
| `Source`         | `FPoseLink`                | —       | Yes | Incoming pose — bones pass through unmodified                                                   |
| `VisemeBlends`   | `FInworldVisemeBlends`     | —       | Yes | Per-viseme weights from the BFL functions, updated each tick                                    |
| `VisemeData`     | `UInworldVisemeDataAsset*` | —       | No  | Data asset mapping visemes to morph target curve names and weights                              |
| `SmoothingSpeed` | `float`                    | `12.0`  | No  | Interpolation speed toward target weights per second. `0` disables smoothing                    |
| `Alpha`          | `float`                    | `1.0`   | No  | Overall blend strength `[0, 2]`. `0` suppresses lip-sync, `1` is full, `2` doubles morph values |

***

## UInworldVisemeDataAsset

A `UDataAsset` that maps each viseme category to one or more morph target curve name/weight pairs. You create one asset per character rig, then assign it to the `VisemeData` property on the AnimGraph node.

Each viseme entry is a `TMap<FName, float>` where the key is the morph target curve name on your Skeletal Mesh and the value is the contribution weight for that viseme.

**Supported viseme entries:** `BMP`, `FV`, `TH`, `CDGKNSTXYZ`, `CHJSH`, `L`, `R`, `QW`, `AEI`, `EE`, `O`, `U`

(`STOP` is handled automatically — it does not need an entry in the data asset.)

<img src="https://mintcdn.com/inworldai/drrW6P-S34ZcN4Q3/img/unreal/runtime/viseme_data_asset_metahuman.png?fit=max&auto=format&n=drrW6P-S34ZcN4Q3&q=85&s=6195fbf719792ea8cc13b2dbeabce5ea" alt="Example UInworldVisemeDataAsset configured for MetaHuman" width="904" height="729" data-path="img/unreal/runtime/viseme_data_asset_metahuman.png" />

### Setting up a VisemeDataAsset

1. In the Content Browser, right-click and choose **Miscellaneous > Data Asset**
2. Select `InworldVisemeDataAsset` as the class
3. Open the asset and for each viseme entry, add the morph target curve names from your Skeletal Mesh and their blend weights
4. Assign the asset to the `VisemeData` property on your `Inworld Viseme` AnimGraph node

For MetaHuman characters, each viseme typically maps to one or more `CTRL_expressions_*` curves. Weights are additive — multiple curves per viseme are all applied simultaneously.

***

## Setting up lip-sync in an Animation Blueprint

<Steps titleSize="h3">
  <Step title="Add the Inworld Viseme node to your AnimGraph">
    Open your character's Animation Blueprint and navigate to the **AnimGraph**. Search for **Inworld Viseme** and place the node in your graph, wiring its **Source** input from your existing pose and its output toward **Output Pose**.

    Assign your `UInworldVisemeDataAsset` to the **Viseme Data** property on the node.
  </Step>

  <Step title="Create a VisemeBlends variable">
    Add an `FInworldVisemeBlends` variable to the Animation Blueprint. Wire it into the **Viseme Blends** pin on the Inworld Viseme node. This variable will be updated each tick from the character component.
  </Step>

  <Step title="Bind to OnVoiceAudioPlayback">
    In the character actor's Blueprint (or on `BeginPlay` in the Anim BP), get the `UInworldVoiceAudioComponent` and bind to `OnVoiceAudioPlayback`. In the callback, call `GetVisemeBlends(PlaybackInfo, PhoneSpans)` and store the result into your `FInworldVisemeBlends` variable using `Set Anim Instance Variable` or a direct property write.

    For best performance, also bind to `OnVoiceAudioStart` and call `BuildPhoneSpansFromTTSOutput` there to cache the spans array. Then use `GetVisemeBlends(PlaybackInfo, CachedPhoneSpans)` in `OnVoiceAudioPlayback` instead of `GetVisemeBlendsTTS`.
  </Step>

  <Step title="Tune smoothing and alpha">
    Adjust **Smoothing Speed** on the AnimGraph node to control how snappily the mouth follows phoneme changes. The default of `12` is a good starting point. Use **Alpha** to globally scale the strength of lip-sync, which is useful for blending with other facial animation systems.
  </Step>
</Steps>
