Skip to main content
Inworld’s TTS output carries per-phoneme timing data that drives real-time lip-sync. The system maps phonemes to viseme categories and exposes blend weights each frame, which you apply to morph targets through an Animation Blueprint node.

How it works

When the TTS node produces a FInworldData_TTSOutput, it includes a Timestamps array. Each entry is an FInworldAudioChunkTimestamp — a word with its start/end times and a Phones array of FInworldPhoneSpan entries. Each span carries the phoneme symbol, its viseme category, and its timestamp. At playback time, UInworldVoiceAudioComponent fires OnVoiceAudioPlayback every tick with the current FInworldVoiceAudioPlaybackInfo (elapsed duration) and the cached phone spans. You pass these into a BFL function to get per-viseme or per-phoneme blend weights, then feed those weights into the Inworld Viseme AnimGraph node.
TTS Node → FInworldData_TTSOutput (Audio + Timestamps[word → Phones[phoneme]])

UInworldVoiceAudioComponent → OnVoiceAudioPlayback (PlaybackInfo + PhoneSpans)

GetVisemeBlendsTTS / GetVisemeBlends → FInworldVisemeBlends

Inworld Viseme AnimGraph Node → morph target curves

Data types

FInworldData_TTSOutput

The output of a TTS node. Contains everything needed for playback and lip-sync.
FieldTypeDescription
AudioFInworldData_AudioThe synthesized PCM audio
TextFStringThe text that was synthesized
TimestampsTArray<FInworldAudioChunkTimestamp>Per-word timing and phonetic breakdown

FInworldAudioChunkTimestamp

One word in the utterance, with its time range and phone-level detail.
FieldTypeDescription
TokenFStringThe word text
StartTimefloatWord start time in seconds
EndTimefloatWord end time in seconds
PhonesTArray<FInworldPhoneSpan>Per-phoneme breakdown for this word
bIsPartialbooltrue if this word may still change (streaming update)

FInworldPhoneSpan

One phoneme within a word. The source of all lip-sync timing.
FieldTypeDescription
PhonemeFStringIPA phoneme symbol (e.g. "b", "æ")
VisemeFStringViseme category string (e.g. "BMP", "AEI")
TimestampfloatTime in seconds when this phoneme sounds
DurationfloatDuration of this phoneme in seconds
WordIndexAtAudioChunkint32Index of the parent word in Timestamps

FInworldVisemeBlends

Blend weights for the 12 Inworld viseme categories, each in [0, 1]. STOP represents silence/rest.
FieldSounds
BMPb, m, p
FVf, v
THth
CDGKNSTXYZc, d, g, k, n, s, t, x, y, z
CHJSHch, j, sh
Ll
Rr
QWq, w
AEIa, e, i
EEee
Oo
Uu
STOPsilence / rest (defaults to 1.0)

FInworldVoiceAudioPlaybackInfo

Playback timing provided each tick by OnVoiceAudioPlayback. Pass this to BFL functions to get the correct viseme weights for the current frame.
FieldTypeDescription
Utterance.PlayedDurationfloatSeconds elapsed in the current utterance — used by BFL functions to look up the active phone span
Utterance.TotalDurationfloatTotal duration of the utterance
Utterance.PlayedPercentfloatPlayback progress [0, 1]
Interaction.PlayedDurationfloatSeconds elapsed across the whole interaction

UInworldVoiceAudioComponent

The UInworldVoiceAudioComponent handles TTS audio playback and is the main source of per-frame lip-sync data.

Methods

MethodDescription
QueueVoice(FInworldData_DataStream_TTSOutput)Queue a TTS stream chunk for playback
Interrupt()Stop playback immediately and clear the queue
GetCurrentPhoneSpans()Returns the cached TArray<FInworldPhoneSpan> for the current utterance — use with GetVisemeBlends

Events

EventSignatureDescription
OnVoiceAudioStart(Component, FInworldData_TTSOutput, bInteractionStart)Fired when a new utterance begins
OnVoiceAudioPlayback(Component, PlaybackInfo, FInworldData_TTSOutput, PhoneSpans)Fired every tick during playback — primary hook for lip-sync
OnVoiceAudioUpdated(Component, FInworldData_TTSOutput)Fired when TTS output data is updated
OnVoiceAudioComplete(Component, FInworldData_TTSOutput, bInteractionEnd)Fired when an utterance finishes normally
OnVoiceAudioInterrupt(Component, FInworldData_TTSOutput)Fired when playback is interrupted
OnVoiceAudioPlayback is the recommended binding point for lip-sync — it provides PlaybackInfo and pre-built PhoneSpans in one call.

Blueprint Function Library — Viseme & Phoneme functions

All functions are on UInworldBlueprintFunctionLibrary.
FunctionCategoryDescription
BuildPhoneSpansFromTTSOutput(TTSOutput, OutPhoneSpans)VisemeFlattens Timestamps[].Phones[] into a single array. Call once per utterance and cache the result.
GetVisemeBlends(PlaybackInfo, PhoneSpans)VisemeReturns FInworldVisemeBlends for the current playback time using a cached span array. Recommended for performance.
GetVisemeBlendsTTS(PlaybackInfo, TTSOutput)VisemeSame as above but builds spans internally from TTSOutput each call. Convenient, but less efficient.
GetPhonemeBlends(PlaybackInfo, PhoneSpans)PhonemeReturns FInworldPhonemeBlends (raw IPA phoneme weights) from a cached span array.
GetPhonemeBlendsTTS(PlaybackInfo, TTSOutput)PhonemeSame as above but reads directly from TTSOutput.
GetCurrentWord(PlaybackInfo, TTSOutput)PlaybackReturns FInworldAudioVoiceWord — the word currently being spoken (Word, WordIndex, TotalWordCount).
Recommended pattern: bind to OnVoiceAudioPlayback, cache PhoneSpans on OnVoiceAudioStart, then call GetVisemeBlends(PlaybackInfo, CachedPhoneSpans) each tick. This avoids re-flattening the timestamp array every frame.

Inworld Viseme AnimGraph node

UAnimGraphNode_InworldViseme is an Animation Blueprint node that applies morph target curves from viseme blend weights. Bone transforms pass through unchanged — only the curve track (morph targets) is modified. Inworld Viseme AnimGraph node in use

Properties

PropertyTypeDefaultPinDescription
SourceFPoseLinkYesIncoming pose — bones pass through unmodified
VisemeBlendsFInworldVisemeBlendsYesPer-viseme weights from the BFL functions, updated each tick
VisemeDataUInworldVisemeDataAsset*NoData asset mapping visemes to morph target curve names and weights
SmoothingSpeedfloat12.0NoInterpolation speed toward target weights per second. 0 disables smoothing
Alphafloat1.0NoOverall blend strength [0, 2]. 0 suppresses lip-sync, 1 is full, 2 doubles morph values

UInworldVisemeDataAsset

A UDataAsset that maps each viseme category to one or more morph target curve name/weight pairs. You create one asset per character rig, then assign it to the VisemeData property on the AnimGraph node. Each viseme entry is a TMap<FName, float> where the key is the morph target curve name on your Skeletal Mesh and the value is the contribution weight for that viseme. Supported viseme entries: BMP, FV, TH, CDGKNSTXYZ, CHJSH, L, R, QW, AEI, EE, O, U (STOP is handled automatically — it does not need an entry in the data asset.) Example UInworldVisemeDataAsset configured for MetaHuman

Setting up a VisemeDataAsset

  1. In the Content Browser, right-click and choose Miscellaneous > Data Asset
  2. Select InworldVisemeDataAsset as the class
  3. Open the asset and for each viseme entry, add the morph target curve names from your Skeletal Mesh and their blend weights
  4. Assign the asset to the VisemeData property on your Inworld Viseme AnimGraph node
For MetaHuman characters, each viseme typically maps to one or more CTRL_expressions_* curves. Weights are additive — multiple curves per viseme are all applied simultaneously.

Setting up lip-sync in an Animation Blueprint

1

Add the Inworld Viseme node to your AnimGraph

Open your character’s Animation Blueprint and navigate to the AnimGraph. Search for Inworld Viseme and place the node in your graph, wiring its Source input from your existing pose and its output toward Output Pose.Assign your UInworldVisemeDataAsset to the Viseme Data property on the node.
2

Create a VisemeBlends variable

Add an FInworldVisemeBlends variable to the Animation Blueprint. Wire it into the Viseme Blends pin on the Inworld Viseme node. This variable will be updated each tick from the character component.
3

Bind to OnVoiceAudioPlayback

In the character actor’s Blueprint (or on BeginPlay in the Anim BP), get the UInworldVoiceAudioComponent and bind to OnVoiceAudioPlayback. In the callback, call GetVisemeBlends(PlaybackInfo, PhoneSpans) and store the result into your FInworldVisemeBlends variable using Set Anim Instance Variable or a direct property write.For best performance, also bind to OnVoiceAudioStart and call BuildPhoneSpansFromTTSOutput there to cache the spans array. Then use GetVisemeBlends(PlaybackInfo, CachedPhoneSpans) in OnVoiceAudioPlayback instead of GetVisemeBlendsTTS.
4

Tune smoothing and alpha

Adjust Smoothing Speed on the AnimGraph node to control how snappily the mouth follows phoneme changes. The default of 12 is a good starting point. Use Alpha to globally scale the strength of lip-sync, which is useful for blending with other facial animation systems.