> ## Documentation Index
> Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Adding Naturalness

> Replicate the conversational feel of the [Inworld realtime demo](https://realtime.ai/).

The realtime demo's natural cadence comes from three things:

1. `inworld-tts-2` — the model that renders bracketed cues (`[laugh]`, `[breathe]`, `[sigh]`) as audible sounds and interprets `[speak ...]` steering tags.
2. `providerData.tts` settings: `delivery_mode: 'CREATIVE'`, `segmenter_strategy: 'full_turn'`, `steering_handling: 'emit_once'`.
3. A system prompt that tells the LLM to emit those tags.

Voice choice matters less than the three above. The demo uses `Sarah`.

## session.update

Send this after the WebSocket opens. Replace `SYSTEM_PROMPT` with the prompt in the next section.

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    instructions: SYSTEM_PROMPT,
    audio: {
      output: {
        model: 'inworld-tts-2',
        voice: 'Sarah'
      },
      input: {
        turn_detection: {
          type: 'semantic_vad',
          eagerness: 'low'
        }
      }
    },
    providerData: {
      tts: {
        delivery_mode: 'CREATIVE',
        segmenter_strategy: 'full_turn',
        steering_handling: 'emit_once'
      },
      backchannel: { enabled: true },
      responsiveness: { enabled: true }
    }
  }
}));
```

Notes:

* `delivery_mode: 'CREATIVE'` is the most expressive of `STABLE | BALANCED | CREATIVE`.
* `segmenter_strategy: 'full_turn'` buffers the whole LLM turn before synthesis. Highest quality, highest time-to-first-audio. Switch to `fast_start` or `sentence` if latency matters more.
* `semantic_vad` with `eagerness: 'low'` tolerates pauses without cutting the turn off.
* [`backchannel`](/realtime/usage/back-channel) emits "uh-huh" while the user is still talking. [`responsiveness`](/realtime/usage/responsiveness) emits a short filler if the LLM is slow to reply.

## System prompt

`inworld-tts-2` only renders tags that appear in the text. The LLM has to be told to emit them. Use this as a starter:

```
You are a warm, conversational AI on a voice call. Speak the way a person speaks, not the way a chatbot writes.

TURN LENGTH
5 to 10 words by default. A backchannel ("yeah", "mm-hm", "right", "huh") is often the whole turn. Go longer only when the user asks you to explain or walk through something.

NON-VERBALS — six bracketed sounds the voice can produce: [laugh], [breathe], [sigh], [cough], [clear throat], [yawn]. Use where a person would actually make that sound. At most one per turn, often none.

STEERING TAGS — at most ONE [speak ...] tag per turn. If used, it MUST be the first thing in the turn. Use it only when the user's emotional register has shifted, or when they ask for a specific style:
- User excited / shared good news → [speak with bright energy, faster, warmer]
- User frustrated → [speak evenly, slower, lower volume, no defensiveness]
- User vulnerable, paused on something hard → [speak softly, slower, with warmth]
- User asks for a specific voice ("speak like a pirate") → honor it literally and stay in that voice until they drop it
Default is no tag — tone carries through word choice and rhythm. Once you've shifted manner, keep it across turns without re-tagging.

SMALL DISFLUENCIES
- Fillers: "um", "uh", "hmm"
- Soft openers: "oh", "well", "so", "right", "okay"
- Hedges: "kind of", "I guess", "maybe"
- Self-repairs: "I, I think"
- Backchannels: "yeah", "mm-hm", "right"
Zero to two per turn, often none.

Steering tags ([speak ...]), non-verbals ([laugh], [breathe], etc.), and stage modifiers always stay in English even if the conversation is in another language. Only the spoken words switch.
```

Layer your own persona on top.

## What each piece does

* **`delivery_mode: 'CREATIVE'`.** `STABLE` and `BALANCED` flatten prosody. `CREATIVE` allows more expressive variation.
* **`segmenter_strategy: 'full_turn'`.** Synthesizing the whole turn at once preserves intonation across sentence boundaries.
* Prompting to output steering and non-verbals. Without it the LLM never emits `[breathe]` or `[speak softly]` and TTS-2 has nothing to render. This is the biggest single contributor.
* **`backchannel` + `responsiveness`.** Optional. Fill dead air during user turns and slow LLM warmups respectively.

## Tweaks

* **Latency-sensitive (mobile, WebRTC).** Use `segmenter_strategy: 'fast_start'` or `'sentence'`.
* **Voice cloning.** Cloning gives you timbre. The conversational feel still comes from the prompt + provider data above. See [voice cloning](/tts/voice-cloning).
* **Non-English.** Append a language directive to the system prompt. Keep `[speak ...]` and `[laugh]` / `[breathe]` tags in English. Set `providerData.tts.language` for the TTS accent.

## See also

* [API Extensions reference](/realtime/provider-data) — full `providerData` surface
* [Configuring Realtime Models](/realtime/usage/using-realtime-models) — model selection, voice, semantic VAD
* [TTS-2 steering and voice tags](/tts/capabilities/steering)
* [Back-channel](/realtime/usage/back-channel) and [Responsiveness](/realtime/usage/responsiveness)
