> ## Documentation Index
> Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Responsiveness (intermediate fillers)

> Speak a short, low-latency filler ("let me think", "one moment") while the main LLM is still warming up, so the agent never feels frozen.

Responsiveness — the intermediate-filler layer — bridges the gap between the moment the user finishes speaking and the moment the main LLM produces its first audible delta. This can be useful if a tool call is made or if the main LLM is slow to produce its response.

A small "filler" LLM races against the main model: if the main model takes longer than `initial_wait_timeout_ms` to emit its first token, the server speaks a short filler ("let me think", "good question, one moment") via TTS, then transparently hands off to the main response when it lands.

## Enabling responsiveness

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      responsiveness: {
        enabled: true,
        initial_wait_timeout_ms: 1200,
        hard_deadline_ms: 2000,
        history_tail_items: 4,
        temperature: 0.7,
        max_tokens: 12,
        min_filler_gap_ms: 8000,
        max_initial_per_turn: 1,
        enable_filler_on_first_assistant_reply: false,
        pause_text: ''
      }
    }
  }
}));
```

| Field                                    | Type    | Default        | Description                                                                                                                                                                                                                                              |
| ---------------------------------------- | ------- | -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `enabled`                                | boolean | `false`        | Per-session opt-in. A session that does not send this object never gets a filler, even when the server has the racing infrastructure wired and the Unleash flag is on.                                                                                   |
| `small_model`                            | string  | server default | Override the filler LLM model identifier. Useful for A/B testing different small models without a service redeploy.                                                                                                                                      |
| `initial_wait_timeout_ms`                | integer | server default | T — how long to wait for the main LLM's first delta before committing to the filler. Lower values fire fillers more aggressively (better perceived responsiveness, more frequent fillers).                                                               |
| `hard_deadline_ms`                       | integer | server default | Caps the small / filler LLM's total streaming time so a slow filler model can't itself become a latency tax.                                                                                                                                             |
| `history_tail_items`                     | integer | server default | Recent conversation items the small LLM sees as context. Trades coherence (more history = more on-topic fillers) for token cost.                                                                                                                         |
| `temperature`                            | number  | server default | Sampling temperature for the small LLM.                                                                                                                                                                                                                  |
| `max_tokens`                             | integer | server default | Caps the small LLM's response length. Keep this small — fillers should be brief.                                                                                                                                                                         |
| `min_filler_gap_ms`                      | integer | server default | Minimum gap between any two fillers within a single user-turn chain. Prevents back-to-back fillers when the main LLM is consistently slow.                                                                                                               |
| `max_initial_per_turn`                   | integer | `1`            | Caps initial fillers per user turn. The default of `1` matches the v1 single-filler behavior.                                                                                                                                                            |
| `max_buffer_deltas`                      | integer | server default | Bounds the in-memory buffer of main-LLM deltas held while the filler is being spoken. Once the buffer is exhausted the main response is flushed even if the filler is still in progress.                                                                 |
| `enable_filler_on_first_assistant_reply` | boolean | `false`        | Allows responsiveness fillers on the very first assistant response in a session. Default `false` because the first reply is often a greeting that doesn't benefit from a filler.                                                                         |
| `prompt_template`                        | string  | server default | Overrides the system prompt fed to the small filler LLM. Empty string is treated as "use server default" — you can't clear it to literally empty. Append a language directive here for multilingual sessions; the compiled-in default is English-biased. |
| `pause_text`                             | string  | server default | TTS-only hint injected between the filler and the main answer (e.g. a brief audible breath or a short connector word). Empty string disables injection for this session.                                                                                 |

All fields are optional pointers — sending `null` or omitting a field leaves the server-side default in place.

## Interaction with TTS

The filler is spoken through the same TTS pipeline as the main response — the same voice, the same `delivery_mode`, the same `language`. Filler audio is delivered on your transport's normal assistant-audio path: on WebSocket sessions it arrives on the regular `response.output_audio.delta` stream, and on WebRTC sessions it plays on the inbound RTP audio track (the same track that carries main-response audio — see [WebRTC](/realtime/connect/webrtc)). There is no dedicated filler event type or separate media track. This means:

* Voice / language / accent flips made via `session.update` apply to fillers immediately.
* If you have configured `providerData.tts.conversational = true`, fillers participate in the shared TTS context just like main responses.
* `pause_text`, when set, is synthesized through the same TTS call after the filler — useful for a softer transition from "let me think" into the actual answer.

## Example: Japanese fillers

The compiled-in filler prompt biases the small LLM toward English, so a Japanese session needs both a Japanese TTS language pin and a `prompt_template` override that tells the filler LLM to reply in Japanese. The TTS voice and `delivery_mode` come from your normal `audio.output` and `providerData.tts` settings, so the filler is spoken in the Japanese voice you've already configured.

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: { output: { voice: 'Hiroshi', model: 'inworld-tts-2' } },
    providerData: {
      tts: { language: 'ja-JP' },
      responsiveness: {
        enabled: true,
        initial_wait_timeout_ms: 1200,
        max_tokens: 12,
        prompt_template:
          'You are a polite Japanese conversational assistant. ' +
          'The user has just spoken and the main response is being prepared. ' +
          'Output a single short filler in natural spoken Japanese ' +
          '(e.g. "ちょっと待ってください", "少々お待ちを", "そうですね") — ' +
          'no greeting, no follow-up question, just the filler. ' +
          'Reply in Japanese only.'
      }
    }
  }
}));
```

What this does:

* **`prompt_template`** is the single most important setting for non-English fillers. The compiled-in default is English-biased; without the override, the filler LLM emits English ("let me think") even mid-Japanese conversation. Keep the prompt short and example-driven — the filler LLM is small and benefits from concrete examples in the target language.
* **`providerData.tts.language: 'ja-JP'`** anchors the TTS accent so the synthesized filler audio is rendered in Japanese, not transliterated through an English voice.
* **`max_tokens: 12`** caps filler length. Japanese fillers are typically short (`ちょっと待ってください` is 5 tokens-ish); keep this low so the filler doesn't outrun the main LLM warmup.
* Voice selection (e.g. `Hiroshi`) and `delivery_mode` are inherited from your TTS config — no responsiveness-specific override needed.

## Tuning tips

* Start with `enabled: true` and the server defaults. Measure end-to-end perceived latency before tuning anything else.
* If fillers fire too often (the user notices a filler before *every* response), raise `initial_wait_timeout_ms`. Fillers should mask only the slowest tail of main-LLM responses.
* If fillers fire too rarely (long awkward silences), lower `initial_wait_timeout_ms`.
* For multilingual sessions, append a language directive to `prompt_template`. The compiled-in default biases the filler LLM toward English; without the override the agent will speak English fillers mid-Spanish or mid-French conversation.
* Pair with `enable_filler_on_first_assistant_reply: false` (the default) so the opening greeting plays cleanly — fillers on the very first turn tend to feel awkward.
