> ## Documentation Index
> Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Configuring Models

The Inworld Realtime API uses an OpenAI Realtime API-compatible event system to facilitate voice experiences. This guide walks through configuring each layer — **STT**, **LLM**, and **TTS** — plus the conversation- and observability-level controls that span all three. For the field-by-field reference of Inworld extensions, see [Inworld Realtime API Extensions](/realtime/provider-data).

## Configure a session

For WebSocket, the connection starts with a `session.created` event. For WebRTC, send `session.update` as soon as the data channel opens. In both cases, use `session.update` to configure your session. Here you can set:

* `model` — LLM provider and model (e.g. `openai/gpt-4.1-nano`) or router (e.g. `inworld/latency-optimizer-ab-test`)
* `instructions`
* `output_modalities` (`["audio", "text"]`, `["audio"]`, or `["text"]`)
* Audio input and output configuration — voice, TTS model, PCM format, speed
* `max_output_tokens` (`"inf"` or a numeric ceiling)
* `tools` (function definitions) and `tool_choice` settings
* `providerData` — Inworld extensions for STT, TTS, memory, back-channel, and responsiveness (see [Inworld Realtime API Extensions](/realtime/provider-data))

Partial updates are supported, so you can adjust the LLM, voice, TTS model, temperature, or tool lists mid-session without rebuilding the socket.

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    type: 'realtime',
    model: 'openai/gpt-4o-mini',
    instructions: 'You are a friendly narrator.',
    output_modalities: ['audio', 'text'],
    temperature: 0.8,
    audio: {
      input: {
        transcription: { model: 'inworld/inworld-stt-1' },
        turn_detection: {
          type: 'semantic_vad',
          create_response: true,
          interrupt_response: true
        }
      },
      output: {
        voice: 'Clive',
        model: 'inworld-tts-2',
        speed: 1.0
      }
    }
  }
}));
```

***

## STT (Speech-to-Text)

### Choose an STT model

Set `audio.input.transcription.model` to select the speech-to-text model used to transcribe user audio. `inworld/inworld-stt-1` is the recommended default for most realtime voice agents; pick a third-party model when its specific strength (sub-300ms latency, semantic end-of-turn, etc.) matters for your use case.

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: { input: { transcription: { model: 'inworld/inworld-stt-1' } } }
  }
}));
```

| Model                                         | Best for                                                                                                         |
| --------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| `inworld/inworld-stt-1`                       | Inworld's first-party STT with configurable turn-taking controls. Recommended default.                           |
| `assemblyai/u3-rt-pro`                        | High-accuracy, sub-300ms latency, multilingual streaming (English, Spanish, French, German, Italian, Portuguese) |
| `assemblyai/universal-streaming-multilingual` | Multilingual streaming across the same six languages                                                             |
| `assemblyai/universal-streaming-english`      | English-optimized streaming                                                                                      |
| `soniox/stt-rt-v4`                            | High-accuracy real-time streaming with semantic end-of-turn detection and multilingual support                   |

If the selected model is not recognised, the server responds with an `error` event (`type: "invalid_request_error"`, `code: "invalid_value"`, `param: "session.audio.input.transcription.model"`) and the rest of the `session.update` is not applied. See [STT Introduction](/stt/overview) for the full model catalogue and comparison.

### Transcription hints

Guide the STT decoder with a prompt (vocabulary, domain context, formatting preferences). This is the OpenAI-standard [`audio.input.transcription.prompt`](https://platform.openai.com/docs/api-reference/realtime-sessions/session) field and is portable across OpenAI-compatible SDKs:

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: {
      input: {
        transcription: {
          model: 'assemblyai/u3-rt-pro',
          prompt: 'Medical dictation. Vocabulary: angioplasty, myocardial infarction.',
          language: 'en'
        }
      }
    }
  }
}));
```

| Field      | Type   | Description                                                                                |
| ---------- | ------ | ------------------------------------------------------------------------------------------ |
| `model`    | string | STT model ID. See [Choose an STT model](#choose-an-stt-model).                             |
| `prompt`   | string | Transcription guidance: vocabulary hints, domain context, formatting preferences.          |
| `language` | string | BCP-47 language code (e.g. `"en"`, `"es"`). Optional; the model auto-detects when omitted. |

### Tune turn detection

Turn detection — when the server decides a user has finished speaking — is controlled by the OpenAI-standard [`audio.input.turn_detection`](https://platform.openai.com/docs/api-reference/realtime-sessions/session) object. The Realtime API supports both VAD types and is wire-compatible with the OpenAI SDK.

#### `semantic_vad`

Model-based end-of-turn detection backed by the STT stream. `eagerness` is the primary tuning knob.

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: {
      input: {
        turn_detection: {
          type: 'semantic_vad',
          eagerness: 'medium',          // 'low' | 'medium' | 'high' | 'auto'
          create_response: true,
          interrupt_response: true
        }
      }
    }
  }
}));
```

| Field                | Type    | Description                                                                                                                                                                       |
| -------------------- | ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `type`               | string  | `"semantic_vad"` (default)                                                                                                                                                        |
| `eagerness`          | string  | How aggressively to end turns: `"low"`, `"medium"`, `"high"`, `"auto"`. Lower eagerness requires stronger end-of-turn confidence; higher eagerness commits to end-of-turn sooner. |
| `create_response`    | boolean | Auto-create a response on turn end (default `true`)                                                                                                                               |
| `interrupt_response` | boolean | Interrupt the active response when the user speaks (default `true`)                                                                                                               |

`eagerness` maps to a full set of four STT turn-detection parameters — confidence threshold, VAD threshold, minimum end-of-turn silence, and maximum within-turn silence. Lower thresholds and shorter silences mean the STT model commits to end-of-turn sooner (more eager).

| `eagerness` | `end_of_turn_confidence_threshold` | `vad_threshold` | `min_end_of_turn_silence` (ms) | `max_turn_silence` (ms) |
| ----------- | ---------------------------------- | --------------- | ------------------------------ | ----------------------- |
| `low`       | `0.85`                             | `0.5`           | `400`                          | `3000`                  |
| `medium`    | `0.70`                             | `0.5`           | `160`                          | `2400`                  |
| `auto`      | `0.70`                             | `0.5`           | `160`                          | `2400`                  |
| `high`      | `0.55`                             | `0.3`           | `80`                           | `1200`                  |

`auto` mirrors `medium` until router-side adaptive logic exists. Any explicit field under `providerData.stt` (see [STT extensions](#stt-extensions) below) overrides the eagerness-derived default for that field — fields you do not set keep the eagerness mapping.

#### `server_vad`

Inworld-hosted Silero VAD + Smart Turn detector. Tunable fields match OpenAI's server\_vad shape and can be changed mid-session via partial `session.update`.

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: {
      input: {
        turn_detection: {
          type: 'server_vad',
          threshold: 0.5,
          prefix_padding_ms: 200,
          silence_duration_ms: 1000,
          idle_timeout_ms: 8000,
          create_response: true,
          interrupt_response: true
        }
      }
    }
  }
}));
```

| Field                 | Type            | Description                                                                                                                           |
| --------------------- | --------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| `type`                | string          | `"server_vad"`                                                                                                                        |
| `threshold`           | number          | Silero VAD speech cutoff, `0.0`–`1.0`. Default `0.5`.                                                                                 |
| `prefix_padding_ms`   | integer         | Pre-speech audio retained before an utterance, in ms. Default `200`.                                                                  |
| `silence_duration_ms` | integer         | Trailing silence required to finalize the turn, in ms. Default `1000`.                                                                |
| `idle_timeout_ms`     | integer \| null | When set, the server emits `input_audio_buffer.timeout_triggered` after this many ms with no detected speech. `null` or `0` disables. |
| `create_response`     | boolean         | Auto-create a response on turn end (default `true`)                                                                                   |
| `interrupt_response`  | boolean         | Interrupt the active response when the user speaks (default `true`)                                                                   |

All fields accept partial `session.update` — omit a field to keep its current value. Changes take effect on the next audio chunk processed.

See [Voice Activity Detection (VAD)](/realtime/usage/managing-conversations#voice-activity-detection) for the VAD event lifecycle.

### Audio input formats

Set the wire format for client → server audio under `audio.input.format`. Four formats are supported; pick based on your source. The same catalog applies to `audio.output.format` (covered under [TTS](#audio-output-format) below).

| `type`          | Encoding                        | Sample rate                   | When to use                                                              |
| --------------- | ------------------------------- | ----------------------------- | ------------------------------------------------------------------------ |
| `audio/pcm`     | Signed 16-bit little-endian PCM | `rate` (default `24000`)      | Default for browser, mobile, and most server-side sources. Send mono.    |
| `audio/pcmu`    | G.711 μ-law                     | Fixed 8000 Hz (ignore `rate`) | Telephony (Twilio Media Streams, SIP trunks in North America/Japan).     |
| `audio/pcma`    | G.711 A-law                     | Fixed 8000 Hz (ignore `rate`) | Telephony (SIP trunks in Europe and most of the rest of the world).      |
| `audio/float32` | 32-bit float PCM, little-endian | `rate` (default `24000`)      | Pipelines that natively produce float32 samples (some audio frameworks). |

Audio is always mono, base64-encoded inside the JSON envelope (e.g. `input_audio_buffer.append`).

```javascript theme={"system"}
// PCM16 @ 24 kHz (default — omit `format` entirely for the same result)
audio: { input: { format: { type: 'audio/pcm', rate: 24000 } } }

// G.711 μ-law @ 8 kHz, for Twilio
audio: { input: { format: { type: 'audio/pcmu' } } }
```

A legacy shorthand is also accepted: send `format` as a bare string — `"pcm16"`, `"g711_ulaw"`, `"g711_alaw"`, or `"float32"` — and the server expands it to the object form above.

The server resamples to 16 kHz internally for STT, so PCM input rate doesn't need to match the STT model's native rate. Send any rate that's convenient; `24000` and `8000` (G.711) are the common choices.

See [Telephony with Twilio](/realtime/usage/twilio) for a worked example of the G.711 path.

### Send audio input

There are two ways to send audio input:

**Method 1: Streaming Audio (Real-time)**
Use `input_audio_buffer.*` events for streaming real-time audio from a microphone:

1. Encode microphone data in your chosen [input format](#audio-input-formats) (PCM16 at 24 kHz is the default).
2. Send chunks via `input_audio_buffer.append`.
3. VAD automatically detects speech boundaries and commits the buffer.

**Method 2: Pre-recorded Audio**
Use `conversation.item.create` with `input_audio` content type for pre-recorded audio chunks:

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'conversation.item.create',
  item: {
    type: 'message',
    role: 'user',
    content: [{
      type: 'input_audio',
      audio: base64AudioData  // Base64-encoded PCM16 or OPUS
    }]
  }
}));
```

### STT extensions

Inworld extensions for STT live under `providerData.stt` — voice profile signals, language hints (Soniox), and explicit overrides for the four turn-detection parameters that `semantic_vad.eagerness` controls implicitly. Full field reference and the voice-profile payload shape are in [`providerData.stt`](/realtime/provider-data#stt-providerdata-stt).

***

## LLM

### Choose a router or LLM

Set `model` in `session.update` to select which [Router](/router/introduction) or LLM handles the conversation. The format is `provider/modelName` or `inworld/routerId`:

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    model: 'openai/gpt-4o-mini'
  }
}));
```

If you omit `model`, the default model (`google-ai-studio/gemini-2.5-flash`) is used. You can change the model mid-session with a partial update — the new model takes effect on the next response.

### Reasoning effort

For models that support chain-of-thought reasoning (e.g. `google-ai-studio/gemini-2.5-pro`), configure reasoning depth via `text_generation_config.reasoning`:

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    model: 'google-ai-studio/gemini-2.5-pro',
    text_generation_config: {
      reasoning: { effort: 'MEDIUM' }
    }
  }
}));
```

| Effort    | Behaviour                                                                   |
| --------- | --------------------------------------------------------------------------- |
| `NONE`    | Disables reasoning entirely.                                                |
| `MINIMAL` | \~10% of max completion tokens used for thinking.                           |
| `LOW`     | \~20%                                                                       |
| `MEDIUM`  | \~50% (server default when `reasoning` is present but `effort` is omitted). |
| `HIGH`    | \~80%                                                                       |
| `XHIGH`   | \~95%                                                                       |

Reasoning tokens are not included in the streamed text output by default; set `text_generation_config.reasoning.exclude: false` to include them. Usage is reported in `response.done` under `usage.output_token_details.reasoning_tokens`.

<Note>
  Support varies by model — some do not support reasoning, others accept only a subset of effort levels. When `reasoning` is omitted, the model's default applies; for reasoning-capable models this may add latency. Set `effort: "NONE"` explicitly if you need minimal latency.
</Note>

For the full field reference (`maxTokens`, `exclude`, and other generation params), see [`text_generation_config`](/realtime/provider-data#text-generation-config-text_generation_config).

### Send text input

Create explicit conversation items for text turns:

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'conversation.item.create',
  item: {
    type: 'message',
    role: 'user',
    content: [{
      type: 'input_text',
      text: 'Give me a two-sentence summary.'
    }]
  }
}));
```

### Function calling

The Realtime API supports function calling so your agent can fetch live data or trigger actions mid-conversation. Define functions in `session.tools`, then handle calls as they arrive.

#### 1. Register a tool

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    type: 'realtime',
    tools: [{
      type: 'function',
      name: 'get_horoscope',
      description: 'Get the horoscope for a zodiac sign',
      parameters: {
        type: 'object',
        properties: {
          sign: {
            type: 'string',
            description: 'Zodiac sign, e.g. Aries, Taurus'
          }
        },
        required: ['sign']
      }
    }],
    tool_choice: 'auto'
  }
}));
```

#### 2. Handle the function call

When the model decides to call a function, you receive a `response.function_call_arguments.done` event with the `call_id`, function `name`, and serialized `arguments`. Execute your logic, then return the result:

```javascript theme={"system"}
ws.on('message', (buffer) => {
  const event = JSON.parse(buffer.toString());

  if (event.type === 'response.function_call_arguments.done') {
    const { call_id, name, arguments: argsJson } = event;
    const args = JSON.parse(argsJson);

    // Run your business logic
    let result;
    if (name === 'get_horoscope') {
      result = fetchHoroscope(args.sign);
    }

    // Send the function result back
    ws.send(JSON.stringify({
      type: 'conversation.item.create',
      item: {
        type: 'function_call_output',
        call_id,
        output: JSON.stringify(result)
      }
    }));

    // Tell the model to continue with the result
    ws.send(JSON.stringify({ type: 'response.create' }));
  }
});
```

#### 3. What happens next

After `response.create`, the model incorporates the function output and continues the conversation — speaking the horoscope aloud (if `output_modalities` includes `audio`) or streaming text deltas. The user hears the answer without any gap in the conversation flow.

You can register multiple tools and the model will call them as needed. Each call arrives as a separate `response.function_call_arguments.done` event with its own `call_id`.

### Memory

Inworld's automatic conversation memory layer extracts durable facts and a rolling summary, prepends them to the system prompt, and trims older transcript items so context stays bounded. Configured under `providerData.memory`. See [`providerData.memory`](/realtime/provider-data#memory-providerdata-memory) for the field reference, and [Long-term Memory](/realtime/usage/long-term-memory) for the cross-session persistence pattern.

***

## TTS (Text-to-Speech)

### Choose a TTS model

Set `audio.output.model` to select the text-to-speech model:

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: { output: { model: 'inworld-tts-2' } }
  }
}));
```

| Model                  | Size | Notes                                                                                                                               |
| ---------------------- | ---- | ----------------------------------------------------------------------------------------------------------------------------------- |
| `inworld-tts-2`        | 8B   | Higher quality audio. Recommended for most agents. Required for `providerData.tts.conversational` and the `CREATIVE` delivery mode. |
| `inworld-tts-1.5-mini` | 1B   | Faster inference, lower latency. Server default when `audio.output.model` is omitted.                                               |

Examples throughout these docs use `inworld-tts-2` for quality; switch to `inworld-tts-1.5-mini` if you're optimizing for raw latency or running at high concurrency. You can change the TTS model mid-session alongside voice or independently.

### Choose a voice

Set `audio.output.voice` to control the agent's speaking voice:

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: { output: { voice: 'Olivia' } }
  }
}));
```

The default voice is `Dennis`. Browse available voices in the [TTS Playground](/tts/tts-playground) or list them programmatically with the [List Voices API](/api-reference/voiceAPI/voiceservice/list-voices).

### Audio output format

Set the wire format for server → client audio under `audio.output.format`. The catalog is identical to [Audio input formats](#audio-input-formats) above — `audio/pcm`, `audio/pcmu`, `audio/pcma`, or `audio/float32`. Default is PCM16 at 24 kHz.

```javascript theme={"system"}
// Default — PCM16 @ 24 kHz (omit format entirely for the same result)
audio: { output: { format: { type: 'audio/pcm', rate: 24000 } } }

// G.711 μ-law @ 8 kHz, for Twilio — TTS audio comes back already mulaw-encoded
audio: { output: { format: { type: 'audio/pcmu' } } }
```

In most setups, set input and output to the same format so your client only has one codec path. The exception is telephony, where you typically want both sides on G.711 to match the carrier.

The server resamples internally — TTS models synthesize at their native rate and the server downsamples (or upsamples) to whatever `audio.output.format.rate` you request, so any reasonable rate is accepted.

### TTS extensions

Inworld extensions for TTS live under `providerData.tts` — segmentation strategy, steering handling, synthesis language, the TTS-2 delivery preset, (for TTS-2) conversational mode, and **timestamp alignment** for lip-sync or word highlighting. Full field reference, segmenter strategy table, conversational-mode details, and the timestamp output shape are in [`providerData.tts`](/realtime/provider-data#tts-providerdata-tts).

To opt into alignment, set `providerData.tts.timestamp_type` to `WORD` or `CHARACTER` and choose a transport strategy (`SYNC` for real-time lip-sync, `ASYNC` for lower latency). See [TTS timestamps and alignment](/realtime/provider-data#tts-timestamps-and-alignment) for the full output shape and sync/async semantics.

***

## Managing the session

### Conversation state

Use conversation events to keep context lean:

* `conversation.item.retrieve`: pull any prior item by ID.
* `conversation.item.delete`: remove items that should not remain in context.

Pair these with `max_output_tokens` and `response.cancel` to control overall cost ([conversation management guide](/realtime/usage/managing-conversations)).

### Observing usage

`response.done` carries a `response.usage` block on every response — including cancelled responses (barge-in, supersede). The base fields (`total_tokens`, `input_tokens`, `output_tokens`, plus `input_token_details` / `output_token_details`) cover LLM accounting, and three optional sub-objects attribute usage per modality:

| Field                     | Type    | Description                                                                                                                                                                                                         |
| ------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `usage.llm.model`         | string  | Effective upstream LLM after router resolution. Useful when you sent `inworld/auto` and want to see which model the router picked.                                                                                  |
| `usage.tts.model`         | string  | TTS model used (e.g. `inworld-tts-2`).                                                                                                                                                                              |
| `usage.tts.characters`    | integer | Characters synthesized across all TTS segments of this response.                                                                                                                                                    |
| `usage.tts.audio_seconds` | number  | Assistant audio duration emitted by TTS, in seconds. The canonical TTS billing signal.                                                                                                                              |
| `usage.stt.model`         | string  | STT model used (e.g. `soniox/stt-rt-v4`).                                                                                                                                                                           |
| `usage.stt.audio_seconds` | number  | User audio duration transcribed for this turn, in seconds. Drained per `response.done` from a rolling per-session counter — each response sees only the user audio that arrived since the previous `response.done`. |

Each modality sub-object is omitted when there's nothing to report (e.g. a TTS-only response with no preceding user turn won't carry `stt`).

```javascript theme={"system"}
ws.on('message', (buffer) => {
  const event = JSON.parse(buffer.toString());
  if (event.type !== 'response.done') return;

  const u = event.response.usage;
  if (!u) return;
  console.log(
    `[usage] llm=${u.llm?.model ?? '?'} ` +
    `tokens=${u.input_tokens}/${u.output_tokens} ` +
    `tts=${u.tts?.audio_seconds?.toFixed(2) ?? '0'}s ` +
    `stt=${u.stt?.audio_seconds?.toFixed(2) ?? '0'}s`
  );
});
```

For the full schema including `input_token_details` / `output_token_details` breakdowns, see the [response.done event](/api-reference/realtimeAPI/realtime/realtime-websocket#operation-publish-responseDone).

`input_token_details` also carries prompt-cache counters: `cached_tokens` (input served from a cache hit) and `cache_write_tokens` (input written when establishing a cache entry). These appear automatically when a provider caches implicitly, and you can opt into explicit caching of the system prompt and tools via [`providerData.caching`](/realtime/provider-data#prompt-caching-providerdata-caching).

### Monitor errors

Handle `error` events (with `type`, `code`, and `param`) and implement a reconnection/backoff strategy for transient failures. See the [API reference](/api-reference/realtimeAPI/realtime/realtime-websocket) for error event schemas.

```javascript theme={"system"}
ws.on('message', (buffer) => {
  const event = JSON.parse(buffer.toString());

  if (event.type === 'error') {
    handleError(event.error);
  }
});
```
