> ## Documentation Index
> Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Configuring Models

The Inworld Realtime API uses an OpenAI Realtime API-compatible event system to facilitate voice experiences. This guide summarizes the key building blocks documented in the API reference ([API reference](/api-reference/realtimeAPI/realtime/realtime-websocket)).

## Configure a Session

For WebSocket, the connection starts with a `session.created` event. For WebRTC, send `session.update` as soon as the data channel opens. In both cases, use `session.update` to configure your session. Here you can set:

* `model` — LLM provider and model (e.g. `openai/gpt-4.1-nano`) or router (e.g. `inworld/latency-optimizer-ab-test`)
* `instructions`
* `output_modalities` (`["audio", "text"]`, `["audio"]`, or `["text"]`)
* Audio input and output configuration — voice, TTS model, PCM format, speed
* `max_output_tokens` (`"inf"` or a numeric ceiling)
* `tools` (function definitions) and `tool_choice` settings

Partial updates are supported, so you can adjust the LLM, voice, TTS model, temperature, or tool lists mid-session without rebuilding the socket.

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    type: 'realtime',
    model: 'openai/gpt-4o-mini',
    instructions: 'You are a friendly narrator.',
    output_modalities: ['audio', 'text'],
    temperature: 0.8,
    audio: {
      input: {
        turn_detection: {
          type: 'semantic_vad',
          create_response: true,
          interrupt_response: true
        }
      },
      output: {
        voice: 'Clive',
        model: 'inworld-tts-1.5-mini',
        speed: 1.0
      }
    }
  }
}));
```

## Choose a Router or LLM

Set `model` in `session.update` to select which [Router](/router/introduction) or LLM handles the conversation. The format is `provider/modelName` or `inworld/routerId`:

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    model: 'openai/gpt-4o-mini'
  }
}));
```

If you omit `model`, the default model (`google-ai-studio/gemini-2.5-flash`) is used. You can change the model mid-session with a partial update — the new model takes effect on the next response.

## Choose a Voice

Set `audio.output.voice` to control the agent's speaking voice:

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: { output: { voice: 'Olivia' } }
  }
}));
```

The default voice is `Dennis`. Browse available voices in the [TTS Playground](/tts/tts-playground) or list them programmatically with the [List Voices API](/api-reference/voiceAPI/voiceservice/list-voices).

## Choose a TTS Model

Set `audio.output.model` to select the text-to-speech model:

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: { output: { model: 'inworld-tts-2' } }
  }
}));
```

| Model                  | Size | Notes                                     |
| ---------------------- | ---- | ----------------------------------------- |
| `inworld-tts-1.5-mini` | 1B   | Faster inference, lower latency (default) |
| `inworld-tts-2`        | 8B   | Higher quality audio                      |

The default is `inworld-tts-1.5-mini`. You can change the TTS model mid-session alongside voice or independently.

## Choose an STT Model

Set `audio.input.transcription.model` to select the speech-to-text model used to transcribe user audio:

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: { input: { transcription: { model: 'assemblyai/u3-rt-pro' } } }
  }
}));
```

| Model                                         | Best for                                                                                                             |
| --------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- |
| `assemblyai/u3-rt-pro`                        | High-accuracy, sub-300ms latency, multilingual streaming (English, Spanish, French, German, Italian, Portuguese)     |
| `assemblyai/universal-streaming-multilingual` | Multilingual streaming across the same six languages                                                                 |
| `assemblyai/universal-streaming-english`      | English-optimized streaming                                                                                          |
| `inworld/inworld-stt-1`                       | Voice agents that benefit from Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking |
| `soniox/stt-rt-v4`                            | High-accuracy real-time streaming with semantic end-of-turn detection and multilingual support                       |

If the selected model is not recognised, the server responds with an `error` event (`type: "invalid_request_error"`, `code: "invalid_value"`, `param: "session.audio.input.transcription.model"`) and the rest of the `session.update` is not applied. See [STT Introduction](/stt/overview) for the full model catalogue and comparison.

## Tune Turn Detection

Turn detection — when the server decides a user has finished speaking — is controlled by the OpenAI-standard [`audio.input.turn_detection`](https://platform.openai.com/docs/api-reference/realtime-sessions/session) object. The Realtime API supports both VAD types and is wire-compatible with the OpenAI SDK.

### `semantic_vad`

Model-based end-of-turn detection backed by the STT stream. `eagerness` is the primary tuning knob.

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: {
      input: {
        turn_detection: {
          type: 'semantic_vad',
          eagerness: 'medium',          // 'low' | 'medium' | 'high' | 'auto'
          create_response: true,
          interrupt_response: true
        }
      }
    }
  }
}));
```

| Field                | Type    | Description                                                                                                                                                                       |
| -------------------- | ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `type`               | string  | `"semantic_vad"` (default)                                                                                                                                                        |
| `eagerness`          | string  | How aggressively to end turns: `"low"`, `"medium"`, `"high"`, `"auto"`. Lower eagerness requires stronger end-of-turn confidence; higher eagerness commits to end-of-turn sooner. |
| `create_response`    | boolean | Auto-create a response on turn end (default `true`)                                                                                                                               |
| `interrupt_response` | boolean | Interrupt the active response when the user speaks (default `true`)                                                                                                               |

### `server_vad`

Inworld-hosted Silero VAD + Smart Turn detector. Tunable fields match OpenAI's server\_vad shape and can be changed mid-session via partial `session.update`.

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: {
      input: {
        turn_detection: {
          type: 'server_vad',
          threshold: 0.5,
          prefix_padding_ms: 200,
          silence_duration_ms: 1000,
          idle_timeout_ms: 8000,
          create_response: true,
          interrupt_response: true
        }
      }
    }
  }
}));
```

| Field                 | Type            | Description                                                                                                                           |
| --------------------- | --------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| `type`                | string          | `"server_vad"`                                                                                                                        |
| `threshold`           | number          | Silero VAD speech cutoff, `0.0`–`1.0`. Default `0.5`.                                                                                 |
| `prefix_padding_ms`   | integer         | Pre-speech audio retained before an utterance, in ms. Default `200`.                                                                  |
| `silence_duration_ms` | integer         | Trailing silence required to finalize the turn, in ms. Default `1000`.                                                                |
| `idle_timeout_ms`     | integer \| null | When set, the server emits `input_audio_buffer.timeout_triggered` after this many ms with no detected speech. `null` or `0` disables. |
| `create_response`     | boolean         | Auto-create a response on turn end (default `true`)                                                                                   |
| `interrupt_response`  | boolean         | Interrupt the active response when the user speaks (default `true`)                                                                   |

All fields accept partial `session.update` — omit a field to keep its current value. Changes take effect on the next audio chunk processed.

See [Voice Activity Detection (VAD)](/realtime/usage/managing-conversations#voice-activity-detection-vad) for the VAD event lifecycle.

## Transcription Hints

Guide the STT decoder with a prompt (vocabulary, domain context, formatting preferences). This is the OpenAI-standard [`audio.input.transcription.prompt`](https://platform.openai.com/docs/api-reference/realtime-sessions/session) field and is portable across OpenAI-compatible SDKs:

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: {
      input: {
        transcription: {
          model: 'assemblyai/u3-rt-pro',
          prompt: 'Medical dictation. Vocabulary: angioplasty, myocardial infarction.',
          language: 'en'
        }
      }
    }
  }
}));
```

| Field      | Type   | Description                                                                                |
| ---------- | ------ | ------------------------------------------------------------------------------------------ |
| `model`    | string | STT model ID. See [Choose an STT Model](#choose-an-stt-model).                             |
| `prompt`   | string | Transcription guidance: vocabulary hints, domain context, formatting preferences.          |
| `language` | string | BCP-47 language code (e.g. `"en"`, `"es"`). Optional; the model auto-detects when omitted. |

## Send Input

### Audio

There are two ways to send audio input:

**Method 1: Streaming Audio (Real-time)**
Use `input_audio_buffer.*` events for streaming real-time audio from a microphone:

1. Convert microphone data to PCM16, 24 kHz, mono.
2. Send chunks via `input_audio_buffer.append`.
3. VAD automatically detects speech boundaries and commits the buffer.

**Method 2: Pre-recorded Audio**
Use `conversation.item.create` with `input_audio` content type for pre-recorded audio chunks:

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'conversation.item.create',
  item: {
    type: 'message',
    role: 'user',
    content: [{
      type: 'input_audio',
      audio: base64AudioData  // Base64-encoded PCM16 or OPUS
    }]
  }
}));
```

### Text

Create explicit conversation items for text turns:

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'conversation.item.create',
  item: {
    type: 'message',
    role: 'user',
    content: [{
      type: 'input_text',
      text: 'Give me a two-sentence summary.'
    }]
  }
}));
```

## Function Calling

The Realtime API supports function calling so your agent can fetch live data or trigger actions mid-conversation. Define functions in `session.tools`, then handle calls as they arrive.

### 1. Register a tool

```javascript theme={"system"}
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    type: 'realtime',
    tools: [{
      type: 'function',
      name: 'get_horoscope',
      description: 'Get the horoscope for a zodiac sign',
      parameters: {
        type: 'object',
        properties: {
          sign: {
            type: 'string',
            description: 'Zodiac sign, e.g. Aries, Taurus'
          }
        },
        required: ['sign']
      }
    }],
    tool_choice: 'auto'
  }
}));
```

### 2. Handle the function call

When the model decides to call a function, you receive a `response.function_call_arguments.done` event with the `call_id`, function `name`, and serialized `arguments`. Execute your logic, then return the result:

```javascript theme={"system"}
ws.on('message', (buffer) => {
  const event = JSON.parse(buffer.toString());

  if (event.type === 'response.function_call_arguments.done') {
    const { call_id, name, arguments: argsJson } = event;
    const args = JSON.parse(argsJson);

    // Run your business logic
    let result;
    if (name === 'get_horoscope') {
      result = fetchHoroscope(args.sign);
    }

    // Send the function result back
    ws.send(JSON.stringify({
      type: 'conversation.item.create',
      item: {
        type: 'function_call_output',
        call_id,
        output: JSON.stringify(result)
      }
    }));

    // Tell the model to continue with the result
    ws.send(JSON.stringify({ type: 'response.create' }));
  }
});
```

### 3. What happens next

After `response.create`, the model incorporates the function output and continues the conversation — speaking the horoscope aloud (if `output_modalities` includes `audio`) or streaming text deltas. The user hears the answer without any gap in the conversation flow.

You can register multiple tools and the model will call them as needed. Each call arrives as a separate `response.function_call_arguments.done` event with its own `call_id`.

## Manage Conversation State

Use conversation events to keep context lean:

* `conversation.item.retrieve`: pull any prior item by ID.
* `conversation.item.delete`: remove items that should not remain in context.

Pair these with `max_output_tokens` and `response.cancel` to control overall cost ([conversation management guide](/realtime/usage/managing-conversations)).

## Monitor Errors

Handle `error` events (with `type`, `code`, and `param`) and implement a reconnection/backoff strategy for transient failures. See the [API reference](/api-reference/realtimeAPI/realtime/realtime-websocket) for error event schemas.

```javascript theme={"system"}
ws.on('message', (buffer) => {
  const event = JSON.parse(buffer.toString());

  if (event.type === 'error') {
    handleError(event.error);
  }
});
```
