Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt

Use this file to discover all available pages before exploring further.

The Inworld Realtime API uses an OpenAI Realtime API-compatible event system to facilitate voice experiences. This guide summarizes the key building blocks documented in the API reference (API reference).

Configure a Session

For WebSocket, the connection starts with a session.created event. For WebRTC, send session.update as soon as the data channel opens. In both cases, use session.update to configure your session. Here you can set:
  • model — LLM provider and model (e.g. openai/gpt-4.1-nano) or router (e.g. inworld/latency-optimizer-ab-test)
  • instructions
  • output_modalities (["audio", "text"], ["audio"], or ["text"])
  • Audio input and output configuration — voice, TTS model, PCM format, speed
  • max_output_tokens ("inf" or a numeric ceiling)
  • tools (function definitions) and tool_choice settings
Partial updates are supported, so you can adjust the LLM, voice, TTS model, temperature, or tool lists mid-session without rebuilding the socket.
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    type: 'realtime',
    model: 'openai/gpt-4o-mini',
    instructions: 'You are a friendly narrator.',
    output_modalities: ['audio', 'text'],
    temperature: 0.8,
    audio: {
      input: {
        turn_detection: {
          type: 'semantic_vad',
          create_response: true,
          interrupt_response: true
        }
      },
      output: {
        voice: 'Clive',
        model: 'inworld-tts-1.5-mini',
        speed: 1.0
      }
    }
  }
}));

Choose a Router or LLM

Set model in session.update to select which Router or LLM handles the conversation. The format is provider/modelName or inworld/routerId:
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    model: 'openai/gpt-4o-mini'
  }
}));
If you omit model, the default model (google-ai-studio/gemini-2.5-flash) is used. You can change the model mid-session with a partial update — the new model takes effect on the next response.

Choose a Voice

Set audio.output.voice to control the agent’s speaking voice:
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: { output: { voice: 'Olivia' } }
  }
}));
The default voice is Dennis. Browse available voices in the TTS Playground or list them programmatically with the List Voices API.

Choose a TTS Model

Set audio.output.model to select the text-to-speech model:
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: { output: { model: 'inworld-tts-1.5-max' } }
  }
}));
ModelSizeNotes
inworld-tts-1.5-mini1BFaster inference, lower latency (default)
inworld-tts-1.5-max8BHigher quality audio
The default is inworld-tts-1.5-mini. You can change the TTS model mid-session alongside voice or independently.

Choose an STT Model

Set audio.input.transcription.model to select the speech-to-text model used to transcribe user audio:
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: { input: { transcription: { model: 'assemblyai/u3-rt-pro' } } }
  }
}));
ModelBest for
assemblyai/u3-rt-proHigh-accuracy, sub-300ms latency, multilingual streaming (English, Spanish, French, German, Italian, Portuguese)
assemblyai/universal-streaming-multilingualMultilingual streaming across the same six languages
assemblyai/universal-streaming-englishEnglish-optimized streaming
inworld/inworld-stt-1Voice agents that benefit from Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking
soniox/stt-rt-v4High-accuracy real-time streaming with semantic end-of-turn detection and multilingual support
If the selected model is not recognised, the server responds with an error event (type: "invalid_request_error", code: "invalid_value", param: "session.audio.input.transcription.model") and the rest of the session.update is not applied. See STT Introduction for the full model catalogue and comparison.

Tune Turn Detection

Turn detection — when the server decides a user has finished speaking — is controlled by the OpenAI-standard audio.input.turn_detection object. The Realtime API supports both VAD types and is wire-compatible with the OpenAI SDK.

semantic_vad

Model-based end-of-turn detection backed by the STT stream. eagerness is the primary tuning knob.
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: {
      input: {
        turn_detection: {
          type: 'semantic_vad',
          eagerness: 'medium',          // 'low' | 'medium' | 'high' | 'auto'
          create_response: true,
          interrupt_response: true
        }
      }
    }
  }
}));
FieldTypeDescription
typestring"semantic_vad" (default)
eagernessstringHow aggressively to end turns: "low", "medium", "high", "auto". Lower eagerness requires stronger end-of-turn confidence; higher eagerness commits to end-of-turn sooner.
create_responsebooleanAuto-create a response on turn end (default true)
interrupt_responsebooleanInterrupt the active response when the user speaks (default true)

server_vad

Inworld-hosted Silero VAD + Smart Turn detector. Tunable fields match OpenAI’s server_vad shape and can be changed mid-session via partial session.update.
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: {
      input: {
        turn_detection: {
          type: 'server_vad',
          threshold: 0.5,
          prefix_padding_ms: 200,
          silence_duration_ms: 1000,
          idle_timeout_ms: 8000,
          create_response: true,
          interrupt_response: true
        }
      }
    }
  }
}));
FieldTypeDescription
typestring"server_vad"
thresholdnumberSilero VAD speech cutoff, 0.01.0. Default 0.5.
prefix_padding_msintegerPre-speech audio retained before an utterance, in ms. Default 200.
silence_duration_msintegerTrailing silence required to finalize the turn, in ms. Default 1000.
idle_timeout_msinteger | nullWhen set, the server emits input_audio_buffer.timeout_triggered after this many ms with no detected speech. null or 0 disables.
create_responsebooleanAuto-create a response on turn end (default true)
interrupt_responsebooleanInterrupt the active response when the user speaks (default true)
All fields accept partial session.update — omit a field to keep its current value. Changes take effect on the next audio chunk processed. See Voice Activity Detection (VAD) for the VAD event lifecycle.

Transcription Hints

Guide the STT decoder with a prompt (vocabulary, domain context, formatting preferences). This is the OpenAI-standard audio.input.transcription.prompt field and is portable across OpenAI-compatible SDKs:
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: {
      input: {
        transcription: {
          model: 'assemblyai/u3-rt-pro',
          prompt: 'Medical dictation. Vocabulary: angioplasty, myocardial infarction.',
          language: 'en'
        }
      }
    }
  }
}));
FieldTypeDescription
modelstringSTT model ID. See Choose an STT Model.
promptstringTranscription guidance: vocabulary hints, domain context, formatting preferences.
languagestringBCP-47 language code (e.g. "en", "es"). Optional; the model auto-detects when omitted.

Send Input

Audio

There are two ways to send audio input: Method 1: Streaming Audio (Real-time) Use input_audio_buffer.* events for streaming real-time audio from a microphone:
  1. Convert microphone data to PCM16, 24 kHz, mono.
  2. Send chunks via input_audio_buffer.append.
  3. VAD automatically detects speech boundaries and commits the buffer.
Method 2: Pre-recorded Audio Use conversation.item.create with input_audio content type for pre-recorded audio chunks:
ws.send(JSON.stringify({
  type: 'conversation.item.create',
  item: {
    type: 'message',
    role: 'user',
    content: [{
      type: 'input_audio',
      audio: base64AudioData  // Base64-encoded PCM16 or OPUS
    }]
  }
}));

Text

Create explicit conversation items for text turns:
ws.send(JSON.stringify({
  type: 'conversation.item.create',
  item: {
    type: 'message',
    role: 'user',
    content: [{
      type: 'input_text',
      text: 'Give me a two-sentence summary.'
    }]
  }
}));

Function Calling

The Realtime API supports function calling so your agent can fetch live data or trigger actions mid-conversation. Define functions in session.tools, then handle calls as they arrive.

1. Register a tool

ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    type: 'realtime',
    tools: [{
      type: 'function',
      name: 'get_horoscope',
      description: 'Get the horoscope for a zodiac sign',
      parameters: {
        type: 'object',
        properties: {
          sign: {
            type: 'string',
            description: 'Zodiac sign, e.g. Aries, Taurus'
          }
        },
        required: ['sign']
      }
    }],
    tool_choice: 'auto'
  }
}));

2. Handle the function call

When the model decides to call a function, you receive a response.function_call_arguments.done event with the call_id, function name, and serialized arguments. Execute your logic, then return the result:
ws.on('message', (buffer) => {
  const event = JSON.parse(buffer.toString());

  if (event.type === 'response.function_call_arguments.done') {
    const { call_id, name, arguments: argsJson } = event;
    const args = JSON.parse(argsJson);

    // Run your business logic
    let result;
    if (name === 'get_horoscope') {
      result = fetchHoroscope(args.sign);
    }

    // Send the function result back
    ws.send(JSON.stringify({
      type: 'conversation.item.create',
      item: {
        type: 'function_call_output',
        call_id,
        output: JSON.stringify(result)
      }
    }));

    // Tell the model to continue with the result
    ws.send(JSON.stringify({ type: 'response.create' }));
  }
});

3. What happens next

After response.create, the model incorporates the function output and continues the conversation — speaking the horoscope aloud (if output_modalities includes audio) or streaming text deltas. The user hears the answer without any gap in the conversation flow. You can register multiple tools and the model will call them as needed. Each call arrives as a separate response.function_call_arguments.done event with its own call_id.

Manage Conversation State

Use conversation events to keep context lean:
  • conversation.item.retrieve: pull any prior item by ID.
  • conversation.item.delete: remove items that should not remain in context.
Pair these with max_output_tokens and response.cancel to control overall cost (conversation management guide).

Monitor Errors

Handle error events (with type, code, and param) and implement a reconnection/backoff strategy for transient failures. See the API reference for error event schemas.
ws.on('message', (buffer) => {
  const event = JSON.parse(buffer.toString());

  if (event.type === 'error') {
    handleError(event.error);
  }
});