Managing Conversations

Conversation Items

Conversation items represent messages and interactions in your conversation. Each item has:

ID: Unique identifier
Type: message, function_call, function_call_output
Role: user, assistant, or tool
Content: The actual content of the item (array of content parts)

Content Types

Conversation items support different content types depending on direction: Input Content Types (for user messages):

input_text - Plain text input from the user
input_audio - Base64-encoded audio input from the user

Output Content Types (for assistant responses):

text - Text output from the assistant
audio - Audio output from the assistant

You can mix multiple content parts in a single conversation item. For example, you can combine text and audio in the same message.

Creating Conversation Items

Text Messages

ws.send(JSON.stringify({
  type: 'conversation.item.create',
  item: {
    type: 'message',
    role: 'user',
    content: [
      {
        type: 'input_text',
        text: 'Hello, how are you?'
      }
    ]
  }
}));

Audio Messages

There are two ways to send audio input: Method 1: Streaming Audio (Real-time) Use input_audio_buffer.append for streaming real-time audio from a microphone:

// Stream audio chunks in real-time
ws.send(JSON.stringify({
  type: 'input_audio_buffer.append',
  audio: base64AudioData
}));
// VAD automatically detects speech boundaries and commits the buffer

Method 2: Pre-recorded Audio Chunks Use conversation.item.create with input_audio for pre-recorded audio chunks:

ws.send(JSON.stringify({
  type: 'conversation.item.create',
  item: {
    type: 'message',
    role: 'user',
    content: [{
      type: 'input_audio',
      audio: base64AudioData  // Base64-encoded PCM16 or OPUS audio
    }]
  }
}));

When to use each method:

Streaming (input_audio_buffer.append): Use for real-time microphone input, voice conversations, live audio streaming
Pre-recorded (conversation.item.create with input_audio): Use for pre-recorded audio files, batch processing, or when you have complete audio chunks ready

Mixed Content

You can combine multiple content types in a single conversation item:

ws.send(JSON.stringify({
  type: 'conversation.item.create',
  item: {
    type: 'message',
    role: 'user',
    content: [
      {
        type: 'input_text',
        text: 'Here is some context about the audio:'
      },
      {
        type: 'input_audio',
        audio: base64AudioData
      },
      {
        type: 'input_text',
        text: 'And here is additional context.'
      }
    ]
  }
}));

Receiving Conversation Items

When items are added to the conversation, you’ll receive events:

ws.on('message', (data) => {
  const event = JSON.parse(data);
  
  if (event.type === 'conversation.item.added') {
    console.log('Item added:', event.item.id);
    console.log('Content:', event.item.content);
  }
  
  if (event.type === 'conversation.item.done') {
    console.log('Item processing complete:', event.item.id);
  }
});

Retrieving Conversation Items

Retrieve specific conversation items:

ws.send(JSON.stringify({
  type: 'conversation.item.retrieve',
  item_id: 'item-id-here'
}));

The server will respond with:

{
  type: 'conversation.item.retrieved',
  item: {
    id: 'item-id-here',
    type: 'message',
    role: 'user',
    content: [...]
  }
}

Deleting Conversation Items

Remove items from the conversation:

ws.send(JSON.stringify({
  type: 'conversation.item.delete',
  item_id: 'item-id-here'
}));

You’ll receive a confirmation:

{
  type: 'conversation.item.deleted',
  item_id: 'item-id-here'
}

Function Calling

The Realtime API supports function calling, allowing the assistant to invoke tools you define. Configure functions in session.update and handle function call events.

Defining Functions

ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    type: 'realtime',
    tools: [{
      type: 'function',
      name: 'get_weather',
      description: 'Get the weather for a location',
      parameters: {
        type: 'object',
        properties: {
          location: {
            type: 'string',
            description: 'The city and state, e.g. San Francisco, CA'
          }
        },
        required: ['location']
      }
    }],
    tool_choice: 'auto'
  }
}));

Handling Function Calls

ws.on('message', (data) => {
  const event = JSON.parse(data);
  
  if (event.type === 'response.function_call_arguments.done') {
    const result = executeFunction(event.name, JSON.parse(event.arguments));
    
    ws.send(JSON.stringify({
      type: 'conversation.item.create',
      item: {
        type: 'function_call_output',
        call_id: event.call_id,
        output: JSON.stringify(result)
      }
    }));
    
    ws.send(JSON.stringify({
      type: 'response.create'
    }));
  }
});

Voice Activity Detection

Voice Activity Detection (VAD) automatically detects when speech starts and stops, enabling natural turn-taking in conversations. Configure VAD through session.update.

Configuring VAD

ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    type: 'realtime',
    audio: {
      input: {
        turn_detection: {
          type: 'semantic_vad',
          eagerness: 'medium',
          create_response: true,
          interrupt_response: true
        }
      }
    }
  }
}));

VAD Types

semantic_vad: Uses conversational awareness to detect natural speech boundaries. Adjust eagerness (low, medium, high) to control responsiveness.

VAD Events

ws.on('message', (data) => {
  const event = JSON.parse(data);
  
  if (event.type === 'input_audio_buffer.speech_started') {
    console.log('Speech detected');
    // Update UI to show user is speaking
  }
  
  if (event.type === 'input_audio_buffer.speech_stopped') {
    console.log('Speech ended');
    // Update UI, prepare for response
  }
});

Error Handling

The Realtime API emits error events for various failure scenarios. Handle these events to provide robust error recovery and user feedback.

Error Event Structure

ws.on('message', (data) => {
  const event = JSON.parse(data);
  
  if (event.type === 'error') {
    const error = event.error;
    
    switch (error.type) {
      case 'invalid_request_error':
        console.error('Invalid request:', error.message);
        if (error.param) {
          console.error('Parameter:', error.param);
        }
        break;
      case 'server_error':
        console.error('Server error:', error.message);
        // Implement retry logic
        break;
      case 'rate_limit_error':
        console.error('Rate limit exceeded');
        // Pause requests, implement backoff
        break;
    }
  }
});

Error Types

invalid_request_error: Invalid parameters or malformed requests. Check error.param for the specific field.
server_error: Transient server-side failures. Implement retry logic with exponential backoff.
rate_limit_error: Rate limit exceeded. Throttle requests and retry with exponential backoff.

Interruption Handling

Interrupt active responses when new user input arrives.

Interrupting Responses

Cancel an in-progress response when the user starts speaking again:

ws.on('message', (data) => {
  const event = JSON.parse(data);
  
  if (event.type === 'input_audio_buffer.speech_started') {
    // User started speaking, cancel current response
    ws.send(JSON.stringify({
      type: 'response.cancel'
    }));
  }
});

When interrupt_response: true is set in VAD configuration, the server automatically cancels responses when new speech is detected.

Managing Context

Session Instructions

Update session instructions to guide the conversation:

ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    type: 'realtime',
    instructions: 'You are a helpful assistant. Be concise and friendly.'
  }
}));

Conversation History

The API automatically maintains conversation history. You can:

Keep full history: Let the conversation grow naturally
Selective deletion: Remove specific items that aren’t needed
Session resets: Start a new session when you need a clean context window

Example: Conversation Manager

Here’s a complete example of managing conversations:

class ConversationManager {
  constructor(ws) {
    this.ws = ws;
    this.items = new Map();
    this.setupListeners();
  }
  
  setupListeners() {
    this.ws.on('message', (data) => {
      const event = JSON.parse(data);
      
      switch (event.type) {
        case 'conversation.item.added':
          this.items.set(event.item.id, event.item);
          break;
        case 'conversation.item.deleted':
          this.items.delete(event.item_id);
          break;
      }
    });
  }
  
  sendMessage(text) {
    this.ws.send(JSON.stringify({
      type: 'conversation.item.create',
      item: {
        type: 'message',
        role: 'user',
        content: [{
          type: 'input_text',
          text: text
        }]
      }
    }));
  }
  
  deleteItem(itemId) {
    this.ws.send(JSON.stringify({
      type: 'conversation.item.delete',
      item_id: itemId
    }));
  }
  
  getConversationHistory() {
    return Array.from(this.items.values());
  }
}

Best Practices

Monitor Context Length: Keep track of conversation length to avoid exceeding limits
Strategic Deletion: Remove old context that’s no longer relevant
Item Tracking: Maintain a local map of conversation items for quick access
Error Handling: Handle cases where items might not exist when deleting/retrieving
Context Management: Use session instructions to guide conversation behavior

Use Cases

Long Conversations: Delete old context to maintain performance
Error Recovery: Delete incorrect items and resend
Context Switching: Clear conversation context when changing topics
Memory Management: Remove items that are no longer needed

Overview

Build with Realtime API

Guides

Resources

Conversation Items

Content Types

Creating Conversation Items

Text Messages

Audio Messages

Mixed Content

Receiving Conversation Items

Retrieving Conversation Items

Deleting Conversation Items

Function Calling

Defining Functions

Handling Function Calls

Voice Activity Detection

Configuring VAD

VAD Types

VAD Events

Error Handling

Error Event Structure

Error Types

Interruption Handling

Interrupting Responses

Managing Context

Session Instructions

Conversation History

Example: Conversation Manager

Best Practices

Use Cases

Overview

Build with Realtime API

Guides

Resources

​Conversation Items

​Content Types

​Creating Conversation Items

​Text Messages

​Audio Messages

​Mixed Content

​Receiving Conversation Items

​Retrieving Conversation Items

​Deleting Conversation Items

​Function Calling

​Defining Functions

​Handling Function Calls

​Voice Activity Detection

​Configuring VAD

​VAD Types

​VAD Events

​Error Handling

​Error Event Structure

​Error Types

​Interruption Handling

​Interrupting Responses

​Managing Context

​Session Instructions

​Conversation History

​Example: Conversation Manager

​Best Practices

​Use Cases

Conversation Items

Content Types

Creating Conversation Items

Text Messages

Audio Messages

Mixed Content

Receiving Conversation Items

Retrieving Conversation Items

Deleting Conversation Items

Function Calling

Defining Functions

Handling Function Calls

Voice Activity Detection

Configuring VAD

VAD Types

VAD Events

Error Handling

Error Event Structure

Error Types

Interruption Handling

Interrupting Responses

Managing Context

Session Instructions

Conversation History

Example: Conversation Manager

Best Practices

Use Cases