Skip to main content

Conversation Items

Conversation items represent messages and interactions in your conversation. Each item has:
  • ID: Unique identifier
  • Type: message, function_call, function_call_output
  • Role: user, assistant, or tool
  • Content: The actual content of the item (array of content parts)

Content Types

Conversation items support different content types depending on direction: Input Content Types (for user messages):
  • input_text - Plain text input from the user
  • input_audio - Base64-encoded audio input from the user
Output Content Types (for assistant responses):
  • text - Text output from the assistant
  • audio - Audio output from the assistant
You can mix multiple content parts in a single conversation item. For example, you can combine text and audio in the same message.

Creating Conversation Items

Text Messages

ws.send(JSON.stringify({
  type: 'conversation.item.create',
  item: {
    type: 'message',
    role: 'user',
    content: [
      {
        type: 'input_text',
        text: 'Hello, how are you?'
      }
    ]
  }
}));

Audio Messages

There are two ways to send audio input: Method 1: Streaming Audio (Real-time) Use input_audio_buffer.append for streaming real-time audio from a microphone:
// Stream audio chunks in real-time
ws.send(JSON.stringify({
  type: 'input_audio_buffer.append',
  audio: base64AudioData
}));
// VAD automatically detects speech boundaries and commits the buffer
Method 2: Pre-recorded Audio Chunks Use conversation.item.create with input_audio for pre-recorded audio chunks:
ws.send(JSON.stringify({
  type: 'conversation.item.create',
  item: {
    type: 'message',
    role: 'user',
    content: [{
      type: 'input_audio',
      audio: base64AudioData  // Base64-encoded PCM16 or OPUS audio
    }]
  }
}));
When to use each method:
  • Streaming (input_audio_buffer.append): Use for real-time microphone input, voice conversations, live audio streaming
  • Pre-recorded (conversation.item.create with input_audio): Use for pre-recorded audio files, batch processing, or when you have complete audio chunks ready

Mixed Content

You can combine multiple content types in a single conversation item:
ws.send(JSON.stringify({
  type: 'conversation.item.create',
  item: {
    type: 'message',
    role: 'user',
    content: [
      {
        type: 'input_text',
        text: 'Here is some context about the audio:'
      },
      {
        type: 'input_audio',
        audio: base64AudioData
      },
      {
        type: 'input_text',
        text: 'And here is additional context.'
      }
    ]
  }
}));

Receiving Conversation Items

When items are added to the conversation, you’ll receive events:
ws.on('message', (data) => {
  const event = JSON.parse(data);
  
  if (event.type === 'conversation.item.added') {
    console.log('Item added:', event.item.id);
    console.log('Content:', event.item.content);
  }
  
  if (event.type === 'conversation.item.done') {
    console.log('Item processing complete:', event.item.id);
  }
});

Retrieving Conversation Items

Retrieve specific conversation items:
ws.send(JSON.stringify({
  type: 'conversation.item.retrieve',
  item_id: 'item-id-here'
}));
The server will respond with:
{
  type: 'conversation.item.retrieved',
  item: {
    id: 'item-id-here',
    type: 'message',
    role: 'user',
    content: [...]
  }
}

Deleting Conversation Items

Remove items from the conversation:
ws.send(JSON.stringify({
  type: 'conversation.item.delete',
  item_id: 'item-id-here'
}));
You’ll receive a confirmation:
{
  type: 'conversation.item.deleted',
  item_id: 'item-id-here'
}

Function Calling

The Realtime API supports function calling, allowing the assistant to invoke tools you define. Configure functions in session.update and handle function call events.

Defining Functions

ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    type: 'realtime',
    tools: [{
      type: 'function',
      name: 'get_weather',
      description: 'Get the weather for a location',
      parameters: {
        type: 'object',
        properties: {
          location: {
            type: 'string',
            description: 'The city and state, e.g. San Francisco, CA'
          }
        },
        required: ['location']
      }
    }],
    tool_choice: 'auto'
  }
}));

Handling Function Calls

ws.on('message', (data) => {
  const event = JSON.parse(data);
  
  if (event.type === 'response.function_call_arguments.done') {
    const result = executeFunction(event.name, JSON.parse(event.arguments));
    
    ws.send(JSON.stringify({
      type: 'conversation.item.create',
      item: {
        type: 'function_call_output',
        call_id: event.call_id,
        output: JSON.stringify(result)
      }
    }));
    
    ws.send(JSON.stringify({
      type: 'response.create'
    }));
  }
});

Voice Activity Detection

Voice Activity Detection (VAD) automatically detects when speech starts and stops, enabling natural turn-taking in conversations. Configure VAD through session.update.

Configuring VAD

ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    type: 'realtime',
    audio: {
      input: {
        turn_detection: {
          type: 'semantic_vad',
          eagerness: 'medium',
          create_response: true,
          interrupt_response: true
        }
      }
    }
  }
}));

VAD Types

  • semantic_vad: Uses conversational awareness to detect natural speech boundaries. Adjust eagerness (low, medium, high) to control responsiveness.

VAD Events

ws.on('message', (data) => {
  const event = JSON.parse(data);
  
  if (event.type === 'input_audio_buffer.speech_started') {
    console.log('Speech detected');
    // Update UI to show user is speaking
  }
  
  if (event.type === 'input_audio_buffer.speech_stopped') {
    console.log('Speech ended');
    // Update UI, prepare for response
  }
});

Error Handling

The Realtime API emits error events for various failure scenarios. Handle these events to provide robust error recovery and user feedback.

Error Event Structure

ws.on('message', (data) => {
  const event = JSON.parse(data);
  
  if (event.type === 'error') {
    const error = event.error;
    
    switch (error.type) {
      case 'invalid_request_error':
        console.error('Invalid request:', error.message);
        if (error.param) {
          console.error('Parameter:', error.param);
        }
        break;
      case 'server_error':
        console.error('Server error:', error.message);
        // Implement retry logic
        break;
      case 'rate_limit_error':
        console.error('Rate limit exceeded');
        // Pause requests, implement backoff
        break;
    }
  }
});

Error Types

  • invalid_request_error: Invalid parameters or malformed requests. Check error.param for the specific field.
  • server_error: Transient server-side failures. Implement retry logic with exponential backoff.
  • rate_limit_error: Rate limit exceeded. Throttle requests and retry with exponential backoff.

Interruption Handling

Interrupt active responses when new user input arrives.

Interrupting Responses

Cancel an in-progress response when the user starts speaking again:
ws.on('message', (data) => {
  const event = JSON.parse(data);
  
  if (event.type === 'input_audio_buffer.speech_started') {
    // User started speaking, cancel current response
    ws.send(JSON.stringify({
      type: 'response.cancel'
    }));
  }
});
When interrupt_response: true is set in VAD configuration, the server automatically cancels responses when new speech is detected.

Managing Context

Session Instructions

Update session instructions to guide the conversation:
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    type: 'realtime',
    instructions: 'You are a helpful assistant. Be concise and friendly.'
  }
}));

Conversation History

The API automatically maintains conversation history. You can:
  1. Keep full history: Let the conversation grow naturally
  2. Selective deletion: Remove specific items that aren’t needed
  3. Session resets: Start a new session when you need a clean context window

Example: Conversation Manager

Here’s a complete example of managing conversations:
class ConversationManager {
  constructor(ws) {
    this.ws = ws;
    this.items = new Map();
    this.setupListeners();
  }
  
  setupListeners() {
    this.ws.on('message', (data) => {
      const event = JSON.parse(data);
      
      switch (event.type) {
        case 'conversation.item.added':
          this.items.set(event.item.id, event.item);
          break;
        case 'conversation.item.deleted':
          this.items.delete(event.item_id);
          break;
      }
    });
  }
  
  sendMessage(text) {
    this.ws.send(JSON.stringify({
      type: 'conversation.item.create',
      item: {
        type: 'message',
        role: 'user',
        content: [{
          type: 'input_text',
          text: text
        }]
      }
    }));
  }
  
  deleteItem(itemId) {
    this.ws.send(JSON.stringify({
      type: 'conversation.item.delete',
      item_id: itemId
    }));
  }
  
  getConversationHistory() {
    return Array.from(this.items.values());
  }
}

Best Practices

  1. Monitor Context Length: Keep track of conversation length to avoid exceeding limits
  2. Strategic Deletion: Remove old context that’s no longer relevant
  3. Item Tracking: Maintain a local map of conversation items for quick access
  4. Error Handling: Handle cases where items might not exist when deleting/retrieving
  5. Context Management: Use session instructions to guide conversation behavior

Use Cases

  • Long Conversations: Delete old context to maintain performance
  • Error Recovery: Delete incorrect items and resend
  • Context Switching: Clear conversation context when changing topics
  • Memory Management: Remove items that are no longer needed