> ## Documentation Index
> Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# LLM + TTS (Voice Responses)

> Generate intelligent text with any LLM and convert it to speech in a single API call

## Overview

If you're already using Realtime TTS, Realtime Router enables you to optimize and combine your **LLM** requests with **Realtime Text-to-Speech** in a single request. Instead of managing two separate API calls (one for text generation, one for speech synthesis), you send one request and receive both text and audio back. Both streaming and non-streaming modes are supported.

In streaming mode, Realtime Router handles the entire pipeline: it intelligently routes your prompt to the best LLM, streams the generated text through an **optimized chunking engine**, and sends each chunk to the TTS engine as it's produced. The result is low-latency voice output — you hear the first audio well before the LLM finishes generating the full response. In non-streaming mode, the complete audio and transcript are returned together once the full response is ready.

**This is ideal for:**

* Voice assistants and conversational agents
* Real-time narration and read-aloud features
* Accessibility-first applications
* Any workflow where your users hear AI responses instead of (or in addition to) reading them

## Quick Start

Add the `audio` parameter to any chat completions request to enable TTS. You'll receive both the text response and audio data in the same stream.

```bash theme={"system"}
curl --request POST \
  --url https://api.inworld.ai/v1/chat/completions \
  --header 'Authorization: Basic <your-api-key>' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "inworld/my-router",
    "max_tokens": 1000,
    "stream": true,
    "audio": {
      "voice": "Dennis",
      "model": "inworld-tts-2"
    },
    "messages": [
      {"role": "user", "content": "What is the meaning of life?"}
    ]
  }'
```

That's it. Inworld Router will:

1. Route your prompt to your preset Inworld Route (or your chosen model)
2. Stream text chunks to Inworld TTS as they're generated
3. Return both text and audio in the SSE stream

## Audio Parameters

The `audio` object controls voice synthesis:

| Parameter | Type   | Description                                                                                                                                                                       |
| --------- | ------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `voice`   | string | **Required.** The voice ID to use for speech synthesis (e.g., `"Dennis"`, `"Chloe"`). See [List Voices](/api-reference/ttsAPI/texttospeech/list-voices) for all available voices. |
| `model`   | string | **Required.** The TTS model to use (e.g., `"inworld-tts-2"`). See [TTS Models](/tts/tts-models) for available options.                                                            |

### Default Audio Output

| Property    | Value     |
| ----------- | --------- |
| Sample rate | 48,000 Hz |
| Format      | PCM       |

## Streaming Response Format

When streaming is enabled (`"stream": true`), the response is delivered as Server-Sent Events (SSE). Each event is a JSON object in the `data` field.

When TTS is active, text is delivered through `delta.audio.transcript`. Audio data and its corresponding transcript are sent together via `delta.audio`:

```json theme={"system"}
data: {"choices":[{"delta":{"audio":{"data":"<base64-pcm-audio>","transcript":"Hello! How can I assist you today?"}},"index":0}],...}
```

| Field                    | Description                                                 |
| ------------------------ | ----------------------------------------------------------- |
| `delta.audio.data`       | Base64-encoded PCM audio.                                   |
| `delta.audio.transcript` | The text being spoken. Use this for real-time text display. |

<Note>
  Text and audio are chunked independently. Text is chunked at natural sentence boundaries, while audio is chunked at fixed byte sizes. This means a single `transcript` value may span multiple audio chunks. The transcript for a text segment is attached to the first audio chunk of that segment — subsequent audio chunks for the same segment will contain only `data` without a `transcript` field.
</Note>

### Non-Streaming Response

Without streaming (`"stream": false`), the full audio and transcript are returned in the `message.audio` object:

```json theme={"system"}
{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "",
      "audio": {
        "id": "audio_chatcmpl-xyz",
        "data": "<base64-pcm-audio>",
        "transcript": "Hello! How can I assist you today?"
      }
    },
    "finish_reason": "stop"
  }]
}
```

<Note>
  When TTS is active, `message.content` is empty. The full text is available in `message.audio.transcript`.
</Note>

## Use Any LLM

The `audio` parameter works with any model available through Inworld Router. The LLM generates text, and Inworld Router handles the TTS conversion separately — so your choice of voice is independent of your choice of model. See the [Models API](/api-reference/modelsAPI/modelservice/list-models) for a full list of supported LLM models.

```bash theme={"system"}
# Use auto model selection + TTS
curl --request POST \
  --url https://api.inworld.ai/v1/chat/completions \
  --header 'Authorization: Basic <your-api-key>' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "auto",
    "stream": true,
    "audio": {
      "voice": "Chloe",
      "model": "inworld-tts-2"
    },
    "messages": [
      {"role": "user", "content": "Tell me a short bedtime story."}
    ],
    "extra_body": {
      "sort": ["latency"]
    }
  }'
```

This combines Inworld Router's intelligent model selection with TTS — you get the fastest available LLM **and** voice output in one call.

## Combine with Smart Routing Features

All Inworld Router capabilities work alongside TTS:

### Failover with Voice

If your primary model is unavailable, Inworld Router fails over to a backup — and the voice output continues seamlessly:

```bash theme={"system"}
curl --request POST \
  --url https://api.inworld.ai/v1/chat/completions \
  --header 'Authorization: Basic <your-api-key>' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "openai/gpt-5",
    "stream": true,
    "audio": {
      "voice": "Dennis",
      "model": "inworld-tts-2"
    },
    "messages": [
      {"role": "user", "content": "Explain quantum computing simply."}
    ],
    "extra_body": {
      "models": ["anthropic/claude-opus-4-6", "google-ai-studio/gemini-2.5-flash"]
    }
  }'
```

If GPT-5 fails, the request falls over to Claude or Gemini — and the same voice (Dennis) is used regardless of which model generates the text.

### Cost-Optimized Voice Responses

Route to the cheapest model while still getting audio output:

```bash theme={"system"}
curl --request POST \
  --url https://api.inworld.ai/v1/chat/completions \
  --header 'Authorization: Basic <your-api-key>' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "auto",
    "stream": true,
    "audio": {
      "voice": "Dennis",
      "model": "inworld-tts-2"
    },
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "extra_body": {
      "sort": ["price", "latency"]
    }
  }'
```

## Optimized Chunking

Inworld Router includes a **built-in text chunking engine** optimized for TTS. Rather than waiting for the LLM to finish generating the full response, the router:

1. Buffers incoming tokens from the LLM
2. Detects natural sentence and clause boundaries
3. Sends each chunk to the TTS engine as soon as it's ready

This pipeline significantly reduces **Time to First Audio (TTFA)** — your users start hearing the response while the LLM is still generating text. The chunking is tuned for natural-sounding speech: it avoids breaking mid-word or mid-phrase, producing smooth, conversational audio.

## Tool Calling

Tool calls (function calling) work alongside TTS. When the LLM decides to call a tool, the tool call is returned as standard `delta.tool_calls` chunks (no audio is generated for that turn). Once you execute the tool and send the result back with TTS enabled, the final response is spoken.

### Tools + Voice Example

```python theme={"system"}
import requests
import json

API_URL = "https://api.inworld.ai/v1/chat/completions"
HEADERS = {
    "Authorization": "Basic <your-api-key>",
    "Content-Type": "application/json",
}

# Step 1: Request with tools + TTS enabled
response = requests.post(API_URL, headers=HEADERS, json={
    "model": "openai/gpt-5",
    "messages": [{"role": "user", "content": "What's the weather in Tokyo?"}],
    "tools": [{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        }
    }],
    "audio": {
        "voice": "Dennis",
        "model": "inworld-tts-2"
    }
}).json()
# LLM returns tool_calls (no audio on this turn)

tool_call = response["choices"][0]["message"]["tool_calls"][0]

# Step 2: Execute the tool call
tool_result = get_weather("Tokyo")  # Your function

# Step 3: Send tool result back — this response is spoken aloud
audio_response = requests.post(API_URL, headers=HEADERS, json={
    "model": "openai/gpt-5",
    "stream": True,
    "messages": [
        {"role": "user", "content": "What's the weather in Tokyo?"},
        response["choices"][0]["message"],
        {"role": "tool", "content": tool_result, "tool_call_id": tool_call["id"]}
    ],
    "audio": {
        "voice": "Dennis",
        "model": "inworld-tts-2"
    }
}, stream=True)
# Parse SSE stream for audio chunks (same as Python example below)
```

## Python Example

```python theme={"system"}
import requests
import json
import base64

response = requests.post(
    "https://api.inworld.ai/v1/chat/completions",
    headers={
        "Authorization": "Basic <your-api-key>",
        "Content-Type": "application/json",
    },
    json={
        "model": "openai/gpt-5",
        "max_tokens": 500,
        "stream": True,
        "audio": {
            "voice": "Dennis",
            "model": "inworld-tts-2",
        },
        "messages": [
            {"role": "user", "content": "Tell me a fun fact about space."}
        ],
    },
    stream=True,
)

audio_chunks = []
full_transcript = ""

for line in response.iter_lines():
    line = line.decode("utf-8")
    if not line.startswith("data: "):
        continue
    data = line[6:]
    if data == "[DONE]":
        break

    chunk = json.loads(data)
    delta = chunk["choices"][0].get("delta", {})
    audio = delta.get("audio")

    if audio:
        # Text transcript — use for real-time text display
        if "transcript" in audio:
            full_transcript += audio["transcript"]
            print(audio["transcript"], end="", flush=True)

        # Audio data — PCM 48kHz 16-bit mono
        if "data" in audio:
            audio_chunks.append(base64.b64decode(audio["data"]))

pcm_audio = b"".join(audio_chunks)
```

## JavaScript / Node.js Example

```javascript theme={"system"}
const response = await fetch("https://api.inworld.ai/v1/chat/completions", {
  method: "POST",
  headers: {
    Authorization: "Basic <your-api-key>",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "openai/gpt-5",
    max_tokens: 500,
    stream: true,
    audio: {
      voice: "Dennis",
      model: "inworld-tts-2",
    },
    messages: [
      { role: "user", content: "Tell me a fun fact about space." },
    ],
  }),
});

const reader = response.body.getReader();
const decoder = new TextDecoder();
const audioChunks = [];
let fullTranscript = "";
let buffer = "";

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  buffer += decoder.decode(value, { stream: true });
  const lines = buffer.split("\n");
  buffer = lines.pop();

  for (const line of lines) {
    if (!line.startsWith("data: ")) continue;
    const data = line.slice(6);
    if (data === "[DONE]") break;

    const chunk = JSON.parse(data);
    const audio = chunk.choices[0]?.delta?.audio;

    if (audio) {
      // Text transcript — use for real-time text display
      if (audio.transcript) {
        fullTranscript += audio.transcript;
        process.stdout.write(audio.transcript);
      }

      // Audio data — PCM 48kHz 16-bit mono
      if (audio.data) {
        audioChunks.push(Buffer.from(audio.data, "base64"));
      }
    }
  }
}

const pcmAudio = Buffer.concat(audioChunks);
```

## Next Steps

* [OpenAI Compatibility](/router/openai-compatibility) — use Inworld Router with the OpenAI SDK
* [Cost Optimizer](/router/guides/cost-optimizer) — route by query complexity to reduce costs
* [Failover System](/router/guides/failover-system) — build a resilient multi-provider setup
* [Extra Body Parameters](/router/usage/extra-body-parameters) — all available `sort`, `models`, and `ignore` options