Ref: [Main Index](https://docs.inworld.ai/llms.txt) # Inworld AI Documentation - Full Content This file contains every documentation page from docs.inworld.ai, generated from docs.json. For the lightweight index, see [llms.txt](https://docs.inworld.ai/llms.txt). ## Home #### Hello Inworld Source: https://docs.inworld.ai/introduction ## Build with Inworld State-of-the-art voice AI at a radically accessible price point Low-latency, natural speech-to-speech conversations Powerful model routing to optimize for every user and context ## Get Started Learn how to make your first TTS API call. Create your first LLM Router request Build a voice agent that streams audio using WebSocket. Get started with the Unreal AI Agent Runtime. Get started with the Node.js Agent Runtime. Create and chat with an AI character with Agent Runtime. **Using AI to code?** Paste [https://platform.inworld.ai/llms.txt](https://platform.inworld.ai/llms.txt) into Claude, ChatGPT, or Cursor to integrate Inworld TTS and Agent Runtime into your app quickly and reliably. --- ## TTS ### Get Started #### Intro to TTS Source: https://docs.inworld.ai/tts/tts Inworld's text-to-speech (TTS) models offer ultra-realistic, context-aware speech synthesis, zero data retention, and precise voice cloning capabilities, enabling developers to build natural and engaging experiences with human-like speech quality at an accessible price point. Our models can be accessed via [API](/api-reference/ttsAPI/texttospeech/synthesize-speech) or the [TTS Playground](http://platform.inworld.ai/tts-playground). Learn how to make your first API call with a guided tutorial. Try different TTS models and voice cloning in TTS Playground. Browse ready-to-use GitHub samples for common use cases. ## Models ### Our flagship model, delivering the best balance of quality and speed - Rich, expressive, contextually aware speech - Support for 15 languages - Optimized for real-time use (\<200ms median latency) - High quality instant voice cloning ### Our ultra-fast, most cost-efficient model. For when latency is the top priority. - Ultra-low latency (~120ms median latency) - Support for 15 languages - Radically affordable pricing - High quality instant voice cloning ## Features | **Feature** | **TTS-1.5-Max** | **TTS-1.5-Mini** | | :---- | :---- | :---- | | Radically accessible pricing                 | $10/1M characters                | $5/1M characters             | | Quality                 | #1 ranked, maximum stability| \#1 ranked | | P50 Latency                 | 200 ms | 120 ms | | [Free instant voice cloning](/tts/voice-cloning)                 | | | | Professional voice cloning                 | | | | [Custom pronunciation](/tts/capabilities/custom-pronunciation)                 | | | | [Multilingual](/tts/capabilities/generating-audio#language-support)                 | 15 languages | 15 languages | | [Audio markups](/tts/capabilities/audio-markups) for emotion, style and non-verbals                 | | | | [Timestamp alignment](/tts/capabilities/timestamps)                 | | | | [On-premises deployments](/tts/on-premises)                 | | | | [Zero data retention](/tts/zero-data-retention)                 | | | --- #### Developer Quickstart Source: https://docs.inworld.ai/quickstart-tts The [TTS Playground](https://platform.inworld.ai/tts-playground) is the easiest way to experiment with Inworld’s Text-to-Speech models—try out different voices, adjust parameters, and preview instant voice clones. Once you’re ready to go beyond testing and build into a real-time application, the API gives you full access to advanced features and integration options. In this quickstart, we’ll focus on the Text-to-Speech API, guiding you through your first request to generate high-quality, ultra-realistic speech from text. ## Make your first TTS API request Create an [Inworld account](https://platform.inworld.ai/signup). In [Inworld Portal](https://platform.inworld.ai/), generate an API key by going to [**Settings** > **API Keys**](https://platform.inworld.ai/api-keys). Copy the Base64 credentials. Set your API key as an environment variable. ```shell macOS and Linux export INWORLD_API_KEY='your-base64-api-key-here' ``` ```shell Windows setx INWORLD_API_KEY "your-base64-api-key-here" ``` This is the simplest way to try Inworld TTS and works well for many applications — batch audio generation, pre-rendered content, and anywhere latency isn't critical. If your application requires real-time, low-latency audio delivery, see the [streaming example](#stream-your-audio-output) in the next step. For Python or JavaScript, create a new file called `inworld_quickstart.py` or `inworld_quickstart.js`. Copy the corresponding code into the file. For a curl request, copy the request. ```python Python import requests import base64 import os # Synchronous endpoint — returns complete audio in a single response. # For low-latency or real-time use cases, use the streaming endpoint instead. url = "https://api.inworld.ai/tts/v1/voice" headers = { "Authorization": f"Basic {os.getenv('INWORLD_API_KEY')}", "Content-Type": "application/json" } payload = { "text": "What a wonderful day to be a text-to-speech model!", "voiceId": "Ashley", "modelId": "inworld-tts-1.5-max" } response = requests.post(url, json=payload, headers=headers) response.raise_for_status() result = response.json() audio_content = base64.b64decode(result['audioContent']) with open("output.mp3", "wb") as f: f.write(audio_content) ``` ```javascript JavaScript const fs = require('fs'); async function main() { const url = 'https://api.inworld.ai/tts/v1/voice'; const response = await fetch(url, { method: 'POST', headers: { 'Authorization': `Basic ${process.env.INWORLD_API_KEY}`, 'Content-Type': 'application/json', }, body: JSON.stringify({ text: 'What a wonderful day to be a text-to-speech model!', voiceId: 'Ashley', modelId: 'inworld-tts-1.5-max', }), }); if (!response.ok) { throw new Error(`HTTP error! status: ${response.status}`); } const result = await response.json(); const audioBuffer = Buffer.from(result.audioContent, 'base64'); fs.writeFileSync('output.mp3', audioBuffer); console.log('Audio saved to output.mp3'); } main(); ``` ```curl cURL curl --request POST \ --url https://api.inworld.ai/tts/v1/voice \ --header "Authorization: Basic $INWORLD_API_KEY" \ --header 'Content-Type: application/json' \ --data '{ "text": "What a wonderful day to be a text-to-speech model!", "voiceId": "Ashley", "modelId": "inworld-tts-1.5-max" }' \ | jq -r '.audioContent' | base64 -d > output.mp3 ``` For Python, you may also have to install `requests` if not already installed. ```bash Python pip install requests ``` Run the code for Python or JavaScript, or enter the curl command into your terminal. ```bash Python python inworld_quickstart.py ``` ```bash JavaScript node inworld_quickstart.js ``` You should see a saved file called `output.mp3`. You can play this file with any audio player. ## Stream your audio output Now that you've made your first TTS API request, you can try streaming responses as well. Assuming you've already followed the instructions above to set up your API key: First, create a new file called `inworld_stream_quickstart.py` for Python or `inworld_stream_quickstart.js` for JavaScript. Next, set your `INWORLD_API_KEY` as an environment variable. Finally, copy the following code into the file. For this streaming example, we'll use Linear PCM format (instead of MP3), which we specify in the `audio_config`. We also include a `Connection: keep-alive` header to reuse the TCP+TLS connection across requests. The first request to the API may be slower due to the initial TCP and TLS handshake. Subsequent requests on the same connection will be faster. Use `Connection: keep-alive` (and a persistent session in Python) to take advantage of connection reuse. See the [low-latency examples](https://github.com/inworld-ai/inworld-api-examples/tree/main/tts) in our API examples repo for more advanced techniques. ```python Python import requests import base64 import os import json import wave import io import time url = "https://api.inworld.ai/tts/v1/voice:stream" payload = { "text": "What a wonderful day to be a text-to-speech model! I'm super excited to show you how streaming works.", "voice_id": "Ashley", "model_id": "inworld-tts-1.5-max", "audio_config": { "audio_encoding": "LINEAR16", "sample_rate_hertz": 48000, }, } # Use a persistent session for connection reuse (TCP+TLS keep-alive) session = requests.Session() session.headers.update({ "Authorization": f"Basic {os.getenv('INWORLD_API_KEY')}", "Content-Type": "application/json", "Connection": "keep-alive", }) start_time = time.time() ttfb = None raw_audio_data = io.BytesIO() with session.post(url, json=payload, stream=True) as response: response.raise_for_status() for line in response.iter_lines(decode_unicode=True): if line.strip(): try: chunk = json.loads(line) result = chunk.get("result") if result and "audioContent" in result: audio_chunk = base64.b64decode(result["audioContent"]) if ttfb is None: ttfb = time.time() - start_time # Skip WAV header (first 44 bytes) from each chunk if len(audio_chunk) > 44: raw_audio_data.write(audio_chunk[44:]) print(f"Received {len(audio_chunk)} bytes") except json.JSONDecodeError: continue total_time = time.time() - start_time with wave.open("output_stream.wav", "wb") as wf: wf.setnchannels(1) wf.setsampwidth(2) wf.setframerate(payload["audio_config"]["sample_rate_hertz"]) wf.writeframes(raw_audio_data.getvalue()) print("Audio saved to output_stream.wav") print(f"Time to first chunk: {ttfb:.3f}s" if ttfb else "No chunks received") print(f"Total time: {total_time:.3f}s") session.close() ``` ```javascript JavaScript const fs = require('fs'); async function main() { const url = 'https://api.inworld.ai/tts/v1/voice:stream'; const audioConfig = { audio_encoding: 'LINEAR16', sample_rate_hertz: 48000, }; const startTime = Date.now(); let ttfb = null; const response = await fetch(url, { method: 'POST', headers: { 'Authorization': `Basic ${process.env.INWORLD_API_KEY}`, 'Content-Type': 'application/json', 'Connection': 'keep-alive', }, body: JSON.stringify({ text: "What a wonderful day to be a text-to-speech model! I'm super excited to show you how streaming works.", voice_id: 'Ashley', model_id: 'inworld-tts-1.5-max', audio_config: audioConfig, }), }); if (!response.ok) { throw new Error(`HTTP error! status: ${response.status}`); } const rawChunks = []; // Read the streaming response const reader = response.body.getReader(); const decoder = new TextDecoder(); let buffer = ''; while (true) { const { done, value } = await reader.read(); if (done) break; buffer += decoder.decode(value, { stream: true }); const lines = buffer.split('\n'); buffer = lines.pop() || ''; for (const line of lines) { if (line.trim()) { try { const chunk = JSON.parse(line); if (chunk.result && chunk.result.audioContent) { const audioBuffer = Buffer.from(chunk.result.audioContent, 'base64'); if (ttfb === null) { ttfb = (Date.now() - startTime) / 1000; } // Skip WAV header (first 44 bytes) from each chunk if (audioBuffer.length > 44) { rawChunks.push(audioBuffer.subarray(44)); console.log(`Received ${audioBuffer.length} bytes`); } } } catch { continue; } } } } // Build WAV file from raw audio data const rawAudio = Buffer.concat(rawChunks); const header = Buffer.alloc(44); const sampleRate = audioConfig.sample_rate_hertz; const byteRate = sampleRate * 2; // 16-bit mono header.write('RIFF', 0); header.writeUInt32LE(36 + rawAudio.length, 4); header.write('WAVE', 8); header.write('fmt ', 12); header.writeUInt32LE(16, 16); header.writeUInt16LE(1, 20); // PCM header.writeUInt16LE(1, 22); // mono header.writeUInt32LE(sampleRate, 24); header.writeUInt32LE(byteRate, 28); header.writeUInt16LE(2, 32); // block align header.writeUInt16LE(16, 34); // bits per sample header.write('data', 36); header.writeUInt32LE(rawAudio.length, 40); fs.writeFileSync('output_stream.wav', Buffer.concat([header, rawAudio])); const totalTime = (Date.now() - startTime) / 1000; console.log('Audio saved to output_stream.wav'); console.log(`Time to first chunk: ${ttfb?.toFixed(3)}s`); console.log(`Total time: ${totalTime.toFixed(3)}s`); } main(); ``` Run the code for Python or JavaScript. The console will print out as streamed bytes are written to the audio file. ```python Python python inworld_stream_quickstart.py ``` ```javascript JavaScript node inworld_stream_quickstart.js ``` You should see a saved file called `output_stream.wav`. You can play this file with any audio player. ## Next Steps Now that you've tried out Inworld's TTS API, you can explore more of Inworld's TTS capabilities. Understand the capabilities of Inworld's TTS models. Create a personalized voice clone with just 5 seconds of audio. Learn tips and tricks for synthesizing high-quality speech. --- #### TTS Models Source: https://docs.inworld.ai/tts/tts-models Inworld provides a family of state-of-the-art TTS models, optimized for different use cases, quality levels, and performance requirements. --- ### Build with TTS #### Capabilities #### Generating Audio Source: https://docs.inworld.ai/tts/capabilities/generating-audio ## Voices Inworld offers a variety of built-in voices across available languages that showcase a range of vocal characteristics and styles. These voices can be immediately tried out in TTS Playground and used in your applications. For greater customization, we recommend [voice cloning](/tts/voice-cloning). Create distinct, personalized voices tailored to your experience, with as little as 5 seconds of audio. Voices perform optimally when synthesizing text in the same language as the original voice. While cross-language synthesis is possible, you'll achieve the best quality, pronunciation, and naturalness by matching the voice's native language to your text content. ## Language Support As a larger and more capable model, Inworld TTS 1.5 Max is better suited for multilingual applications, offering better pronunciation, more accurate intonation, and more natural-sounding speech. Inworld's models offer support for the following languages: - English (`en`) - Arabic (`ar`) - Chinese (`zh`) - Dutch (`nl`) - French (`fr`) - German (`de`) - Hebrew (`he`) - Hindi (`hi`) - Italian (`it`) - Japanese (`ja`) - Korean (`ko`) - Polish (`pl`) - Portuguese (`pt`) - Russian (`ru`) - Spanish (`es`) ## Supported Formats Multiple audio formats are available via API to support different application requirements. The default is MP3. - **MP3:** Popular compressed format with broad device and platform compatibility. - Sample rate: 16kHz - 48kHz - Bit rates: 32kbps - 320kbps - **PCM (`PCM`):** Raw uncompressed 16-bit signed little-endian samples with no WAV header. Recommended for WebSocket use cases and real-time applications that process raw audio samples directly without needing container metadata. - Sample rate: 8kHz - 48kHz - Bit depth: 16-bit - **WAV (`WAV`):** Uncompressed 16-bit signed little-endian samples with WAV header optimized for HTTP streaming. For non-streaming, the WAV header is included in the response. For HTTP streaming, the WAV header is included in the first audio chunk only, so all chunks in that response can be concatenated directly into a single valid WAV file. For WebSocket streaming, a WAV header is emitted at the first audio chunk of each `flush`/`flush_completed` event, so direct concatenation without processing is only valid within a single flush; to build one continuous WAV file across multiple flushes, clients must strip or rebuild the repeated headers between flushes. - Sample rate: 8kHz - 48kHz - Bit depth: 16-bit - **Linear PCM (`LINEAR16`):** Uncompressed 16-bit signed little-endian samples with WAV header. Maintained for backward compatibility. For non-streaming, the WAV header is included in the response. For streaming (HTTP streaming or WebSocket), the WAV header is included in every audio chunk, so each chunk is a valid WAV file on its own. Clients must strip headers when concatenating chunks. - Sample rate: 8kHz - 48kHz - Bit depth: 16-bit - **Opus:** High-quality compressed format optimized for low latency web and mobile applications. - Sample rate: 8kHz - 48kHz - Bit rates: 32kbps - 192kbps - **μ-law:** Compressed telephony format ideal for voice applications with bandwidth constraints. - Sample rate: 8kHz - **A-law:** Compressed telephony format ideal for voice applications with bandwidth constraints. - Sample rate: 8kHz ## Additional Configurations The following optional configurations can also be adjusted as needed when synthesizing audio: - **Temperature**: Higher values increase variation, which can produce more diverse outputs with desirable outcomes, but also increases the chances of bad generations and hallucinations. Lower values improve stability and speaker similarity, though going too low increases the chances of broken generation. The default is 1.0. - **Talking Speed**: Controls how fast the voice speaks. 1.0 is the normal native speed, while 0.5 is half the normal speed and 1.5 is 1.5x faster than the normal speed. - **Emphasis Markers**: Asterisks around a word (e.g. `*really*`) can be used to signal emphasis, prompting the voice to stress that word more strongly. This helps convey tone, intent, or emotion more clearly in spoken output. --- #### Voice Cloning Source: https://docs.inworld.ai/tts/voice-cloning Inworld's text-to-speech models offer best-in-class voice cloning capabilities, enabling developers to create distinct, personalized voices for their experiences. There are three ways to clone a voice: 1. **Instant Voice Cloning** - Clone a voice in minutes, with only 5-15 seconds of audio. Also known as zero-shot cloning. Available to all users through Portal. 2. **Voice Cloning via API** - Instant voice cloning via API. Useful for workflow automation or enabling your users to clone their own voices. 3. **Professional Voice Cloning** - For the highest quality, fine-tune a model with 30\+ minutes of audio. Professional voice cloning is currently not publicly available. To get access, please [reach out to our sales team](https://inworld.ai/contact-sales). Don't have audio samples? Use [Voice Design](/tts/voice-design) to create a voice from a text description instead. ## Instant Voice Cloning In Portal, select **TTS Playground** from the left-hand side panel. In the TTS Playground, click **Create Voice** and select **Clone**. Name your voice and select the language, which should match the audio samples. _Voices will work best when synthesizing text that matches the language of the original audio samples._ You can either upload or record audio: - **Upload**: Drag and drop or browse to upload 1 audio file. Accepted formats: wav, mp3, webm. Maximum file size is 4MB. Audio samples longer than 15 seconds will be automatically trimmed to 15 seconds. - **Record**: Click "Record audio" and record your audio. You can use the suggested scripts to help guide your recording, or use your own script. For best results, record in a quiet place to minimize background noise, avoid mic noise, and speak with a variety of emotions to capture the full range of the voice. Enable "Remove background noise" if you wish to remove background noise from your audio. Confirm you have the rights to clone the voice, then click "Continue". Check out our [Voice Cloning Best Practices](/tts/best-practices/voice-cloning) for helpful tips and tricks to improve the quality of your voices clones. Once voice cloning completes, you'll see the "Try your cloned voice" interface. Enter text in the input field and press play to hear your cloned voice. You can test different phrases to ensure the voice sounds as expected. If the voice doesn't sound quite right, you can delete the voice and start over, create another voice, or test it in the TTS Playground for more advanced testing options. There is a default limit of 1,000 cloned voices stored per account. If you need a higher limit, please [contact our sales team](https://inworld.ai/contact-sales). To use the cloned voice via API, copy the voice ID for your cloned voice in TTS Playground. Use that value for the `voiceId` when making an API call. See our [Quickstart](/quickstart-tts) to learn how to make your first API call. Instant voice cloning may not perform well for less common voices, such as children's voices or unique accents. For those use cases, we recommend professional voice cloning. ## Voice Cloning API Reference And Examples If you want to automate voice cloning (for example, to support creator onboarding at scale), use the Voice Cloning API. - **API reference**: [Clone a voice](/api-reference/voiceAPI/voiceservice/clone-voice) - **Python example**: [example_voice_clone.py](https://github.com/inworld-ai/inworld-api-examples/blob/main/tts/python/example_voice_clone.py) - **JavaScript example**: [example_voice_clone.js](https://github.com/inworld-ai/inworld-api-examples/blob/main/tts/js/example_voice_clone.js) Voice cloning has lower rate limits than regular speech synthesis. For details, see [Rate limits](/resources/rate-limits). ## Next Steps Looking for more tips and tricks? Check out the resources below to get started\! Learn best practices for producing high-quality voice clones. Learn best practices for synthesizing high-quality speech. Explore Python and JavaScript code examples for TTS integration. --- #### Voice Design Source: https://docs.inworld.ai/tts/voice-design Inworld's Voice Design lets you create a completely new voice from a text description. It is perfect for when you need a unique voice but can't find the right voice in our Voice Library and don't have existing audio recordings for voice cloning. Voice Design uses a model to generate a voice based on the following two inputs: 1. **Voice description** - A text description of the voice you have in mind (e.g., age, gender, accent, tone, pitch). 2. **Script** - The text the voice will speak. This shapes the generated voice, so using a script that matches the intended voice produces the best results. Each time you generate, we'll return up to three voice previews so you can listen, compare, and select the ones that work best for your project. Voice Design is currently in [research preview](/tts/resources/support#what-do-experimental-preview-and-stable-mean). Please share any feedback with us via the feedback form in [Portal](https://platform.inworld.ai) or in [Discord](https://discord.gg/inworld). { user.groups?.includes('inworld') ? <>

To get started, there are two ways to use voice design:

  1. Through Inworld Portal - Go to TTS Playground > Create Voice > Design and follow the guided flow.
  2. Via API - Useful if you want to generate a lot of voices or expose this capability to your users.
: <> } ## Design a Voice in Portal In [Portal](https://platform.inworld.ai/), select [**TTS Playground**](https://platform.inworld.ai/tts-playground) from the left-hand side panel. Click **Create Voice** and select **Design**. Describe the voice you want to create. The description must be in English and be between 30 and 250 characters. Keep your description concise but specific, so the model can most accurately produce what you have in mind. A good voice description should include: - **Gender and age range** (e.g., "a mid-20s to early 30s female voice", "a middle-aged male voice") - **Accent** (e.g., "British accent", "Southern American accent") - **Pitch and pace** (e.g., "low-pitched", "fast-paced", "steady pace") - **Tone and emotion** (e.g., "warm and friendly", "authoritative and composed") - **Timbre** (e.g., "rich and smooth", "slightly raspy", "clear and bright") **Example**: "A middle-aged male voice with a clear British accent speaking at a steady pace and with a neutral tone." Use the **Improve Description** button to automatically enhance your description based on best practices. This adds missing attributes like pitch, pace, tone, and timbre to help the model produce a more accurate voice. Choose the language for your generated voice. If you're using the auto-generated script, the script will be written in your selected language. Select how you want to provide the script that the voice will speak: - **Auto-generate script** - The system automatically generates a script that matches your voice description in the selected language. This is the easiest option and works well for most use cases. - **Write my own** - Write a custom script for the voice to speak. For best results, scripts should result in 5 to 15 seconds of audio, which is roughly between 50 and 200 characters in English. The script shapes the voice that gets generated. Use a script that matches your imagined voice, and the model will tailor the voice to suit the content it's speaking. Click **Generate voice**, which will create up to 3 voice previews. Listen to each preview by clicking the play button, then select the voice(s) you want to keep. Each generation produces slightly different results. If the first set of voices doesn't sound right, click **Generate voice** again to regenerate or adjust your description and voice script to better match what you have in mind before regenerating. Check out our [Voice Cloning Best Practices](/tts/best-practices/voice-design) guide for helpful tips and tricks to improve your designed voices. After selecting one or more voices, give each voice a name, add optional tags, and save them to your voice library. Your designed voices will appear alongside your other voices in the TTS Playground. To use your designed voice via API, copy the voice ID from the TTS Playground. Use that value for the `voiceId` when making an API call. See our [Quickstart](/quickstart-tts) to learn how to make your first API call. { user.groups?.includes('inworld') ? <>

Design a Voice via API

When designing a voice via API, there are two steps:

: <> } ## Voice Design API Reference And Examples If you want to automate voice design (for example, to support creator onboarding at scale), use the Voice Design API. - **Design a voice API reference**: [Design a voice](/api-reference/voiceAPI/voiceservice/design-voice) - **Publish a voice API reference**: [Publish a voice](/api-reference/voiceAPI/voiceservice/publish-voice) - **Python example**: [example_voice_design_publish.py](https://github.com/inworld-ai/inworld-api-examples/blob/main/tts/python/example_voice_design_publish.py) - **JavaScript example**: [example_voice_design_publish.js](https://github.com/inworld-ai/inworld-api-examples/blob/main/tts/js/example_voice_design_publish.js) ## Next Steps Learn best practices for designing voices. Clone an existing voice with just 5-15 seconds of audio. Learn how to make your first TTS API call in minutes. --- #### Voice Tags Source: https://docs.inworld.ai/tts/capabilities/voice-tags Voice tags provide descriptive metadata about each voice, helping you categorize and filter voices based on their characteristics. Tags describe properties like gender, age group, tone, and style, making it easier to find the right voice for your use case. ## Understanding voice tags Each voice includes a `tags` array with descriptive labels such as: - **Gender**: `male`, `female`, `non-binary` - **Age group**: `young_adult`, `adult`, `middle-aged`, `elderly` - **Vocal style**: `energetic`, `calm`, `professional`, `friendly`, `warm` - **Voice quality**: `smooth`, `clear`, `expressive`, `conversational` ## Adding voice tags To add custom tags to voices, you'll use the voice cloning feature in the API playground: ### Step-by-step process 1. **Navigate to the API playground** - Go to the [TTS playground](https://platform.inworld.ai/tts-playground) - Select the "Clone Voice" option 2. **Configure voice parameters** - Enter a voice name and description for your cloned voice 3. **Add voice tags** - Press Enter after each tag entry to add it to the list 4. **Upload your audio sample** 5. **Submit and process** 6. **Verify tags in voice list** - Your new voice will appear with the assigned tags ## Using voice tags Voice tags are returned in the [List voices](https://docs.inworld.ai/api-reference/ttsAPI/texttospeech/list-voices) endpoint response: --- #### Audio Markups Source: https://docs.inworld.ai/tts/capabilities/audio-markups Audio markups let you control how the model speaks—not only what it says, but pacing, emotion, and non-verbal sounds. This page covers two kinds: **SSML break tags** for inserting silences, and **emotion, delivery, and non-verbal markups**, bracket-style tags for expression and vocalizations. ## SSML break tags *Use when you need precise control over silence duration and position.* You can insert silences at specific points in the generated speech. The TTS API and Inworld Portal support SSML `` in text input for streaming, non-streaming, and WebSocket requests, in all languages. You can specify silences in milliseconds or seconds. For example, `` and `` produce the same result. **Constraints:** - Use well-formed SSML: specify the slash and brackets—for example, ``. - Tag names and attributes are **case insensitive**; for example, `` works. - Up to **20** break tags are supported per request. After the first 20 tags, the remaining ones will be ignored. - Each break is at most **10 seconds**—for example, `time="10s"` or `time="10000ms"`. **Example:** ``` One second pause two seconds pause this is the end. ``` ## Emotion, delivery, and non-verbal markups *Use when you want to control emotion, delivery style, or add sounds like sighs and laughs.* The markups below are experimental and supported for English only. They give you finer control over how the model speaks: emotional expression, delivery style such as whispering, and non-verbal vocalizations such as sighs and coughs. These markups are currently [experimental](/tts/resources/support#what-do-experimental-preview-and-stable-mean) and only support English. ### Emotion and Delivery Style Emotion and delivery style markups control the way a given text is spoken. These work best when used at the beginning of a text and apply to the text that follows. - **Emotion**: `[happy]`, `[sad]`, `[angry]`, `[surprised]`, `[fearful]`, `[disgusted]` - **Delivery Style**: `[laughing]`, `[whispering]` For example: ``` [happy] I can't believe this is happening. ``` **Best practices:** Use only one emotion or delivery style markup at the **beginning** of your text. Using multiple emotion and delivery style markups or placing them mid-text may produce mixed results. Instead, split the text into separate requests with the markup at the start of each. See our [Best Practices](/tts/best-practices/generating-speech#audio-markups) guide for more details. ### Non-verbal Vocalization Non-verbal vocalization markups add in non-verbal sounds based on where they are placed in the text. - `[breathe]`, `[clear_throat]`, `[cough]`, `[laugh]`, `[sigh]`, `[yawn]` For example: ``` [clear_throat] Did you hear what I said? [sigh] You never listen to me! ``` **Best practices:** You can use multiple non-verbal vocalizations within a single piece of text to add the appropriate vocal effects throughout the speech. --- #### Custom Pronunciation Source: https://docs.inworld.ai/tts/capabilities/custom-pronunciation Sometimes you may need to ensure that a word is spoken with a specific pronunciation, especially for uncommon words such as company names, brand names, nicknames, geographic locations, medical terms, or legal terms that may not appear in the model’s training data. Custom pronunciation lets you precisely control how these words are spoken. ### How to Use Inworld TTS supports inline IPA phoneme notation for custom pronunciation. Use the [International Phonetic Alphabet (IPA)](https://www.vocabulary.com/resources/ipa-pronunciation/) format, wrapped in slashes (`/ /`). For example: - Suppose you are building an AI travel agent, and it is recommending the destination Crete, which is pronounced /kriːt/ (“kreet”) in English. - You can ensure the correct pronunciation by passing it inline: `Your interests are a perfect match for a honeymoon in /kriːt/.` The model will substitute the IPA pronunciation wherever it appears inline in your text. If the text is generated by an LLM, you can simply replace the original spelling with the IPA transcription before passing it to the TTS model. ### Finding the Right IPA Phonemes If you are unsure of the correct phonemes, there are several ways to find them: - **Ask an LLM like ChatGPT**: For example, you can ask: > “What are the IPA phonemes for the word Crete, pronounced like ‘kreet’?” - **Use reference websites**: Resources such as [Vocabulary.com’s IPA Pronunciation Guide](https://www.vocabulary.com/resources/ipa-pronunciation/) provide tables of symbols with example words. Once you have the correct phonemes, you can embed them directly into your TTS request: `Your adventure in /kriːt/ begins today.` --- #### Timestamps Source: https://docs.inworld.ai/tts/capabilities/timestamps Timestamp alignment supports English and Spanish; other languages are experimental. Timestamp alignment lets you retrieve timing information that matches the generated audio, which is useful for experiences like word highlighting, karaoke‑style captions, and lipsync. Set the `timestampType` request parameter to control granularity: - `WORD`: Return timestamps for each word, including detailed phoneme-level timing with viseme symbols - `CHARACTER`: Return timestamps for each character or punctuation Enabling timestamp alignment can increase latency (especially for the non-streaming endpoint). When enabled, the response includes timestamp arrays: - `WORD`: `timestampInfo.wordAlignment` with `words`, `wordStartTimeSeconds`, `wordEndTimeSeconds` - For TTS 1.5 models, `phoneticDetails` containing detailed phoneme-level timing with viseme symbols - `CHARACTER`: `timestampInfo.characterAlignment` with `characters`, `characterStartTimeSeconds`, `characterEndTimeSeconds` Phoneme and viseme timings (`phoneticDetails`) are currently only returned for **WORD** alignment (not CHARACTER). See the [API reference](https://docs.inworld.ai/api-reference/ttsAPI/texttospeech/synthesize-speech) for full details. ## Streaming behavior You can control how timestamp data is delivered alongside audio using [`timestampTransportStrategy`](/api-reference/ttsAPI/texttospeech/synthesize-speech-stream#body-timestamp-transport-strategy). ### Sync (default) Audio and alignment arrive together in each chunk. Every chunk contains both audio data and its corresponding timestamps. ``` Chunk 1: audio + timestamps for chunk 1 Chunk 2: audio + timestamps for chunk 2 Chunk 3: audio + timestamps for chunk 3 ``` This is the simplest approach, however the first audio will be slightly delayed. ### Async Audio chunks arrive first, followed by separate trailing messages containing only timestamp data. This reduces time-to-first-audio with TTS 1.5 models, since the server doesn't need to wait for alignment computation before sending audio. ``` Chunk 1: audio only Chunk 2: audio only Chunk 3: audio only Chunk 4: timestamps only (alignment for chunks 1–3) Chunk 5: timestamps only ... ``` Use async when you prioritize playback speed and can handle timestamps arriving after their corresponding audio. Use sync when you need audio and timestamps together in each chunk (e.g., for real-time lip-sync or word highlighting during playback). Set `timestampTransportStrategy` to `SYNC` or `ASYNC` in your request. See the [API reference](/api-reference/ttsAPI/texttospeech/synthesize-speech-stream#body-timestamp-transport-strategy) for details. ### Response structure #### TTS 1.5 models (`inworld-tts-1.5-mini`, `inworld-tts-1.5-max`) Returns enhanced alignment data with **phonetic details**: detailed phoneme-level timing with viseme symbols for precise lip-sync animation. ```json { "timestampInfo": { "wordAlignment": { "words": ["Hello,", "world,", "this", "will", "be", "saved"], "wordStartTimeSeconds": [0, 0.28, 0.96, 1.25, 1.38, 1.5], "wordEndTimeSeconds": [0.28, 0.8, 1.25, 1.38, 1.5, 1.99], "phoneticDetails": [ { "wordIndex": 0, "phones": [ {"phoneSymbol": "h", "startTimeSeconds": 0, "durationSeconds": 0.07, "visemeSymbol": "aei"}, {"phoneSymbol": "ə", "startTimeSeconds": 0.07, "durationSeconds": 0.030000001, "visemeSymbol": "aei"}, {"phoneSymbol": "l", "startTimeSeconds": 0.1, "durationSeconds": 0.089999996, "visemeSymbol": "l"}, {"phoneSymbol": "oʊ1", "startTimeSeconds": 0.19, "durationSeconds": 0.09, "visemeSymbol": "o"} ], "isPartial": false }, { "wordIndex": 1, "phones": [ {"phoneSymbol": "w", "startTimeSeconds": 0.28, "durationSeconds": 0.18, "visemeSymbol": "qw"}, {"phoneSymbol": "ɝ1", "startTimeSeconds": 0.46, "durationSeconds": 0.119999975, "visemeSymbol": "r"}, {"phoneSymbol": "l", "startTimeSeconds": 0.58, "durationSeconds": 0.08000004, "visemeSymbol": "l"}, {"phoneSymbol": "d", "startTimeSeconds": 0.66, "durationSeconds": 0.13999999, "visemeSymbol": "cdgknstxyz"} ], "isPartial": false }, { "wordIndex": 2, "phones": [ {"phoneSymbol": "ð", "startTimeSeconds": 0.96, "durationSeconds": 0.14000005, "visemeSymbol": "th"}, {"phoneSymbol": "ɪ1", "startTimeSeconds": 1.1, "durationSeconds": 0.06999993, "visemeSymbol": "ee"}, {"phoneSymbol": "s", "startTimeSeconds": 1.17, "durationSeconds": 0.08000004, "visemeSymbol": "cdgknstxyz"} ], "isPartial": false } ] } } } ``` ##### Phonetic details structure Each entry in `phoneticDetails` contains: | Field | Description | | :---- | :---- | | `wordIndex` | Index of the word this phonetic detail belongs to (0-based). | | `phones` | Array of phonemes that make up this word. | | `isPartial` | True when the server considers the word potentially unstable (e.g., last word in a non-final streaming update). Clients may choose to delay processing partial words until `isPartial` becomes `false`. | Each phone entry contains: | Field | Description | | :---- | :---- | | `phoneSymbol` | The phoneme symbol in IPA notation. | | `startTimeSeconds` | Start time of the phoneme in seconds. May be omitted for the first phoneme of a word. | | `durationSeconds` | Duration of the phoneme in seconds. | | `visemeSymbol` | The viseme symbol for lip-sync animation. | ##### Viseme symbols The following viseme symbols are used for lip-sync animation: | Viseme | Description | | :---- | :---- | | `aei` | Open mouth vowels (a, e, i, ə, ʌ, æ, ɑ, etc.) | | `o` | Rounded vowels (o, ʊ, əʊ, oʊ, etc.) | | `ee` | Front vowels (i, ɪ, eɪ, etc.) | | `bmp` | Bilabial consonants (b, m, p) | | `fv` | Labiodental consonants (f, v) | | `l` | Lateral consonant (l) | | `r` | Rhotic sounds (r, ɝ, ɚ) | | `th` | Dental fricatives (θ, ð) | | `qw` | Rounded consonants (w, ʍ) | | `cdgknstxyz` | Alveolar/velar consonants (c, d, g, k, n, s, t, x, y, z) | #### TTS 1 models (`inworld-tts-1`, `inworld-tts-1-max`) Returns basic word/character timing arrays: ```json { "timestampInfo": { "wordAlignment": { "words": ["Hello", "world,", "this", "will", "be", "saved"], "wordStartTimeSeconds": [0, 0.33, 0.69, 0.89, 1.1, 1.26], "wordEndTimeSeconds": [0.28, 0.63, 0.87, 1.05, 1.16, 1.6] } } } ``` --- #### Long Text Input Source: https://docs.inworld.ai/tts/capabilities/long-text-input The TTS API accepts up to **2,000 characters** per request. For longer content — articles, book chapters, scripts — you need to split the text into chunks, synthesize each one, and stitch the resulting audio back together. We provide ready-to-run scripts in **[Python](https://github.com/inworld-ai/inworld-api-examples/blob/main/tts/python/example_tts_long_input.py)** and **[JavaScript](https://github.com/inworld-ai/inworld-api-examples/blob/main/tts/js/example_tts_long_input_compressed.js)** that handle this entire pipeline for you. ## How It Works The input is split into segments under the 2,000-character API limit. The chunking algorithm looks for natural break points in the following priority order: 1. Paragraph breaks (`\n\n`) 2. Line breaks (`\n`) 3. Sentence endings (`.` `!` `?`) 4. Last space (fallback) This ensures audio segments end at natural pauses, producing smooth-sounding output. Each chunk is sent to the TTS API with controlled concurrency and automatic retry logic for rate limits. Chunks are processed in parallel (default: 2 concurrent requests) to speed up synthesis while respecting API rate limits. The individual audio responses are combined into a single output file. The Python script produces a **WAV** file with configurable silence between segments, while the JavaScript script produces an **MP3** file and uses **ffmpeg** to merge segments with correct duration metadata. ## Configuration Both scripts share the same tunable parameters: | Parameter | Default | Description | | :--- | :--- | :--- | | `MIN_CHUNK_SIZE` | 500 | Minimum characters before looking for a break point | | `MAX_CHUNK_SIZE` | 1,900 | Maximum chunk size (stays under the 2,000-char API limit) | | `MAX_CONCURRENT_REQUESTS` | 2 | Parallel API requests (increase with caution to avoid rate limits) | | `MAX_RETRIES` | 3 | Retry attempts for rate-limited requests with exponential backoff | ## Running the Scripts ### Prerequisites - An **Inworld API key** set as the `INWORLD_API_KEY` environment variable - A text file with your long-form content - **Python 3** (for the Python script) or **Node.js** (for the JavaScript script) - **ffmpeg** (optional, for the JS script — produces correct MP3 duration metadata) ### Python ```bash export INWORLD_API_KEY=your_api_key_here pip install requests python-dotenv python example_tts_long_input.py ``` The script reads the input text file, chunks it, synthesizes all chunks with the Inworld TTS API, and saves the combined audio as a **WAV** file. It also prints a splice report showing the exact timestamps where chunks were joined, useful for quality checking. ### JavaScript ```bash export INWORLD_API_KEY=your_api_key_here node example_tts_long_input_compressed.js ``` The script follows the same chunking and synthesis pipeline, outputting a compressed **MP3** file. When **ffmpeg** is available, it merges segments with correct duration metadata. Otherwise, it falls back to raw concatenation. ## Code Examples WAV output with splice report and configurable silence between segments Compressed MP3 output with ffmpeg-based segment merging ## Next Steps Learn about the standard (non-streaming) synthesis API. Use streaming for real-time playback of shorter content. Optimize time-to-first-audio for real-time use cases. --- #### TTS Playground Source: https://docs.inworld.ai/tts/tts-playground The TTS Playground makes it easy to try out Inworld's TTS capabilities through an interactive playground. It can be used to find the perfect voice for your project, test different text inputs, adjust voice settings, and experiment with audio markup tags. ## Get Started In [Portal](https://platform.inworld.ai/), select **TTS Playground** from the left-hand side panel. Enter the text you want to convert to speech. If you need some ideas, you can select one of the suggestion chips at the bottom of the screen. Note that the playground accepts up to 2,000 characters per request. For longer content, see [Long Text Input](/tts/capabilities/long-text-input). On the right-hand side, click on the voice dropdown to browse available voices. You can filter by language or search by name, and click the play button next to each voice to hear how the voice sounds. Select a voice. Click the "Generate" button on the bottom right. The audio will automatically start playing once it's been generated. You can also download the clip to save it. ## Advanced Features For greater control over the generated audio, you can try out the following: 1. **Try a different model** - Select a different model from the right-hand side panel to see how it compares. See [Models](/tts/tts-models) for more information about each model. 2. **Adjust configurations** - Use the sliders on the right-hand side panel to adjust Temperature and Talking Speed. See [here](/tts/capabilities/generating-audio#additional-configurations) for more information. 3. **Experiment with audio markups** - Try adding audio markups, such as `[happy]` or `[cough]`, to your input text to control emotional expression, delivery style, and non-verbal vocalizations. See [here](/tts/capabilities/audio-markups) for more information. In addition, you can enable the **Highlight words** toggle in the right-hand panel to visualize word-level [timestamps](/tts/capabilities/timestamps) during playback. This feature is currently only available for English when using the `inworld-tts-1.5-mini` and `inworld-tts-1.5-max` models. ## Create a Voice In the TTS Playground, click **+ Create a Voice** to design a new voice from a text description or clone one from audio: - **[Voice Design](/tts/voice-design)** — Describe the voice you want in text (age, accent, tone, etc.) and get AI-generated voice candidates - **[Voice Cloning](/tts/voice-cloning)** — Clone a voice from 5–15 seconds of audio samples ## Next Steps Ready for more? Whether you're looking to clone a voice, design one from text, or start building with our API, we've got you covered. Create a voice from a text description—no audio needed. Create a personalized voice clone with just 5 seconds of audio. Learn tips and tricks for synthesizing high-quality speech. Learn how to make your first API call in minutes. --- #### Synthesize Speech Source: https://docs.inworld.ai/tts/synthesize-speech You send text, the server generates the entire audio, and returns it in a single HTTP response. No streaming, no open connections — just one request and one response with the complete audio file. Best for batch or offline work like audiobooks, voiceovers, and podcasts, or any workflow where you can wait for the full file before playback. For real-time playback or low-latency scenarios, use the [Streaming API](/tts/synthesize-speech-streaming) or [WebSocket API](/tts/synthesize-speech-websocket). See the [latency best practices](/tts/best-practices/latency) for details. ## Code Examples View our JavaScript implementation example View our Python implementation example ## API Reference View the complete API specification ## Next Steps Learn best practices for producing high-quality voice clones. Learn best practices for synthesizing high-quality speech. Explore Python and JavaScript code examples for TTS integration. --- #### Synthesize Speech (Streaming) Source: https://docs.inworld.ai/tts/synthesize-speech-streaming You send text, the server returns audio chunks over HTTP as they are generated. Playback can begin before the full synthesis is complete, significantly reducing time-to-first-audio. Best for real-time applications, conversational AI, and long-form content — anywhere you want low-latency playback without managing a persistent connection. For even lower latency with multiple requests in a session, consider the [WebSocket API](/tts/synthesize-speech-websocket). For tips on optimizing latency, see the [latency best practices guide](/tts/best-practices/latency). ## Timestamp Transport Strategy When using [timestamp alignment](/tts/capabilities/timestamps), you can choose how timestamps are delivered alongside audio using `timestampTransportStrategy`: - **`SYNC`** (default): Each chunk contains both audio and timestamps together. - **`ASYNC`**: Audio chunks arrive first, with timestamps following in separate trailing messages. This reduces time-to-first-audio with TTS 1.5 models. See [Timestamps](/tts/capabilities/timestamps#streaming-behavior) for details on how each mode works. ## Code Examples View our JavaScript implementation example View our Python implementation example ## API Reference View the complete API specification ## Next Steps Learn best practices for producing high-quality voice clones. Learn best practices for synthesizing high-quality speech. Explore Python and JavaScript code examples for TTS integration. --- #### Synthesize Speech (WebSocket) Source: https://docs.inworld.ai/tts/synthesize-speech-websocket You open a persistent WebSocket connection and send text messages. The server streams audio chunks back over the same connection — no per-request overhead, no repeated handshakes. This gives you the lowest possible latency. Best for voice agents and interactive applications that send multiple synthesis requests in a session, where avoiding connection setup on every call makes a measurable difference. If you only need a single request-response with chunked audio, the [Streaming API](/tts/synthesize-speech-streaming) is simpler to integrate. For tips on optimizing latency, see the [latency best practices guide](/tts/best-practices/latency). ## Timestamp Transport Strategy When using [timestamp alignment](/tts/capabilities/timestamps), you can choose how timestamps are delivered alongside audio using `timestampTransportStrategy`: - **`SYNC`** (default): Each chunk contains both audio and timestamps together. - **`ASYNC`**: Audio chunks arrive first, with timestamps following in separate trailing messages. This reduces time-to-first-audio with TTS 1.5 models. See [Timestamps](/tts/capabilities/timestamps#streaming-behavior) for details on how each mode works. ## Code Examples View our JavaScript implementation example View our Python implementation example ## API Reference View the complete API specification ## Next Steps Learn best practices for producing high-quality voice clones. Learn best practices for synthesizing high-quality speech. Explore Python and JavaScript code examples for TTS integration. --- #### Integrations Source: https://docs.inworld.ai/tts/integrations Inworld’s API is integrated with leading voice and real-time platforms for developers. This makes it easy to get started building real-time voice agents and voice-based experiences at scale powered by Inworld’s radically affordable, state-of-the-art TTS models. ## Daily (Pipecat) [Pipecat](https://docs.pipecat.ai/getting-started/introduction) is an open source Python framework for building real-time voice and multimodal AI agents that can see, hear, and speak. It’s designed for developers who want full control over how AI services, network transports, and audio processing are orchestrated—enabling ultra-low latency, natural-feeling conversations across custom pipelines, whether running locally or in production infrastructure. Inworld voices and text-to-speech models are supported via a built-in `InworldTTSService`, allowing you to stream high-quality audio or generate speech on demand from within your own runtime. To get started with Pipecat + Inworld, follow this [guide](https://docs.pipecat.ai/server/services/tts/inworld). ## LiveKit [LiveKit](https://livekit.io/) is an open source platform for developers building realtime agents. It makes it easy to integrate audio, video, text, data, and AI models while offering scalable realtime infrastructure built on top of WebRTC. Inworld voices and text-to-speech models are available as a plugin for LiveKit Agents, a flexible framework for building real-time conversational agents. This makes it easier for developers to create previously unimaginable, real-time voice experiences such as multiplayer games, agentic NPCs, customer-facing avatars, live training simulations, and more at an accessible price. To get started with LiveKit + Inworld, follow this [guide](https://docs.livekit.io/agents/integrations/tts/inworld/). ## NLX NLX is a no-code platform for developers and businesses to build, deploy, and manage conversational AI applications across a variety of channels. It enables the creation of sophisticated, multimodal experiences that can include chat, voice, and video. Inworld TTS is available through NLX as one of the default voice providers or you can build a custom integration. Kickstart your journey with NLX + Inworld by signing up for an NLX account, or dive right in with this how-to [guide](https://docs.nlx.ai/platform/build/integrations/text-to-speech-providers/inworld). ## Stream (Vision Agents) [Stream (Vision Agents)](https://visionagents.ai/) is Stream's [open-source framework](https://github.com/GetStream/vision-agents) that helps developers quickly build low-latency vision AI applications. Since its initial launch, the project has expanded with additional plugins, better model support, and major improvements to latency, audio, and video handling. Stream (Vision Agents) integrates Inworld's state-of-the-art TTS models directly into their platform, giving developers an out-of-the-box way to bring natural, expressive voice to their AI agents. To get started with Stream (Vision Agents) x Inworld, follow this [guide](https://github.com/GetStream/Vision-Agents/tree/main/plugins/inworld). ## Ultravox [Ultravox](https://ultravox.ai) is a real-time voice AI infrastructure layer that delivers fast, natural, and scalable voice agents. Its purpose-built inference stack powers a best-in-class speech understanding model, while developer tools including easy-to-use APIs and client-side SDKs help teams deliver production voice agents faster. Inworld voices are natively integrated with the Ultravox platform and available for use in all accounts, making it easy to create natural, conversational agents with emotionally expressive voices. To get started with Ultravox + Inworld, follow these [instructions](https://docs.ultravox.ai/voices/bring-your-own#inworld). ## Vapi [Vapi](https://vapi.ai/) is a developer platform for building advanced voice AI agents. By handling the complex infrastructure, they enable developers to focus on creating great voice experiences. Inworld's TTS is integrated with Vapi's platform, giving you access to Inworld's high-fidelity, emotionally expressive voices seamlessly on Vapi. To get started with Vapi + Inworld, follow this [guide](https://docs.vapi.ai/api-reference/assistants/create#request.body.voice) ## Voximplant [Voximplant](https://voximplant.ai/) is a serverless Voice AI orchestration platform and cloud communications stack for building real-time voice agents over the phone and the web. It combines programmable telephony (PSTN, SIP, WhatsApp), WebRTC, and client SDKs with a serverless JavaScript runtime (VoxEngine), so developers can efficiently orchestrate calls, speech services, and LLMs in one environment. Inworld's TTS is natively integrated into Voximplant's realtime speech synthesis APIs, enabling low-latency streaming of expressive Inworld voices into any Voximplant-powered call. With a single VoxEngine scenario, you can connect your agent logic to Inworld for speech generation, route calls globally, and rapidly scale from prototype to production. To get started with Voximplant + Inworld, check out this [announcement](https://voximplant.com/blog/inworld-text-to-speech-now-available-in-voximplant). --- #### On-Prem #### TTS On-Premises Source: https://docs.inworld.ai/tts/on-premises Inworld TTS On-Premises lets organizations run high-quality text-to-speech models locally — without sending text or audio data to the cloud. It's built for enterprises that require strict data control, low latency, and compliance with internal or regulatory standards. Inworld TTS On-Premises is available for both the **Inworld TTS-1.5 Mini** and **Inworld TTS-1.5 Max** models. To get started with TTS On-Premises, contact [sales@inworld.ai](mailto:sales@inworld.ai) for pricing and access to the container registry. ## Why TTS On-Premises No outbound data transfer. Full ownership of text and audio. Optimized for production workloads and interactive applications. Suitable for air-gapped, private, and compliance-sensitive deployments. Containerized architecture designed for operational stability. ## How it works Inworld TTS On-Premises is delivered as a GPU-accelerated, Docker-containerized version of the Inworld TTS API. It exposes both REST and gRPC APIs for easy integration. | Port | Protocol | Description | |------|----------|-------------| | **8081** | HTTP | REST API (recommended) | | **9030** | gRPC | For gRPC clients | ### Performance - **Latency:** Real-time streaming on supported NVIDIA GPUs - **Throughput:** Multiple concurrent sessions are supported depending on the GPU being utilized Contact [sales@inworld.ai](mailto:sales@inworld.ai) to get a detailed performance report for your specific hardware. ## System requirements Inworld TTS supports all modern cloud NVIDIA GPUs: A100s, H100s, H200, B200, B300. If you have a specific target hardware platform not on this list, please reach out for custom support. The minimum inference machine requirements are as follows: | Component | Requirement | |-----------|-------------| | **GPU** | NVIDIA H100 SXM5 (80GB) | | **RAM** | 64GB+ system memory | | **CPU** | 8+ cores | | **Disk** | 50GB free space | | **OS** | Ubuntu 22.04 LTS | | **Software** | Docker + NVIDIA Container Toolkit | | **Software** | Google Cloud SDK (gcloud CLI) | | **CUDA** | 13.0+ | ## Prerequisites Before deploying TTS On-Premises, ensure the following software is installed on your Ubuntu 22.04 LTS machine. ### NVIDIA drivers Install the latest NVIDIA drivers for your GPU. Follow the official guide at [nvidia.com/drivers](https://www.nvidia.com/en-us/drivers), or use the following commands on Ubuntu: ```bash # Update packages sudo apt-get update # Install basic toolchain and kernel headers sudo apt-get install -y gcc make wget linux-headers-$(uname -r) # Install NVIDIA driver (check https://www.nvidia.com/en-us/drivers for the latest version) sudo apt-get install -y nvidia-driver-580 ``` ### Docker Install Docker Engine by following the official guide: [Install Docker Engine on Ubuntu](https://docs.docker.com/engine/install/ubuntu/). Optionally, add the current user to the `docker` group so you can run Docker without `sudo`: [Linux post-installation steps](https://docs.docker.com/engine/install/linux-postinstall/). ### NVIDIA Container Toolkit Install the NVIDIA Container Toolkit to enable GPU access from Docker containers. Follow both the **Installation** and **Configuration** sections of the official guide: [NVIDIA Container Toolkit install guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html). ### Google Cloud SDK Install the gcloud CLI by following the official guide: [Install the gcloud CLI](https://cloud.google.com/sdk/docs/install#deb). ### Verify prerequisites Run the following command to verify that Docker, NVIDIA drivers, and the NVIDIA Container Toolkit are all correctly installed: ```bash docker run --rm --gpus all nvidia/cuda:13.0.0-base-ubuntu22.04 nvidia-smi ``` You should see your GPU listed in the output alongside the driver version and CUDA version. If this command succeeds, your environment is ready for TTS On-Premises deployment. ### Firewall requirements The TTS On-Premises container listens on the following ports for inbound traffic: | Port | Protocol | Description | |------|----------|-------------| | **8081** | HTTP | REST API | | **9030** | gRPC | gRPC API | You will also need to allow the following outbound traffic: - `us-central1-docker.pkg.dev` on port **443** — GCP Artifact Registry for pulling container images ## Quick start ### 1. Create a GCP service account Create a service account in your GCP project and generate a key file: ```bash # Create the service account gcloud iam service-accounts create inworld-tts-onprem \ --project= \ --display-name="Inworld TTS On-Prem" \ --description="Service account for Inworld TTS on-prem container" # Create a key file gcloud iam service-accounts keys create service-account-key.json \ --iam-account=inworld-tts-onprem@.iam.gserviceaccount.com \ --project= ``` ### 2. Share the service account email with Inworld Send the service account email (e.g., `inworld-tts-onprem@.iam.gserviceaccount.com`) to your Inworld contact. Inworld will provide your **Customer ID**. ### 3. Authenticate to the container registry ```bash gcloud auth activate-service-account \ --key-file=service-account-key.json gcloud auth configure-docker us-central1-docker.pkg.dev ``` For more authentication options, see [Configure authentication to Artifact Registry for Docker](https://docs.google.com/artifact-registry/docs/docker/authentication#gcloud-helper). ### 4. Configure ```bash cp onprem.env.example onprem.env ``` Edit `onprem.env` with your values: ```bash INWORLD_CUSTOMER_ID= TTS_IMAGE=us-central1-docker.pkg.dev/inworld-ai-registry/tts-onprem/tts-1.5-mini-h100-onprem: KEY_FILE=./service-account-key.json ``` ### 5. Start ```bash ./run.sh ``` The script will: 1. Check prerequisites (Docker, GPU, NVIDIA Container Toolkit) 2. Validate your configuration 3. Fix key file permissions if needed 4. Pull the Docker image 5. Start the container 6. Wait for services to be ready (~3 minutes) The ML model takes approximately 3 minutes to load on first startup. This is normal. ### 6. Verify the deployment Check that the container is running and services are healthy: ```bash ./run.sh status ``` ### 7. Send a test request ```bash curl -X POST http://localhost:8081/tts/v1/voice \ -H "Content-Type: application/json" \ -d '{ "text": "Hello, this is a test of the on-premises TTS system.", "voice_id": "Craig", "model_id": "inworld-tts-1.5-mini", "audio_config": { "audio_encoding": "LINEAR16", "sample_rate_hertz": 48000 } }' ``` ### List available voices ```bash curl http://localhost:8081/tts/v1/voices ``` For the full API specification, see the [Synthesize Speech API reference](/api-reference/ttsAPI/texttospeech/synthesize-speech). ## Lifecycle commands ```bash ./run.sh # Start the container ./run.sh stop # Stop and remove the container ./run.sh status # Check container and service health ./run.sh logs # Show recent logs from all services ./run.sh logs -f # Tail all service logs live ./run.sh logs export # Export all logs to a timestamped folder ./run.sh restart # Restart the container ``` ## Available images | Image | Model | GPU | |-------|-------|-----| | `tts-1.5-mini-h100-onprem` | 1B (mini) | H100 | | `tts-1.5-max-h100-onprem` | 8B (max) | H100 | Registry: `us-central1-docker.pkg.dev/inworld-ai-registry/tts-onprem/` ## Configuration ### onprem.env | Variable | Required | Description | |----------|----------|-------------| | `INWORLD_CUSTOMER_ID` | Yes | Your customer ID | | `TTS_IMAGE` | Yes | Docker image URL (see [Available Images](#available-images)) | | `KEY_FILE` | Yes | Path to your GCP service account key file | ## Logs ```bash # Show recent logs from all services (last 20 lines each) ./run.sh logs # Tail all service logs live ./run.sh logs -f # Export all logs to a timestamped folder ./run.sh logs export ``` Individual service logs: ```bash docker exec inworld-tts-onprem tail -f /var/log/tts-v3-trtllm.log # ML server docker exec inworld-tts-onprem tail -f /var/log/tts-normalization.log # Text normalization docker exec inworld-tts-onprem tail -f /var/log/public-tts-service.log # TTS service docker exec inworld-tts-onprem tail -f /var/log/grpc-gateway.log # HTTP gateway docker exec inworld-tts-onprem tail -f /var/log/w-proxy.log # gRPC proxy docker exec inworld-tts-onprem tail -f /var/log/supervisord.log # Supervisor ``` ## Troubleshooting | Issue | Solution | |-------|----------| | "INWORLD_CUSTOMER_ID is required" | Set `INWORLD_CUSTOMER_ID` in `onprem.env` | | "GCP credentials file not found" | Check that `KEY_FILE` in `onprem.env` points to a valid file | | "Credentials file is not readable" | Fix permissions on host: `chmod 644 .json` | | "Topic not found" | Verify your `INWORLD_CUSTOMER_ID` matches the PubSub topic name | | "Permission denied for topic" | Ensure Inworld has granted your service account publish access | | Slow startup (~3 min) | Normal — text processing grammars take time to initialize | ```bash # Check service status docker exec inworld-tts-onprem supervisorctl -s unix:///tmp/supervisor.sock status # Export logs for support ./run.sh logs export ``` Share the exported logs folder with [Inworld support](mailto:support@inworld.ai) when reporting issues. ## Advanced: manual Docker run For users who prefer to run Docker directly without `run.sh`: ```bash docker run -d \ --gpus all \ --name inworld-tts-onprem \ -p 8081:8081 \ -p 9030:9030 \ -e INWORLD_CUSTOMER_ID= \ -v $(pwd)/service-account-key.json:/app/gcp-credentials/service-account.json:ro \ us-central1-docker.pkg.dev/inworld-ai-registry/tts-onprem/tts-1.5-mini-h100-onprem: ``` - Ensure your key file has 644 permissions: `chmod 644 service-account-key.json` - The container exposes port 8081 (HTTP) and 9030 (gRPC) - Use `docker ps` to check container health — STATUS will show `healthy` when ready ```bash # Stop and remove docker stop inworld-tts-onprem && docker rm inworld-tts-onprem # View logs docker logs inworld-tts-onprem # Check service status docker exec inworld-tts-onprem supervisorctl -s unix:///tmp/supervisor.sock status ``` ## Benchmarking For performance testing, see the [Benchmarking](/tts/on-premises-benchmarking) guide. ## FAQs Yes. The on-premises container is designed for production workloads. To get started, contact [sales@inworld.ai](mailto:sales@inworld.ai) for access to the repository. For complete data control, low latency, and compliance with strict security or regulatory requirements. No. All text and audio processing occurs entirely within your environment. Deployment takes just a few minutes, with a brief model warm-up (~200 seconds). Enterprises, governments, and regulated industries that cannot use cloud-based TTS. **In-scope:** - API compatibility with Inworld public API - All built-in voices in Inworld's Voice Library - The following model capabilities: text normalization, timestamps, and audio pre- and post-processing settings - Deployment how-to's and latency benchmarks reproduction scripts **Out-of-scope:** - Instant voice cloning features and their APIs - Voice design and its API --- #### Benchmarking Source: https://docs.inworld.ai/tts/on-premises-benchmarking A comprehensive load testing tool for TTS On-Premises that measures performance metrics including latency, throughput, and streaming characteristics across different QPS (Queries Per Second) loads. ## Overview The tool simulates realistic TTS workloads by sending requests at specified rates with configurable burstiness patterns. It measures: - End-to-end latency - Audio generation latency per second - Streaming metrics (first chunk, 4th chunk, average chunk latencies) - Request success rates - Server performance under different load conditions ## Quick start ```bash # Install the load test tool pip install -e . # Basic load test with streaming python load-test.main \ --host http://localhost:8081 \ --stream \ --min-qps 1.0 \ --max-qps 7.0 \ --qps-step 2.0 \ --number-of-samples 300 ``` ## Parameters ### Required | Parameter | Description | Example | |---|---|---| | `--host` | Base address of the On-Premises TTS server (endpoint auto-appended) | `http://localhost:8081` | ### Load configuration | Parameter | Default | Description | |---|---:|---| | `--min-qps` | `1.0` | Minimum requests per second to test | | `--max-qps` | `10.0` | Maximum requests per second to test | | `--qps-step` | `2.0` | Step size for QPS increments | | `--number-of-samples` | `1` | Total number of texts to synthesize per QPS level | | `--burstiness` | `1.0` | Request timing pattern (`1.0` = Poisson, `< 1.0` = bursty, `> 1.0` = uniform) | ### TTS configuration | Parameter | Default | Description | |---|---:|---| | `--stream` | `False` | Use streaming synthesis (`/SynthesizeSpeechStream`) vs non-streaming (`/SynthesizeSpeech`) | | `--max_tokens` | `400` | Maximum tokens to synthesize (~8s audio at 50 tokens/s) | | `--voice-ids` | `["Olivia", "Remy"]` | Voice IDs to use (can specify multiple) | | `--model_id` | `None` | Model ID for TTS synthesis (optional) | | `--text_samples_file` | `scripts/tts_load_testing/text_samples.json` | File containing text samples | ### Output and analysis | Parameter | Default | Description | |---|---:|---| | `--benchmark_name` | auto-generated | Name for the benchmark run (affects output files) | | `--plot_only` | `False` | Only generate plots from existing results (skip testing) | | `--verbose` | `False` | Enable verbose output for debugging | ## Examples ### Streaming vs non-streaming comparison ```bash # Non-streaming test python load-test.main \ --host http://localhost:8081 \ --min-qps 10.0 \ --max-qps 50.0 \ --qps-step 10.0 \ --number-of-samples 500 \ --benchmark_name non-streaming-test # Streaming test python load-test.main \ --host http://localhost:8081 \ --stream \ --min-qps 10.0 \ --max-qps 50.0 \ --qps-step 10.0 \ --number-of-samples 500 \ --benchmark_name streaming-test ``` ### Plot-only mode Generate plots from existing results without re-running tests: ```bash ./scripts/tts-load-test \ --plot_only \ --benchmark_name prod-stress-test ``` ## Understanding results The tool generates comprehensive metrics for each QPS level. ### Latency metrics - **E2E Latency:** Complete request-response time - **Audio Generation Latency:** Time per second of generated audio - **First Chunk Latency:** Time to first audio chunk (streaming only) - **4th Chunk Latency:** Time to 4th audio chunk (streaming only) - **Average Chunk Latency:** Mean time between chunks (streaming only) ### Percentiles Results include P50, P90, P95, and P99 percentiles for all latency metrics. ### Output files Results are saved in `benchmark_result/{benchmark_name}/`: - `result.json` — Raw performance data - `{benchmark_name}_*.png` — Performance charts ## Burstiness parameter The burstiness parameter controls request timing distribution: | Value | Behavior | |---|---| | `1.0` | Poisson process (natural randomness) | | `< 1.0` | More bursty (requests come in clusters) | | `> 1.0` | More uniform (evenly spaced requests) | ## Performance tips 1. **Start small** — Begin with low QPS and small sample sizes 2. **Use appropriate text samples** — Match your production text length distribution 3. **Monitor server resources** — Watch CPU, memory, and network during tests 4. **Consider burstiness** — Real-world traffic is often bursty (try 0.7–0.9) 5. **Test both modes** — Compare streaming vs non-streaming for your use case ## Troubleshooting ### Common issues | Issue | Solution | |---|---| | Connection errors | Verify server address and network connectivity | | Authentication errors | Set `INWORLD_API_KEY` for external APIs | | High latency | Check server load and network conditions | | Memory issues | Reduce `number-of-samples` for high QPS tests | ### Debug mode Use the `--verbose` flag for detailed request/response logging: ```bash ./scripts/tts-load-test --verbose --host ... # other params ``` ## Architecture The tool uses: - **Async/await:** Efficient concurrent request handling - **Pausable timers:** Accurate server-only timing measurements - **Multiple protocols:** gRPC, HTTP REST API support - **Configurable clients:** Pluggable client architecture - **Real-time progress:** Live progress bars and status updates --- ### Best Practices #### Generating speech Source: https://docs.inworld.ai/tts/best-practices/generating-speech This guide covers techniques and best practices for generating high-quality, natural-sounding speech for your applications. If you're using an LLM to generate text for TTS, see our dedicated guide on [Prompting for TTS](/tts/best-practices/prompting-for-tts) for prompt templates and techniques. ## General Best Practices 1. **Pick a suitable voice** - Different voices will be better suited for different applications. Choose a voice that matches the emotional range and expression you're looking for. For example, for a meditation app, select a more steady and calm voice. For an encouraging fitness coach, select a more expressive and excited voice. 2. **Pay attention to punctuation** - Punctuation matters! Use exclamation points (!) to make the voice more emphatic and excited. Use periods to insert natural pauses. Where possible, make sure to include punctuation at the end of the sentence. 3. **Use asterisks for emphasis** - You can emphasize specific words by surrounding them with asterisks. For example, writing "We \*need\* a beach vacation" will cause the voice to stress the word "need" when speaking, whereas "We need a \*beach\* vacation" will emphasize the word "beach". This can help clarify tone or intent in nuanced dialogue. 4. **Match the voice to the text language** - Voices perform optimally when synthesizing text in the same language as the original voice. While cross-language synthesis is possible, you'll achieve the best quality, pronunciation, and naturalness by matching the voice's native language to your text content. 5. **Normalize complex text** - If you find that the model is mispronouncing certain complex phrases like phone numbers or dollar amounts, it can help to normalize the text. This may be particularly helpful for non-English languages. Some examples of normalization include: - **Phone numbers**: "(123)456-7891" -> "one two three, four five six, seven eight nine one" - **Dates**: 5/6/2025 -> "may sixth twenty twenty five" *(helpful since date formats may vary)* - **Times**: "12:55 PM" -> "twelve fifty-five PM" - **Emails**: test@example.com -> "test at example dot com" - **Monetary values**: $5,342.29 -> "five thousand three hundred and forty two dollars and twenty nine cents" - **Symbols**: 2+2=4 -> “two plus two equals four" 6. **Tune the temperature** - The temperature controls the variation in audio output. Higher values increase variation, which can produce more diverse outputs with desirable outcomes but also increases the chances of bad generations and hallucinations. This can be useful for generating barks, demo clips, or other non-real-time use cases. Lower temperatures improve stability and speaker similarity, though going too low increases the chances of broken generation. For real-time use cases, we recommend keeping the temperature between 0.8 and 1, with the default being 1.0. ## Latency For realtime use cases, minimizing latency is critical. Here are some tips and techniques you can use: 1. **Stream TTS output** - Instead of waiting for the entire generation (which may take some time if it is long), you can start playback as soon as the first chunk arrives so that the user doesn't have to wait. Inworld's [websocket streaming](/api-reference/ttsAPI/texttospeech/synthesize-speech-websocket) should be the lowest-latency option, but [streaming over HTTP](/api-reference/ttsAPI/texttospeech/synthesize-speech-stream) will also be superior to a [non-streaming setup](/api-reference/ttsAPI/texttospeech/synthesize-speech). 2. **Chunk TTS input** - Instead of sending a large request to the TTS model (whether it's pre-written or generated by an LLM), consider breaking it into sentence chunks and sending them one by one. The Inworld Agent Runtime provides [built-in tools](/node/runtime-reference/classes/graph_dsl_nodes_text_chunking_node.TextChunkingNode) to handle this in a performant manner. For synthesizing text longer than 2,000 characters, see our ready-to-run scripts in the [Long Text Input](/tts/capabilities/long-text-input) guide. ## Advanced Tips ### Natural, Conversational Speech Natural human conversation is not perfect. It's full of filler words, pauses, and other natural speech patterns that make it sound more human. Our TTS models are trained to generate the requested text as is, in order to produce the most accurate and consistent output that can be used for a wide range of applications. After all, not all applications want to have a bunch of filler words inserted into the speech! To generate natural, conversational speech, you can use the following techniques: 1. Insert filler words like `uh`, `um`, `well`, `like`, and `you know` in the text. For example, instead of: ``` I'm not too sure about that. ``` change it to: ``` Uh, I'm not uh too sure about that. ``` If the text is already being generated using an LLM, you can add instructions in the prompt to insert filler words in the response. Alternatively, you can use a small LLM to insert filler words given a piece of text. 2. Use [audio markups](/tts/capabilities/audio-markups) to add non-verbal vocalizations like `[sigh]`, `[breathe]`, `[clear_throat]`. These natural speech patterns can make the speech sound more natural. ### Audio Markups This feature is currently [experimental](/tts/resources/support#what-do-experimental%2C-preview%2C-and-stable-mean%3F), and is not recommended for real-time, production use cases. When using audio markups, there are a number of techniques for producing the best results. 1. **Choose contextually appropriate markups** - Markups will work best when they make sense with the text content. When markups conflict with the text, the model may struggle to handle the contradiction. For example, the following phrase can be challenging: ``` [angry] I appreciate your help and I’m really grateful for your kindness. ``` The text is clearly grateful and sincere, which contradicts with the angry markup. 2. **Avoid conflicting markups** - When using multiple markups for a single text, ensure they don't conflict with each other. For example, this markup can be problematic: ``` [angry] I can't believe you did that. [yawn] You never listen. ``` Yawning typically indicates boredom or tiredness, which rarely occurs alongside anger. 3. **Break up the text** - Emotion and delivery style markups work best when placed at the beginning of text with a single markup per request. Using multiple emotion and delivery style markups or placing them mid-text may produce mixed results. Instead of making one request like this: ``` [angry] I can't believe you didn't save the last bite of cake for me. [laughing] Got you! I was just kidding. ``` Break it into two requests: ``` [angry] I can't believe you didn't save the last bite of cake for me. ``` ``` [laughing] Got you! I was just kidding. ``` 4. **Repeat non-verbal vocalizations if necessary** - If a non-verbal vocalization is consistently being omitted, it may help to repeat the markup to ensure that it is vocalized. This works best for vocalizations where repetition sounds natural, such as `[laugh] [laugh]` or `[cough] [cough]`. --- #### Latency Source: https://docs.inworld.ai/tts/best-practices/latency For realtime use cases, minimizing latency is critical. Here are some tips and techniques you can use: 1. **Stream TTS output** - Instead of waiting for the entire generation (which may take some time if it is long), you can start playback as soon as the first chunk arrives so that the user doesn't have to wait. Inworld's [websocket streaming](/api-reference/ttsAPI/texttospeech/synthesize-speech-websocket) should be the lowest-latency option, but [streaming over HTTP](/api-reference/ttsAPI/texttospeech/synthesize-speech-stream) will also be superior to a [non-streaming setup](/api-reference/ttsAPI/texttospeech/synthesize-speech). 2. **Chunk streaming LLM output into TTS** - For the fastest time to first audio, consider breaking streaming LLM output into sentence chunks and sending them one by one to TTS. The Inworld Agent Runtime provides [built-in tools](/node/runtime-reference/classes/graph_dsl_nodes_text_chunking_node.TextChunkingNode) to handle this in a performant manner. 3. **Use JWT authentication to stream directly to the client** - For applications like mobile apps or browser-based experiences, use [JWT authentication](/api-reference/introduction#jwt-authentication) to stream TTS directly to the client rather than proxying through your server and adding extra latency. 4. **Reuse connections with keep-alive** - The first request to the API incurs a TCP and TLS handshake. Use `Connection: keep-alive` (and persistent sessions in Python) to reuse the established connection on subsequent requests. See our [low-latency Python](https://github.com/inworld-ai/inworld-api-examples/blob/main/tts/python/example_tts_low_latency_http.py) and [JavaScript](https://github.com/inworld-ai/inworld-api-examples/blob/main/tts/js/example_tts_low_latency_http.js) examples for this technique in practice. ## Next Steps Looking for more tips and tricks? Check out the resources below to get started! Learn best practices for producing high-quality voice clones. Learn best practices for synthesizing high-quality speech. Explore Python and JavaScript code examples for TTS integration. --- #### Prompting for TTS Source: https://docs.inworld.ai/tts/best-practices/prompting-for-tts When an LLM generates text that gets fed into TTS, the default output often sounds flat and unnatural. LLMs tend to produce clean, well-formatted text, but clean text isn't the same as *speakable* text. Dates stay as `12/04`, acronyms aren't expanded, and there are no cues for emphasis, pauses, or emotion. This guide shows you what to add to your LLM system prompt so that its output is optimized for Inworld TTS. ## Quality Dimensions ### Emphasis Use asterisks around words to make TTS stress them. Exclamation marks add energy, and ellipses create trailing-off effects. **Prompt snippet:** ``` Use asterisks (*word*) to emphasize key words in your response — focus on prices, deadlines, action items, or any word the listener needs to catch. Use punctuation to convey tone: - Exclamation marks for excitement or urgency - Ellipsis (...) for trailing off, hesitation, or leaving a thought unfinished Example: "I thought it would work, but..." ``` **Before (no emphasis guidance):** > I think this is a really important point and you should consider it carefully. **After (with emphasis guidance):** > I think this is a \*really\* important point, and you should consider it \*carefully\*. Use single asterisks only (`*word*`). Double asterisks (`**word**`) will cause TTS to read the asterisk characters aloud instead of emphasizing the word. ### Pronunciation For uncommon words like brand names, proper nouns, and technical terms, Inworld TTS supports inline [IPA phoneme notation](/tts/capabilities/custom-pronunciation). You can provide a pronunciation dictionary in your system prompt that the LLM substitutes inline. **Prompt snippet:** ``` When you use any of the following words, replace them with their IPA pronunciation inline using slash notation: - "Crete" → /kriːt/ - "Yosemite" → /joʊˈsɛmɪti/ - "Nguyen" → /ŋwɪən/ - "Acai" → /ɑːsɑːˈiː/ ``` **Before (no pronunciation guidance):** > You should visit Crete for your honeymoon. **After (with IPA substitution):** > You should visit /kriːt/ for your honeymoon. Inworld TTS reads the IPA notation and produces the correct pronunciation. See [Custom Pronunciation](/tts/capabilities/custom-pronunciation) for details on finding the right IPA phonemes. Another common approach is to use a string parser that replaces important-to-pronounce words from your pronunciation dictionary before passing the text to TTS. This works well as a post-processing step when you don't want to add IPA instructions to your LLM prompt, or when the same dictionary needs to be applied consistently across multiple LLM providers. ### Pauses and Pacing Punctuation controls pacing in TTS. Periods create natural pauses between thoughts. Commas insert shorter breaks. Sentence length affects overall rhythm: short sentences speed things up, longer sentences slow them down. **Prompt snippet:** ``` Control pacing through punctuation and sentence structure: - Use periods to separate thoughts and create pauses - Use commas for shorter breaks within sentences - Use ellipsis (...) to create a lingering pause or beat - Use short sentences for emphasis and urgency - Use longer sentences for calm, measured delivery ``` **Before (flat pacing):** > The results are in and we exceeded our target by 40 percent so this is the best quarter we have ever had. **After (with pacing guidance):** > The results are in. We exceeded our target... by \*forty percent\*. This is the \*best\* quarter we have ever had. ### Non-verbal Vocalizations Inworld TTS supports non-verbal tokens that add human-like sounds: `[sigh]`, `[laugh]`, `[breathe]`, `[cough]`, `[clear_throat]`, `[yawn]`. These make speech sound more natural and emotionally grounded. Audio markups are currently [experimental](/tts/resources/support#what-do-experimental%2C-preview%2C-and-stable-mean%3F) and only support English. **Prompt snippet:** ``` Insert non-verbal vocalizations where they would naturally occur in conversation: - [sigh] for frustration, relief, or resignation - [laugh] for amusement or warmth - [breathe] before delivering important or emotional statements - [cough] or [clear_throat] for naturalistic transitions - [yawn] for tiredness Place these tokens inline in your text, e.g.: "[sigh] I really thought that would work." ``` **Before (no vocalizations):** > I really thought that would work. Oh well, let's try again. **After (with vocalizations):** > [sigh] I \*really\* thought that would work. [laugh] Oh well, let's try again. See [Audio Markups](/tts/capabilities/audio-markups) for the full list of supported markups including emotion and delivery style tags. ### Conversational Naturalness Natural human speech is full of filler words like `uh`, `um`, `well`, `like`, `you know`. Adding these to LLM output makes TTS sound less robotic and more conversational. **Prompt snippet:** ``` To sound natural and conversational, include filler words where a human speaker would naturally use them: - "uh" and "um" for thinking moments - "well" and "so" for transitions - "like" and "you know" for casual emphasis Example: "So, uh, I was thinking we could, you know, try a different approach." ``` **Before (no fillers):** > I was thinking we could try a different approach. **After (with fillers):** > So, uh, I was thinking we could, you know, try a \*different\* approach. Filler words are best for casual, conversational use cases. Skip them for formal applications like news reading, professional narration, or customer support. ### Output Length LLMs tend to be verbose. A detailed paragraph may read well on screen, but sounds unnatural and exhausting when spoken aloud. Keeping responses short produces better-sounding speech and reduces latency. A good default is to ask your LLM to respond in 1–2 sentences unless the user's query specifically demands a longer answer. Use sentences as your length unit, not words or characters. LLMs operate on tokens, so word and character counts are unreliable constraints. **Prompt snippet:** ``` Keep your responses to 1-2 sentences unless the user's question specifically requires a longer explanation. Prefer concise, direct answers. ``` **Before (too verbose):** > Well, the weather forecast for tomorrow is showing that there will be partly cloudy skies throughout the morning hours, with temperatures expected to reach a high of around seventy-five degrees Fahrenheit by the early afternoon, and then cooling down to approximately sixty degrees in the evening. **After (concise):** > Tomorrow looks like partly cloudy skies, with a high around \*seventy-five\* and cooling to sixty by evening. ## Example Prompt Templates Below are complete, copyable system prompt blocks tailored for common use cases. Each template combines the techniques above into a ready-to-use prompt. Use this template for chatbots, AI companions, virtual friends, and other informal conversational applications. ``` ## Speech Output Rules Your responses will be converted to speech using TTS. Follow these rules to produce natural, expressive spoken output: ### Expressiveness - Use *asterisks* to emphasize key words - Use exclamation marks for excitement, ellipsis for trailing off - Insert non-verbal vocalizations where natural: [sigh], [laugh], [breathe], [cough], [clear_throat], [yawn] Example: "[laugh] That's *exactly* what I was thinking!" ### Naturalness - Include filler words (uh, um, well, like, you know) where a human would naturally pause - Vary sentence length for natural rhythm - Use contractions (don't, can't, I'm, we're) instead of formal forms ### Pronunciation - Replace uncommon proper nouns with IPA: e.g., /kriːt/ for Crete [Add your pronunciation dictionary here] ### Text Formatting - Write numbers in spoken form: "twenty-three" not "23" - Write dates in spoken form: "march fifteenth" not "3/15" - Never use markdown formatting, bullet points, or structured text - Never use emojis or special characters - Write everything as natural spoken sentences ``` Use this template for customer support agents, sales assistants, and other professional conversational applications. ``` ## Speech Output Rules Your responses will be converted to speech using TTS. Follow these rules to produce clear, professional spoken output: ### Clarity - Use *asterisks* sparingly to emphasize critical information (prices, deadlines, action items) - Use short, clear sentences for important details - Use periods to separate distinct points ### Professionalism - Do NOT use filler words (uh, um, like, you know) - Do NOT use non-verbal vocalizations ([sigh], [laugh], etc.) - Maintain a warm but professional tone - Use contractions naturally (don't, we'll, you're) ### Numbers and Data - Speak account numbers digit by digit: "one two three four five six" not "123456" - Speak prices naturally: "forty-nine ninety-nine" or "forty-nine dollars and ninety-nine cents" - Speak dates fully: "january fifteenth, twenty twenty-five" not "1/15/2025" - Speak phone numbers in groups: "five five five, one two three, four five six seven" ### Pronunciation - Replace product names and brand terms with IPA where needed [Add your pronunciation dictionary here] ### Text Formatting - Never use markdown formatting, bullet points, or structured text - Never use emojis or special characters - Write everything as natural spoken sentences ``` Use this template for coding assistants, documentation readers, technical narrators, and developer-facing tools. ``` ## Speech Output Rules Your responses will be converted to speech using TTS. Follow these rules to produce accurate, well-paced technical speech: ### Technical Accuracy - Spell out acronyms on first use: "AWS, or Amazon Web Services" - For common acronyms after first use, speak them as words if pronounceable (e.g., "NASA") or spell them out if not (e.g., "A-P-I") - Speak URLs by component: "github dot com slash inworld dash AI" - Speak code identifiers in plain English: "the getUserName function" not "getUserName()" - Speak version numbers naturally: "version three point two" not "v3.2" ### Pronunciation - Replace technical proper nouns with IPA: [Add your pronunciation dictionary here, e.g.:] - "Kubernetes" → /kuːbərˈnɛtiːz/ - "Nginx" → /ˈɛndʒɪnɛks/ - "PostgreSQL" → /ˈpoʊstɡrɛsˌkjuːˈɛl/ ### Pacing - Use measured, even pacing. Avoid rushing through technical content. - Insert periods before key technical terms to create natural pauses - Keep sentences moderate length - Do NOT use filler words (uh, um, like, you know) ### Text Formatting - Write all numbers in spoken form: "forty-two" not "42" - Never use markdown formatting, bullet points, or code blocks - Never use emojis or special characters - Write everything as natural spoken sentences ``` ## Notes on Normalization Inworld TTS includes an optional **normalization** step that automatically expands dates, numbers, emails, currencies, and symbols into their spoken forms before synthesis. Understanding how normalization interacts with your LLM output is important for getting the best results. Toggle normalization with the `applyTextNormalization` parameter in your [TTS API request](/api-reference/ttsAPI/texttospeech/synthesize-speech-stream#body-apply-text-normalization): - `ON` — always normalize - `OFF` — skip normalization entirely - `APPLY_TEXT_NORMALIZATION_UNSPECIFIED` (default) — TTS decides per-request Normalization adds slight latency to each TTS request. For latency-sensitive applications, consider having your LLM handle text expansion directly and setting `applyTextNormalization` to `OFF`. ### With Normalization On Inworld TTS handles common expansions automatically. Your LLM prompt still benefits from guiding edge cases that normalization may not cover: - **Ambiguous dates**: `01/02/2025` could be January 2nd or February 1st depending on locale - **Domain-specific abbreviations**: `RDS`, `k8s`, `HIPAA` may not expand as expected - **Uncommon acronyms**: Industry-specific terms that aren't in common usage ### With Normalization Off The LLM must handle **all** text expansion. Your prompt must instruct the LLM to write everything in spoken form: no digits, no symbols, no shorthand. ### Comparison Table | Raw Text | Normalization Produces | LLM Should Produce (Normalization Off) | |---|---|---| | `12/04/2025` | "twelve oh four twenty twenty-five" | "december fourth, twenty twenty-five" | | `(555) 123-4567` | "five five five, one two three, four five six seven" | "five five five, one two three, four five six seven" | | `$1,249.99` | "one thousand two hundred forty-nine dollars and ninety-nine cents" | "twelve hundred forty-nine dollars and ninety-nine cents" | | `3:45 PM` | "three forty-five PM" | "three forty-five PM" | | `test@example.com` | "test at example dot com" | "test at example dot com" | | `2 + 2 = 4` | "two plus two equals four" | "two plus two equals four" | ### When to Use Each - **Normalization on** (recommended for most cases): Less prompt engineering required. Inworld TTS handles standard expansions and you only need to guide edge cases. - **Normalization off**: Use when you need full control over how text is spoken, or when your domain has specific pronunciation requirements that conflict with default expansion rules. **Prompt snippet for normalization off:** ``` CRITICAL: Write ALL text in fully spoken form. Never use digits, symbols, or abbreviations. - Dates: "december fourth, twenty twenty-five" not "12/04/2025" - Phone numbers: "five five five, one two three, four five six seven" not "(555) 123-4567" - Currency: "forty-nine dollars and ninety-nine cents" not "$49.99" - Times: "three forty-five PM" not "3:45 PM" - Emails: "john at example dot com" not "john@example.com" - Symbols: "two plus two equals four" not "2+2=4" ``` ## Tips for Iterating - **Test with the TTS Playground**: Use the [TTS Playground](/tts/tts-playground) to quickly hear how your LLM output sounds when synthesized. Paste in sample outputs and iterate on your prompt until the speech quality meets your needs. - **Tune LLM temperature for consistency**: Lower temperatures produce more consistent output that follows your formatting rules reliably. Higher temperatures can produce more expressive text but may ignore specific instructions. Start around `0.7` and adjust based on results. - **Iterate on your pronunciation dictionary**: Start with a small set of terms and expand as you discover mispronunciations during testing. Ask an LLM to generate IPA for new terms. ## Next Steps Best practices for synthesizing high-quality speech, including punctuation, emphasis, and temperature tuning. Control emotion, delivery style, and non-verbal vocalizations with markup tags. Define exact pronunciations for uncommon words using inline IPA notation. --- #### Voice Cloning Source: https://docs.inworld.ai/tts/best-practices/voice-cloning This guide walks through best practices and techniques for generating high-quality voice clones. For more information on how to create a voice clone, check out this [guide](/tts/voice-cloning). Inworld offers two types of voice cloning: instant voice cloning (available via [Inworld Portal](https://platform.inworld.ai)) and professional voice cloning (please [reach out](https://inworld.ai/contact-sales) for more information). We've broken down the best practices in this guide to general best practices that apply to all voice clones, as well as more specific best practices for each type of cloning. ## General Best Practices 1. **Capture the full range of expression** - Make sure your script and delivery cover the emotions and expressiveness you want the voice to capture. The more variety you include, the better the model will be at recreating those feelings. If the audio is flat, the resulting voice will usually sound monotone as well. Below are some scripts you can use that we've found work well: - *Are you ready to save big? Get set for the sale of the century! Deals and discounts like never before! You won't want to miss this.* - *Every challenge we face is an opportunity in disguise. Wouldn't you agree? So cheer up! It'll all be okay.* - *How have you been? It's been way too long since we last caught up. By the way, I heard about your recent promotion. Congratulations! I'm so excited for you!* 2. **Speak clearly and consistently** - Pronounce each word carefully and avoid filler sounds like sighs or coughs. Try not to have unnaturally long pauses in the middle of your recording, as this can affect the flow of the cloned voice. 3. **Minimize noise** - Record in a quiet environment and keep a reasonable distance from the microphone to reduce echo, plosives, and device noise. After recording, listen back to ensure your audio is clean and free of any unwanted sounds. ## Best Practices for Instant Voice Cloning 1. **Keep final clip short** - Use a 5-15s total length for enough context while keeping the voice consistent. 2. **Use high-quality audio** - Record with at least a 22 kHz sample rate and 16-bit depth. 3. **Vary emotion and delivery** - Combine a few short clips that show different expressions into your final clip; use short pauses or crossfades between clips to avoid abrupt cuts. 4. **Use clean audio** - Avoid artifacts, background noise, and non-speech sounds. 5. **Normalize volume** - Keep levels fairly consistent with normal voice variation; avoid clipping due to very high dB. 6. **Avoid mid-word cuts** - Don’t use samples that break in the middle of words. Instant voice cloning may not perform well for less common voices, such as children's voices or unique accents. For those use cases, we recommend professional voice cloning. ## Best Practices for Professional Voice Cloning 1. **Follow the optimal recording specifications** - For the best voice quality, we recommend recording audio with the following specifications: - Audio Format: .wav - Sampling Frequency: 48 kHz - Bit Rate: 24 bits - Codec: Linear PCM (uncompressed) - Channel(s): 1 (mono) - Loudness Level: -23LUFS ±0.5 LU (compliant with ITU-R BS.1770-3) - Peak Values Level (Max): -5 dBFS using True Peak value (compliant with ITU-R BS.1770-3) - Noise Floor Level (Max): -60dB 2. **Maintain consistent voice delivery** - Keep your voice consistent throughout all recordings. It’s fine to reflect natural variation based on the script (such as hesitations, questions, or exclamations), but avoid major changes in accent or style between samples. 3. **Provide ample, high-quality samples** - While the minimum required audio is only 30 samples (5–20 seconds each, totaling about 5 minutes), we recommend at least 120 samples (totaling about 20 minutes) for the best results. There’s no upper limit to the number of samples you can provide—more clean, high-quality recordings will generally lead to higher quality clones. 4. **Include transcripts where possible** - Text transcripts are not strictly necessary, but we recommend providing them if available—especially for uncommon words, product names, or company terms. This ensures accurate pronunciation in the final voice clone. ## Automation via API If you need to clone multiple voices (for example, to support a batch of creators or a pipeline workflow), you can automate voice cloning via the API. - **API reference**: [Clone a voice](/api-reference/voiceAPI/voiceservice/clone-voice) - **Python example**: [example_voice_clone.py](https://github.com/inworld-ai/inworld-api-examples/blob/main/tts/python/example_voice_clone.py) - **JavaScript example**: [example_voice_clone.js](https://github.com/inworld-ai/inworld-api-examples/blob/main/tts/js/example_voice_clone.js) Voice cloning has lower rate limits than regular speech synthesis. For details, see [Rate limits](/resources/rate-limits). --- #### Voice Design Source: https://docs.inworld.ai/tts/best-practices/voice-design This guide walks through best practices and techniques for generating high-quality voices using Voice Design. For a step-by-step walkthrough on how to design a voice, check out the [Voice Design guide](/tts/voice-design). Voice Design is currently in [research preview](/tts/resources/support#what-do-experimental-preview-and-stable-mean). Please share any feedback with us via the feedback form in [Portal](https://platform.inworld.ai) or in [Discord](https://discord.gg/inworld). ## Voice Description Best Practices The voice description helps the model understand the type of voice you want to generate. The following best practices will help you write descriptions that produce better voices: 1. **Be specific in your description** - Vague descriptions like "a fun voice" may produce less consistent results. Include details about age, gender, language (if not English), accent, pitch, pace, timbre, tone, and emotional quality. We generally recommend structuring your description in this order: *Distinctive Qualities → Gender → Language / Accent → Age → Tone → Delivery Style → Pacing → Additional Qualities → Audio Quality* For example: > *"A soothing, calming female voice with soft American accent, 30-45 years old. Gentle, flowing delivery with natural pauses and smooth transitions. Warm, peaceful tone that creates relaxation without sounding robotic. Perfect broadcast quality audio."* 2. **Be specific with age** - If more general terms like "young" and "old" are not producing the desired voice, use more specific age ranges like "mid-20s to early 30s" or "late 60s to early 70s". - For child voices, try specifying exact ages (e.g., "8-10 years old") and emphasize "natural" and "age-appropriate" to avoid over-cutesy results. - For elderly voices, include both the age range and specific texture descriptors ("gravelly," "weathered") along with pacing cues ("slower, deliberate"). 3. **For regional accents, specify the city or region** - For regional accents, always include the specific city or region. For example, write "Boston accent" rather than "Northeast accent." 4. **Describe vocal texture in the middle** - Place descriptions of the vocal texture and timbre (e.g., "raspy," "breathy," "nasally") in the middle of your voice description, never at the end. Use modifiers like "slight," "subtle," or "natural" to prevent over-exaggeration. 5. **End with audio quality** - For the clearest audio quality, include the phrase "Perfect broadcast quality audio.” at the end of your description. This can be especially helpful if the voice includes descriptions like "gravelly", "breathy", or "scratchy" that may be misinterpreted as audio degradation. 6. **Avoid conflicting descriptors** - Don't use conflicting descriptors (e.g., "fast-paced" with "slow, deliberate"), as that may confuse the model. 7. **Experiment with multiple generations** - Each generation produces slightly different results. Especially for less common voices (e.g., children, elderly specific regional accents), you may need to generate a couple of times to get a succesful voice. ## Voice Script Best Practices The script shapes the voice that gets generated, as the model will tailor the voice to suit the content it's speaking. If writing your own script, the following best practices will help ensure the best results. 1. **Match the script to the voice** - The model will tailor the voice to the script. Write a script that matches your voice and desired use case. For example, if you're designing a customer support voice, use a script that sounds like a customer support conversation. For accented voices, use words and phrasing typical of that accent. For example, for a British voice, use words like "brilliant," "proper," or "spot on." 2. **Aim for 5-15 seconds** - Aim for a script that will generate 5-15 seconds of audio (50-200 characters in English), so that your resulting voice has enough generated audio to reference for how the voice should sound in your future audio generations. 3. **Match the desired language** - Make sure the script is in the desired language (e.g., write a Chinese script if you want the voice to speak Chinese). ## Next Steps Follow the step-by-step guide to design your first voice. Learn best practices for synthesizing high-quality speech with your designed voices. Clone an existing voice with just 5-15 seconds of audio. --- ### Resources #### Release Notes Source: https://docs.inworld.ai/release-notes/tts ## Inworld TTS 1.5 Launched [Inworld TTS 1.5](https://inworld.ai/blog/inworld-tts-1-5-the-world-s-best-realtime-text-to-speech-model), our newest generation of realtime TTS models featuring: * **Two New Models:** Our flagship model `inworld-tts-1.5-max` is ideal for most use cases, with the best balance of quality and speed. For use cases where latency is the top priority, we also offer `inworld-tts-1.5-mini`. * **Latency Improvements:** Our new TTS-1.5 models achieve P90 latency for first audio chunk delivery under 250ms for our Max model and under 160ms for our Mini model, a 4x improvement compared to TTS-1. * **More Expressive and More Stable:** TTS 1.5 is 20% more expressive than prior generations and demonstrates a 25% reduction in word error rate. * **Additional Languages:** We've added support for additional languages, including Hindi, Arabic, and Hebrew, bringing total languages supported to 15. ## Updates to Inworld TTS Released an upgraded version of the Inworld TTS models with higher overall quality. * **Speech Quality:** Clearer, more natural speech with smoother pacing and more accurate pronunciation. * **Voice Similarity:** Cloned voices sound closer to the originals, preserving each voice’s unique style. * **Non-English Languages:** More consistent, reliable output across supported non-English languages. * **Custom Pronunciation:** New support for inline IPA, giving you control over exact word pronunciations. See the [Key Features](/tts/capabilities/custom-pronunciation) for details. --- #### Billing Source: https://docs.inworld.ai/portal/billing --- #### Usage Source: https://docs.inworld.ai/portal/usage --- #### Zero Data Retention Source: https://docs.inworld.ai/tts/resources/zero-data-retention Enterprise customers in healthcare, finance, and other regulated industries can now ensure that no customer text inputs or audio outputs are persistently stored in our systems after processing completes. ## What products are configured for Zero Data Retention Mode? | Product | Type | Eligible for Zero Data Retention? | | :---- | :---- | :---- | | Text-to-Speech | Text Input | Yes | | Text-to-Speech | Audio Output | Yes | | Voice Cloning | Audio Samples | No | | Voice Design | Voice Description/Script | No | ## How Zero Data Retention works When enabled, all customer text inputs sent to our text-to-speech engine are processed in memory to generate audio. Once complete, both the text and the generated audio are immediately redacted from our logging systems. This applies to all text sent for speech synthesis, including requests using cloned voices. ## Supported models TTS Zero Data Retention currently supports **TTS-1.5-Mini** and **TTS-1.5-Max** only. Legacy TTS-1 and TTS-1-Max models are not in the support scope. ## Enterprise-level security Zero Data Retention is configured at the workspace level, allowing customers with strict data privacy requirements to maintain compliance. Whether you're handling protected health information or sensitive financial data, you get the security guarantees needed for production deployment at scale. ## FAQs TTS Zero Data Retention protects privacy by redacting all input texts from logs — including text sent to Cloned Voices — while only retaining the initial source audio required to build the voice itself. This is currently a workspace-level configuration managed by the Inworld Engineering team. Yes. Zero Data Retention is configured per workspace. Contact our sales team to enable it for specific workspaces. With Zero Data Retention enabled, original input text and audio output cannot be retrieved. Debugging and troubleshooting capabilities are limited to non-sensitive metadata only. No. The TTS engine still receives the text to generate the audio. The redaction happens only at the logging layer, ensuring the record of what was said is not saved. --- #### ElevenLabs Migration Source: https://docs.inworld.ai/tts/resources/elevenlabs-migration Batch-migrate your existing ElevenLabs voice clones into Inworld using our open-source migration tool: [github.com/inworld-ai/voice-migration-tool](https://github.com/inworld-ai/voice-migration-tool) The tool runs locally on your machine and communicates directly with the ElevenLabs and Inworld APIs; it does not proxy your data through any additional intermediary servers. Requires **Node.js** 18+ and **ffmpeg** installed on your machine. ## Steps 1. **Clone and start the tool** — Run: `git clone https://github.com/inworld-ai/voice-migration-tool.git && cd voice-migration-tool && npm install && npm run dev` then open http://localhost:3000. 2. **Connect your accounts** — Enter your ElevenLabs API key, Inworld API key, and Inworld workspace name. Only voices you created yourself are shown. 3. **Select voices and migrate** — Select voices and click Migrate Selected. Audio samples are automatically converted to WAV, padded to 5s minimum, and trimmed to 15s maximum. 4. **Preview your migrated voices** — Click Preview on any voice to generate a sample utterance and confirm the clone works. --- #### Support Source: https://docs.inworld.ai/tts/resources/support --- ## STT ### Get Started #### Intro to STT Source: https://docs.inworld.ai/stt/overview Inworld's Speech-to-Text (STT) API provides a unified integration point for industry-leading transcription providers. You get consistent authentication, request formatting, and response handling across providers — without managing multiple SDKs or credentials. The API supports both synchronous transcription for complete audio files and real-time bidirectional streaming over WebSocket for live audio. Make your first STT API call and get a transcript. View the complete API specification. Browse ready-to-use GitHub samples for sync and real-time STT. ## Supported Providers ### Groq | **Model ID** | **Endpoints** | **Best for** | | :--- | :--- | :--- | | `groq/whisper-large-v3` | Sync API only | General-purpose transcription for recorded audio | ### AssemblyAI | **Model ID** | **Endpoints** | **Best for** | | :--- | :--- | :--- | | `assemblyai/universal-streaming-multilingual` | WebSocket only | Multilingual streaming (English, Spanish, French, German, Italian, Portuguese) | | `assemblyai/universal-streaming-english` | WebSocket only | English-optimized streaming | AssemblyAI models currently support the WebSocket streaming endpoint only. Sync HTTP support is coming soon. For pricing details, see [inworld.ai/pricing](https://inworld.ai/pricing). ## Features | **Feature** | **groq/whisper-large-v3** | **assemblyai/universal-streaming-multilingual** | **assemblyai/universal-streaming-english** | | :--- | :--- | :--- | :--- | | Pricing | $0.111/hour | $0.15/hour | $0.15/hour | | Endpoint | Sync API only | WebSocket only | WebSocket only | | Real-time streaming | | | | | Best for | General-purpose transcription for recorded audio | Multilingual streaming (English, Spanish, French, German, Italian, Portuguese) | English-optimized streaming | | Languages | 100+ (Whisper) | 6 languages | English | ## Supported Audio Formats | **Format** | **Sync API** | **WebSocket Streaming** | | :--- | :--- | :--- | | `LINEAR16` (PCM) | | | | `MP3` | | | | `OGG_OPUS` | | | | `FLAC` | | | | `AUTO_DETECT` | | | Recommended defaults: 16,000 Hz sample rate, 16-bit depth, mono. For container formats (MP3, FLAC, OGG_OPUS), `sampleRateHertz` is optional — the API auto-detects it from the file header. ## Endpoints | **Endpoint** | **Method** | **Description** | | :--- | :--- | :--- | | [`/stt/v1/transcribe`](/api-reference/sttAPI/speechtotext/transcribe) | POST | Send complete audio, receive full transcript | | [`/stt/v1/transcribe:streamBidirectional`](/api-reference/sttAPI/speechtotext/transcribe-stream-websocket) | WebSocket | Stream audio in real time, receive transcription chunks as they become available | --- #### Developer Quickstart Source: https://docs.inworld.ai/stt/quickstart In this quickstart, you'll send an audio file to the STT API and receive a transcript. ## Make your first STT API request Create an [Inworld account](https://platform.inworld.ai/signup). (See Authentication section above for API key setup.) Set your API key as an environment variable. ```shell macOS and Linux export INWORLD_API_KEY='your-base64-api-key-here' ``` ```shell Windows setx INWORLD_API_KEY "your-base64-api-key-here" ``` The STT API accepts base64-encoded audio. Prepare your audio file (e.g., `input.mp3`) and encode it: ```shell macOS export AUDIO_BASE64=$(base64 input.mp3 | tr -d '\n') ``` ```shell Linux export AUDIO_BASE64=$(base64 -w0 input.mp3) ``` Recommended audio settings: 16,000 Hz sample rate, mono, 16-bit depth. See [Supported Audio Formats](/stt/overview#supported-audio-formats) for all options. ```curl cURL curl --request POST \ --url https://api.inworld.ai/stt/v1/transcribe \ --header "Authorization: Basic $INWORLD_API_KEY" \ --header "Content-Type: application/json" \ --data "{ \"transcribeConfig\": { \"modelId\": \"groq/whisper-large-v3\", \"audioEncoding\": \"MP3\" }, \"audioData\": { \"content\": \"$AUDIO_BASE64\" } }" ``` Set `audioEncoding` to match your file format (`MP3`, `LINEAR16`, `OGG_OPUS`, `FLAC`), or use `AUTO_DETECT` to let the API infer it from the audio header. A successful response contains the transcript: ```json { "transcription": { "transcript": "Hey, I just wanted to check in on the delivery status for my order.", "isFinal": true, "wordTimestamps": [] }, "usage": null } ``` | Field | Description | | :--- | :--- | | `transcription.transcript` | The transcribed text | | `transcription.isFinal` | Whether the result is finalized | | `transcription.wordTimestamps` | Per-word timing data (coming soon) | | `usage` | Usage metrics for billing (coming soon) | ## Next Steps Learn about supported providers, audio formats, and endpoints. View the complete API specification. --- #### Voice Profiles Source: https://docs.inworld.ai/stt/voice-profiles Voice Profile analyzes vocal characteristics of the speaker alongside transcription. It returns structured classification data for **Age**, **Emotion**, **Pitch**, **Vocal Style**, and **Accent**, each with confidence scores ranging from 0.0 to 1.0. Voice Profile is available across all STT models on the Inworld STT API. By understanding _who_ is speaking and _how_ they are speaking, applications can adapt responses, adjust tone, route conversations, or trigger context-sensitive behaviors in real time. ## Use cases - **Voice agents and NPCs** — Adapt responses based on the speaker's detected emotion or vocal style. - **Accessibility** — Detect age category or vocal style to adjust UI, pacing, or interaction complexity. - **Content moderation** — Flag unusual vocal patterns (shouting, crying) for escalation or review. - **Analytics and insights** — Aggregate emotion and vocal style data across sessions. - **Localization** — Use accent detection to dynamically select language models or localized content. ## How it works Voice Profile analysis runs automatically when configured via `inworldConfig.voiceProfileThreshold` in your request. The confidence threshold controls which labels are returned — only labels at or above the threshold are included. Default: `0.5`. Range: 0.0–1.0. ## Classification categories ### Age | Label | Description | | :--- | :--- | | `young` | Young adult / teenager | | `adult` | Adult speaker | | `kid` | Child speaker | | `old` | Elderly speaker | | `unclear` | Age could not be determined | ### Emotion | Label | Description | | :--- | :--- | | `tender` | Soft, gentle, caring tone | | `sad` | Sorrowful or melancholy tone | | `calm` | Relaxed, even-tempered delivery | | `neutral` | No strong emotional signal | | `happy` | Cheerful, upbeat tone | | `angry` | Frustrated, aggressive tone | | `fearful` | Anxious or frightened tone | | `surprised` | Startled or astonished tone | | `disgusted` | Revulsion or strong disapproval | | `unclear` | Emotion could not be determined | ### Pitch | Label | Description | | :--- | :--- | | `low` | Low-pitched voice | | `medium` | Medium-pitched voice | | `high` | High-pitched voice | ### Vocal Style | Label | Description | | :--- | :--- | | `whispering` | Hushed, breathy delivery | | `normal` | Standard conversational speech | | `singing` | Melodic or musical delivery | | `mumbling` | Unclear, low-articulation speech | | `crying` | Speech accompanied by crying | | `laughing` | Speech accompanied by laughter | | `shouting` | Loud, raised-voice delivery | | `monotone` | Flat, unvaried pitch delivery | | `unclear` | Vocal style could not be determined | ### Accent Detects the speaker's accent using BCP-47 locale codes. Supported codes include: `en-US`, `en-GB`, `en-AU`, `zh-CN`, `fr-FR`, `es-ES`, `es-419`, `es-MX`, `ar-EG`, and more. ## Configuration Set `voiceProfileThreshold` inside `inworldConfig`: ```json { "transcribeConfig": { "modelId": "", "language": "en-US", "audioEncoding": "MP3", "inworldConfig": { "voiceProfileThreshold": 0.5 } } } ``` ## Response structure The `voiceProfile` object is returned alongside `transcription` and `usage` in both sync and streaming responses. Each category is an array of `{ label, confidence }` objects, sorted by descending confidence. The JSON below shows the **normalized** response shape (camelCase throughout). Raw API payloads may use `snake_case` for the same fields (for example `vocal_style`, `transcribed_audio_ms`, `model_id`). Prefer representing one layer per example in your own docs and client code — either the raw API shape or the normalized shape — not a mix of both. ### Example response (sync, normalized shape) ```json { "transcription": { "transcript": "Hey, I just wanted to check in on the delivery status.", "isFinal": true }, "voiceProfile": { "age": [ { "label": "young", "confidence": 0.78 } ], "emotion": [ { "label": "tender", "confidence": 0.97 }, { "label": "sad", "confidence": 0.03 } ], "pitch": [ { "label": "medium", "confidence": 0.85 } ], "vocalStyle": [ { "label": "whispering", "confidence": 0.97 }, { "label": "normal", "confidence": 0.03 } ], "accent": [ { "label": "en-US", "confidence": 0.48 } ] }, "usage": { "transcribedAudioMs": 3200, "modelId": "inworld/inworld-stt-1" } } ``` ## Best practices - **Start with the default threshold (0.5)** — Filters out low-confidence noise while keeping useful labels. - **Use emotion and vocal style together** — Combining both gives a richer picture. - **Handle missing fields gracefully** — Fields may be absent if classification confidence is insufficient. - **Accent is probabilistic** — Use it as a signal rather than a hard routing decision. --- ### Resources #### Billing Source: https://docs.inworld.ai/stt/resources/billing --- #### Usage Source: https://docs.inworld.ai/stt/resources/usage --- #### Support Source: https://docs.inworld.ai/stt/resources/support --- ## Realtime API ### Overview #### Intro to Realtime API (Speech-to-Speech) Source: https://docs.inworld.ai/realtime/overview Inworld's Realtime API (Speech-to-Speech) enables low-latency, speech-to-speech interactions with voice agents. The API follows the OpenAI Realtime protocol, extended to enable additional customization. Build a voice agent with WebSocket, mic input, and audio playback. Build a voice agent with browser-native WebRTC — no manual audio encoding. See the full event schemas for the Realtime API. JavaScript examples for the Realtime API. Python examples for the Realtime API. Inworld's Realtime API is currently in [research preview](/tts/resources/support#what-do-experimental-preview-and-stable-mean). Please share any feedback with us via the feedback form in [Portal](https://platform.inworld.ai) or in [Discord](https://discord.gg/inworld). ## Key Features - **WebSocket and WebRTC transports**: Connect over [WebSocket](/realtime/connect/websocket) or [WebRTC](/realtime/connect/webrtc) with a standard event schema. - **Automatic interruption-handling and turn-taking**: Your agent will manage conversations naturally and be resilient to user barge-in. - **Router support**: Utilize Inworld Routers to enable a single agent to dynamically handle different user cohorts, or to facilitate A/B tests. - **OpenAI compatibility**: Drop-in replacement for the OpenAI Realtime API with a simple [migration path](/realtime/openai-migration). ## Guides Configure sessions, send input, and orchestrate responses. Session lifecycle and conversation events. Step-by-step guide to switch from OpenAI to Inworld. See the [API reference](/api-reference/realtimeAPI/realtime/realtime-websocket) for full event schemas. --- #### WebSocket Quickstart Source: https://docs.inworld.ai/realtime/quickstart-websocket Build a browser-based voice agent that streams audio to the Inworld Realtime API using WebSocket. The WebSocket transport is best for server-side and proxied connections where you can set custom headers. For browser-native voice with lower latency, see the [WebRTC Quickstart](/realtime/quickstart-webrtc). ## Get Started Create an [Inworld account](https://platform.inworld.ai/signup). (See Authentication section above for API key setup.) Set your API key as an environment variable. ```shell macOS and Linux export INWORLD_API_KEY='your-base64-api-key-here' ``` ```shell Windows setx INWORLD_API_KEY "your-base64-api-key-here" ``` Create `server.js`. It proxies WebSocket events between the browser and Inworld, configures the voice session, and triggers an initial greeting. ```javascript server.js import { readFileSync } from 'fs'; import { createServer } from 'http'; import { WebSocketServer, WebSocket } from 'ws'; const html = readFileSync('index.html'); const server = createServer((req, res) => { res.writeHead(200, { 'Content-Type': 'text/html' }); res.end(html); }); const wss = new WebSocketServer({ server, path: '/ws' }); const SESSION_CFG = JSON.stringify({ type: 'session.update', session: { instructions: 'You are a friendly voice assistant. Keep responses brief.', } }); const GREET = JSON.stringify({ type: 'conversation.item.create', item: { type: 'message', role: 'user', content: [{ type: 'input_text', text: 'Greet the user' }] } }); wss.on('connection', (browser) => { let setup = 0; const api = new WebSocket( `wss://api.inworld.ai/api/v1/realtime/session?key=voice-${Date.now()}&protocol=realtime`, { headers: { Authorization: `Basic ${process.env.INWORLD_API_KEY}` } } ); api.on('message', (raw) => { if (setup < 2) { const t = JSON.parse(raw.toString()).type; if (t === 'session.created') { api.send(SESSION_CFG); setup = 1; } else if (t === 'session.updated' && setup === 1) { api.send(GREET); api.send('{"type":"response.create"}'); setup = 2; } } if (browser.readyState === WebSocket.OPEN) browser.send(raw.toString()); }); browser.on('message', (msg) => { if (api.readyState === WebSocket.OPEN) api.send(msg.toString()); }); browser.on('close', () => api.close()); api.on('close', () => { if (browser.readyState === WebSocket.OPEN) browser.close(); }); api.on('error', (e) => console.error('API error:', e.message)); }); let port = 3000; server.on('error', (e) => { if (e.code === 'EADDRINUSE') { console.warn(`Port ${port} in use, trying ${++port}…`); server.listen(port); } else throw e; }); server.listen(port, () => console.log(`Open http://localhost:${port}`)); ``` Create `index.html` in the same directory. It captures microphone audio, plays agent audio, and displays transcripts that fade after each turn. ```html index.html Voice Agent ``` ```bash npm init -y && npm pkg set type=module npm install ws node server.js ``` Open [http://localhost:3000](http://localhost:3000) and click **Start Conversation**. The agent greets you with audio. ## How It Works | Component | Role | | --- | --- | | **Browser** | Captures mic audio (PCM16, 24 kHz), plays agent audio | | **Server** | Proxies events between browser and Inworld, holds the API key server-side | | **Inworld Realtime API** | Handles speech-to-text, LLM processing, and text-to-speech in one WebSocket session | Key events used: - `input_audio_buffer.append` — streams mic audio to Inworld - `response.output_audio.delta` — agent audio chunks for playback - `input_audio_buffer.speech_started` — triggers interruption (stops agent playback) ## Next Steps Full connection details, session config, and event handling. Configure the key elements of your voice agent. --- #### WebRTC Quickstart Source: https://docs.inworld.ai/realtime/quickstart-webrtc Build a browser-based voice agent that streams audio to the Inworld Realtime API using WebRTC. Audio is handled natively by the browser — no manual PCM encoding or base64 conversion needed. WebRTC is ideal for browser voice apps with low latency. For server-side integrations, see the [WebSocket Quickstart](/realtime/quickstart-websocket). ## Get Started Create an [Inworld account](https://platform.inworld.ai/signup). (See Authentication section above for API key setup.) Create a `.env` file: ```shell .env INWORLD_API_KEY=your-base64-api-key-here ``` Create `server.js`. It serves the page and provides a `/api/config` endpoint that fetches ICE servers from the WebRTC proxy while keeping the API key server-side. ```javascript server.js import 'dotenv/config'; import { readFileSync } from 'fs'; import { createServer } from 'http'; const html = readFileSync('index.html'); const API_KEY = process.env.INWORLD_API_KEY || ''; const PROXY = 'https://api.inworld.ai'; const server = createServer(async (req, res) => { if (req.url === '/api/config') { let ice = []; try { const r = await fetch(`${PROXY}/v1/realtime/ice-servers`, { headers: { Authorization: `Bearer ${API_KEY}` }, }); if (r.ok) ice = (await r.json()).ice_servers || []; } catch {} res.writeHead(200, { 'Content-Type': 'application/json' }); res.end(JSON.stringify({ api_key: API_KEY, ice_servers: ice, url: `${PROXY}/v1/realtime/calls` })); return; } res.writeHead(200, { 'Content-Type': 'text/html' }); res.end(html); }); let port = 3000; server.on('error', (e) => { if (e.code === 'EADDRINUSE') { console.warn(`Port ${port} in use, trying ${++port}…`); server.listen(port); } else throw e; }); server.listen(port, () => console.log(`Open http://localhost:${port}`)); ``` Create `index.html` in the same directory. It connects via WebRTC, streams mic audio automatically, and plays agent audio through an RTP track. ```html index.html WebRTC Voice Agent ``` ```bash npm init -y && npm pkg set type=module npm install dotenv node server.js ``` Open [http://localhost:3000](http://localhost:3000) and click **Start Conversation**. The agent greets you with audio. ## Option 2: Using OpenAI Agents SDK If you're building a more advanced voice agent with features like agent handoffs, tool calling, and guardrails, you can use the [OpenAI Agents SDK](https://github.com/openai/openai-agents-js) with Inworld's WebRTC proxy. We provide a ready-to-run playground based on OpenAI's realtime agents demo. ```bash git clone https://github.com/inworld-ai/experimental-oai-realtime-agents-playground.git cd experimental-oai-realtime-agents-playground npm install ``` If you are unable to access this repository, please contact **support@inworld.ai** for access. Open `.env` and set `OPENAI_API_KEY` to your **Inworld** API key (the same Base64 credentials from [Inworld Portal](https://platform.inworld.ai/)): ```shell .env OPENAI_API_KEY=your-inworld-base64-api-key-here ``` Despite the variable name `OPENAI_API_KEY`, this must be your **Inworld** API key — not an OpenAI key. The SDK uses this variable name by convention, but the playground routes all traffic through the Inworld WebRTC proxy. ```bash npm run dev ``` Open [http://localhost:3000](http://localhost:3000). Select a scenario from the **Scenario** dropdown and start talking. The playground includes two agentic patterns: - **Chat-Supervisor** — A realtime chat agent handles basic conversation while a more capable text model (e.g. `gpt-4.1`) handles tool calls and complex responses. - **Sequential Handoff** — Specialized agents transfer the user between them to handle specific intents (e.g. authentication → returns → sales). For full details on customizing agents, see the playground's README. --- ## How It Works | Component | Role | | --- | --- | | **Browser** | Captures mic audio via WebRTC, plays agent audio from RTP track | | **Node.js server** | Serves the page and `/api/config` (ICE servers + API key) | | **WebRTC proxy** | Bridges WebRTC ↔ WebSocket, transcodes OPUS ↔ PCM16 | | **Inworld Realtime API** | Handles speech-to-text, LLM processing, and text-to-speech | Key differences from WebSocket: - Audio flows via **RTP tracks** (no base64 encoding) - Events flow via **DataChannel** (same JSON schema) - Browser handles **OPUS codec** natively ## Next Steps Full connection details, session config, and SDK integration. VAD configuration, audio formats, and conversation flow. Migrate from OpenAI Realtime API to Inworld. --- ### Build with Realtime API #### WebSocket Source: https://docs.inworld.ai/realtime/connect/websocket Connect via WebSocket. For browser-native, low-latency voice, see [WebRTC](/realtime/connect/webrtc). ## Endpoint ``` wss://api.inworld.ai/api/v1/realtime/session?key=&protocol=realtime ``` | Parameter | Required | Description | | --- | --- | --- | | `key` | Yes | Session ID from your app | | `protocol` | Yes | `realtime` | ## Authentication | Environment | Header | Notes | | --- | --- | --- | | **Server-side (Node.js)** | `Authorization: Basic ` | The API key from [Inworld Portal](https://platform.inworld.ai/) is already Base64-encoded | | **Client-side (browser)** | `Authorization: Bearer ` | Mint a JWT on your backend. See the [JWT sample app](https://github.com/inworld-ai/inworld-nodejs-jwt-sample-app) for a complete example | ## Flow 1. Connect → receive `session.created` 2. Send `session.update` (instructions, audio config, tools) 3. Stream audio (`input_audio_buffer.append`) or text (`conversation.item.create`) 4. `response.create` → handle `response.output_*` until `response.done` ## Session Config `session.update` accepts partial updates so you are able to dynamically update your prompt, voice, model, tools, and so on during the conversation. ```javascript ws.send(JSON.stringify({ type: 'session.update', session: { type: 'realtime', model: 'openai/gpt-4o-mini', instructions: 'You are a concise concierge.', output_modalities: ['audio', 'text'], audio: { input: { turn_detection: { type: 'semantic_vad', eagerness: 'medium', create_response: true, interrupt_response: true } }, output: { voice: 'Clive', model: 'inworld-tts-1.5-mini', speed: 1.0 } }, tools: [{ type: 'function', name: 'get_weather', description: 'Fetch weather for a location', parameters: { type: 'object', properties: { location: { type: 'string' } }, required: ['location'] } }] } })); ``` ## Audio Input and output audio should be PCM16, 24 kHz mono, base64 encoded. Recommended chunk size is 100-200ms. ```javascript ws.send(JSON.stringify({ type: 'input_audio_buffer.append', audio: base64PcmChunk })); ``` Use `input_audio_buffer.clear` to discard unwanted audio. ## Text The Realtime API can accept text as well as audio. Send it from your client using `conversation.item.create`. ```javascript ws.send(JSON.stringify({ type: 'conversation.item.create', item: { type: 'message', role: 'user', content: [{ type: 'input_text', text: 'Can you summarize the notes I sent?' }] } })); ``` ## Events Speech-to-speech conversations are facilitated by websocket events - both client-sent events which you'll send to the API, and server-sent events which you'll receive and react to. - **Session:** `session.created`, `session.updated` - **Conversation:** `conversation.item.added/done/retrieved/deleted/truncated`, transcription deltas/completions - **Responses:** `response.created`, `response.output_item.added/done`, `response.output_text.delta/done`, `response.output_audio.delta/done`, `response.done` - **Audio/VAD:** `input_audio_buffer.speech_started`, `input_audio_buffer.speech_stopped`, `response.output_audio_transcript.delta` - **Errors:** `error` The full list of events and their schemas is available in the [API reference](/api-reference/realtimeAPI/realtime/realtime-websocket). ## Node.js websocket server example Server-side Node.js example using the `ws` library with Basic auth. ```javascript const sessionId = 'your-session-id'; const credentials = process.env.INWORLD_API_KEY; const ws = new WebSocket(`wss://api.inworld.ai/api/v1/realtime/session?key=${sessionId}&protocol=realtime`, { headers: { Authorization: `Basic ${credentials}` } }); ws.on('open', () => { console.log('WebSocket connected'); }); ws.on('message', (buffer) => { const message = JSON.parse(buffer.toString()); switch (message.type) { case 'session.created': console.log('Session created:', message.session.id); updateSession(); break; case 'session.updated': console.log('Session updated'); sendMessage('Hello!'); break; case 'conversation.item.added': console.log('Conversation item added:', message.item.id); break; case 'conversation.item.done': console.log('Conversation item done'); createResponse(); break; case 'input_audio_buffer.speech_started': console.log('Speech started at', message.audio_start_ms, 'ms'); break; case 'input_audio_buffer.speech_stopped': console.log('Speech stopped at', message.audio_end_ms, 'ms'); break; case 'conversation.item.input_audio_transcription.delta': console.log('Transcription delta:', message.delta); break; case 'conversation.item.input_audio_transcription.completed': console.log('Transcription complete:', message.transcript); break; case 'response.created': console.log('Response created:', message.response.id); break; case 'response.output_item.added': console.log('Output item added:', message.item.id); break; case 'response.output_text.delta': console.log('Text delta:', message.delta); break; case 'response.output_audio.delta': // Decode and play audio chunk const audioBuffer = Buffer.from(message.delta, 'base64'); playAudio(audioBuffer); break; case 'response.output_audio_transcript.delta': console.log('Audio transcript delta:', message.delta); break; case 'response.done': console.log('Response complete, status:', message.response.status); break; case 'error': console.error('Error:', message.error.message, message.error.code); break; } }); function updateSession() { ws.send(JSON.stringify({ type: 'session.update', session: { type: 'realtime', output_modalities: ['text', 'audio'], instructions: 'You are a helpful AI assistant.', audio: { input: { turn_detection: { type: 'semantic_vad', eagerness: 'medium', create_response: true, interrupt_response: true } }, output: { voice: 'Clive' } } } })); } function sendMessage(text) { ws.send(JSON.stringify({ type: 'conversation.item.create', item: { type: 'message', role: 'user', content: [{ type: 'input_text', text }] } })); } function createResponse() { ws.send(JSON.stringify({ type: 'response.create', response: { output_modalities: ['text', 'audio'] } })); } function cancelResponse() { ws.send(JSON.stringify({ type: 'response.cancel' })); } function sendAudioChunk(audioChunk) { ws.send(JSON.stringify({ type: 'input_audio_buffer.append', audio: audioChunk // base64-encoded audio data })); } function clearAudioBuffer() { ws.send(JSON.stringify({ type: 'input_audio_buffer.clear' })); } ``` [API reference](/api-reference/realtimeAPI/realtime/realtime-websocket) for full schemas. --- #### WebRTC Source: https://docs.inworld.ai/realtime/connect/webrtc Connect via WebRTC for browser-native, low-latency voice. A WebRTC proxy bridges your peer connection to the same realtime service used by the [WebSocket](/realtime/connect/websocket) transport, transcoding OPUS ↔ PCM16 and forwarding events transparently. ## Endpoint ``` https://api.inworld.ai ``` | Endpoint | Method | Description | | --- | --- | --- | | `/v1/realtime/calls` | POST | SDP offer/answer exchange | | `/v1/realtime/ice-servers` | GET | STUN/TURN server configuration | ## Authentication Pass your Inworld API key as a Bearer token. The proxy forwards it to the realtime service. ``` Authorization: Bearer ``` Keep the API key server-side. Serve it to the browser via a backend endpoint (see examples below). ## Flow 1. Fetch config from your server (API key + ICE servers) 2. Create `RTCPeerConnection` with ICE servers 3. Create data channel `oai-events` + add microphone track 4. Create SDP offer → POST to `/v1/realtime/calls` → set SDP answer 5. Data channel opens → send `session.update` → start conversation Audio flows via RTP tracks (no manual encode/decode). Events flow via data channel using the same JSON schema as [WebSocket](/realtime/connect/websocket). ## Session Config Same `session.update` as WebSocket, sent through the data channel. See [model, voice, and TTS configuration](/realtime/usage/using-realtime-models#choose-an-llm) for details. ```javascript dc.send(JSON.stringify({ type: 'session.update', session: { type: 'realtime', model: 'openai/gpt-4o-mini', instructions: 'You are a concise concierge.', output_modalities: ['audio', 'text'], audio: { input: { turn_detection: { type: 'semantic_vad', eagerness: 'medium', create_response: true, interrupt_response: true } }, output: { voice: 'Clive', model: 'inworld-tts-1.5-mini', speed: 1.0 } } } })); ``` ## Audio Unlike WebSocket (manual base64 PCM), WebRTC handles audio natively: - **Input**: browser captures mic and sends OPUS over RTP automatically - **Output**: proxy sends AI audio back as an RTP track — attach to `