Intro to Realtime STT - Inworld AI Documentation

The Realtime Speech-to-Text (STT) API provides a unified integration point for industry-leading transcription providers. You get consistent authentication, request formatting, and response handling across providers — without managing multiple SDKs or credentials. The API supports both synchronous transcription for complete audio files and real-time bidirectional streaming over WebSocket for live audio.

Developer Quickstart

Make your first STT API call and get a transcript.

API Reference

View the complete API specification.

Code Examples

Browse ready-to-use GitHub samples for sync and real-time STT.

Using AI to code? Paste https://docs.inworld.ai/llms.txt into your assistant so it knows every page on this site. Want live search? Add the MCP server.

Supported Providers

Inworld (first-party) — Experimental

Model ID	Endpoints	Best for
`inworld/inworld-stt-1`	Sync API + WebSocket	Voice agents and character-driven apps that benefit from transcription plus Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking

The Inworld first-party model is currently Experimental. Features and pricing are subject to change.

Supports English plus 29 additional languages in experimental mode. See Supported Languages for the full list.

Groq

Model ID	Endpoints	Best for
`groq/whisper-large-v3`	Sync API only	General-purpose transcription for recorded audio

AssemblyAI

Model ID	Endpoints	Best for
`assemblyai/universal-streaming-multilingual`	WebSocket only	Multilingual streaming (English, Spanish, French, German, Italian, Portuguese)
`assemblyai/universal-streaming-english`	WebSocket only	English-optimized streaming
`assemblyai/u3-rt-pro`	WebSocket only	High-accuracy, sub-300ms latency, multilingual streaming (English, Spanish, French, German, Italian, Portuguese)
`assemblyai/whisper-rt`	WebSocket only	Real-time Whisper transcription

AssemblyAI models currently support the WebSocket streaming endpoint only. Sync HTTP support is coming soon.

Soniox

Model ID	Endpoints	Best for
`soniox/stt-rt-v4`	WebSocket only	High-accuracy real-time streaming with semantic end-of-turn detection and multilingual support

Soniox models currently support the WebSocket streaming endpoint only.

For pricing details, see Billing or inworld.ai/pricing.

Model comparison

Feature	inworld/inworld-stt-1	groq/whisper-large-v3	assemblyai/universal-streaming-multilingual	assemblyai/universal-streaming-english	assemblyai/u3-rt-pro	assemblyai/whisper-rt	soniox/stt-rt-v4
Pricing	See pricing	See pricing	See pricing	See pricing	See pricing	See pricing	See pricing
Endpoint	Sync API + WebSocket	Sync API only	WebSocket only	WebSocket only	WebSocket only	WebSocket only	WebSocket only
Real-time streaming
Best for	Voice agents with Voice Profile and configurable turn-taking	General-purpose transcription for recorded audio	Multilingual streaming (English, Spanish, French, German, Italian, Portuguese)	English-optimized streaming	High-accuracy, sub-300ms multilingual streaming (English, Spanish, French, German, Italian, Portuguese)	Real-time Whisper transcription	High-accuracy real-time streaming with semantic end-of-turn detection and multilingual support
Languages	English; 29 Experimental (see below)	100+ (Whisper)	6 languages	English	6 languages	100+ (Whisper)	Multilingual

Supported Audio Formats

Format	Sync API	WebSocket Streaming
`LINEAR16` (PCM)
`MP3`
`OGG_OPUS`
`FLAC`
`AUTO_DETECT`

Recommended defaults: 16,000 Hz sample rate, 16-bit depth, mono. For container formats (MP3, FLAC, OGG_OPUS, WAV), sampleRateHertz is optional — the API auto-detects it from the file header.

STT performs best with 16 kHz audio. Lower sample rates (such as 8 kHz telephony audio) contain fewer data points for the model to interpret, which reduces transcription accuracy. Upsampling low-sample-rate audio does not improve quality — it only interpolates between existing samples without adding new information.

Endpoints

Endpoint	Method	Description
`/stt/v1/transcribe`	POST	Send complete audio, receive full transcript
`/stt/v1/transcribe:streamBidirectional`	WebSocket	Stream audio in real time, receive transcription chunks as they become available

Supported Languages

Language support depends on the STT provider. See Model comparison above for more details.

Inworld first-party model (`inworld/inworld-stt-1`)

Available:

English (en)

Experimental:

Spanish (es)
French (fr)
German (de)
Italian (it)
Portuguese (pt)
Dutch (nl)
Russian (ru)
Chinese (zh)
Japanese (ja)
Korean (ko)
Arabic (ar)
Hindi (hi)
Turkish (tr)
Polish (pl)
Swedish (sv)

Cantonese (yue)
Indonesian (id)
Thai (th)
Vietnamese (vi)
Malay (ms)
Danish (da)
Finnish (fi)
Czech (cs)
Filipino (fil)
Persian (fa)
Greek (el)
Hungarian (hu)
Macedonian (mk)
Romanian (ro)

Use language when you want to force recognition for a known language. Omit language to allow auto-detection when supported.

Error Handling

Errors follow the standard gRPC status format. Authentication error

{
  "code": 16,
  "message": "Unauthenticated: invalid or missing API key.",
  "details": []
}

Invalid request

{
  "code": 3,
  "message": "Unsupported audio encoding.",
  "details": []
}

Common gRPC status codes

Code	Name	Description
`3`	`INVALID_ARGUMENT`	Invalid or missing request field (encoding, model ID, audio data)
`8`	`RESOURCE_EXHAUSTED`	Too many concurrent requests (rate limit)
`16`	`UNAUTHENTICATED`	Invalid or missing API key

Best Practices

Model choice — Use inworld/inworld-stt-1 when you want Voice Profile or Inworld-optimized turn-taking; use Groq/AssemblyAI/Soniox for specific latency/accuracy needs.
Audio — Use MP3/OGG_OPUS for file uploads to reduce size; use LINEAR16 for streaming (required) and when you need highest quality.
Streaming — For Inworld model with manual turn-taking, send endTurn at each turn boundary and closeStream when done.
Speech events — Listen for speechStarted and speechStopped events in the streaming response to detect when a speaker begins and stops talking. Use these to build custom turn-taking logic or visualize voice activity.
Voice Profile — Set voiceProfileConfig.enableVoiceProfile to true and optionally adjust topN (default: 10) to control how many labels per category are returned.
Test with sample audio and your target language before production.

Troubleshooting

Issue	What to check
No transcript	API key, audio encoding matches request, valid audio file
`UNAUTHENTICATED`	`INWORLD_API_KEY` set correctly and not expired in Portal
`INVALID_ARGUMENT`	`audioEncoding` matches the actual format (LINEAR16 for raw PCM, MP3 for MP3, etc.)
Poor quality	Try a higher-accuracy model; use 16 kHz sample rate (8 kHz telephony audio has fewer data points and will produce lower-quality results); ensure clear speech
Large file failures	Split or compress (e.g. MP3/OGG_OPUS); respect upload size limits
No Voice Profile	Ensure `voiceProfileConfig.enableVoiceProfile` is set to `true` in your request; response may also omit it if the selected model does not support it

For more help, see the Inworld Discord community.

Documentation Index

Developer Quickstart

API Reference

Code Examples

​Supported Providers

​Inworld (first-party) — Experimental

​Groq

​AssemblyAI

​Soniox

​Model comparison

​Supported Audio Formats

​Endpoints

​Supported Languages

​Inworld first-party model (inworld/inworld-stt-1)

​Error Handling

​Best Practices

​Troubleshooting

Supported Providers

Inworld (first-party) — Experimental

Groq

AssemblyAI

Soniox

Model comparison

Supported Audio Formats

Endpoints

Supported Languages

Inworld first-party model (`inworld/inworld-stt-1`)

Error Handling

Best Practices

Troubleshooting