Transcribe audio to text using leading STT providers through a single API.
The Realtime Speech-to-Text (STT) API provides a unified integration point for industry-leading transcription providers. You get consistent authentication, request formatting, and response handling across providers — without managing multiple SDKs or credentials.The API supports both synchronous transcription for complete audio files and real-time bidirectional streaming over WebSocket for live audio.
Developer Quickstart
Make your first STT API call and get a transcript.
API Reference
View the complete API specification.
Code Examples
Browse ready-to-use GitHub samples for sync and real-time STT.
Using AI to code? Paste https://docs.inworld.ai/llms.txt into your assistant so it knows every page on this site. Want live search? Add the MCP server.
Voice agents and character-driven apps that benefit from transcription plus Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking
Recommended defaults: 16,000 Hz sample rate, 16-bit depth, mono. For container formats (MP3, FLAC, OGG_OPUS, WAV), sampleRateHertz is optional — the API auto-detects it from the file header.
Sync transcription accepts audio files up to ~16 MB. The actual duration depends on the encoding — for example, ~18 minutes of MP3 or ~8 minutes of 16 kHz 16-bit WAV. For larger files, split them into chunks or use the WebSocket streaming endpoint.
STT performs best with 16 kHz audio. Lower sample rates (such as 8 kHz telephony audio) contain fewer data points for the model to interpret, which reduces transcription accuracy. Upsampling low-sample-rate audio does not improve quality — it only interpolates between existing samples without adding new information.
Language support depends on the STT provider. See Language Support for the full list of languages supported by the Inworld first-party model, and links to third-party provider language documentation.
Model choice — Use inworld/inworld-stt-1 when you want Voice Profile or Inworld-optimized turn-taking; use Groq/AssemblyAI/Soniox for specific latency/accuracy needs.
Audio — Use MP3/OGG_OPUS for file uploads to reduce size; use LINEAR16 for streaming (required) and when you need highest quality.
Streaming — For Inworld model with manual turn-taking, send endTurn at each turn boundary and closeStream when done.
Speech events — Listen for speechStarted and speechStopped events in the streaming response to detect when a speaker begins and stops talking. Use these to build custom turn-taking logic or visualize voice activity.
Voice Profile — Set voiceProfileConfig.enableVoiceProfile to true and optionally adjust topN (default: 10) to control how many labels per category are returned.
Test with sample audio and your target language before production.
API key, audio encoding matches request, valid audio file
UNAUTHENTICATED
INWORLD_API_KEY set correctly and not expired in Portal
INVALID_ARGUMENT
audioEncoding matches the actual format (LINEAR16 for raw PCM, MP3 for MP3, etc.)
Poor quality
Try a higher-accuracy model; use 16 kHz sample rate (8 kHz telephony audio has fewer data points and will produce lower-quality results); ensure clear speech
Large file failures
Split or compress (e.g. MP3/OGG_OPUS); respect upload size limits
No Voice Profile
Ensure voiceProfileConfig.enableVoiceProfile is set to true in your request; response may also omit it if the selected model does not support it