Transcribe audio to text using leading STT providers through a single API.
Inworld’s Speech-to-Text (STT) API provides a unified integration point for industry-leading transcription providers. You get consistent authentication, request formatting, and response handling across providers — without managing multiple SDKs or credentials.The API supports both synchronous transcription for complete audio files and real-time bidirectional streaming over WebSocket for live audio.
Developer Quickstart
Make your first STT API call and get a transcript.
API Reference
View the complete API specification.
Code Examples
Browse ready-to-use GitHub samples for sync and real-time STT.
Voice agents and character-driven apps that benefit from transcription plus Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking
The Inworld first-party model is currently Experimental. Features and pricing are subject to change.
Sync transcription currently supports LINEAR16 and FLAC encodings only. MP3, OGG_OPUS, and AUTO_DETECT support for sync is coming soon.
Recommended defaults: 16,000 Hz sample rate, 16-bit depth, mono. For container formats (MP3, FLAC, OGG_OPUS, WAV), sampleRateHertz is optional — the API auto-detects it from the file header.
STT performs best with 16 kHz audio. Lower sample rates (such as 8 kHz telephony audio) contain fewer data points for the model to interpret, which reduces transcription accuracy. Upsampling low-sample-rate audio does not improve quality — it only interpolates between existing samples without adding new information.
Language support depends on the STT provider. See Model comparison above for more details.Use language when you want to force recognition for a known language. Omit language to allow auto-detection when supported.
Model choice — Use inworld/inworld-stt-1 when you want Voice Profile or Inworld-optimized turn-taking; use Groq/AssemblyAI for specific latency/accuracy needs.
Audio — Use MP3/OGG_OPUS for file uploads to reduce size; use LINEAR16 for streaming (required) and when you need highest quality.
Streaming — For Inworld model with manual turn-taking, send EndTurn at each turn boundary and CloseStream when done.
Voice Profile — Set voiceProfileConfig.enableVoiceProfile to true and optionally adjust topN (default: 10) to control how many labels per category are returned.
Test with sample audio and your target language before production.
API key, audio encoding matches request, valid audio file
UNAUTHENTICATED
INWORLD_API_KEY set correctly and not expired in Portal
INVALID_ARGUMENT
audioEncoding matches the actual format (LINEAR16 for raw PCM, MP3 for MP3, etc.)
Poor quality
Try a higher-accuracy model; use 16 kHz sample rate (8 kHz telephony audio has fewer data points and will produce lower-quality results); ensure clear speech
Large file failures
Split or compress (e.g. MP3/OGG_OPUS); respect upload size limits
No Voice Profile
Ensure voiceProfileConfig.enableVoiceProfile is set to true in your request; response may also omit it if the selected model does not support it