Transcribe audio to text using leading STT providers through a single API.
Inworld’s Speech-to-Text (STT) API provides a unified integration point for industry-leading transcription providers. You get consistent authentication, request formatting, and response handling across providers — without managing multiple SDKs or credentials.The API supports both synchronous transcription for complete audio files and real-time bidirectional streaming over WebSocket for live audio.
Developer Quickstart
Make your first STT API call and get a transcript.
API Reference
View the complete API specification.
Code Examples
Browse ready-to-use GitHub samples for sync and real-time STT.
Voice agents and character-driven apps that benefit from transcription plus Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking
The Inworld first-party model is currently Experimental. Features and pricing are subject to change.
Recommended defaults: 16,000 Hz sample rate, 16-bit depth, mono. For container formats (MP3, FLAC, OGG_OPUS, WAV), sampleRateHertz is optional — the API auto-detects it from the file header.
Language support depends on the STT provider. See Model comparison above for more details.Use language when you want to force recognition for a known language. Omit language to allow auto-detection when supported.
Model choice — Use inworld/inworld-stt-1 when you want Voice Profile or Inworld-optimized turn-taking; use Groq/AssemblyAI for specific latency/accuracy needs.
Audio — Use MP3/OGG_OPUS for file uploads to reduce size; use LINEAR16 for streaming (required) and when you need highest quality.
Streaming — For Inworld model with manual turn-taking, send EndTurn at each turn boundary and CloseStream when done.
Voice Profile — Set inworldConfig.voiceProfileThreshold (e.g. 0.5) to filter low-confidence labels.
Test with sample audio and your target language before production.