Skip to main content
The Realtime Speech-to-Text (STT) API provides a unified integration point for industry-leading transcription providers. You get consistent authentication, request formatting, and response handling across providers — without managing multiple SDKs or credentials. The API supports both synchronous transcription for complete audio files and real-time bidirectional streaming over WebSocket for live audio.

Developer Quickstart

Make your first STT API call and get a transcript.

API Reference

View the complete API specification.

Code Examples

Browse ready-to-use GitHub samples for sync and real-time STT.
Using AI to code? Paste https://docs.inworld.ai/llms.txt into your assistant so it knows every page on this site. Want live search? Add the MCP server.

Supported Providers

Inworld (first-party)

Model IDEndpointsBest for
inworld/inworld-stt-1Sync API + WebSocketVoice agents and character-driven apps that benefit from transcription plus Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking
Supports 30 languages. See Language Support for the full list.

Groq

Model IDEndpointsBest for
groq/whisper-large-v3Sync API onlyGeneral-purpose transcription for recorded audio

AssemblyAI

Model IDEndpointsBest for
assemblyai/universal-streaming-multilingualWebSocket onlyMultilingual streaming (English, Spanish, French, German, Italian, Portuguese)
assemblyai/universal-streaming-englishWebSocket onlyEnglish-optimized streaming
assemblyai/u3-rt-proWebSocket onlyHigh-accuracy, sub-300ms latency, multilingual streaming (English, Spanish, French, German, Italian, Portuguese)
assemblyai/whisper-rtWebSocket onlyReal-time Whisper transcription
AssemblyAI models currently support the WebSocket streaming endpoint only. Sync HTTP support is coming soon.

Soniox

Model IDEndpointsBest for
soniox/stt-rt-v4WebSocket onlyHigh-accuracy real-time streaming with semantic end-of-turn detection and multilingual support
Soniox models currently support the WebSocket streaming endpoint only.
For pricing details, see Billing or inworld.ai/pricing.

Model comparison

Featureinworld/inworld-stt-1groq/whisper-large-v3assemblyai/universal-streaming-multilingualassemblyai/universal-streaming-englishassemblyai/u3-rt-proassemblyai/whisper-rtsoniox/stt-rt-v4
PricingSee pricingSee pricingSee pricingSee pricingSee pricingSee pricingSee pricing
EndpointSync API + WebSocketSync API onlyWebSocket onlyWebSocket onlyWebSocket onlyWebSocket onlyWebSocket only
Real-time streaming
Best forVoice agents with Voice Profile and configurable turn-takingGeneral-purpose transcription for recorded audioMultilingual streaming (English, Spanish, French, German, Italian, Portuguese)English-optimized streamingHigh-accuracy, sub-300ms multilingual streaming (English, Spanish, French, German, Italian, Portuguese)Real-time Whisper transcriptionHigh-accuracy real-time streaming with semantic end-of-turn detection and multilingual support
Languages30 languages (see list)100+ (Whisper)6 languagesEnglish6 languages100+ (Whisper)Multilingual

Supported Audio Formats

FormatSync APIWebSocket Streaming
LINEAR16 (PCM)
MP3
OGG_OPUS
FLAC
AUTO_DETECT
Recommended defaults: 16,000 Hz sample rate, 16-bit depth, mono. For container formats (MP3, FLAC, OGG_OPUS, WAV), sampleRateHertz is optional — the API auto-detects it from the file header.
Sync transcription accepts audio files up to ~16 MB. The actual duration depends on the encoding — for example, ~18 minutes of MP3 or ~8 minutes of 16 kHz 16-bit WAV. For larger files, split them into chunks or use the WebSocket streaming endpoint.
STT performs best with 16 kHz audio. Lower sample rates (such as 8 kHz telephony audio) contain fewer data points for the model to interpret, which reduces transcription accuracy. Upsampling low-sample-rate audio does not improve quality — it only interpolates between existing samples without adding new information.

Endpoints

EndpointMethodDescription
/stt/v1/transcribePOSTSend complete audio, receive full transcript
/stt/v1/transcribe:streamBidirectionalWebSocketStream audio in real time, receive transcription chunks as they become available

Supported Languages

Language support depends on the STT provider. See Language Support for the full list of languages supported by the Inworld first-party model, and links to third-party provider language documentation.

Error Handling

Errors follow the standard gRPC status format. Authentication error
{
  "code": 16,
  "message": "Unauthenticated: invalid or missing API key.",
  "details": []
}
Invalid request
{
  "code": 3,
  "message": "Unsupported audio encoding.",
  "details": []
}
Common gRPC status codes
CodeNameDescription
3INVALID_ARGUMENTInvalid or missing request field (encoding, model ID, audio data)
8RESOURCE_EXHAUSTEDToo many concurrent requests (rate limit)
16UNAUTHENTICATEDInvalid or missing API key

Best Practices

  • Model choice — Use inworld/inworld-stt-1 when you want Voice Profile or Inworld-optimized turn-taking; use Groq/AssemblyAI/Soniox for specific latency/accuracy needs.
  • Audio — Use MP3/OGG_OPUS for file uploads to reduce size; use LINEAR16 for streaming (required) and when you need highest quality.
  • Streaming — For Inworld model with manual turn-taking, send endTurn at each turn boundary and closeStream when done.
  • Speech events — Listen for speechStarted and speechStopped events in the streaming response to detect when a speaker begins and stops talking. Use these to build custom turn-taking logic or visualize voice activity.
  • Voice Profile — Set voiceProfileConfig.enableVoiceProfile to true and optionally adjust topN (default: 10) to control how many labels per category are returned.
  • Test with sample audio and your target language before production.

Troubleshooting

IssueWhat to check
No transcriptAPI key, audio encoding matches request, valid audio file
UNAUTHENTICATEDINWORLD_API_KEY set correctly and not expired in Portal
INVALID_ARGUMENTaudioEncoding matches the actual format (LINEAR16 for raw PCM, MP3 for MP3, etc.)
Poor qualityTry a higher-accuracy model; use 16 kHz sample rate (8 kHz telephony audio has fewer data points and will produce lower-quality results); ensure clear speech
Large file failuresSplit or compress (e.g. MP3/OGG_OPUS); respect upload size limits
No Voice ProfileEnsure voiceProfileConfig.enableVoiceProfile is set to true in your request; response may also omit it if the selected model does not support it
For more help, see the Inworld Discord community.