The Realtime Speech-to-Text (STT) API provides a unified integration point for industry-leading transcription providers. You get consistent authentication, request formatting, and response handling across providers — without managing multiple SDKs or credentials. The API supports both synchronous transcription for complete audio files and real-time bidirectional streaming over WebSocket for live audio.Documentation Index
Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt
Use this file to discover all available pages before exploring further.
Developer Quickstart
Make your first STT API call and get a transcript.
API Reference
View the complete API specification.
Code Examples
Browse ready-to-use GitHub samples for sync and real-time STT.
Supported Providers
Inworld (first-party) — Experimental
| Model ID | Endpoints | Best for |
|---|---|---|
inworld/inworld-stt-1 | Sync API + WebSocket | Voice agents and character-driven apps that benefit from transcription plus Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking |
The Inworld first-party model is currently Experimental. Features and pricing are subject to change.
Sync transcription currently supports LINEAR16 and FLAC encodings only. MP3, OGG_OPUS, and AUTO_DETECT support for sync is coming soon.
Groq
| Model ID | Endpoints | Best for |
|---|---|---|
groq/whisper-large-v3 | Sync API only | General-purpose transcription for recorded audio |
AssemblyAI
| Model ID | Endpoints | Best for |
|---|---|---|
assemblyai/universal-streaming-multilingual | WebSocket only | Multilingual streaming (English, Spanish, French, German, Italian, Portuguese) |
assemblyai/universal-streaming-english | WebSocket only | English-optimized streaming |
assemblyai/u3-rt-pro | WebSocket only | High-accuracy, sub-300ms latency, multilingual streaming (English, Spanish, French, German, Italian, Portuguese) |
assemblyai/whisper-rt | WebSocket only | Real-time Whisper transcription |
AssemblyAI models currently support the WebSocket streaming endpoint only. Sync HTTP support is coming soon.
Model comparison
| Feature | inworld/inworld-stt-1 | groq/whisper-large-v3 | assemblyai/universal-streaming-multilingual | assemblyai/universal-streaming-english | assemblyai/u3-rt-pro | assemblyai/whisper-rt |
|---|---|---|---|---|---|---|
| Pricing | See pricing | See pricing | See pricing | See pricing | See pricing | See pricing |
| Endpoint | Sync API + WebSocket | Sync API only | WebSocket only | WebSocket only | WebSocket only | WebSocket only |
| Real-time streaming | ||||||
| Best for | Voice agents with Voice Profile and configurable turn-taking | General-purpose transcription for recorded audio | Multilingual streaming (English, Spanish, French, German, Italian, Portuguese) | English-optimized streaming | High-accuracy, sub-300ms multilingual streaming (English, Spanish, French, German, Italian, Portuguese) | Real-time Whisper transcription |
| Languages | English | 100+ (Whisper) | 6 languages | English | 6 languages | 100+ (Whisper) |
Supported Audio Formats
| Format | Sync API | WebSocket Streaming |
|---|---|---|
LINEAR16 (PCM) | ||
MP3 | ||
OGG_OPUS | ||
FLAC | ||
AUTO_DETECT |
sampleRateHertz is optional — the API auto-detects it from the file header.
STT performs best with 16 kHz audio. Lower sample rates (such as 8 kHz telephony audio) contain fewer data points for the model to interpret, which reduces transcription accuracy. Upsampling low-sample-rate audio does not improve quality — it only interpolates between existing samples without adding new information.
Endpoints
| Endpoint | Method | Description |
|---|---|---|
/stt/v1/transcribe | POST | Send complete audio, receive full transcript |
/stt/v1/transcribe:streamBidirectional | WebSocket | Stream audio in real time, receive transcription chunks as they become available |
Supported Languages
Language support depends on the STT provider. See Model comparison above for more details. Uselanguage when you want to force recognition for a known language. Omit language to allow auto-detection when supported.
Error Handling
Errors follow the standard gRPC status format. Authentication error| Code | Name | Description |
|---|---|---|
3 | INVALID_ARGUMENT | Invalid or missing request field (encoding, model ID, audio data) |
8 | RESOURCE_EXHAUSTED | Too many concurrent requests (rate limit) |
16 | UNAUTHENTICATED | Invalid or missing API key |
Best Practices
- Model choice — Use
inworld/inworld-stt-1when you want Voice Profile or Inworld-optimized turn-taking; use Groq/AssemblyAI for specific latency/accuracy needs. - Audio — Use MP3/OGG_OPUS for file uploads to reduce size; use LINEAR16 for streaming (required) and when you need highest quality.
- Streaming — For Inworld model with manual turn-taking, send
EndTurnat each turn boundary andCloseStreamwhen done. - Voice Profile — Set
voiceProfileConfig.enableVoiceProfiletotrueand optionally adjusttopN(default: 10) to control how many labels per category are returned. - Test with sample audio and your target language before production.
Troubleshooting
| Issue | What to check |
|---|---|
| No transcript | API key, audio encoding matches request, valid audio file |
UNAUTHENTICATED | INWORLD_API_KEY set correctly and not expired in Portal |
INVALID_ARGUMENT | audioEncoding matches the actual format (LINEAR16 for raw PCM, MP3 for MP3, etc.) |
| Poor quality | Try a higher-accuracy model; use 16 kHz sample rate (8 kHz telephony audio has fewer data points and will produce lower-quality results); ensure clear speech |
| Large file failures | Split or compress (e.g. MP3/OGG_OPUS); respect upload size limits |
| No Voice Profile | Ensure voiceProfileConfig.enableVoiceProfile is set to true in your request; response may also omit it if the selected model does not support it |