Skip to main content
Inworld’s Speech-to-Text (STT) API provides a unified integration point for industry-leading transcription providers. You get consistent authentication, request formatting, and response handling across providers — without managing multiple SDKs or credentials. The API supports both synchronous transcription for complete audio files and real-time bidirectional streaming over WebSocket for live audio.

Supported Providers

Groq

Model IDEndpointsBest for

AssemblyAI

Model IDEndpointsBest for
assemblyai/universal-streaming-multilingualWebSocket onlyMultilingual streaming (English, Spanish, French, German, Italian, Portuguese)
assemblyai/universal-streaming-englishWebSocket onlyEnglish-optimized streaming
AssemblyAI models currently support the WebSocket streaming endpoint only. Sync HTTP support is coming soon.
For pricing details, see inworld.ai/pricing.

Supported Audio Formats

FormatSync APIWebSocket Streaming
LINEAR16 (PCM)
MP3
OGG_OPUS
FLAC
AUTO_DETECT
Recommended defaults: 16,000 Hz sample rate, 16-bit depth, mono. For container formats (MP3, FLAC, OGG_OPUS), sampleRateHertz is optional — the API auto-detects it from the file header.

Endpoints

EndpointMethodDescription
/stt/v1/transcribePOSTSend complete audio, receive full transcript
/stt/v1/transcribe:streamBidirectionalWebSocketStream audio in real time, receive transcription chunks as they become available