Skip to main content
Inworld’s Speech-to-Text (STT) API provides a unified integration point for industry-leading transcription providers. You get consistent authentication, request formatting, and response handling across providers — without managing multiple SDKs or credentials. The API supports both synchronous transcription for complete audio files and real-time bidirectional streaming over WebSocket for live audio.

Developer Quickstart

Make your first STT API call and get a transcript.

API Reference

View the complete API specification.

Code Examples

Browse ready-to-use GitHub samples for sync and real-time STT.

Supported Providers

Inworld (first-party) — Experimental

Model IDEndpointsBest for
inworld/inworld-stt-1Sync API + WebSocketVoice agents and character-driven apps that benefit from transcription plus Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking
The Inworld first-party model is currently Experimental. Features and pricing are subject to change.

Groq

Model IDEndpointsBest for
groq/whisper-large-v3Sync API onlyGeneral-purpose transcription for recorded audio

AssemblyAI

Model IDEndpointsBest for
assemblyai/universal-streaming-multilingualWebSocket onlyMultilingual streaming (English, Spanish, French, German, Italian, Portuguese)
assemblyai/universal-streaming-englishWebSocket onlyEnglish-optimized streaming
assemblyai/u3-rt-proWebSocket onlyHigh-accuracy, sub-300ms latency, multilingual streaming (English, Spanish, French, German, Italian, Portuguese)
assemblyai/whisper-rtWebSocket onlyReal-time Whisper transcription
AssemblyAI models currently support the WebSocket streaming endpoint only. Sync HTTP support is coming soon.
For pricing details, see Billing or inworld.ai/pricing.

Model comparison

Featureinworld/inworld-stt-1groq/whisper-large-v3assemblyai/universal-streaming-multilingualassemblyai/universal-streaming-englishassemblyai/u3-rt-proassemblyai/whisper-rt
Pricing$0.28/hour$0.111/hour$0.15/hour$0.15/hour$0.45/hour$0.30/hour
EndpointSync API + WebSocketSync API onlyWebSocket onlyWebSocket onlyWebSocket onlyWebSocket only
Real-time streaming
Best forVoice agents with Voice Profile and configurable turn-takingGeneral-purpose transcription for recorded audioMultilingual streaming (English, Spanish, French, German, Italian, Portuguese)English-optimized streamingHigh-accuracy, sub-300ms multilingual streaming (English, Spanish, French, German, Italian, Portuguese)Real-time Whisper transcription
LanguagesEnglish100+ (Whisper)6 languagesEnglish6 languages100+ (Whisper)

Supported Audio Formats

FormatSync APIWebSocket Streaming
LINEAR16 (PCM)
MP3
OGG_OPUS
FLAC
AUTO_DETECT
Recommended defaults: 16,000 Hz sample rate, 16-bit depth, mono. For container formats (MP3, FLAC, OGG_OPUS, WAV), sampleRateHertz is optional — the API auto-detects it from the file header.

Endpoints

EndpointMethodDescription
/stt/v1/transcribePOSTSend complete audio, receive full transcript
/stt/v1/transcribe:streamBidirectionalWebSocketStream audio in real time, receive transcription chunks as they become available

Supported Languages

Language support depends on the STT provider. See Model comparison above for more details. Use language when you want to force recognition for a known language. Omit language to allow auto-detection when supported.

Error Handling

Responses use a consistent error shape. Providers may add extra codes. Authentication error
{
  "error": {
    "code": "AUTHENTICATION_FAILED",
    "message": "Invalid API key. Expected Base64-encoded credentials."
  }
}
Rate limit
{
  "error": {
    "code": "RATE_LIMIT_EXCEEDED",
    "message": "Maximum concurrent requests exceeded. Retry after 5 seconds."
  }
}
Common codes
CodeDescription
AUTHENTICATION_FAILEDInvalid or missing API key
RATE_LIMIT_EXCEEDEDToo many concurrent requests
INVALID_AUDIO_FORMATEncoding does not match the declared format
UNSUPPORTED_LANGUAGELanguage not supported by the selected provider

Best Practices

  • Model choice — Use inworld/inworld-stt-1 when you want Voice Profile or Inworld-optimized turn-taking; use Groq/AssemblyAI for specific latency/accuracy needs.
  • Audio — Use MP3/OGG_OPUS for file uploads to reduce size; use LINEAR16 for streaming (required) and when you need highest quality.
  • Streaming — For Inworld model with manual turn-taking, send EndTurn at each turn boundary and CloseStream when done.
  • Voice Profile — Set inworldConfig.voiceProfileThreshold (e.g. 0.5) to filter low-confidence labels.
  • Test with sample audio and your target language before production.

Troubleshooting

IssueWhat to check
No transcriptAPI key, audio encoding matches request, valid audio file
AUTHENTICATION_FAILEDINWORLD_API_KEY set correctly and not expired in Portal
INVALID_AUDIO_FORMATaudioEncoding matches the actual format (LINEAR16 for raw PCM, MP3 for MP3, etc.)
Poor qualityTry a higher-accuracy model; use ≥16 kHz, clear speech
Large file failuresSplit or compress (e.g. MP3/OGG_OPUS); respect upload size limits
No Voice ProfileResponse may omit it if not available for the selected model or request
For more help, see the Inworld Discord community.