inworld/inworld-stt-1), which adds Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking (automatic or manual).
Make your first STT API request
Create an API key
Create an Inworld account.In Inworld Portal, generate an API key by going to Settings > API Keys. Copy the Base64 credentials.Set your API key as an environment variable.
Prepare an audio file
The STT API accepts audio in several formats (e.g. MP3, OGG_OPUS, FLAC, LINEAR16). Audio bytes are sent in the request payload as a base64-encoded string — base64 is the transport encoding, not the audio format. Requirements vary by use case:
Recommended settings:
| Use case | Format | Notes |
|---|---|---|
| File upload (sync) | LINEAR16, MP3, OGG_OPUS, FLAC, AUTO_DETECT | Sample rate can be auto-detected from file headers when possible |
| Streaming | LINEAR16 (PCM) | Other encodings are not supported for streaming to minimize latency and preserve quality |
- Sample rate: 16,000 Hz (STT performs best at this rate; lower sample rates like 8 kHz contain fewer data points, reducing accuracy)
- Bit depth: 16-bit (for LINEAR16)
- Channels: Mono (1 channel)
sampleRateHertz is optional — the API can auto-detect it from the file header.Sync transcription accepts audio files up to ~16 MB. The actual duration depends on the encoding (e.g., ~18 minutes of MP3 or ~8 minutes of 16 kHz 16-bit WAV). For larger files, split them into chunks or use the WebSocket streaming endpoint.
Send the request
Audio is sent as a JSON payload with base64-encoded audio content. The API returns the complete transcript when processing is complete (and optionally Voice Profile, when returned by the API).Create a new file
inworld_stt_quickstart.py or inworld_stt_quickstart.js and use the code below. The Inworld model (inworld/inworld-stt-1) provides transcription plus optional Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking for streaming.Review the response
The response includes the transcript and usage fields, plus optional
voiceProfile when available.Response (sync)| Field | Description |
|---|---|
transcription.transcript | The transcribed text |
transcription.isFinal | Whether the result is finalized |
transcription.wordTimestamps | Per-word timing data (when available) |
usage | Usage metrics for billing |
voiceProfile | (When returned) Age, pitch, emotion, vocalStyle, accent with label and confidence. Available with Inworld and supported third-party models |
Configuration parameters
transcribeConfig
voiceProfileConfig
audioData
| Field | Type | Required | Description |
|---|---|---|---|
modelId | string | Yes | STT model ID. Use inworld/inworld-stt-1 for WebSocket and HTTP |
language | string | No | ISO 639 language code (e.g. en, ja). BCP-47 codes like en-US are also accepted and converted to the base language. If omitted, the model may auto-detect. See Language Support for the full list |
audioEncoding | string | Yes | One of: LINEAR16, MP3, OGG_OPUS, FLAC, AUTO_DETECT. For streaming, use LINEAR16 only |
sampleRateHertz | integer | No | Sample rate in Hz. Default 16000. Can be omitted for formats with headers (MP3, FLAC, OGG_OPUS, WAV) |
numberOfChannels | integer | No | Channel count. Default 1 |
voiceProfileConfig | object | No | Voice Profile configuration. See below |
| Field | Type | Required | Description |
|---|---|---|---|
enableVoiceProfile | bool | Yes | Set to true to enable Voice Profile analysis |
topN | integer | No | Number of top labels per category to return. Default: 10 |
| Field | Type | Required | Description |
|---|---|---|---|
content | string | Yes | Base64-encoded audio bytes |
Streaming (WebSocket)
For real-time microphone or live audio:- First message must contain
transcribeConfig(same fields as above, includingvoiceProfileConfigto enable Voice Profile). - Later messages send
audioChunkwith base64-encoded LINEAR16 (PCM) audio only. - Turn and stream end:
- To signal end of a speaker turn, send
endTurn. - Send
closeStreamwhen the client is done sending audio (required for WebSocket; gRPC clients can just close the send side).
- To signal end of a speaker turn, send
voiceProfile, speech events (speechStarted when voice activity is detected, speechStopped when silence is detected after speech), and finally Usage when the stream is closed.
Streaming endpoint (WebSocket): wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional
Next Steps
STT Overview
Learn about supported providers, audio formats, and endpoints.
API Reference
View the complete API specification.