inworld/inworld-stt-1), which adds Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking (automatic or manual).
Make your first STT API request
Create an API key
Create an Inworld account.In Inworld Portal, generate an API key by going to Settings > API Keys. Copy the Base64 credentials.Set your API key as an environment variable.
Prepare an audio file
The STT API accepts base64-encoded audio and supports multiple audio formats. Requirements vary by use case:
Recommended settings:
| Use case | Format | Notes |
|---|---|---|
| File upload (sync) | LINEAR16, MP3, OGG_OPUS, FLAC, AUTO_DETECT | Sample rate can be auto-detected from file headers when possible |
| Streaming | LINEAR16 (PCM) | Other encodings are not supported for streaming to minimize latency and preserve quality |
- Sample rate: 16,000 Hz (STT performs best at this rate; lower sample rates like 8 kHz contain fewer data points, reducing accuracy)
- Bit depth: 16-bit (for LINEAR16)
- Channels: Mono (1 channel)
sampleRateHertz is optional — the API can auto-detect it from the file header.Send the request
Audio is sent as a JSON payload with base64-encoded audio content. The API returns the complete transcript when processing is complete (and optionally Voice Profile, when returned by the API).Create a new file
inworld_stt_quickstart.py or inworld_stt_quickstart.js and use the code below. The Inworld model (inworld/inworld-stt-1) provides transcription plus optional Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking for streaming.Review the response
The response includes the transcript and usage fields, plus optional
voiceProfile when available.Response (sync)| Field | Description |
|---|---|
transcription.transcript | The transcribed text |
transcription.isFinal | Whether the result is finalized |
transcription.wordTimestamps | Per-word timing data (when available) |
usage | Usage metrics for billing |
voiceProfile | (When returned) Age, pitch, emotion, vocal_style, accent with label and confidence. Available with Inworld and supported third-party models |
Configuration parameters
transcribeConfig / transcribe_config
voiceProfileConfig / voice_profile_config
audioData
| Field | Type | Required | Description |
|---|---|---|---|
modelId / model_id | string | Yes | STT model ID. Use inworld/inworld-stt-1 for WebSocket and HTTP |
language | string | No | BCP-47 language code (e.g. en-US). If omitted, the model may auto-detect |
audioEncoding | string | Yes | One of: LINEAR16, MP3, OGG_OPUS, FLAC, AUTO_DETECT. For streaming, use LINEAR16 only |
sampleRateHertz | integer | No | Sample rate in Hz. Default 16000. Can be omitted for formats with headers (MP3, FLAC, OGG_OPUS, WAV) |
numberOfChannels | integer | No | Channel count. Default 1 |
voiceProfileConfig | object | No | Voice Profile configuration. See below |
| Field | Type | Required | Description |
|---|---|---|---|
enableVoiceProfile / enable_voice_profile | bool | Yes | Set to true to enable Voice Profile analysis |
topN / top_n | integer | No | Number of top labels per category to return. Default: 10 |
| Field | Type | Required | Description |
|---|---|---|---|
content | string | Yes | Base64-encoded audio bytes |
Streaming (WebSocket)
For real-time microphone or live audio:- First message must contain
transcribeConfig(same fields as above, includingvoiceProfileConfigto enable Voice Profile). - Later messages send
audioChunkwith base64-encoded LINEAR16 (PCM) audio only. - Turn and stream end:
- To signal end of a speaker turn, send
EndTurn. - Send
CloseStreamwhen the client is done sending audio (required for WebSocket; gRPC clients can just close the send side).
- To signal end of a speaker turn, send
voiceProfile, and finally Usage when the stream is closed.
Streaming endpoint (WebSocket): wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional
Next Steps
STT Overview
Learn about supported providers, audio formats, and endpoints.
API Reference
View the complete API specification.