Skip to main content
In this quickstart, you’ll send an audio file to the STT API and receive a transcript. It also highlights Inworld STT (inworld/inworld-stt-1), which adds Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking (automatic or manual).

Make your first STT API request

1

Create an API key

Create an Inworld account.In Inworld Portal, generate an API key by going to Settings > API Keys. Copy the Base64 credentials.Set your API key as an environment variable.
export INWORLD_API_KEY='your-base64-api-key-here'
2

Prepare an audio file

The STT API accepts base64-encoded audio and supports multiple audio formats. Requirements vary by use case:
Use caseFormatNotes
File upload (sync)LINEAR16, MP3, OGG_OPUS, FLAC, AUTO_DETECTSample rate can be auto-detected from file headers when possible
StreamingLINEAR16 (PCM)Other encodings are not supported for streaming to minimize latency and preserve quality
Recommended settings:
  • Sample rate: 16,000 Hz (STT performs best at this rate; lower sample rates like 8 kHz contain fewer data points, reducing accuracy)
  • Bit depth: 16-bit (for LINEAR16)
  • Channels: Mono (1 channel)
For file uploads (MP3, FLAC, OGG_OPUS, WAV), sampleRateHertz is optional — the API can auto-detect it from the file header.
3

Send the request

Audio is sent as a JSON payload with base64-encoded audio content. The API returns the complete transcript when processing is complete (and optionally Voice Profile, when returned by the API).Create a new file inworld_stt_quickstart.py or inworld_stt_quickstart.js and use the code below. The Inworld model (inworld/inworld-stt-1) provides transcription plus optional Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking for streaming.
import requests
import os
import base64

# Sync endpoint
URL = "https://api.inworld.ai/stt/v1/transcribe"

# Use a 16-bit PCM WAV file (16 kHz, mono)
with open("input.wav", "rb") as f:
    audio_content = base64.b64encode(f.read()).decode("utf-8")

payload = {
    "transcribe_config": {
        "model_id": "inworld/inworld-stt-1",
        "language": "en-US",
        "audio_encoding": "LINEAR16",
        "voice_profile_config": {
            "enable_voice_profile": True,
        },
    },
    "audio_data": {"content": audio_content},
}

headers = {
    "Authorization": f"Basic {os.getenv('INWORLD_API_KEY')}",
    "Content-Type": "application/json",
}

response = requests.post(URL, headers=headers, json=payload)
response.raise_for_status()
result = response.json()

print("Transcript:", result["transcription"]["transcript"])

# Voice Profile (when returned by the API)
if "voiceProfile" in result and result["voiceProfile"]:
    vp = result["voiceProfile"]
    if vp.get("age"):
        print("Age:", vp["age"].get("label"), vp["age"].get("confidence"))
    if vp.get("pitch"):
        print("Pitch:", vp["pitch"].get("label"), vp["pitch"].get("confidence"))
4

Review the response

The response includes the transcript and usage fields, plus optional voiceProfile when available.Response (sync)
FieldDescription
transcription.transcriptThe transcribed text
transcription.isFinalWhether the result is finalized
transcription.wordTimestampsPer-word timing data (when available)
usageUsage metrics for billing
voiceProfile(When returned) Age, pitch, emotion, vocal_style, accent with label and confidence. Available with Inworld and supported third-party models
5

Configuration parameters

transcribeConfig / transcribe_config
FieldTypeRequiredDescription
modelId / model_idstringYesSTT model ID. Use inworld/inworld-stt-1 for WebSocket and HTTP
languagestringNoBCP-47 language code (e.g. en-US). If omitted, the model may auto-detect
audioEncodingstringYesOne of: LINEAR16, MP3, OGG_OPUS, FLAC, AUTO_DETECT. For streaming, use LINEAR16 only
sampleRateHertzintegerNoSample rate in Hz. Default 16000. Can be omitted for formats with headers (MP3, FLAC, OGG_OPUS, WAV)
numberOfChannelsintegerNoChannel count. Default 1
voiceProfileConfigobjectNoVoice Profile configuration. See below
voiceProfileConfig / voice_profile_config
FieldTypeRequiredDescription
enableVoiceProfile / enable_voice_profileboolYesSet to true to enable Voice Profile analysis
topN / top_nintegerNoNumber of top labels per category to return. Default: 10
audioData
FieldTypeRequiredDescription
contentstringYesBase64-encoded audio bytes
6

Run the code

pip install requests  # if needed
python inworld_stt_quickstart.py
Example output:
Transcript: Hey, I just wanted to check in on the delivery status for my order.

Streaming (WebSocket)

For real-time microphone or live audio:
  1. First message must contain transcribeConfig (same fields as above, including voiceProfileConfig to enable Voice Profile).
  2. Later messages send audioChunk with base64-encoded LINEAR16 (PCM) audio only.
  3. Turn and stream end:
    • To signal end of a speaker turn, send EndTurn.
    • Send CloseStream when the client is done sending audio (required for WebSocket; gRPC clients can just close the send side).
Example first WebSocket message:
{
  "transcribeConfig": {
    "modelId": "inworld/inworld-stt-1",
    "audioEncoding": "LINEAR16"
  }
}
Responses stream back as Transcription (interim and final), optional voiceProfile, and finally Usage when the stream is closed. Streaming endpoint (WebSocket): wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional

Next Steps

STT Overview

Learn about supported providers, audio formats, and endpoints.

API Reference

View the complete API specification.