Skip to main content
In this quickstart, you’ll send an audio file to the STT API and receive a transcript. It also highlights Inworld STT (inworld/inworld-stt-1), which adds Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking (automatic or manual).

Make your first STT API request

1

Create an API key

Create an Inworld account.In Inworld Portal, generate an API key by going to Settings > API Keys. Copy the Base64 credentials.Set your API key as an environment variable.
export INWORLD_API_KEY='your-base64-api-key-here'
2

Prepare an audio file

The STT API accepts audio in several formats (e.g. MP3, OGG_OPUS, FLAC, LINEAR16). Audio bytes are sent in the request payload as a base64-encoded string — base64 is the transport encoding, not the audio format. Requirements vary by use case:
Use caseFormatNotes
File upload (sync)LINEAR16, MP3, OGG_OPUS, FLAC, AUTO_DETECTSample rate can be auto-detected from file headers when possible
StreamingLINEAR16 (PCM)Other encodings are not supported for streaming to minimize latency and preserve quality
Recommended settings:
  • Sample rate: 16,000 Hz (STT performs best at this rate; lower sample rates like 8 kHz contain fewer data points, reducing accuracy)
  • Bit depth: 16-bit (for LINEAR16)
  • Channels: Mono (1 channel)
For file uploads (MP3, FLAC, OGG_OPUS, WAV), sampleRateHertz is optional — the API can auto-detect it from the file header.
Sync transcription accepts audio files up to ~16 MB. The actual duration depends on the encoding (e.g., ~18 minutes of MP3 or ~8 minutes of 16 kHz 16-bit WAV). For larger files, split them into chunks or use the WebSocket streaming endpoint.
3

Send the request

Audio is sent as a JSON payload with base64-encoded audio content. The API returns the complete transcript when processing is complete (and optionally Voice Profile, when returned by the API).Create a new file inworld_stt_quickstart.py or inworld_stt_quickstart.js and use the code below. The Inworld model (inworld/inworld-stt-1) provides transcription plus optional Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking for streaming.
import requests
import os
import base64

# Sync endpoint
URL = "https://api.inworld.ai/stt/v1/transcribe"

# Use a 16-bit PCM WAV file (16 kHz, mono)
with open("input.wav", "rb") as f:
    audio_content = base64.b64encode(f.read()).decode("utf-8")

payload = {
    "transcribeConfig": {
        "modelId": "inworld/inworld-stt-1",
        "language": "en",
        "audioEncoding": "LINEAR16",
        "voiceProfileConfig": {
            "enableVoiceProfile": True,
        },
    },
    "audioData": {"content": audio_content},
}

headers = {
    "Authorization": f"Basic {os.getenv('INWORLD_API_KEY')}",
    "Content-Type": "application/json",
}

response = requests.post(URL, headers=headers, json=payload)
response.raise_for_status()
result = response.json()

print("Transcript:", result["transcription"]["transcript"])

# Voice Profile (when returned by the API)
if "voiceProfile" in result and result["voiceProfile"]:
    vp = result["voiceProfile"]
    if vp.get("age"):
        print("Age:", vp["age"].get("label"), vp["age"].get("confidence"))
    if vp.get("pitch"):
        print("Pitch:", vp["pitch"].get("label"), vp["pitch"].get("confidence"))
4

Review the response

The response includes the transcript and usage fields, plus optional voiceProfile when available.Response (sync)
FieldDescription
transcription.transcriptThe transcribed text
transcription.isFinalWhether the result is finalized
transcription.wordTimestampsPer-word timing data (when available)
usageUsage metrics for billing
voiceProfile(When returned) Age, pitch, emotion, vocalStyle, accent with label and confidence. Available with Inworld and supported third-party models
5

Configuration parameters

transcribeConfig
FieldTypeRequiredDescription
modelIdstringYesSTT model ID. Use inworld/inworld-stt-1 for WebSocket and HTTP
languagestringNoISO 639 language code (e.g. en, ja). BCP-47 codes like en-US are also accepted and converted to the base language. If omitted, the model may auto-detect. See Language Support for the full list
audioEncodingstringYesOne of: LINEAR16, MP3, OGG_OPUS, FLAC, AUTO_DETECT. For streaming, use LINEAR16 only
sampleRateHertzintegerNoSample rate in Hz. Default 16000. Can be omitted for formats with headers (MP3, FLAC, OGG_OPUS, WAV)
numberOfChannelsintegerNoChannel count. Default 1
voiceProfileConfigobjectNoVoice Profile configuration. See below
voiceProfileConfig
FieldTypeRequiredDescription
enableVoiceProfileboolYesSet to true to enable Voice Profile analysis
topNintegerNoNumber of top labels per category to return. Default: 10
audioData
FieldTypeRequiredDescription
contentstringYesBase64-encoded audio bytes
6

Run the code

pip install requests  # if needed
python inworld_stt_quickstart.py
Example output:
Transcript: Hey, I just wanted to check in on the delivery status for my order.

Streaming (WebSocket)

For real-time microphone or live audio:
  1. First message must contain transcribeConfig (same fields as above, including voiceProfileConfig to enable Voice Profile).
  2. Later messages send audioChunk with base64-encoded LINEAR16 (PCM) audio only.
  3. Turn and stream end:
    • To signal end of a speaker turn, send endTurn.
    • Send closeStream when the client is done sending audio (required for WebSocket; gRPC clients can just close the send side).
Example first WebSocket message:
{
  "transcribeConfig": {
    "modelId": "inworld/inworld-stt-1",
    "audioEncoding": "LINEAR16"
  }
}
Responses stream back as Transcription (interim and final), optional voiceProfile, speech events (speechStarted when voice activity is detected, speechStopped when silence is detected after speech), and finally Usage when the stream is closed. Streaming endpoint (WebSocket): wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional

Next Steps

STT Overview

Learn about supported providers, audio formats, and endpoints.

API Reference

View the complete API specification.