Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt

Use this file to discover all available pages before exploring further.

In this quickstart, you’ll send an audio file to the STT API and receive a transcript. It also highlights Inworld STT (inworld/inworld-stt-1), which adds Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking (automatic or manual).

Make your first STT API request

1

Create an API key

Create an Inworld account.In Inworld Portal, generate an API key by going to Settings > API Keys. Copy the Base64 credentials.Set your API key as an environment variable.
export INWORLD_API_KEY='your-base64-api-key-here'
2

Prepare an audio file

The STT API accepts base64-encoded audio and supports multiple audio formats. Requirements vary by use case:
Use caseFormatNotes
File upload (sync)LINEAR16, MP3, OGG_OPUS, FLAC, AUTO_DETECTSample rate can be auto-detected from file headers when possible
StreamingLINEAR16 (PCM)Other encodings are not supported for streaming to minimize latency and preserve quality
Recommended settings:
  • Sample rate: 16,000 Hz (STT performs best at this rate; lower sample rates like 8 kHz contain fewer data points, reducing accuracy)
  • Bit depth: 16-bit (for LINEAR16)
  • Channels: Mono (1 channel)
For file uploads (MP3, FLAC, OGG_OPUS, WAV), sampleRateHertz is optional — the API can auto-detect it from the file header.
3

Send the request

Audio is sent as a JSON payload with base64-encoded audio content. The API returns the complete transcript when processing is complete (and optionally Voice Profile, when returned by the API).Create a new file inworld_stt_quickstart.py or inworld_stt_quickstart.js and use the code below. The Inworld model (inworld/inworld-stt-1) provides transcription plus optional Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking for streaming.
import requests
import os
import base64

# Sync endpoint
URL = "https://api.inworld.ai/stt/v1/transcribe"

# Use a 16-bit PCM WAV file (16 kHz, mono)
with open("input.wav", "rb") as f:
    audio_content = base64.b64encode(f.read()).decode("utf-8")

payload = {
    "transcribeConfig": {
        "modelId": "inworld/inworld-stt-1",
        "language": "en-US",
        "audioEncoding": "LINEAR16",
        "voiceProfileConfig": {
            "enableVoiceProfile": True,
        },
    },
    "audioData": {"content": audio_content},
}

headers = {
    "Authorization": f"Basic {os.getenv('INWORLD_API_KEY')}",
    "Content-Type": "application/json",
}

response = requests.post(URL, headers=headers, json=payload)
response.raise_for_status()
result = response.json()

print("Transcript:", result["transcription"]["transcript"])

# Voice Profile (when returned by the API)
if "voiceProfile" in result and result["voiceProfile"]:
    vp = result["voiceProfile"]
    if vp.get("age"):
        print("Age:", vp["age"].get("label"), vp["age"].get("confidence"))
    if vp.get("pitch"):
        print("Pitch:", vp["pitch"].get("label"), vp["pitch"].get("confidence"))
4

Review the response

The response includes the transcript and usage fields, plus optional voiceProfile when available.Response (sync)
FieldDescription
transcription.transcriptThe transcribed text
transcription.isFinalWhether the result is finalized
transcription.wordTimestampsPer-word timing data (when available)
usageUsage metrics for billing
voiceProfile(When returned) Age, pitch, emotion, vocalStyle, accent with label and confidence. Available with Inworld and supported third-party models
5

Configuration parameters

transcribeConfig
FieldTypeRequiredDescription
modelIdstringYesSTT model ID. Use inworld/inworld-stt-1 for WebSocket and HTTP
languagestringNoBCP-47 language code (e.g. en-US). If omitted, the model may auto-detect. See Supported Languages for the full list
audioEncodingstringYesOne of: LINEAR16, MP3, OGG_OPUS, FLAC, AUTO_DETECT. For streaming, use LINEAR16 only
sampleRateHertzintegerNoSample rate in Hz. Default 16000. Can be omitted for formats with headers (MP3, FLAC, OGG_OPUS, WAV)
numberOfChannelsintegerNoChannel count. Default 1
voiceProfileConfigobjectNoVoice Profile configuration. See below
voiceProfileConfig
FieldTypeRequiredDescription
enableVoiceProfileboolYesSet to true to enable Voice Profile analysis
topNintegerNoNumber of top labels per category to return. Default: 10
audioData
FieldTypeRequiredDescription
contentstringYesBase64-encoded audio bytes
6

Run the code

pip install requests  # if needed
python inworld_stt_quickstart.py
Example output:
Transcript: Hey, I just wanted to check in on the delivery status for my order.

Streaming (WebSocket)

For real-time microphone or live audio:
  1. First message must contain transcribeConfig (same fields as above, including voiceProfileConfig to enable Voice Profile).
  2. Later messages send audioChunk with base64-encoded LINEAR16 (PCM) audio only.
  3. Turn and stream end:
    • To signal end of a speaker turn, send endTurn.
    • Send closeStream when the client is done sending audio (required for WebSocket; gRPC clients can just close the send side).
Example first WebSocket message:
{
  "transcribeConfig": {
    "modelId": "inworld/inworld-stt-1",
    "audioEncoding": "LINEAR16"
  }
}
Responses stream back as Transcription (interim and final), optional voiceProfile, speech events (speechStarted when voice activity is detected, speechStopped when silence is detected after speech), and finally Usage when the stream is closed. Streaming endpoint (WebSocket): wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional

Next Steps

STT Overview

Learn about supported providers, audio formats, and endpoints.

API Reference

View the complete API specification.