Realtime TTS-2 is live. Built for realtime conversation that feels human. Learn more
{
"modelId": "assemblyai/universal-streaming-multilingual",
"audioEncoding": "LINEAR16",
"sampleRateHertz": 16000,
"language": "en-US"
}{
"content": "<YOUR_AUDIO>"
}{}{}{
"transcript": "Hello, this is a test transcription.",
"isFinal": true,
"wordTimestamps": []
}{
"transcribedAudioMs": 123,
"modelId": "<string>"
}{
"startTimeMs": 1250,
"confidence": 0.95
}{
"silenceDurationMs": 750
}Bidirectional streaming API for real-time speech-to-text transcription over WebSocket.
This method listens for streaming audio input and returns recognized text chunks one by one as soon as they are ready. Audio chunks are expected to be a part of a single voice input. Suitable for streaming live conversations, microphone input, or other streaming audio sources.
To use the API:
transcribeConfig message first to configure the session (model, language, audio encoding, etc.).audioChunk messages containing raw audio bytes.transcription results as they become available, including both interim (partial) and final results.speechStarted and speechStopped events to detect voice activity changes.endTurn to signal end of a speaker’s turn.closeStream when done.Documentation Index
Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt
Use this file to discover all available pages before exploring further.
{
"modelId": "assemblyai/universal-streaming-multilingual",
"audioEncoding": "LINEAR16",
"sampleRateHertz": 16000,
"language": "en-US"
}{
"content": "<YOUR_AUDIO>"
}{}{}{
"transcript": "Hello, this is a test transcription.",
"isFinal": true,
"wordTimestamps": []
}{
"transcribedAudioMs": 123,
"modelId": "<string>"
}{
"startTimeMs": 1250,
"confidence": 0.95
}{
"silenceDurationMs": 750
}Your authentication credentials. For Basic authentication, please populate Basic $INWORLD_API_KEY
Configure the transcription session. Must be the first message sent. Contains model selection, audio format settings, and optional feature configurations.
Send a chunk of audio data for transcription. Must be sent after the initial transcribe config message.
Signal the end of a speaker's turn. Some providers do not support manual turn-taking; for those providers, sending this message will have no effect.
Signal that the client is done sending audio data. Required for HTTP/WebSocket clients since there is no equivalent to gRPC stream close.
Transcription result streamed back as audio is processed. May be an interim (partial) result or a final result depending on the isFinal field.
Usage metrics for billing and monitoring purposes. Coming soon — this field is not yet populated.
Signal to indicate the start of a speaker's speech. Sent when voice activity is detected in the audio stream.
Signal raised when STT detects silence after speech has stopped. Useful for tracking pauses and implementing custom turn-taking logic.
Was this page helpful?