> ## Documentation Index
> Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Transcribe audio (WebSocket)

> Bidirectional streaming API for real-time speech-to-text transcription over WebSocket.

This method listens for streaming audio input and returns recognized text chunks one by one as soon as they are ready. Audio chunks are expected to be a part of a single voice input. Suitable for streaming live conversations, microphone input, or other streaming audio sources.

To use the API:
- Send a `transcribeConfig` message first to configure the session (model, language, audio encoding, etc.).
- Stream `audioChunk` messages containing raw audio bytes.
- Receive `transcription` results as they become available, including both interim (partial) and final results.
- Listen for `speechStarted` and `speechStopped` events to detect voice activity changes.
- Optionally send `endTurn` to signal end of a speaker's turn.
- Send `closeStream` when done.



## AsyncAPI

````yaml api-reference/sttAPI/transcribe-stream-websocket.json sttStream
id: sttStream
title: Stt stream
description: Primary WebSocket channel for bidirectional speech-to-text streaming.
servers:
  - id: production
    protocol: wss
    host: api.inworld.ai
    bindings: []
    variables: []
address: /stt/v1/transcribe:streamBidirectional
parameters: []
bindings: []
operations:
  - &ref_1
    id: sendRequest
    title: Send request
    type: receive
    messages:
      - &ref_3
        id: transcribeConfig
        payload:
          - name: transcribeConfig
            description: >-
              Configure the transcription session. Must be the first message
              sent. Contains model selection, audio format settings, and
              optional feature configurations.
            type: object
            properties:
              - name: modelId
                type: string
                description: >-
                  The identifier of the model to use for transcription. Format:
                  "{provider}/{model-name}".


                  Available models:

                  - `inworld/inworld-stt-1` — Inworld first-party

                  - `assemblyai/universal-streaming-multilingual` — AssemblyAI
                  multilingual

                  - `assemblyai/universal-streaming-english` — AssemblyAI
                  English

                  - `assemblyai/u3-rt-pro` — AssemblyAI high-accuracy

                  - `assemblyai/whisper-rt` — AssemblyAI Whisper real-time

                  - `soniox/stt-rt-v4` — Soniox real-time


                  See [STT Introduction](/stt/overview) for the full model
                  catalogue.
                required: true
              - name: audioEncoding
                type: string
                description: |-
                  Supported audio encoding formats.

                   - `AUDIO_ENCODING_UNSPECIFIED`: Not specified. Will return an error.
                   - `AUTO_DETECT`: Automatically detect audio encoding from the audio header.
                   - `LINEAR16`: Uncompressed 16-bit signed little-endian samples (Linear PCM).
                   - `MP3`: MP3 audio. Not supported for streaming transcription.
                   - `OGG_OPUS`: Opus encoded audio wrapped in an OGG container. Not supported for streaming transcription.
                   - `FLAC`: FLAC encoded audio. Lossless format. Not supported for streaming transcription.
                enumValues:
                  - AUDIO_ENCODING_UNSPECIFIED
                  - AUTO_DETECT
                  - LINEAR16
                  - MP3
                  - OGG_OPUS
                  - FLAC
                required: true
              - name: language
                type: string
                description: >-
                  Language code for speech recognition in BCP-47 format (e.g.,
                  "en-US", "ja-JP"). If not specified, the model will attempt to
                  auto-detect the language.
                required: false
              - name: sampleRateHertz
                type: integer
                description: >-
                  Sample rate of the audio data in Hertz. Required when the
                  sample rate cannot be inferred from the audio header (e.g.,
                  raw PCM streams). Default: 16000.
                required: false
              - name: numberOfChannels
                type: integer
                description: >-
                  Number of channels in the audio data. Required when the number
                  of channels cannot be inferred from the audio header (e.g.,
                  raw PCM streams). Default: 1.
                required: false
              - name: inactivityTimeoutSeconds
                type: integer
                description: >-
                  Inactivity timeout in seconds. If the client is silent for
                  this duration, the transcription will be stopped.
                required: false
              - name: endOfTurnConfidenceThreshold
                type: number
                description: >-
                  Confidence threshold for end-of-turn prediction. Higher values
                  reduce false-positives. Range: [0.0, 1.0]. Default: 0.5.
                required: false
              - name: prompts
                type: array
                description: >-
                  Contextual prompts to guide the model (e.g., domain-specific
                  context).
                required: false
              - name: includeWordTimestamps
                type: boolean
                description: >-
                  If true, includes per-word timing information in the response.
                  **Coming soon** — word timestamps are not yet populated.
                required: false
              - name: groqConfig
                type: object
                description: Configuration for Groq streaming STT models.
                required: false
                properties:
                  - name: temperature
                    type: number
                    description: >-
                      Temperature for the model. Controls randomness in
                      predictions. Higher values produce more varied output.
                      Range: [0.0, 1.0].
                    required: false
              - name: assemblyaiConfig
                type: object
                description: Configuration for AssemblyAI streaming STT models.
                required: false
                properties:
                  - name: minEndOfTurnSilenceWhenConfident
                    type: integer
                    description: >-
                      Minimum silence duration when confidence is high
                      (milliseconds).
                    required: false
                  - name: maxTurnSilence
                    type: integer
                    description: >-
                      Maximum allowed silence before forcing a turn boundary
                      (milliseconds).
                    required: false
                  - name: vadThreshold
                    type: number
                    description: >-
                      Voice activity detection threshold. Range: [0.0, 1.0].
                      Default: 0.5.
                    required: false
                  - name: prompt
                    type: string
                    description: >-
                      Custom transcription instructions for the model. Works
                      only for Universal-3 Pro Streaming.
                    required: false
              - name: inworldSttV1Config
                type: object
                description: Configuration for Inworld STT 1 models.
                required: false
                properties:
                  - name: minEndOfTurnSilenceWhenConfident
                    type: integer
                    description: >-
                      Minimum silence duration when confidence is high
                      (milliseconds).
                    required: false
                  - name: vadThreshold
                    type: number
                    description: >-
                      Voice activity detection threshold. Range: [0.0, 1.0].
                      Default: 0.5.
                    required: false
              - name: sonioxConfig
                type: object
                description: Configuration for Soniox streaming STT models.
                required: false
                properties:
                  - name: languageHints
                    type: array
                    description: >-
                      Language hints to guide the model. If set, will override
                      the `language` field from the main config.
                    required: false
                  - name: languageHintsStrict
                    type: boolean
                    description: >-
                      If true, model will strongly prefer producing languages
                      only from the `languageHints` list.
                    required: false
                  - name: enableEndpointDetection
                    type: boolean
                    description: >-
                      If true, enables intelligent semantic-based end-of-turn
                      detection.
                    required: false
                  - name: maxEndpointDelayMs
                    type: integer
                    description: >-
                      Maximum allowed delay between the end of the previous turn
                      and the start of the next turn (milliseconds). Must be
                      between 500 and 5000 milliseconds, default is 2000.
                    required: false
                  - name: context
                    type: object
                    description: Contextual information to guide the Soniox model.
                    required: false
                    properties:
                      - name: general
                        type: object
                        description: >-
                          Structured key-value information (domain, topic,
                          intent, etc.).
                        required: false
                      - name: text
                        type: string
                        description: Longer free-form background text or related documents.
                        required: false
                      - name: terms
                        type: array
                        description: Domain-specific or uncommon words.
                        required: false
              - name: voiceProfileConfig
                type: object
                description: Configuration for voice profile detection.
                required: false
                properties:
                  - name: enableVoiceProfile
                    type: boolean
                    description: Enables voice profile feature for this request or stream.
                    required: false
                  - name: topN
                    type: integer
                    description: >-
                      Number of top labels from each class to return. Default:
                      10.
                    required: false
        headers: []
        jsonPayloadSchema:
          type: object
          properties:
            modelId:
              type: string
              description: >-
                The identifier of the model to use for transcription. Format:
                "{provider}/{model-name}".


                Available models:

                - `inworld/inworld-stt-1` — Inworld first-party

                - `assemblyai/universal-streaming-multilingual` — AssemblyAI
                multilingual

                - `assemblyai/universal-streaming-english` — AssemblyAI English

                - `assemblyai/u3-rt-pro` — AssemblyAI high-accuracy

                - `assemblyai/whisper-rt` — AssemblyAI Whisper real-time

                - `soniox/stt-rt-v4` — Soniox real-time


                See [STT Introduction](/stt/overview) for the full model
                catalogue.
              example: assemblyai/universal-streaming-multilingual
              x-parser-schema-id: <anonymous-schema-1>
            audioEncoding:
              type: string
              enum:
                - AUDIO_ENCODING_UNSPECIFIED
                - AUTO_DETECT
                - LINEAR16
                - MP3
                - OGG_OPUS
                - FLAC
              default: AUDIO_ENCODING_UNSPECIFIED
              description: |-
                Supported audio encoding formats.

                 - `AUDIO_ENCODING_UNSPECIFIED`: Not specified. Will return an error.
                 - `AUTO_DETECT`: Automatically detect audio encoding from the audio header.
                 - `LINEAR16`: Uncompressed 16-bit signed little-endian samples (Linear PCM).
                 - `MP3`: MP3 audio. Not supported for streaming transcription.
                 - `OGG_OPUS`: Opus encoded audio wrapped in an OGG container. Not supported for streaming transcription.
                 - `FLAC`: FLAC encoded audio. Lossless format. Not supported for streaming transcription.
              x-parser-schema-id: AudioEncoding
            language:
              type: string
              description: >-
                Language code for speech recognition in BCP-47 format (e.g.,
                "en-US", "ja-JP"). If not specified, the model will attempt to
                auto-detect the language.
              example: en-US
              x-parser-schema-id: <anonymous-schema-2>
            sampleRateHertz:
              type: integer
              format: int32
              description: >-
                Sample rate of the audio data in Hertz. Required when the sample
                rate cannot be inferred from the audio header (e.g., raw PCM
                streams). Default: 16000.
              example: 16000
              x-parser-schema-id: <anonymous-schema-3>
            numberOfChannels:
              type: integer
              format: int32
              description: >-
                Number of channels in the audio data. Required when the number
                of channels cannot be inferred from the audio header (e.g., raw
                PCM streams). Default: 1.
              example: 1
              x-parser-schema-id: <anonymous-schema-4>
            inactivityTimeoutSeconds:
              type: integer
              format: int32
              description: >-
                Inactivity timeout in seconds. If the client is silent for this
                duration, the transcription will be stopped.
              x-parser-schema-id: <anonymous-schema-5>
            endOfTurnConfidenceThreshold:
              type: number
              format: float
              description: >-
                Confidence threshold for end-of-turn prediction. Higher values
                reduce false-positives. Range: [0.0, 1.0]. Default: 0.5.
              x-parser-schema-id: <anonymous-schema-6>
            prompts:
              type: array
              items:
                type: string
                x-parser-schema-id: <anonymous-schema-8>
              description: >-
                Contextual prompts to guide the model (e.g., domain-specific
                context).
              x-parser-schema-id: <anonymous-schema-7>
            includeWordTimestamps:
              type: boolean
              description: >-
                If true, includes per-word timing information in the response.
                **Coming soon** — word timestamps are not yet populated.
              x-parser-schema-id: <anonymous-schema-9>
            groqConfig:
              type: object
              properties:
                temperature:
                  type: number
                  format: float
                  description: >-
                    Temperature for the model. Controls randomness in
                    predictions. Higher values produce more varied output.
                    Range: [0.0, 1.0].
                  x-parser-schema-id: <anonymous-schema-10>
              description: Configuration for Groq streaming STT models.
              x-parser-schema-id: GroqConfig
            assemblyaiConfig:
              type: object
              properties:
                minEndOfTurnSilenceWhenConfident:
                  type: integer
                  format: int32
                  description: >-
                    Minimum silence duration when confidence is high
                    (milliseconds).
                  x-parser-schema-id: <anonymous-schema-11>
                maxTurnSilence:
                  type: integer
                  format: int32
                  description: >-
                    Maximum allowed silence before forcing a turn boundary
                    (milliseconds).
                  x-parser-schema-id: <anonymous-schema-12>
                vadThreshold:
                  type: number
                  format: float
                  description: >-
                    Voice activity detection threshold. Range: [0.0, 1.0].
                    Default: 0.5.
                  x-parser-schema-id: <anonymous-schema-13>
                prompt:
                  type: string
                  description: >-
                    Custom transcription instructions for the model. Works only
                    for Universal-3 Pro Streaming.
                  x-parser-schema-id: <anonymous-schema-14>
              description: Configuration for AssemblyAI streaming STT models.
              x-parser-schema-id: AssemblyAIConfig
            inworldSttV1Config:
              type: object
              properties:
                minEndOfTurnSilenceWhenConfident:
                  type: integer
                  format: int32
                  description: >-
                    Minimum silence duration when confidence is high
                    (milliseconds).
                  x-parser-schema-id: <anonymous-schema-15>
                vadThreshold:
                  type: number
                  format: float
                  description: >-
                    Voice activity detection threshold. Range: [0.0, 1.0].
                    Default: 0.5.
                  x-parser-schema-id: <anonymous-schema-16>
              description: Configuration for Inworld STT 1 models.
              x-parser-schema-id: InworldSttV1Config
            sonioxConfig:
              type: object
              properties:
                languageHints:
                  type: array
                  items:
                    type: string
                    x-parser-schema-id: <anonymous-schema-18>
                  description: >-
                    Language hints to guide the model. If set, will override the
                    `language` field from the main config.
                  x-parser-schema-id: <anonymous-schema-17>
                languageHintsStrict:
                  type: boolean
                  description: >-
                    If true, model will strongly prefer producing languages only
                    from the `languageHints` list.
                  x-parser-schema-id: <anonymous-schema-19>
                enableEndpointDetection:
                  type: boolean
                  description: >-
                    If true, enables intelligent semantic-based end-of-turn
                    detection.
                  x-parser-schema-id: <anonymous-schema-20>
                maxEndpointDelayMs:
                  type: integer
                  format: int32
                  description: >-
                    Maximum allowed delay between the end of the previous turn
                    and the start of the next turn (milliseconds). Must be
                    between 500 and 5000 milliseconds, default is 2000.
                  x-parser-schema-id: <anonymous-schema-21>
                context:
                  type: object
                  properties:
                    general:
                      type: object
                      additionalProperties:
                        type: string
                        x-parser-schema-id: <anonymous-schema-23>
                      description: >-
                        Structured key-value information (domain, topic, intent,
                        etc.).
                      x-parser-schema-id: <anonymous-schema-22>
                    text:
                      type: string
                      description: Longer free-form background text or related documents.
                      x-parser-schema-id: <anonymous-schema-24>
                    terms:
                      type: array
                      items:
                        type: string
                        x-parser-schema-id: <anonymous-schema-26>
                      description: Domain-specific or uncommon words.
                      x-parser-schema-id: <anonymous-schema-25>
                  description: Contextual information to guide the Soniox model.
                  x-parser-schema-id: SonioxConfigContext
              description: Configuration for Soniox streaming STT models.
              x-parser-schema-id: SonioxConfig
            voiceProfileConfig:
              type: object
              properties:
                enableVoiceProfile:
                  type: boolean
                  description: Enables voice profile feature for this request or stream.
                  x-parser-schema-id: <anonymous-schema-27>
                topN:
                  type: integer
                  format: int32
                  description: 'Number of top labels from each class to return. Default: 10.'
                  x-parser-schema-id: <anonymous-schema-28>
              required:
                - enableVoiceProfile
              description: Configuration for voice profile detection.
              x-parser-schema-id: VoiceProfileConfig
          required:
            - modelId
            - audioEncoding
          examples:
            - modelId: assemblyai/universal-streaming-multilingual
              audioEncoding: LINEAR16
              sampleRateHertz: 16000
              language: en-US
          x-parser-schema-id: TranscribeConfigPayload
        title: Transcribe config
        description: >-
          Configure the transcription session. Must be the first message sent.
          Contains model selection, audio format settings, and optional feature
          configurations.
        example: |-
          {
            "modelId": "assemblyai/universal-streaming-multilingual",
            "audioEncoding": "LINEAR16",
            "sampleRateHertz": 16000,
            "language": "en-US"
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: transcribeConfig
      - &ref_4
        id: audioChunk
        payload:
          - name: audioChunk
            description: >-
              Send a chunk of audio data for transcription. Must be sent after
              the initial transcribe config message.
            type: object
            properties:
              - name: content
                type: string
                description: >-
                  The raw audio bytes in the encoding specified by the
                  transcribe config's audioEncoding.
                required: true
        headers: []
        jsonPayloadSchema:
          type: object
          properties:
            content:
              type: string
              format: byte
              description: >-
                The raw audio bytes in the encoding specified by the transcribe
                config's audioEncoding.
              example: <YOUR_AUDIO>
              x-parser-schema-id: <anonymous-schema-29>
          required:
            - content
          examples:
            - content: <YOUR_AUDIO>
          x-parser-schema-id: AudioChunkPayload
        title: Audio chunk
        description: >-
          Send a chunk of audio data for transcription. Must be sent after the
          initial transcribe config message.
        example: |-
          {
            "content": "<YOUR_AUDIO>"
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: audioChunk
      - &ref_5
        id: endTurn
        payload:
          - name: endTurn
            description: >-
              Signal the end of a speaker's turn. Some providers do not support
              manual turn-taking; for those providers, sending this message will
              have no effect.
            type: object
            properties: []
        headers: []
        jsonPayloadSchema:
          type: object
          properties: {}
          examples:
            - {}
          x-parser-schema-id: EndTurnPayload
        title: End turn
        description: >-
          Signal the end of a speaker's turn. Some providers do not support
          manual turn-taking; for those providers, sending this message will
          have no effect.
        example: '{}'
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: endTurn
      - &ref_6
        id: closeStream
        payload:
          - name: closeStream
            description: >-
              Signal that the client is done sending audio data. Required for
              HTTP/WebSocket clients since there is no equivalent to gRPC stream
              close.
            type: object
            properties: []
        headers: []
        jsonPayloadSchema:
          type: object
          properties: {}
          examples:
            - {}
          x-parser-schema-id: CloseStreamPayload
        title: Close stream
        description: >-
          Signal that the client is done sending audio data. Required for
          HTTP/WebSocket clients since there is no equivalent to gRPC stream
          close.
        example: '{}'
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: closeStream
    bindings: []
    extensions: &ref_0
      - id: x-parser-unique-object-id
        value: sttStream
  - &ref_2
    id: receiveResponse
    title: Receive response
    type: send
    messages:
      - &ref_7
        id: transcription
        payload:
          - name: transcription
            description: >-
              Transcription result streamed back as audio is processed. May be
              an interim (partial) result or a final result depending on the
              `isFinal` field.
            type: object
            properties:
              - name: transcript
                type: string
                description: Full transcribed text for this segment.
                required: false
              - name: isFinal
                type: boolean
                description: >-
                  Indicates whether this is a finalized result or an interim
                  (partial) result that may be updated as more audio is
                  processed.
                required: false
              - name: wordTimestamps
                type: array
                description: >-
                  Per-word timing and confidence data. Only populated when
                  `includeWordTimestamps` is enabled. **Coming soon** — word
                  timestamps are not yet populated.
                required: false
        headers: []
        jsonPayloadSchema:
          type: object
          properties:
            transcript:
              type: string
              description: Full transcribed text for this segment.
              x-parser-schema-id: <anonymous-schema-30>
            isFinal:
              type: boolean
              description: >-
                Indicates whether this is a finalized result or an interim
                (partial) result that may be updated as more audio is processed.
              x-parser-schema-id: <anonymous-schema-31>
            wordTimestamps:
              type: array
              items:
                type: object
                properties:
                  word:
                    type: string
                    description: The transcribed word.
                    x-parser-schema-id: <anonymous-schema-33>
                  confidence:
                    type: number
                    format: float
                    description: >-
                      Recognition confidence score for this word. Range: [0.0,
                      1.0].
                    x-parser-schema-id: <anonymous-schema-34>
                  startTimeMs:
                    type: integer
                    format: int32
                    description: >-
                      Offset from the beginning of the audio to the start of
                      this word, in milliseconds.
                    x-parser-schema-id: <anonymous-schema-35>
                  endTimeMs:
                    type: integer
                    format: int32
                    description: >-
                      Offset from the beginning of the audio to the end of this
                      word, in milliseconds.
                    x-parser-schema-id: <anonymous-schema-36>
                x-parser-schema-id: WordTimestamp
              description: >-
                Per-word timing and confidence data. Only populated when
                `includeWordTimestamps` is enabled. **Coming soon** — word
                timestamps are not yet populated.
              x-parser-schema-id: <anonymous-schema-32>
          examples:
            - transcript: Hello, this is a test transcription.
              isFinal: true
              wordTimestamps: []
            - transcript: Hello, this is
              isFinal: false
              wordTimestamps: []
          x-parser-schema-id: TranscriptionResponsePayload
        title: Transcription
        description: >-
          Transcription result streamed back as audio is processed. May be an
          interim (partial) result or a final result depending on the `isFinal`
          field.
        example: |-
          {
            "transcript": "Hello, this is a test transcription.",
            "isFinal": true,
            "wordTimestamps": []
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: transcription
      - &ref_8
        id: usage
        payload:
          - name: usage
            description: >-
              Usage metrics for billing and monitoring purposes. **Coming soon**
              — this field is not yet populated.
            type: object
            properties:
              - name: transcribedAudioMs
                type: integer
                description: The duration of the transcribed audio in milliseconds.
                required: false
              - name: modelId
                type: string
                description: The identifier of the model used for transcription.
                required: false
        headers: []
        jsonPayloadSchema:
          type: object
          properties:
            transcribedAudioMs:
              type: integer
              format: int32
              description: The duration of the transcribed audio in milliseconds.
              x-parser-schema-id: <anonymous-schema-37>
            modelId:
              type: string
              description: The identifier of the model used for transcription.
              x-parser-schema-id: <anonymous-schema-38>
          description: >-
            Usage metrics for billing and monitoring purposes. **Coming soon** —
            this field is not yet populated.
          x-parser-schema-id: UsageResponsePayload
        title: Usage
        description: >-
          Usage metrics for billing and monitoring purposes. **Coming soon** —
          this field is not yet populated.
        example: |-
          {
            "transcribedAudioMs": 123,
            "modelId": "<string>"
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: usage
      - &ref_9
        id: speechStarted
        payload:
          - name: speechStarted
            description: >-
              Signal to indicate the start of a speaker's speech. Sent when
              voice activity is detected in the audio stream.
            type: object
            properties:
              - name: startTimeMs
                type: integer
                description: The timestamp of the start of the speech in milliseconds.
                required: false
              - name: confidence
                type: number
                description: The confidence score of the speech detection.
                required: false
        headers: []
        jsonPayloadSchema:
          type: object
          properties:
            startTimeMs:
              type: integer
              format: int32
              description: The timestamp of the start of the speech in milliseconds.
              x-parser-schema-id: <anonymous-schema-39>
            confidence:
              type: number
              format: float
              description: The confidence score of the speech detection.
              x-parser-schema-id: <anonymous-schema-40>
          examples:
            - startTimeMs: 1250
              confidence: 0.95
          x-parser-schema-id: SpeechStartedResponsePayload
        title: Speech started
        description: >-
          Signal to indicate the start of a speaker's speech. Sent when voice
          activity is detected in the audio stream.
        example: |-
          {
            "startTimeMs": 1250,
            "confidence": 0.95
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: speechStarted
      - &ref_10
        id: speechStopped
        payload:
          - name: speechStopped
            description: >-
              Signal raised when STT detects silence after speech has stopped.
              Useful for tracking pauses and implementing custom turn-taking
              logic.
            type: object
            properties:
              - name: silenceDurationMs
                type: integer
                description: >-
                  The duration of silence detected after speech stopped, in
                  milliseconds.
                required: false
        headers: []
        jsonPayloadSchema:
          type: object
          properties:
            silenceDurationMs:
              type: integer
              format: int32
              description: >-
                The duration of silence detected after speech stopped, in
                milliseconds.
              x-parser-schema-id: <anonymous-schema-41>
          examples:
            - silenceDurationMs: 750
          x-parser-schema-id: SpeechStoppedResponsePayload
        title: Speech stopped
        description: >-
          Signal raised when STT detects silence after speech has stopped.
          Useful for tracking pauses and implementing custom turn-taking logic.
        example: |-
          {
            "silenceDurationMs": 750
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: speechStopped
    bindings: []
    extensions: *ref_0
sendOperations:
  - *ref_1
receiveOperations:
  - *ref_2
sendMessages:
  - *ref_3
  - *ref_4
  - *ref_5
  - *ref_6
receiveMessages:
  - *ref_7
  - *ref_8
  - *ref_9
  - *ref_10
extensions:
  - id: x-parser-unique-object-id
    value: sttStream
securitySchemes:
  - id: auth
    name: authorization
    type: httpApiKey
    description: >-
      Your [authentication](../../../api-reference/introduction) credentials.
      For Basic authentication, please populate `Basic $INWORLD_API_KEY`
    in: query
    extensions: []

````