> ## Documentation Index
> Fetch the complete documentation index at: https://docs.inworld.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Synthesize speech (WebSocket)

> Generate audio from text input while managing multiple independent audio generation streams over a single WebSocket connection.

The independent audio streams each correspond to a *context*, identified by `contextId`, that maintains its own state. To use the API:
- Create a context with audio generation configurations. By default, we allow up to 20 concurrent connections, with a maximum of 5 contexts per connection. 
- When you send text to be synthesized into audio, you can send it to a specific context (optional if there is only 1 context).
- Each context maintains its own buffer that can be flushed either manually or automatically when the buffer reaches a certain threshold (see `maxBufferDelayMs` and `bufferCharThreshold` in the context configurations).
- If texts are sent in full sentences phrases, it's recommended to use `auto_mode` which would automatically balance latency and quality of the generations.
- Responses contain the `contextId` so you can match the audio to the request.
- Close a context when it is no longer needed.



## AsyncAPI

````yaml api-reference/ttsAPI/synthesize-speech-websocket.json ttsStream
id: ttsStream
title: Tts stream
description: ''
servers:
  - id: production
    protocol: wss
    host: api.inworld.ai
    bindings: []
    variables: []
address: /tts/v1/voice:streamBidirectional
parameters: []
bindings: []
operations:
  - &ref_13
    id: sendRequest
    title: Send request
    type: receive
    messages:
      - &ref_15
        id: createContext
        payload:
          - name: Create Context
            description: >-
              Create a new context with specified voice and configuration. A
              context is an independent conversation happening over the
              connection. The configurations for each context are completely
              separate – you can have different voice ids, models, output
              formats, etc. between contexts. *Note*: for each connection, 5
              contexts is the max. If you don't need multiple contexts, you can
              omit the contextId in the message to use a single context
              connection.
            type: object
            properties:
              - name: create
                type: object
                required: true
                properties:
                  - name: voiceId
                    type: string
                    description: The identifier of the voice to use for the synthesis
                    required: false
                  - name: modelId
                    type: string
                    description: >-
                      The ID of the model to use for synthesizing speech. See
                      [Models](../../../tts/tts-models) for available models.
                    required: false
                  - name: audioConfig
                    type: object
                    required: false
                    properties:
                      - name: audioEncoding
                        type: string
                        description: >-
                          The desired output format of the synthesized audio.
                          Defaults to `MP3`.
                           - `LINEAR16`: Uncompressed 16-bit signed little-endian samples (Linear PCM). The WAV header is included in every audio chunk.
                           - `MP3`: MP3 audio.
                           - `OGG_OPUS`: Opus encoded audio wrapped in an ogg container. The result will be a file which can be played natively on Android, and in browsers (at least Chrome and Firefox). The quality of the encoding is considerably higher than MP3 while using approximately the same bitrate.
                           - `ALAW`: ALAW encoded audio. 8-bit companded PCM.
                           - `MULAW`: MULAW encoded audio. 8-bit companded PCM.
                           - `PCM`: PCM audio. Uncompressed 16-bit signed little-endian samples with no WAV header.
                           - `WAV`: WAV audio. Uncompressed 16-bit signed little-endian samples. The WAV header is included in the first audio chunk only. On each flush_completed response, the next audio chunk will also start with a header.
                        enumValues:
                          - AUDIO_ENCODING_UNSPECIFIED
                          - LINEAR16
                          - MP3
                          - OGG_OPUS
                          - ALAW
                          - MULAW
                          - PCM
                          - WAV
                        required: false
                      - name: sampleRateHertz
                        type: integer
                        description: >-
                          The synthesis sample rate (in hertz) for this audio.
                          Accepts values within the range [8000, 48000].

                           When this is specified, if this is different from the voice's natural sample rate, then the audio will be converted to the desired sample rate (which might result in worse audio quality), unless the specified sample rate is not supported for the encoding chosen, in which case it will fail the request and return an error. The default is 48,000.
                        required: false
                      - name: bitRate
                        type: integer
                        description: >-
                          Bits per second of the audio. Only for compressed
                          audio formats (`MP3`, `OGG_OPUS`). The default is
                          128,000.
                        required: false
                      - name: speakingRate
                        type: number
                        description: >-
                          Speaking rate/speed, in the range [0.5, 1.5]. The
                          default is 1.0, which is the normal native speed
                          supported by the specific voice. We recommend using
                          values above 0.8 to ensure high quality.
                        required: false
                  - name: temperature
                    type: number
                    description: >-
                      *Ignored on `inworld-tts-2`. Use
                      [`deliveryMode`](#deliveryMode) instead.*


                      Determines the degree of randomness when sampling audio
                      tokens to generate the response.


                      Defaults to 1.0. Accepts values between 0 (exclusive) and
                      2 (inclusive). Higher values will make the output more
                      random and can lead to more expressive results. Lower
                      values will make it more deterministic. If 0 is provided,
                      the default value will be used.


                      For the most stable results, we recommend using the
                      default value.
                    required: false
                  - name: timestampType
                    type: string
                    description: >-
                      Controls timestamp metadata returned with the audio. When
                      enabled, the response includes timing arrays, which can be
                      useful for word-highlighting, karaoke-style captions, and
                      lipsync.


                      - WORD: Output arrays under `timestampInfo.wordAlignment`
                      (words, wordStartTimeSeconds, wordEndTimeSeconds).

                      - CHARACTER: Output arrays under
                      `timestampInfo.characterAlignment` (characters,
                      characterStartTimeSeconds, characterEndTimeSeconds).

                      - TIMESTAMP_TYPE_UNSPECIFIED: Do not compute alignment;
                      timestamp arrays will be empty or omitted.


                      **Latency note:** Alignment adds additional computation.
                      Enabling alignment can increase latency.


                      The timestamps reset per flush, either triggered manually
                      or automatically by the server. When you receive a
                      `flushCompleted` message, the timestamps for subsequent
                      chunks will start from 0.
                    enumValues:
                      - TIMESTAMP_TYPE_UNSPECIFIED
                      - WORD
                      - CHARACTER
                    required: false
                  - name: maxBufferDelayMs
                    type: integer
                    description: >-
                      If set, determines the maximum time in milliseconds to
                      buffer before starting generation. The timer starts
                      running when the first text in the buffer is received and
                      resets when new text arrives.


                      If not set or set to 0, there will be no time-based limit
                      on the buffer. Instead, the user will need to flush the
                      buffer or set the bufferCharThreshold to trigger audio
                      generation.


                      Note that both length and timeout based flushing can be
                      used together - if both maxBufferDelayMs and
                      bufferCharThreshold are set, the server will flush if
                      either condition is met.
                    required: false
                  - name: bufferCharThreshold
                    type: integer
                    description: >-
                      Defines the minimum number of characters in the buffer
                      that would automatically trigger audio generation. This
                      allows you to rely on automatic triggering instead of
                      calling flush manually. If set to 0 or left undefined, the
                      threshold defaults to 1000 to ensure stable behavior.
                      Cannot be set to a value greater than 1000. Note that the
                      buffer will automatically flush if more than 1000
                      characters are accumulated, regardless of this setting.
                    required: false
                  - name: applyTextNormalization
                    type: string
                    description: >-
                      When enabled, text normalization automatically expands and
                      standardizes things like numbers, dates, times, and
                      abbreviations before converting them to speech. For
                      example, Dr. Smith becomes Doctor Smith, and 3/10/25 is
                      spoken as March tenth, twenty twenty-five. Turning this
                      off may reduce latency, but the speech output will read
                      the text exactly as written. Defaults to automatically
                      deciding whether to apply text normalization.
                    enumValues:
                      - APPLY_TEXT_NORMALIZATION_UNSPECIFIED
                      - 'ON'
                      - 'OFF'
                    required: false
                  - name: autoMode
                    type: boolean
                    description: >-
                      When enabled, the server will control flushing of buffered
                      text to achieve minimal latency, while still maintaining
                      high quality audio output. Recommended when texts are sent
                      in full sentences/phrases. The default is false.
                    required: false
                  - name: timestampTransportStrategy
                    type: string
                    description: >-
                      The transport strategy of timestamps info.


                      - `TIMESTAMP_TRANSPORT_STRATEGY_UNSPECIFIED`: The service
                      will automatically decide the transport strategy.

                      - `SYNC`: Timestamps will be returned in the same message
                      as the audio data.

                      - `ASYNC`: Timestamps could return in trailing message
                      after the audio data. Use this strategy to reduce latency
                      of the first audio chunk with v1.5+ models.
                    enumValues:
                      - TIMESTAMP_TRANSPORT_STRATEGY_UNSPECIFIED
                      - SYNC
                      - ASYNC
                    required: false
                  - name: language
                    type: string
                    description: >-
                      BCP-47 language tag (e.g., `en-US`, `fr-FR`, `ja-JP`)
                      specifying the language that the given voice should speak
                      the text in. If a localized voice prompt exists for the
                      language, it will be used. When omitted, the original
                      voice prompt will be used and the language will be
                      auto-detected from the input text. If an invalid language
                      code is provided, an error will be returned.


                      See [Languages](../../../tts/capabilities/multilingual)
                      for more details.
                    required: false
                  - name: deliveryMode
                    type: string
                    description: >-
                      *Only supported by `inworld-tts-2`. The field is ignored
                      on other models.*


                      Controls how varied the output is. 


                      - `DELIVERY_MODE_UNSPECIFIED`: Defaults to `BALANCED`
                      behavior.

                      - `STABLE`: Optimizes for more consistent, predictable
                      output.

                      - `BALANCED`: Balanced between stability and diversity.

                      - `CREATIVE`: Optimizes for increased emotional range and
                      variation.
                    enumValues:
                      - DELIVERY_MODE_UNSPECIFIED
                      - STABLE
                      - BALANCED
                      - CREATIVE
                    required: false
              - name: contextId
                type: string
                description: Optional context ID. If not provided, one will be generated
                required: false
        headers: []
        jsonPayloadSchema:
          type: object
          properties:
            create:
              type: object
              properties:
                voiceId: &ref_0
                  type: string
                  description: The identifier of the voice to use for the synthesis
                  x-parser-schema-id: VoiceId
                modelId:
                  type: string
                  description: >-
                    The ID of the model to use for synthesizing speech. See
                    [Models](../../../tts/tts-models) for available models.
                  x-parser-schema-id: <anonymous-schema-2>
                audioConfig: &ref_1
                  type: object
                  properties:
                    audioEncoding:
                      type: string
                      enum:
                        - AUDIO_ENCODING_UNSPECIFIED
                        - LINEAR16
                        - MP3
                        - OGG_OPUS
                        - ALAW
                        - MULAW
                        - PCM
                        - WAV
                      default: MP3
                      description: >-
                        The desired output format of the synthesized audio.
                        Defaults to `MP3`.
                         - `LINEAR16`: Uncompressed 16-bit signed little-endian samples (Linear PCM). The WAV header is included in every audio chunk.
                         - `MP3`: MP3 audio.
                         - `OGG_OPUS`: Opus encoded audio wrapped in an ogg container. The result will be a file which can be played natively on Android, and in browsers (at least Chrome and Firefox). The quality of the encoding is considerably higher than MP3 while using approximately the same bitrate.
                         - `ALAW`: ALAW encoded audio. 8-bit companded PCM.
                         - `MULAW`: MULAW encoded audio. 8-bit companded PCM.
                         - `PCM`: PCM audio. Uncompressed 16-bit signed little-endian samples with no WAV header.
                         - `WAV`: WAV audio. Uncompressed 16-bit signed little-endian samples. The WAV header is included in the first audio chunk only. On each flush_completed response, the next audio chunk will also start with a header.
                      x-parser-schema-id: <anonymous-schema-3>
                    sampleRateHertz:
                      type: integer
                      format: int32
                      description: >-
                        The synthesis sample rate (in hertz) for this audio.
                        Accepts values within the range [8000, 48000].

                         When this is specified, if this is different from the voice's natural sample rate, then the audio will be converted to the desired sample rate (which might result in worse audio quality), unless the specified sample rate is not supported for the encoding chosen, in which case it will fail the request and return an error. The default is 48,000.
                      x-parser-schema-id: <anonymous-schema-4>
                    bitRate:
                      type: integer
                      format: int32
                      description: >-
                        Bits per second of the audio. Only for compressed audio
                        formats (`MP3`, `OGG_OPUS`). The default is 128,000.
                      x-parser-schema-id: <anonymous-schema-5>
                    speakingRate:
                      type: number
                      format: double
                      description: >-
                        Speaking rate/speed, in the range [0.5, 1.5]. The
                        default is 1.0, which is the normal native speed
                        supported by the specific voice. We recommend using
                        values above 0.8 to ensure high quality.
                      x-parser-schema-id: <anonymous-schema-6>
                  x-parser-schema-id: AudioConfig
                temperature: &ref_2
                  type: number
                  format: double
                  description: >-
                    *Ignored on `inworld-tts-2`. Use
                    [`deliveryMode`](#deliveryMode) instead.*


                    Determines the degree of randomness when sampling audio
                    tokens to generate the response.


                    Defaults to 1.0. Accepts values between 0 (exclusive) and 2
                    (inclusive). Higher values will make the output more random
                    and can lead to more expressive results. Lower values will
                    make it more deterministic. If 0 is provided, the default
                    value will be used.


                    For the most stable results, we recommend using the default
                    value.
                  x-parser-schema-id: Temperature
                timestampType: &ref_3
                  type: string
                  enum:
                    - TIMESTAMP_TYPE_UNSPECIFIED
                    - WORD
                    - CHARACTER
                  default: TIMESTAMP_TYPE_UNSPECIFIED
                  description: >-
                    Controls timestamp metadata returned with the audio. When
                    enabled, the response includes timing arrays, which can be
                    useful for word-highlighting, karaoke-style captions, and
                    lipsync.


                    - WORD: Output arrays under `timestampInfo.wordAlignment`
                    (words, wordStartTimeSeconds, wordEndTimeSeconds).

                    - CHARACTER: Output arrays under
                    `timestampInfo.characterAlignment` (characters,
                    characterStartTimeSeconds, characterEndTimeSeconds).

                    - TIMESTAMP_TYPE_UNSPECIFIED: Do not compute alignment;
                    timestamp arrays will be empty or omitted.


                    **Latency note:** Alignment adds additional computation.
                    Enabling alignment can increase latency.


                    The timestamps reset per flush, either triggered manually or
                    automatically by the server. When you receive a
                    `flushCompleted` message, the timestamps for subsequent
                    chunks will start from 0.
                  x-parser-schema-id: TimestampType
                maxBufferDelayMs: &ref_4
                  type: integer
                  format: int32
                  description: >-
                    If set, determines the maximum time in milliseconds to
                    buffer before starting generation. The timer starts running
                    when the first text in the buffer is received and resets
                    when new text arrives.


                    If not set or set to 0, there will be no time-based limit on
                    the buffer. Instead, the user will need to flush the buffer
                    or set the bufferCharThreshold to trigger audio generation.


                    Note that both length and timeout based flushing can be used
                    together - if both maxBufferDelayMs and bufferCharThreshold
                    are set, the server will flush if either condition is met.
                  x-parser-schema-id: MaxBufferDelayMs
                bufferCharThreshold: &ref_5
                  type: integer
                  format: int32
                  description: >-
                    Defines the minimum number of characters in the buffer that
                    would automatically trigger audio generation. This allows
                    you to rely on automatic triggering instead of calling flush
                    manually. If set to 0 or left undefined, the threshold
                    defaults to 1000 to ensure stable behavior. Cannot be set to
                    a value greater than 1000. Note that the buffer will
                    automatically flush if more than 1000 characters are
                    accumulated, regardless of this setting.
                  x-parser-schema-id: BufferCharThreshold
                applyTextNormalization: &ref_6
                  type: string
                  enum:
                    - APPLY_TEXT_NORMALIZATION_UNSPECIFIED
                    - 'ON'
                    - 'OFF'
                  default: APPLY_TEXT_NORMALIZATION_UNSPECIFIED
                  description: >-
                    When enabled, text normalization automatically expands and
                    standardizes things like numbers, dates, times, and
                    abbreviations before converting them to speech. For example,
                    Dr. Smith becomes Doctor Smith, and 3/10/25 is spoken as
                    March tenth, twenty twenty-five. Turning this off may reduce
                    latency, but the speech output will read the text exactly as
                    written. Defaults to automatically deciding whether to apply
                    text normalization.
                  x-parser-schema-id: ApplyTextNormalization
                autoMode: &ref_7
                  type: boolean
                  default: false
                  description: >-
                    When enabled, the server will control flushing of buffered
                    text to achieve minimal latency, while still maintaining
                    high quality audio output. Recommended when texts are sent
                    in full sentences/phrases. The default is false.
                  x-parser-schema-id: AutoMode
                timestampTransportStrategy: &ref_8
                  type: string
                  enum:
                    - TIMESTAMP_TRANSPORT_STRATEGY_UNSPECIFIED
                    - SYNC
                    - ASYNC
                  default: TIMESTAMP_TRANSPORT_STRATEGY_UNSPECIFIED
                  description: >-
                    The transport strategy of timestamps info.


                    - `TIMESTAMP_TRANSPORT_STRATEGY_UNSPECIFIED`: The service
                    will automatically decide the transport strategy.

                    - `SYNC`: Timestamps will be returned in the same message as
                    the audio data.

                    - `ASYNC`: Timestamps could return in trailing message after
                    the audio data. Use this strategy to reduce latency of the
                    first audio chunk with v1.5+ models.
                  x-parser-schema-id: TimestampTransportStrategy
                language: &ref_9
                  type: string
                  description: >-
                    BCP-47 language tag (e.g., `en-US`, `fr-FR`, `ja-JP`)
                    specifying the language that the given voice should speak
                    the text in. If a localized voice prompt exists for the
                    language, it will be used. When omitted, the original voice
                    prompt will be used and the language will be auto-detected
                    from the input text. If an invalid language code is
                    provided, an error will be returned.


                    See [Languages](../../../tts/capabilities/multilingual) for
                    more details.
                  x-parser-schema-id: Language
                deliveryMode: &ref_10
                  type: string
                  enum:
                    - DELIVERY_MODE_UNSPECIFIED
                    - STABLE
                    - BALANCED
                    - CREATIVE
                  default: DELIVERY_MODE_UNSPECIFIED
                  description: >-
                    *Only supported by `inworld-tts-2`. The field is ignored on
                    other models.*


                    Controls how varied the output is. 


                    - `DELIVERY_MODE_UNSPECIFIED`: Defaults to `BALANCED`
                    behavior.

                    - `STABLE`: Optimizes for more consistent, predictable
                    output.

                    - `BALANCED`: Balanced between stability and diversity.

                    - `CREATIVE`: Optimizes for increased emotional range and
                    variation.
                  x-parser-schema-id: DeliveryMode
              required:
                - voiceId
                - modelId
              x-parser-schema-id: <anonymous-schema-1>
            contextId:
              type: string
              description: Optional context ID. If not provided, one will be generated
              x-parser-schema-id: <anonymous-schema-7>
          required:
            - create
          examples:
            - create:
                voiceId: Dennis
                modelId: inworld-tts-2
                bufferCharThreshold: 100
                autoMode: true
                timestampType: WORD
                timestampTransportStrategy: ASYNC
              contextId: ctx-1
          x-parser-schema-id: CreateContextPayload
        title: Create Context
        description: >-
          Create a new context with specified voice and configuration. A context
          is an independent conversation happening over the connection. The
          configurations for each context are completely separate – you can have
          different voice ids, models, output formats, etc. between contexts.
          *Note*: for each connection, 5 contexts is the max. If you don't need
          multiple contexts, you can omit the contextId in the message to use a
          single context connection.
        example: |-
          {
            "create": {
              "voiceId": "Dennis",
              "modelId": "inworld-tts-2",
              "bufferCharThreshold": 100,
              "autoMode": true,
              "timestampType": "WORD",
              "timestampTransportStrategy": "ASYNC"
            },
            "contextId": "ctx-1"
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: createContext
      - &ref_16
        id: sendText
        payload:
          - name: Send Text
            description: >-
              Send text to be synthesized for a specific context. You can only
              send up to 1000 characters in a single send_text request. Text can
              be buffered on the server or immediately flushed by including
              `flush_context` in the message.
            type: object
            properties:
              - name: send_text
                type: object
                required: true
                properties:
                  - name: text
                    type: string
                    description: >-
                      The text to synthesize. Maximum 1000 characters per
                      send_text request.
                    required: false
                  - name: flush_context
                    type: object
                    description: >-
                      Flush a context and start synthesis of all accumulated
                      text
                    required: false
                    properties: []
              - name: contextId
                type: string
                description: >-
                  The target context for this message. Optional if only one
                  context has been opened
                required: false
        headers: []
        jsonPayloadSchema:
          type: object
          properties:
            send_text:
              type: object
              properties:
                text:
                  type: string
                  description: >-
                    The text to synthesize. Maximum 1000 characters per
                    send_text request.
                  x-parser-schema-id: <anonymous-schema-9>
                flush_context:
                  type: object
                  properties: {}
                  description: Flush a context and start synthesis of all accumulated text
                  x-parser-schema-id: <anonymous-schema-10>
              required:
                - text
              x-parser-schema-id: <anonymous-schema-8>
            contextId:
              type: string
              description: >-
                The target context for this message. Optional if only one
                context has been opened
              x-parser-schema-id: <anonymous-schema-11>
          required:
            - send_text
          examples:
            - send_text:
                text: Hello, what a wonderful day to be a text-to-speech model!
                flush_context: {}
              contextId: ctx-1
          x-parser-schema-id: SendTextPayload
        title: Send Text
        description: >-
          Send text to be synthesized for a specific context. You can only send
          up to 1000 characters in a single send_text request. Text can be
          buffered on the server or immediately flushed by including
          `flush_context` in the message.
        example: |-
          {
            "send_text": {
              "text": "Hello, what a wonderful day to be a text-to-speech model!",
              "flush_context": {}
            },
            "contextId": "ctx-1"
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: sendText
      - &ref_17
        id: flushContext
        payload:
          - name: Flush Context
            description: >-
              Flush a context and start synthesis of all accumulated text. Note
              that the buffer will automatically flush all text if the length of
              text is greater than 1000 characters, regardless of any other
              buffer settings.
            type: object
            properties:
              - name: flush_context
                type: object
                required: true
                properties: []
              - name: contextId
                type: string
                description: >-
                  The context to flush. Optional if only one context has been
                  opened
                required: false
        headers: []
        jsonPayloadSchema:
          type: object
          description: Flush a context and force a synthesis of all accumulated text
          properties:
            flush_context:
              type: object
              properties: {}
              x-parser-schema-id: <anonymous-schema-14>
            contextId:
              type: string
              description: >-
                The context to flush. Optional if only one context has been
                opened
              x-parser-schema-id: <anonymous-schema-15>
          required:
            - flush_context
          examples:
            - flush_context: {}
              contextId: ctx-1
          x-parser-schema-id: FlushContext
        title: Flush Context
        description: >-
          Flush a context and start synthesis of all accumulated text. Note that
          the buffer will automatically flush all text if the length of text is
          greater than 1000 characters, regardless of any other buffer settings.
        example: |-
          {
            "flush_context": {},
            "contextId": "ctx-1"
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: flushContext
      - &ref_18
        id: closeContext
        payload:
          - name: Close Context
            description: >-
              Close an existing context and release all of its resources.
              Sending a close context message is equivalent to sending a flush
              message right before, so all text in the buffer will be
              synthesized before the context is closed. Note that the session
              will automatically be closed after 10 minutes of inactivity across
              any context.
            type: object
            properties:
              - name: close_context
                type: object
                required: true
                properties: []
              - name: contextId
                type: string
                description: The context to close
                required: false
        headers: []
        jsonPayloadSchema:
          type: object
          properties:
            close_context:
              type: object
              properties: {}
              x-parser-schema-id: <anonymous-schema-12>
            contextId:
              type: string
              description: The context to close
              x-parser-schema-id: <anonymous-schema-13>
          required:
            - close_context
          examples:
            - close_context: {}
              contextId: ctx-1
          x-parser-schema-id: CloseContextPayload
        title: Close Context
        description: >-
          Close an existing context and release all of its resources. Sending a
          close context message is equivalent to sending a flush message right
          before, so all text in the buffer will be synthesized before the
          context is closed. Note that the session will automatically be closed
          after 10 minutes of inactivity across any context.
        example: |-
          {
            "close_context": {},
            "contextId": "ctx-1"
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: closeContext
    bindings: []
    extensions: &ref_12
      - id: x-parser-unique-object-id
        value: ttsStream
  - &ref_14
    id: receiveRequest
    title: Receive request
    type: send
    messages:
      - &ref_19
        id: contextCreated
        payload:
          - name: Context Created
            description: Event sent when a new TTS context has been successfully created
            type: object
            properties:
              - name: result
                type: object
                required: false
                properties:
                  - name: contextId
                    type: string
                    description: The ID of the created context
                    required: false
                  - name: contextCreated
                    type: object
                    required: false
                    properties:
                      - name: voiceId
                        type: string
                        description: The identifier of the voice to use for the synthesis
                        required: false
                      - name: modelId
                        type: string
                        description: >-
                          The ID of the model to use for synthesizing speech.
                          See [Models](../../../tts/tts-models) for available
                          models.
                        required: false
                      - name: audioConfig
                        type: object
                        required: false
                        properties:
                          - name: audioEncoding
                            type: string
                            description: >-
                              The desired output format of the synthesized
                              audio. Defaults to `MP3`.
                               - `LINEAR16`: Uncompressed 16-bit signed little-endian samples (Linear PCM). The WAV header is included in every audio chunk.
                               - `MP3`: MP3 audio.
                               - `OGG_OPUS`: Opus encoded audio wrapped in an ogg container. The result will be a file which can be played natively on Android, and in browsers (at least Chrome and Firefox). The quality of the encoding is considerably higher than MP3 while using approximately the same bitrate.
                               - `ALAW`: ALAW encoded audio. 8-bit companded PCM.
                               - `MULAW`: MULAW encoded audio. 8-bit companded PCM.
                               - `PCM`: PCM audio. Uncompressed 16-bit signed little-endian samples with no WAV header.
                               - `WAV`: WAV audio. Uncompressed 16-bit signed little-endian samples. The WAV header is included in the first audio chunk only. On each flush_completed response, the next audio chunk will also start with a header.
                            enumValues:
                              - AUDIO_ENCODING_UNSPECIFIED
                              - LINEAR16
                              - MP3
                              - OGG_OPUS
                              - ALAW
                              - MULAW
                              - PCM
                              - WAV
                            required: false
                          - name: sampleRateHertz
                            type: integer
                            description: >-
                              The synthesis sample rate (in hertz) for this
                              audio. Accepts values within the range [8000,
                              48000].

                               When this is specified, if this is different from the voice's natural sample rate, then the audio will be converted to the desired sample rate (which might result in worse audio quality), unless the specified sample rate is not supported for the encoding chosen, in which case it will fail the request and return an error. The default is 48,000.
                            required: false
                          - name: bitRate
                            type: integer
                            description: >-
                              Bits per second of the audio. Only for compressed
                              audio formats (`MP3`, `OGG_OPUS`). The default is
                              128,000.
                            required: false
                          - name: speakingRate
                            type: number
                            description: >-
                              Speaking rate/speed, in the range [0.5, 1.5]. The
                              default is 1.0, which is the normal native speed
                              supported by the specific voice. We recommend
                              using values above 0.8 to ensure high quality.
                            required: false
                      - name: temperature
                        type: number
                        description: >-
                          *Ignored on `inworld-tts-2`. Use
                          [`deliveryMode`](#deliveryMode) instead.*


                          Determines the degree of randomness when sampling
                          audio tokens to generate the response.


                          Defaults to 1.0. Accepts values between 0 (exclusive)
                          and 2 (inclusive). Higher values will make the output
                          more random and can lead to more expressive results.
                          Lower values will make it more deterministic. If 0 is
                          provided, the default value will be used.


                          For the most stable results, we recommend using the
                          default value.
                        required: false
                      - name: timestampType
                        type: string
                        description: >-
                          Controls timestamp metadata returned with the audio.
                          When enabled, the response includes timing arrays,
                          which can be useful for word-highlighting,
                          karaoke-style captions, and lipsync.


                          - WORD: Output arrays under
                          `timestampInfo.wordAlignment` (words,
                          wordStartTimeSeconds, wordEndTimeSeconds).

                          - CHARACTER: Output arrays under
                          `timestampInfo.characterAlignment` (characters,
                          characterStartTimeSeconds, characterEndTimeSeconds).

                          - TIMESTAMP_TYPE_UNSPECIFIED: Do not compute
                          alignment; timestamp arrays will be empty or omitted.


                          **Latency note:** Alignment adds additional
                          computation. Enabling alignment can increase latency.


                          The timestamps reset per flush, either triggered
                          manually or automatically by the server. When you
                          receive a `flushCompleted` message, the timestamps for
                          subsequent chunks will start from 0.
                        enumValues:
                          - TIMESTAMP_TYPE_UNSPECIFIED
                          - WORD
                          - CHARACTER
                        required: false
                      - name: maxBufferDelayMs
                        type: integer
                        description: >-
                          If set, determines the maximum time in milliseconds to
                          buffer before starting generation. The timer starts
                          running when the first text in the buffer is received
                          and resets when new text arrives.


                          If not set or set to 0, there will be no time-based
                          limit on the buffer. Instead, the user will need to
                          flush the buffer or set the bufferCharThreshold to
                          trigger audio generation.


                          Note that both length and timeout based flushing can
                          be used together - if both maxBufferDelayMs and
                          bufferCharThreshold are set, the server will flush if
                          either condition is met.
                        required: false
                      - name: bufferCharThreshold
                        type: integer
                        description: >-
                          Defines the minimum number of characters in the buffer
                          that would automatically trigger audio generation.
                          This allows you to rely on automatic triggering
                          instead of calling flush manually. If set to 0 or left
                          undefined, the threshold defaults to 1000 to ensure
                          stable behavior. Cannot be set to a value greater than
                          1000. Note that the buffer will automatically flush if
                          more than 1000 characters are accumulated, regardless
                          of this setting.
                        required: false
                      - name: applyTextNormalization
                        type: string
                        description: >-
                          When enabled, text normalization automatically expands
                          and standardizes things like numbers, dates, times,
                          and abbreviations before converting them to speech.
                          For example, Dr. Smith becomes Doctor Smith, and
                          3/10/25 is spoken as March tenth, twenty twenty-five.
                          Turning this off may reduce latency, but the speech
                          output will read the text exactly as written. Defaults
                          to automatically deciding whether to apply text
                          normalization.
                        enumValues:
                          - APPLY_TEXT_NORMALIZATION_UNSPECIFIED
                          - 'ON'
                          - 'OFF'
                        required: false
                      - name: autoMode
                        type: boolean
                        description: >-
                          When enabled, the server will control flushing of
                          buffered text to achieve minimal latency, while still
                          maintaining high quality audio output. Recommended
                          when texts are sent in full sentences/phrases. The
                          default is false.
                        required: false
                      - name: timestampTransportStrategy
                        type: string
                        description: >-
                          The transport strategy of timestamps info.


                          - `TIMESTAMP_TRANSPORT_STRATEGY_UNSPECIFIED`: The
                          service will automatically decide the transport
                          strategy.

                          - `SYNC`: Timestamps will be returned in the same
                          message as the audio data.

                          - `ASYNC`: Timestamps could return in trailing message
                          after the audio data. Use this strategy to reduce
                          latency of the first audio chunk with v1.5+ models.
                        enumValues:
                          - TIMESTAMP_TRANSPORT_STRATEGY_UNSPECIFIED
                          - SYNC
                          - ASYNC
                        required: false
                      - name: language
                        type: string
                        description: >-
                          BCP-47 language tag (e.g., `en-US`, `fr-FR`, `ja-JP`)
                          specifying the language that the given voice should
                          speak the text in. If a localized voice prompt exists
                          for the language, it will be used. When omitted, the
                          original voice prompt will be used and the language
                          will be auto-detected from the input text. If an
                          invalid language code is provided, an error will be
                          returned.


                          See
                          [Languages](../../../tts/capabilities/multilingual)
                          for more details.
                        required: false
                      - name: deliveryMode
                        type: string
                        description: >-
                          *Only supported by `inworld-tts-2`. The field is
                          ignored on other models.*


                          Controls how varied the output is. 


                          - `DELIVERY_MODE_UNSPECIFIED`: Defaults to `BALANCED`
                          behavior.

                          - `STABLE`: Optimizes for more consistent, predictable
                          output.

                          - `BALANCED`: Balanced between stability and
                          diversity.

                          - `CREATIVE`: Optimizes for increased emotional range
                          and variation.
                        enumValues:
                          - DELIVERY_MODE_UNSPECIFIED
                          - STABLE
                          - BALANCED
                          - CREATIVE
                        required: false
                  - name: status
                    type: object
                    description: >-
                      Status information for gRPC responses, including any error
                      details if applicable
                    required: false
                    properties:
                      - name: code
                        type: integer
                        description: >-
                          The status code, as specified by [gRPC status
                          codes](https://grpc.io/docs/guides/status-codes/).
                        required: false
                      - name: message
                        type: string
                        description: A short description of the error
                        required: false
                      - name: details
                        type: array
                        description: Additional status or error details
                        required: false
        headers: []
        jsonPayloadSchema:
          type: object
          properties:
            result:
              type: object
              properties:
                contextId:
                  type: string
                  description: The ID of the created context
                  x-parser-schema-id: <anonymous-schema-17>
                contextCreated:
                  type: object
                  properties:
                    voiceId: *ref_0
                    modelId:
                      type: string
                      description: >-
                        The ID of the model to use for synthesizing speech. See
                        [Models](../../../tts/tts-models) for available models.
                      x-parser-schema-id: <anonymous-schema-19>
                    audioConfig: *ref_1
                    temperature: *ref_2
                    timestampType: *ref_3
                    maxBufferDelayMs: *ref_4
                    bufferCharThreshold: *ref_5
                    applyTextNormalization: *ref_6
                    autoMode: *ref_7
                    timestampTransportStrategy: *ref_8
                    language: *ref_9
                    deliveryMode: *ref_10
                  x-parser-schema-id: <anonymous-schema-18>
                status: &ref_11
                  type: object
                  properties:
                    code:
                      type: integer
                      format: int32
                      description: >-
                        The status code, as specified by [gRPC status
                        codes](https://grpc.io/docs/guides/status-codes/).
                      x-parser-schema-id: <anonymous-schema-20>
                    message:
                      type: string
                      description: A short description of the error
                      x-parser-schema-id: <anonymous-schema-21>
                    details:
                      type: array
                      items:
                        type: object
                        x-parser-schema-id: <anonymous-schema-23>
                      description: Additional status or error details
                      x-parser-schema-id: <anonymous-schema-22>
                  description: >-
                    Status information for gRPC responses, including any error
                    details if applicable
                  x-parser-schema-id: Status
              x-parser-schema-id: <anonymous-schema-16>
          examples:
            - result:
                contextId: ctx-1
                contextCreated:
                  voiceId: Dennis
                  audioConfig:
                    audioEncoding: LINEAR16
                    sampleRateHertz: 16000
                  modelId: inworld-tts-2
                  timestampType: WORD
                  maxBufferDelayMs: 3000
                  autoMode: true
                  timestampTransportStrategy: SYNC
                  language: en-US
                  deliveryMode: BALANCED
                status:
                  code: 0
                  message: ''
                  details: []
          x-parser-schema-id: ContextCreatedPayload
        title: Context Created
        description: Event sent when a new TTS context has been successfully created
        example: |-
          {
            "result": {
              "contextId": "ctx-1",
              "contextCreated": {
                "voiceId": "Dennis",
                "audioConfig": {
                  "audioEncoding": "LINEAR16",
                  "sampleRateHertz": 16000
                },
                "modelId": "inworld-tts-2",
                "timestampType": "WORD",
                "maxBufferDelayMs": 3000,
                "autoMode": true,
                "timestampTransportStrategy": "SYNC",
                "language": "en-US",
                "deliveryMode": "BALANCED"
              },
              "status": {
                "code": 0,
                "message": "",
                "details": []
              }
            }
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: contextCreated
      - &ref_20
        id: audioChunk
        payload:
          - name: Audio Chunk
            description: Audio data chunk containing synthesized speech
            type: object
            properties:
              - name: result
                type: object
                required: false
                properties:
                  - name: contextId
                    type: string
                    description: The context this audio chunk belongs to
                    required: false
                  - name: audioChunk
                    type: object
                    required: false
                    properties:
                      - name: audioContent
                        type: string
                        description: Base64 encoded audio data
                        required: false
                      - name: usage
                        type: object
                        required: false
                        properties:
                          - name: processedCharactersCount
                            type: integer
                            description: >-
                              Number of characters of the input text processed
                              so far
                            required: false
                          - name: modelId
                            type: string
                            description: The model used for speech synthesis
                            required: false
                      - name: timestampInfo
                        type: object
                        description: >-
                          Timestamp alignment information when alignment is
                          enabled
                        required: false
                        properties:
                          - name: wordAlignment
                            type: object
                            description: Word-level alignment when timestampType is WORD
                            required: false
                            properties:
                              - name: words
                                type: array
                                description: Aligned words in order
                                required: false
                              - name: wordStartTimeSeconds
                                type: array
                                description: >-
                                  Start time for each word in seconds from the
                                  beginning of the audio
                                required: false
                              - name: wordEndTimeSeconds
                                type: array
                                description: >-
                                  End time for each word in seconds from the
                                  beginning of the audio
                                required: false
                          - name: characterAlignment
                            type: object
                            description: >-
                              Character-level alignment when timestampType is
                              CHARACTER
                            required: false
                            properties:
                              - name: characters
                                type: array
                                description: >-
                                  Aligned characters (including punctuation and
                                  spaces) in order
                                required: false
                              - name: characterStartTimeSeconds
                                type: array
                                description: >-
                                  Start time for each character in seconds from
                                  the beginning of the audio
                                required: false
                              - name: characterEndTimeSeconds
                                type: array
                                description: >-
                                  End time for each character in seconds from
                                  the beginning of the audio
                                required: false
                      - name: status
                        type: object
                        description: >-
                          Status information for gRPC responses, including any
                          error details if applicable
                        required: false
                        properties:
                          - name: code
                            type: integer
                            description: >-
                              The status code, as specified by [gRPC status
                              codes](https://grpc.io/docs/guides/status-codes/).
                            required: false
                          - name: message
                            type: string
                            description: A short description of the error
                            required: false
                          - name: details
                            type: array
                            description: Additional status or error details
                            required: false
        headers: []
        jsonPayloadSchema:
          type: object
          properties:
            result:
              type: object
              properties:
                contextId:
                  type: string
                  description: The context this audio chunk belongs to
                  x-parser-schema-id: <anonymous-schema-25>
                audioChunk:
                  type: object
                  properties:
                    audioContent:
                      type: string
                      format: byte
                      description: Base64 encoded audio data
                      x-parser-schema-id: <anonymous-schema-27>
                    usage:
                      type: object
                      properties:
                        processedCharactersCount:
                          type: integer
                          format: int32
                          description: >-
                            Number of characters of the input text processed so
                            far
                          x-parser-schema-id: <anonymous-schema-29>
                        modelId:
                          type: string
                          description: The model used for speech synthesis
                          x-parser-schema-id: <anonymous-schema-30>
                      x-parser-schema-id: <anonymous-schema-28>
                    timestampInfo:
                      type: object
                      properties:
                        wordAlignment:
                          type: object
                          properties:
                            words:
                              type: array
                              items:
                                type: string
                                x-parser-schema-id: <anonymous-schema-32>
                              description: Aligned words in order
                              x-parser-schema-id: <anonymous-schema-31>
                            wordStartTimeSeconds:
                              type: array
                              items:
                                type: number
                                x-parser-schema-id: <anonymous-schema-34>
                              description: >-
                                Start time for each word in seconds from the
                                beginning of the audio
                              x-parser-schema-id: <anonymous-schema-33>
                            wordEndTimeSeconds:
                              type: array
                              items:
                                type: number
                                x-parser-schema-id: <anonymous-schema-36>
                              description: >-
                                End time for each word in seconds from the
                                beginning of the audio
                              x-parser-schema-id: <anonymous-schema-35>
                          description: Word-level alignment when timestampType is WORD
                          x-parser-schema-id: WordAlignment
                        characterAlignment:
                          type: object
                          properties:
                            characters:
                              type: array
                              items:
                                type: string
                                x-parser-schema-id: <anonymous-schema-38>
                              description: >-
                                Aligned characters (including punctuation and
                                spaces) in order
                              x-parser-schema-id: <anonymous-schema-37>
                            characterStartTimeSeconds:
                              type: array
                              items:
                                type: number
                                x-parser-schema-id: <anonymous-schema-40>
                              description: >-
                                Start time for each character in seconds from
                                the beginning of the audio
                              x-parser-schema-id: <anonymous-schema-39>
                            characterEndTimeSeconds:
                              type: array
                              items:
                                type: number
                                x-parser-schema-id: <anonymous-schema-42>
                              description: >-
                                End time for each character in seconds from the
                                beginning of the audio
                              x-parser-schema-id: <anonymous-schema-41>
                          description: >-
                            Character-level alignment when timestampType is
                            CHARACTER
                          x-parser-schema-id: CharacterAlignment
                      description: >-
                        Timestamp alignment information when alignment is
                        enabled
                      x-parser-schema-id: TimestampInfo
                    status: *ref_11
                  x-parser-schema-id: <anonymous-schema-26>
              x-parser-schema-id: <anonymous-schema-24>
          examples:
            - result:
                contextId: ctx-1
                audioChunk:
                  audioContent: UklGRgSYAABXQVZFZm10IBAAAAABAAEAgD4AAAB9AAACABAAZGF0YeCX=
                  usage:
                    processedCharactersCount: 79
                    modelId: inworld-tts-2
                  timestampInfo:
                    wordAlignment:
                      words:
                        - Hello,
                        - what
                        - a
                        - wonderful
                        - day
                        - to
                        - be
                        - a
                        - text-to-speech
                        - model.
                      wordStartTimeSeconds:
                        - 0.031
                        - 0.375
                        - 0.901
                        - 1.002
                        - 1.386
                        - 1.548
                        - 1.649
                        - 1.771
                        - 1.852
                        - 2.58
                      wordEndTimeSeconds:
                        - 0.355
                        - 0.86
                        - 0.921
                        - 1.326
                        - 1.528
                        - 1.609
                        - 1.71
                        - 1.791
                        - 2.539
                        - 2.802
                  status:
                    code: 0
                    message: ''
                    details: []
          x-parser-schema-id: AudioChunkPayload
        title: Audio Chunk
        description: Audio data chunk containing synthesized speech
        example: |-
          {
            "result": {
              "contextId": "ctx-1",
              "audioChunk": {
                "audioContent": "UklGRgSYAABXQVZFZm10IBAAAAABAAEAgD4AAAB9AAACABAAZGF0YeCX=",
                "usage": {
                  "processedCharactersCount": 79,
                  "modelId": "inworld-tts-2"
                },
                "timestampInfo": {
                  "wordAlignment": {
                    "words": [
                      "Hello,",
                      "what",
                      "a",
                      "wonderful",
                      "day",
                      "to",
                      "be",
                      "a",
                      "text-to-speech",
                      "model."
                    ],
                    "wordStartTimeSeconds": [
                      0.031,
                      0.375,
                      0.901,
                      1.002,
                      1.386,
                      1.548,
                      1.649,
                      1.771,
                      1.852,
                      2.58
                    ],
                    "wordEndTimeSeconds": [
                      0.355,
                      0.86,
                      0.921,
                      1.326,
                      1.528,
                      1.609,
                      1.71,
                      1.791,
                      2.539,
                      2.802
                    ]
                  }
                },
                "status": {
                  "code": 0,
                  "message": "",
                  "details": []
                }
              }
            }
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: audioChunk
      - &ref_21
        id: contextClosed
        payload:
          - name: Context Closed
            description: Event sent when a context has been closed
            type: object
            properties:
              - name: result
                type: object
                required: false
                properties:
                  - name: contextId
                    type: string
                    description: The ID of the closed context
                    required: false
                  - name: contextClosed
                    type: object
                    required: false
                    properties: []
                  - name: status
                    type: object
                    description: >-
                      Status information for gRPC responses, including any error
                      details if applicable
                    required: false
                    properties:
                      - name: code
                        type: integer
                        description: >-
                          The status code, as specified by [gRPC status
                          codes](https://grpc.io/docs/guides/status-codes/).
                        required: false
                      - name: message
                        type: string
                        description: A short description of the error
                        required: false
                      - name: details
                        type: array
                        description: Additional status or error details
                        required: false
        headers: []
        jsonPayloadSchema:
          type: object
          properties:
            result:
              type: object
              properties:
                contextId:
                  type: string
                  description: The ID of the closed context
                  x-parser-schema-id: <anonymous-schema-44>
                contextClosed:
                  type: object
                  properties: {}
                  x-parser-schema-id: <anonymous-schema-45>
                status: *ref_11
              x-parser-schema-id: <anonymous-schema-43>
          examples:
            - result:
                contextId: ctx-1
                contextClosed: {}
                status:
                  code: 0
                  message: ''
                  details: []
          x-parser-schema-id: ContextClosedPayload
        title: Context Closed
        description: Event sent when a context has been closed
        example: |-
          {
            "result": {
              "contextId": "ctx-1",
              "contextClosed": {},
              "status": {
                "code": 0,
                "message": "",
                "details": []
              }
            }
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: contextClosed
      - &ref_22
        id: flushCompleted
        payload:
          - name: Flush Completed
            description: >-
              Event sent when speech synthesis for a flush of text is completed.
              Some websocket use cases require an indicator that speech
              synthesis for a flush of text is completed. To facilitate this,
              we've included an empty "flushCompleted":{} event at the end of
              speech synthesis for each flush. Note that the implementation
              currently assumes that flushes execute sequentially, so the first
              flushCompleted event would correspond to the first flush call made
              on the client side.
            type: object
            properties:
              - name: result
                type: object
                required: false
                properties:
                  - name: contextId
                    type: string
                    description: The ID of the context for which the flush completed
                    required: false
                  - name: flushCompleted
                    type: object
                    required: false
                    properties: []
                  - name: status
                    type: object
                    description: >-
                      Status information for gRPC responses, including any error
                      details if applicable
                    required: false
                    properties:
                      - name: code
                        type: integer
                        description: >-
                          The status code, as specified by [gRPC status
                          codes](https://grpc.io/docs/guides/status-codes/).
                        required: false
                      - name: message
                        type: string
                        description: A short description of the error
                        required: false
                      - name: details
                        type: array
                        description: Additional status or error details
                        required: false
        headers: []
        jsonPayloadSchema:
          type: object
          properties:
            result:
              type: object
              properties:
                contextId:
                  type: string
                  description: The ID of the context for which the flush completed
                  x-parser-schema-id: <anonymous-schema-47>
                flushCompleted:
                  type: object
                  properties: {}
                  x-parser-schema-id: <anonymous-schema-48>
                status: *ref_11
              x-parser-schema-id: <anonymous-schema-46>
          examples:
            - result:
                contextId: ctx-1
                flushCompleted: {}
                status:
                  code: 0
                  message: ''
                  details: []
          x-parser-schema-id: FlushCompletedPayload
        title: Flush Completed
        description: >-
          Event sent when speech synthesis for a flush of text is completed.
          Some websocket use cases require an indicator that speech synthesis
          for a flush of text is completed. To facilitate this, we've included
          an empty "flushCompleted":{} event at the end of speech synthesis for
          each flush. Note that the implementation currently assumes that
          flushes execute sequentially, so the first flushCompleted event would
          correspond to the first flush call made on the client side.
        example: |-
          {
            "result": {
              "contextId": "ctx-1",
              "flushCompleted": {},
              "status": {
                "code": 0,
                "message": "",
                "details": []
              }
            }
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: flushCompleted
    bindings: []
    extensions: *ref_12
sendOperations:
  - *ref_13
receiveOperations:
  - *ref_14
sendMessages:
  - *ref_15
  - *ref_16
  - *ref_17
  - *ref_18
receiveMessages:
  - *ref_19
  - *ref_20
  - *ref_21
  - *ref_22
extensions:
  - id: x-parser-unique-object-id
    value: ttsStream
securitySchemes:
  - id: auth
    name: authorization
    type: httpApiKey
    description: >-
      Your [authentication](../../../api-reference/introduction) credentials.
      For Basic authentication, please populate `Basic $INWORLD_API_KEY`
    in: query
    extensions: []

````