Inworld Voice API Reference
Overview
This guide provides a technical reference when working with Inworld's text-to-speech API, or Inworld Voice.
Listed below is an overview of the available API methods for Inworld Voice.
Name | Request Type | Response Type | Description |
---|---|---|---|
ListVoices | ListVoicesRequest | ListVoicesResponse | Returns a list of supported voices. |
SynthesizeSpeech | SynthesizeSpeechRequest | SynthesizeSpeechResponse stream | Synthesizes speech with response streaming. |
SyncSynthesizeSpeech | SynthesizeSpeechRequest | SynthesizeSpeechResponse | As Synthesizes speech but with no response streaming (ie. all speech will be synthesized before the response is sent.) |
List Voices
This section contains a reference for the List Voices within Inworld Voice.
ListVoicesRequest
The top-level message sent by the client to the ListVoices
API.
Field | Type | Description |
---|---|---|
language_code | string | Optional but recommended. BCP-47 language tag. If not specified, the API returns all supported voices. If specified, the API only returns voices capable of synthesizing the given language code. For example, if you specify 'en-NZ', all 'en-NZ' voices will be returned. If you specify 'no', both 'no-*' (Norwegian) and 'nb-*' (Norwegian Bokmal) voices will be returned. |
ListVoicesResponse
The message returned to the client by the ListVoices
API.
Field | Type | Label | Description |
---|---|---|---|
voices | Voice | repeated | A list of voices. |
Voice
The description of a voice supported by the API.
Field | Type | Label | Description |
---|---|---|---|
language_codes | string | repeated | The languages supported by the voice expressed as BCP-47 language tags (e.g. 'en-US', 'es-419', 'cmn-tw'). |
name | string | The name of the voice. | |
voice_metadata | VoiceMetadata | The metadata of the voice. | |
natural_sample_rate_hertz | int32 | The natural sample rate (in hertz) for the voice. |
VoiceMetadata
The properties associated with a given voice.
Field | Type | Description |
---|---|---|
gender | VoiceGender | The gender of the voice. |
age | VoiceAge | The age group of the voice. |
accent | VoiceAccent | The accent of the voice. |
VoiceGender
The gender of the voice as described in SSML voice element.
Name | Number | Description |
---|---|---|
VOICE_GENDER_UNSPECIFIED | 0 | The gender is unspecified or unknown. |
MALE | 1 | A male voice. |
FEMALE | 2 | A female voice. |
NEUTRAL | 3 | A gender-neutral voice. This voice is not yet supported. |
VoiceAge
The age group of the voice.
Name | Number | Description |
---|---|---|
VOICE_AGE_UNSPECIFIED | 0 | The age is unspecified or unknown. |
CHILD | 1 | A child's voice. |
TEEN | 2 | A teenage voice. |
ADULT | 3 | An adult voice. |
SENIOR | 4 | A senior voice. |
VoiceAccent
The free-form name of the accent relative to American English.
Name | Number | Description |
---|---|---|
ACCENT_UNSPECIFIED | 0 | |
ACCENT_BRITISH | 1 | |
ACCENT_RUSSIAN | 2 | |
ACCENT_AUSTRALIAN | 3 | |
ACCENT_GERMAN | 4 | |
ACCENT_FRENCH | 5 |
Synthesize Speech
SynthesizeSpeechRequest
The top-level message sent by the client to the SynthesizeSpeech
API.
Inputs are not modified, so it is recommended you perform UTF character normalization on the client-side before sending to the SynthesizeSpeech
API. This avoids unexpected behavior if non-standard characters are sent.
Field | Type | Description |
---|---|---|
input | SynthesisInput | Required. Must be either plain text or SSML. |
voice | VoiceSelectionParams | Required. The desired voice of the synthesized audio. |
audio_config | AudioConfig | Required. The configuration of the synthesized audio. |
SynthesizeSpeechResponse
The message returned to the client by the SynthesizeSpeech
API.
Field | Type | Description |
---|---|---|
audio_content | bytes | The audio data bytes encoded as specified in the request. For encodings that are wrapped in containers (e.g. MP3, OGG_OPUS) the header is included. For LINEAR16 audio a WAV header is included. |
SynthesisInput
The text input to be synthesized, to a maximum of 5000 bytes.
Inputs are not modified, so it is recommended you perform UTF character normalization on the client-side before sending to the SynthesizeSpeech
API. This avoids unexpected behavior if non-standard characters are sent.
Field | Type | Description |
---|---|---|
text | string | The raw text to be synthesized. |
VoiceSelectionParams
The description of the voice used in a synthesis request.
Field | Type | Description |
---|---|---|
name | string | The name of the voice. |
custom_voice | CustomVoiceParams | The configuration for a custom voice. If [CustomVoiceParams.model] is set the API will choose the custom voice that matches the requested configuration. |
AudioConfig
Description of audio data to be synthesized.
Field | Type | Label | Description |
---|---|---|---|
audio_encoding | AudioEncoding | Required. The format of the audio byte stream. | |
speaking_rate | double | Optional. Input only. Speaking rate/speed, in the range [0.25, 4.0]. 1.0 is the normal native speed supported by the specific voice. 2.0 is twice as fast, and 0.5 is half as fast. If unset(0.0), defaults to the native 1.0 speed. Any other values <0.25 or >4.0 will return an error. | |
sample_rate_hertz | int32 | Optional. The synthesis sample rate (in hertz) for this audio. When this is specified in SynthesizeSpeechRequest, if this is different from the voice's natural sample rate, then the synthesizer will honor this request by converting to the desired sample rate (which might result in worse audio quality), unless the specified sample rate is not supported for the encoding chosen, in which case it will fail and return an error. | |
effects_profile_id | string | repeated | Optional. Input only. An identifier, or list of identifiers, of 'audio effects' profiles. See audio profiles for supported profile ids. Effects are applied in the order they are given, post-synthesis. |
CustomVoiceParams
A description of the custom voice to be synthesized if using an existing model.
Field | Type | Description |
---|---|---|
model | string | Required. The name of the custom model that synthesizes the custom voice. |
AudioEncoding
The desired output format of the audio encoder.
Name | Number | Description |
---|---|---|
AUDIO_ENCODING_UNSPECIFIED | 0 | Not specified. Will return an error. |
LINEAR16 | 1 | Uncompressed 16-bit signed little-endian (Linear PCM). Note: Audio returned as LINEAR16 contains a WAV header. |
MP3 | 2 | MP3 audio at 32kbps. Not recommended. |
OGG_OPUS | 3 | Opus encoded audio wrapped in an ogg container. The result will be a file which can be played natively on Android, and in most browsers. Recommended where supported. |
Data Types
This chart details the Data Types for the Inworld Voice API.