Skip to main content

Inworld Voice API Reference

Overview

This guide provides a technical reference when working with Inworld's text-to-speech API, or Inworld Voice.

Listed below is an overview of the available API methods for Inworld Voice.

NameRequest TypeResponse TypeDescription
ListVoicesListVoicesRequestListVoicesResponseReturns a list of supported voices.
SynthesizeSpeechSynthesizeSpeechRequestSynthesizeSpeechResponse streamSynthesizes speech with response streaming.
SyncSynthesizeSpeechSynthesizeSpeechRequestSynthesizeSpeechResponseAs Synthesizes speech but with no response streaming (ie. all speech will be synthesized before the response is sent.)

List Voices

This section contains a reference for the List Voices within Inworld Voice.

ListVoicesRequest

The top-level message sent by the client to the ListVoices API.

FieldTypeDescription
language_codestringOptional but recommended. BCP-47 language tag. If not specified, the API returns all supported voices. If specified, the API only returns voices capable of synthesizing the given language code. For example, if you specify 'en-NZ', all 'en-NZ' voices will be returned. If you specify 'no', both 'no-*' (Norwegian) and 'nb-*' (Norwegian Bokmal) voices will be returned.

ListVoicesResponse

The message returned to the client by the ListVoices API.

FieldTypeLabelDescription
voicesVoicerepeatedA list of voices.

Voice

The description of a voice supported by the API.

FieldTypeLabelDescription
language_codesstringrepeatedThe languages supported by the voice expressed as BCP-47 language tags (e.g. 'en-US', 'es-419', 'cmn-tw').
namestringThe name of the voice.
voice_metadataVoiceMetadataThe metadata of the voice.
natural_sample_rate_hertzint32The natural sample rate (in hertz) for the voice.

VoiceMetadata

The properties associated with a given voice.

FieldTypeDescription
genderVoiceGenderThe gender of the voice.
ageVoiceAgeThe age group of the voice.
accentVoiceAccentThe accent of the voice.

VoiceGender

The gender of the voice as described in SSML voice element.

NameNumberDescription
VOICE_GENDER_UNSPECIFIED0The gender is unspecified or unknown.
MALE1A male voice.
FEMALE2A female voice.
NEUTRAL3A gender-neutral voice. This voice is not yet supported.

VoiceAge

The age group of the voice.

NameNumberDescription
VOICE_AGE_UNSPECIFIED0The age is unspecified or unknown.
CHILD1A child's voice.
TEEN2A teenage voice.
ADULT3An adult voice.
SENIOR4A senior voice.

VoiceAccent

The free-form name of the accent relative to American English.

NameNumberDescription
ACCENT_UNSPECIFIED0
ACCENT_BRITISH1
ACCENT_RUSSIAN2
ACCENT_AUSTRALIAN3
ACCENT_GERMAN4
ACCENT_FRENCH5

Synthesize Speech

SynthesizeSpeechRequest

The top-level message sent by the client to the SynthesizeSpeech API.

Inputs are not modified, so it is recommended you perform UTF character normalization on the client-side before sending to the SynthesizeSpeech API. This avoids unexpected behavior if non-standard characters are sent.

FieldTypeDescription
inputSynthesisInputRequired. Must be either plain text or SSML.
voiceVoiceSelectionParamsRequired. The desired voice of the synthesized audio.
audio_configAudioConfigRequired. The configuration of the synthesized audio.

SynthesizeSpeechResponse

The message returned to the client by the SynthesizeSpeech API.

FieldTypeDescription
audio_contentbytesThe audio data bytes encoded as specified in the request. For encodings that are wrapped in containers (e.g. MP3, OGG_OPUS) the header is included. For LINEAR16 audio a WAV header is included.

SynthesisInput

The text input to be synthesized, to a maximum of 5000 bytes.

Inputs are not modified, so it is recommended you perform UTF character normalization on the client-side before sending to the SynthesizeSpeech API. This avoids unexpected behavior if non-standard characters are sent.

FieldTypeDescription
textstringThe raw text to be synthesized.

VoiceSelectionParams

The description of the voice used in a synthesis request.

FieldTypeDescription
namestringThe name of the voice.
custom_voiceCustomVoiceParamsThe configuration for a custom voice. If [CustomVoiceParams.model] is set the API will choose the custom voice that matches the requested configuration.

AudioConfig

Description of audio data to be synthesized.

FieldTypeLabelDescription
audio_encodingAudioEncodingRequired. The format of the audio byte stream.
speaking_ratedoubleOptional. Input only. Speaking rate/speed, in the range [0.25, 4.0]. 1.0 is the normal native speed supported by the specific voice. 2.0 is twice as fast, and 0.5 is half as fast. If unset(0.0), defaults to the native 1.0 speed. Any other values <0.25 or >4.0 will return an error.
sample_rate_hertzint32Optional. The synthesis sample rate (in hertz) for this audio. When this is specified in SynthesizeSpeechRequest, if this is different from the voice's natural sample rate, then the synthesizer will honor this request by converting to the desired sample rate (which might result in worse audio quality), unless the specified sample rate is not supported for the encoding chosen, in which case it will fail and return an error.
effects_profile_idstringrepeatedOptional. Input only. An identifier, or list of identifiers, of 'audio effects' profiles. See audio profiles for supported profile ids. Effects are applied in the order they are given, post-synthesis.

CustomVoiceParams

A description of the custom voice to be synthesized if using an existing model.

FieldTypeDescription
modelstringRequired. The name of the custom model that synthesizes the custom voice.

AudioEncoding

The desired output format of the audio encoder.

NameNumberDescription
AUDIO_ENCODING_UNSPECIFIED0Not specified. Will return an error.
LINEAR161Uncompressed 16-bit signed little-endian (Linear PCM). Note: Audio returned as LINEAR16 contains a WAV header.
MP32MP3 audio at 32kbps. Not recommended.
OGG_OPUS3Opus encoded audio wrapped in an ogg container. The result will be a file which can be played natively on Android, and in most browsers. Recommended where supported.

Data Types

This chart details the Data Types for the Inworld Voice API.

API TypeNotesC++JavaPythonGoC#PHPRuby
doubledoubledoublefloatfloat64doublefloatFloat
floatfloatfloatfloatfloat32floatfloatFloat
int32Variable-length encoded. Use sint32 for negatives.int32intintint32intintegerBignum or Fixnum (as required)
int64Variable-length encoded. Use sint32 for negatives.int64longint/longint64longinteger/stringBignum
uint32Variable-length encoded. Use fixed32 for 2^28+uint32intint/longuint32uintintegerBignum or Fixnum (as required)
uint64Variable-length encoded. Use fixed32 for 2^56+uint64longint/longuint64ulonginteger/stringBignum or Fixnum (as required)
sint32Variable-length encoded.int32intintint32intintegerBignum or Fixnum (as required)
sint64Variable-length encoded.int64longint/longint64longinteger/stringBignum
fixed32Fixed four bytes.uint32intintuint32uintintegerBignum or Fixnum (as required)
fixed64Fixed eight bytes.uint64longint/longuint64ulonginteger/stringBignum
sfixed32Fixed four bytes.int32intintint32intintegerBignum or Fixnum (as required)
sfixed64Fixed eight bytes.int64longint/longint64longinteger/stringBignum
boolboolbooleanbooleanboolboolbooleanTrueClass/FalseClass
stringMust contain either UTF-8 or 7-bit ASCII.stringStringstr/unicodestringstringstringString (UTF-8)
bytesAny arbitrary sequence of bytes.stringByteStringstr[]byteByteStringstringString (ASCII-8BIT)