Inworld Voice API Reference

Overview

This guide provides a technical reference when working with Inworld's text-to-speech API, or Inworld Voice.

Listed below is an overview of the available API methods for Inworld Voice.

Name	Request Type	Response Type	Description
ListVoices	ListVoicesRequest	ListVoicesResponse	Returns a list of supported voices.
SynthesizeSpeech	SynthesizeSpeechRequest	SynthesizeSpeechResponse stream	Synthesizes speech with response streaming.
SyncSynthesizeSpeech	SynthesizeSpeechRequest	SynthesizeSpeechResponse	As Synthesizes speech but with no response streaming (ie. all speech will be synthesized before the response is sent.)

List Voices

This section contains a reference for the List Voices within Inworld Voice.

ListVoicesRequest

The top-level message sent by the client to the ListVoices API.

Field	Type	Description
language_code	string	Optional but recommended. BCP-47 language tag. If not specified, the API returns all supported voices. If specified, the API only returns voices capable of synthesizing the given language code. For example, if you specify 'en-NZ', all 'en-NZ' voices will be returned. If you specify 'no', both 'no-'* (Norwegian) and 'nb-'* (Norwegian Bokmal) voices will be returned.

ListVoicesResponse

The message returned to the client by the ListVoices API.

Field	Type	Label	Description
voices	Voice	repeated	A list of voices.

Voice

The description of a voice supported by the API.

Field	Type	Label	Description
language_codes	string	repeated	The languages supported by the voice expressed as BCP-47 language tags (e.g. 'en-US', 'es-419', 'cmn-tw').
name	string		The name of the voice.
voice_metadata	VoiceMetadata		The metadata of the voice.
natural_sample_rate_hertz	int32		The natural sample rate (in hertz) for the voice.

VoiceMetadata

The properties associated with a given voice.

Field	Type	Description
gender	VoiceGender	The gender of the voice.
age	VoiceAge	The age group of the voice.
accent	VoiceAccent	The accent of the voice.

VoiceGender

The gender of the voice as described in SSML voice element.

Name	Number	Description
VOICE_GENDER_UNSPECIFIED	0	The gender is unspecified or unknown.
MALE	1	A male voice.
FEMALE	2	A female voice.
NEUTRAL	3	A gender-neutral voice. This voice is not yet supported.

VoiceAge

The age group of the voice.

Name	Number	Description
VOICE_AGE_UNSPECIFIED	0	The age is unspecified or unknown.
CHILD	1	A child's voice.
TEEN	2	A teenage voice.
ADULT	3	An adult voice.
SENIOR	4	A senior voice.

VoiceAccent

The free-form name of the accent relative to American English.

Name	Number	Description
ACCENT_UNSPECIFIED	0
ACCENT_BRITISH	1
ACCENT_RUSSIAN	2
ACCENT_AUSTRALIAN	3
ACCENT_GERMAN	4
ACCENT_FRENCH	5

Synthesize Speech

SynthesizeSpeechRequest

The top-level message sent by the client to the SynthesizeSpeech API.

Inputs are not modified, so it is recommended you perform UTF character normalization on the client-side before sending to the SynthesizeSpeech API. This avoids unexpected behavior if non-standard characters are sent.

Field	Type	Description
input	SynthesisInput	Required. Must be either plain text or SSML.
voice	VoiceSelectionParams	Required. The desired voice of the synthesized audio.
audio_config	AudioConfig	Required. The configuration of the synthesized audio.

SynthesizeSpeechResponse

The message returned to the client by the SynthesizeSpeech API.

Field	Type	Description
audio_content	bytes	The audio data bytes encoded as specified in the request. For encodings that are wrapped in containers (e.g. MP3, OGG_OPUS) the header is included. For LINEAR16 audio a WAV header is included.

SynthesisInput

The text input to be synthesized, to a maximum of 5000 bytes.

Field	Type	Description
text	string	The raw text to be synthesized.

VoiceSelectionParams

The description of the voice used in a synthesis request.

Field	Type	Description
name	string	The name of the voice.
custom_voice	CustomVoiceParams	The configuration for a custom voice. If [CustomVoiceParams.model] is set the API will choose the custom voice that matches the requested configuration.

AudioConfig

Description of audio data to be synthesized.

Field	Type	Label	Description
audio_encoding	AudioEncoding		Required. The format of the audio byte stream.
speaking_rate	double		Optional. Input only. Speaking rate/speed, in the range [0.25, 4.0]. 1.0 is the normal native speed supported by the specific voice. 2.0 is twice as fast, and 0.5 is half as fast. If unset(0.0), defaults to the native 1.0 speed. Any other values <0.25 or >4.0 will return an error.
sample_rate_hertz	int32		Optional. The synthesis sample rate (in hertz) for this audio. When this is specified in SynthesizeSpeechRequest, if this is different from the voice's natural sample rate, then the synthesizer will honor this request by converting to the desired sample rate (which might result in worse audio quality), unless the specified sample rate is not supported for the encoding chosen, in which case it will fail and return an error.
effects_profile_id	string	repeated	Optional. Input only. An identifier, or list of identifiers, of 'audio effects' profiles. See audio profiles for supported profile ids. Effects are applied in the order they are given, post-synthesis.

CustomVoiceParams

A description of the custom voice to be synthesized if using an existing model.

Field	Type	Description
model	string	Required. The name of the custom model that synthesizes the custom voice.

AudioEncoding

The desired output format of the audio encoder.

Name	Number	Description
AUDIO_ENCODING_UNSPECIFIED	0	Not specified. Will return an error.
LINEAR16	1	Uncompressed 16-bit signed little-endian (Linear PCM). Note: Audio returned as LINEAR16 contains a WAV header.
MP3	2	MP3 audio at 32kbps. Not recommended.
OGG_OPUS	3	Opus encoded audio wrapped in an ogg container. The result will be a file which can be played natively on Android, and in most browsers. Recommended where supported.

Data Types

This chart details the Data Types for the Inworld Voice API.

API Type	Notes	C++	Java	Python	Go	C#	PHP	Ruby
double		double	double	float	float64	double	float	Float
float		float	float	float	float32	float	float	Float
int32	Variable-length encoded. Use sint32 for negatives.	int32	int	int	int32	int	integer	Bignum or Fixnum (as required)
int64	Variable-length encoded. Use sint32 for negatives.	int64	long	int/long	int64	long	integer/string	Bignum
uint32	Variable-length encoded. Use fixed32 for 2^28+	uint32	int	int/long	uint32	uint	integer	Bignum or Fixnum (as required)
uint64	Variable-length encoded. Use fixed32 for 2^56+	uint64	long	int/long	uint64	ulong	integer/string	Bignum or Fixnum (as required)
sint32	Variable-length encoded.	int32	int	int	int32	int	integer	Bignum or Fixnum (as required)
sint64	Variable-length encoded.	int64	long	int/long	int64	long	integer/string	Bignum
fixed32	Fixed four bytes.	uint32	int	int	uint32	uint	integer	Bignum or Fixnum (as required)
fixed64	Fixed eight bytes.	uint64	long	int/long	uint64	ulong	integer/string	Bignum
sfixed32	Fixed four bytes.	int32	int	int	int32	int	integer	Bignum or Fixnum (as required)
sfixed64	Fixed eight bytes.	int64	long	int/long	int64	long	integer/string	Bignum
bool		bool	boolean	boolean	bool	bool	boolean	TrueClass/FalseClass
string	Must contain either UTF-8 or 7-bit ASCII.	string	String	str/unicode	string	string	string	String (UTF-8)
bytes	Any arbitrary sequence of bytes.	string	ByteString	str	[]byte	ByteString	string	String (ASCII-8BIT)

Inworld Voice API Reference

Overview​

List Voices​

ListVoicesRequest​

ListVoicesResponse​

Voice​

VoiceMetadata​

VoiceGender​

VoiceAge​

VoiceAccent​

Synthesize Speech​

SynthesizeSpeechRequest​

SynthesizeSpeechResponse​

SynthesisInput​

VoiceSelectionParams​

AudioConfig​

CustomVoiceParams​

AudioEncoding​

Data Types​