TTS usage is FREE for the month of December! Happy Holidays!
{
"create": {
"voiceId": "Dennis",
"modelId": "inworld-tts-1-max",
"bufferCharThreshold": 100
},
"contextId": "ctx-1"
}{
"send_text": {
"text": "Hello, what a wonderful day to be a text-to-speech model!",
"flush_context": {}
},
"contextId": "ctx-1"
}{
"flush_context": {},
"contextId": "ctx-1"
}{
"update_config": {
"voiceId": "Ashley"
},
"contextId": "ctx-1"
}{
"close_context": {},
"contextId": "ctx-1"
}{
"result": {
"contextId": "ctx-1",
"contextCreated": {
"voiceId": "Dennis",
"audioConfig": {
"audioEncoding": "LINEAR16",
"sampleRateHertz": 16000
},
"modelId": "inworld-tts-1-max",
"timestampType": "WORD",
"maxBufferDelayMs": 3000
},
"status": {
"code": 0,
"message": "",
"details": []
}
}
}{
"result": {
"contextId": "ctx-1",
"audioChunk": {
"audioContent": "UklGRgSYAABXQVZFZm10IBAAAAABAAEAgD4AAAB9AAACABAAZGF0YeCX=",
"usage": {
"processedCharactersCount": 79,
"modelId": "inworld-tts-1"
},
"timestampInfo": {
"wordAlignment": {
"words": [
"Hello,",
"what",
"a",
"wonderful",
"day",
"to",
"be",
"a",
"text-to-speech",
"model."
],
"wordStartTimeSeconds": [
0.031,
0.375,
0.901,
1.002,
1.386,
1.548,
1.649,
1.771,
1.852,
2.58
],
"wordEndTimeSeconds": [
0.355,
0.86,
0.921,
1.326,
1.528,
1.609,
1.71,
1.791,
2.539,
2.802
]
}
},
"status": {
"code": 0,
"message": "",
"details": []
}
}
}
}{
"result": {
"contextId": "ctx-1",
"contextClosed": {},
"status": {
"code": 0,
"message": "",
"details": []
}
}
}{
"result": {
"contextId": "ctx-1",
"contextUpdated": {
"voiceId": "Ashley",
"audioConfig": {
"audioEncoding": "LINEAR16",
"sampleRateHertz": 16000
},
"modelId": "inworld-tts-1-max",
"temperature": 1,
"timestampType": "WORD",
"maxBufferDelayMs": 3000
},
"status": {
"code": 0,
"message": "",
"details": []
}
}
}{
"result": {
"contextId": "ctx-1",
"flushCompleted": {},
"status": {
"code": 0,
"message": "",
"details": []
}
}
}Generate audio from text input while managing multiple independent audio generation streams over a single WebSocket connection.
The independent audio streams each correspond to a context, identified by contextId, that maintains its own state. To use the API:
maxBufferDelayMs and bufferCharThreshold in the context configurations).contextId so you can match the audio to the request.{
"create": {
"voiceId": "Dennis",
"modelId": "inworld-tts-1-max",
"bufferCharThreshold": 100
},
"contextId": "ctx-1"
}{
"send_text": {
"text": "Hello, what a wonderful day to be a text-to-speech model!",
"flush_context": {}
},
"contextId": "ctx-1"
}{
"flush_context": {},
"contextId": "ctx-1"
}{
"update_config": {
"voiceId": "Ashley"
},
"contextId": "ctx-1"
}{
"close_context": {},
"contextId": "ctx-1"
}{
"result": {
"contextId": "ctx-1",
"contextCreated": {
"voiceId": "Dennis",
"audioConfig": {
"audioEncoding": "LINEAR16",
"sampleRateHertz": 16000
},
"modelId": "inworld-tts-1-max",
"timestampType": "WORD",
"maxBufferDelayMs": 3000
},
"status": {
"code": 0,
"message": "",
"details": []
}
}
}{
"result": {
"contextId": "ctx-1",
"audioChunk": {
"audioContent": "UklGRgSYAABXQVZFZm10IBAAAAABAAEAgD4AAAB9AAACABAAZGF0YeCX=",
"usage": {
"processedCharactersCount": 79,
"modelId": "inworld-tts-1"
},
"timestampInfo": {
"wordAlignment": {
"words": [
"Hello,",
"what",
"a",
"wonderful",
"day",
"to",
"be",
"a",
"text-to-speech",
"model."
],
"wordStartTimeSeconds": [
0.031,
0.375,
0.901,
1.002,
1.386,
1.548,
1.649,
1.771,
1.852,
2.58
],
"wordEndTimeSeconds": [
0.355,
0.86,
0.921,
1.326,
1.528,
1.609,
1.71,
1.791,
2.539,
2.802
]
}
},
"status": {
"code": 0,
"message": "",
"details": []
}
}
}
}{
"result": {
"contextId": "ctx-1",
"contextClosed": {},
"status": {
"code": 0,
"message": "",
"details": []
}
}
}{
"result": {
"contextId": "ctx-1",
"contextUpdated": {
"voiceId": "Ashley",
"audioConfig": {
"audioEncoding": "LINEAR16",
"sampleRateHertz": 16000
},
"modelId": "inworld-tts-1-max",
"temperature": 1,
"timestampType": "WORD",
"maxBufferDelayMs": 3000
},
"status": {
"code": 0,
"message": "",
"details": []
}
}
}{
"result": {
"contextId": "ctx-1",
"flushCompleted": {},
"status": {
"code": 0,
"message": "",
"details": []
}
}
}Your authentication credentials. For Basic authentication, please populate Basic $INWORLD_RUNTIME_BASE64_CREDENTIAL
Create a new context with specified voice and configuration. A context is an independent conversation happening over the connection. The configurations for each context are completely separate – you can have different voice ids, models, output formats, etc. between contexts. Note: for each connection, 5 contexts is the max. If you don't need multiple contexts, you can omit the contextId in the message to use a single context connection.
Send text to be synthesized for a specific context. You can only send up to 1000 characters in a single send_text request. Text can be buffered on the server or immediately flushed by including flush_context in the message.
Flush a context and start synthesis of all accumulated text. Note that the buffer will automatically flush all text if the length of text is greater than 1000 characters, regardless of any other buffer settings.
Update the configuration of an existing context (voice, temperature, audio settings, etc.). Note: you cannot update the model. Sending an update message forces the synthesis of all accumulated text in the context's buffer with the previous configurations.
Close an existing context and release all of its resources. Sending a close context message is equivalent to sending a flush message right before, so all text in the buffer will be synthesized before the context is closed. Note that the session will automatically be closed after 10 minutes of inactivity across any context.
Event sent when a new TTS context has been successfully created
Audio data chunk containing synthesized speech
Event sent when a context has been closed
Event sent when a context configuration has been successfully updated
Event sent when speech synthesis for a flush of text is completed. Some websocket use cases require an indicator that speech synthesis for a flush of text is completed. To facilitate this, we've included an empty "flushCompleted":{} event at the end of speech synthesis for each flush. Note that the implementation currently assumes that flushes execute sequentially, so the first flushCompleted event would correspond to the first flush call made on the client side.
Was this page helpful?