Speech Synthesis
Introduction
Speech synthesis (also known as text-to-speech or TTS) is the process of converting written text into spoken audio.
In VSDK, speech synthesis is powered by CSDK, which offers a wide range of voices across different languages, genders, and voice quality (Voice quality availability).
Voice Format
For <language>, refer to the table and use the value from the Vsdk-csdk Code column.
For <name>, use the lowercase version of the name shown in VDK-Studio.
For <quality>, you can find this information in VDK-Studio under Resources → Voice.
Engine | Format | Example |
|---|---|---|
vsdk-csdk |
|
|
SSML Support
VSDK also supports SSML (Speech Synthesis Markup Language), which gives you finer control over how the text is spoken—allowing adjustments such as:
Pronunciation
Pauses
Pitch
Rate
Emphasis
SSML is supported for embedded voices, but not for neural voices (if present in your configuration). Neural voices are more natural-sounding but behave as a black box and do not support markup-based control.
Audio Format
The audio data is a 16-bit signed PCM buffer in Little-Endian format.
It is always mono (1 channel), and the sample rate depends on the engine being used.
Engine | Sample Rate (kHz) |
|---|---|
csdk | 22050 |
Examples
You can see the different routes available in: REST API in the Voice synthesis section.
Synthesis
You can retrieve a list of the voices you configured for your loaded project.
[GET] /voice-synthesis/voices
Then you can request a synthesis.
[POST] /voice-synthesis/synthesize
{
"text": "Hello world, my name is Tom !",
"voice_id": "enu,tom,embedded-compact"
}
If the request is successful, we receive a token and we can head to the WebSocket API.
You can now receive the generated audio through the newly opened socket.
Sample project
A sample project is available for Speech Synthesis usage with VDK Service (in C# or Python).