Speech Synthesis
Introduction
Speech synthesis (also known as text-to-speech or TTS) is the process of converting written text into spoken audio.
In VSDK, speech synthesis is powered by CSDK, which offers a wide range of voices across different languages, genders, and voice quality (Voice quality availability).
Voice Format
For <language>, refer to the table and use the value from the Vsdk-csdk Code column.
For <name>, use the lowercase version of the name shown in VDK-Studio.
For <quality>, you can find this information in VDK-Studio under Resources → Voice.
Engine | Format | Example |
|---|---|---|
vsdk-csdk |
|
|
SSML Support
VSDK also supports SSML (Speech Synthesis Markup Language), which gives you finer control over how the text is spoken—allowing adjustments such as:
Pronunciation
Pauses
Pitch
Rate
Emphasis
SSML is supported for embedded voices, but not for neural voices (if present in your configuration). Neural voices are more natural-sounding but behave as a black box and do not support markup-based control.
Audio Format
The audio data is a 16-bit signed PCM buffer in Little-Endian format.
It is always mono (1 channel), and the sample rate depends on the engine being used.
Engine | Sample Rate (kHz) |
|---|---|
csdk | 22050 |
Sample project
A sample project is available for Speech Synthesis usage with VDK Service (in C# or Python).