Speech Synthesis - C++
Introduction
Speech synthesis (also known as text-to-speech or TTS) is the process of converting written text into spoken audio.
In VSDK, speech synthesis is powered by CSDK, which offers a wide range of voices across different languages, genders, and voice quality (Voice quality availability).
Channels
Channel is what you use to generate speech. It holds one or more voices.
A channel itself doesn’t have a language—the language is defined by the voices you assign to it.
This means a single channel can include voices in different languages.
You can also define multiple channels in your configuration. This is useful when:
You want to synthesize multiple texts at the same time (parallel TTS).
You want to organize voices based on use case (e.g., one channel for alerts, another for navigation).
SSML Support
VSDK also supports SSML (Speech Synthesis Markup Language), which gives you finer control over how the text is spoken—allowing adjustments such as:
Pronunciation
Pauses
Pitch
Rate
Emphasis
SSML is supported for embedded voices, but not for neural voices (if present in your configuration). Neural voices are more natural-sounding but behave as a black box and do not support markup-based control.
Audio Format
The audio data is a 16-bit signed PCM buffer in Little-Endian format.
It is always mono (1 channel), and the sample rate depends on the engine being used.
Engine | Sample Rate (kHz) |
|---|---|
csdk | 22050 |
Voice Format
For <language>, refer to the table and use the value from the Vsdk-csdk Code column.
For <name>, use the lowercase version of the name shown in VDK-Studio.
For <quality>, you can find this information in VDK-Studio under Resources → Voice.
Engine | Format | Example |
|---|---|---|
vsdk-csdk |
|
|
Getting Started
Before you begin, make sure you’ve completed all the necessary preparation steps.
There are two ways to prepare your project for Voice Synthesis:
Using sample code
Starting from scratch
From Sample Code
To download the sample code, you'll need Conan. All the necessary steps are outlined in the general Getting Started guide.
📦 tts
conan search -r vivoka-customer tts # To get the latest version.
conan install -if tts tts/<version>@vivoka/customer
Open project.vdk in VDK-Studio
Export in the same directory assets from VDK-Studio (don’t forget to add voice to channel and save configuration)
conan install . -if build
conan build . -if build
./build/Release/tts <voice_id>
From Scratch
Before proceeding, make sure you’ve completed the following steps:
1. Prepare your VDK Studio project
Create a new project in VDK Studio
Add the Voice Synthesis technology and channel with voice(s)
Export the project to generate the required assets and configuration
2. Set up your project
Install the necessary libraries
vsdk-audio-portaudio/4.1.0@vivoka/customervsdk-csdk-tts/1.1.0@vivoka/customervsdk-samples-utils/1.1.0@vivoka/customer
These steps are better explained in the Get Started guide.
Start Recognition
1. Build Pipeline
For this example, we’ll implement a simple pipeline that records audio from the microphone and sends it to recognizer:
Start by initializing the Voice Synthesis engine:
#include <vsdk/audio/Pipeline.hpp>
#include <vsdk/audio/consumers/PaPlayer.hpp>
#include <vsdk/tts/csdk.hpp>
#include <vsdk/utils/samples/EventLoop.hpp>
#include <vsdk/utils/Misc.hpp> // for printExceptionStack
#include <fmt/core.h>
#include <csignal>
#include <memory>
#include <string>
using namespace Vsdk;
using namespace Vsdk::Audio;
using namespace Vsdk::Tts;
using Vsdk::Utils::Samples::EventLoop;
using Seconds = Vsdk::Audio::Consumer::PaPlayer::Seconds;
using TtsEngine = Vsdk::Tts::Csdk::Engine;
namespace {
constexpr auto channelName = "channel-1";
constexpr auto voiceId = "enu,evan,embedded-pro";
constexpr auto phrase = "Welcome to Vivoka text-to-speech demo.";
}
static void onTtsEvent(Channel & channel, Channel::Event const & e)
{
fmt::print("[channel:{}] {}: {}\n", channel.name(), e.codeString, e.message);
}
static void onTtsError(Channel::Error const & e)
{
auto type = (e.type == Channel::ErrorType::Error ? "Error" : "Warning");
fmt::print(stderr, "[{:>7}] {}: {}\n", type, e.codeString, e.message);
}
int main() try
{
std::shared_ptr<void> const guard(nullptr, [] (auto) { EventLoop::destroy(); });
auto engine = Engine::make<TtsEngine>("config/vsdk.json");
fmt::print("TTS Engine version: {}\n", engine->version());
auto channel = engine->channel(channelName, voiceId);
channel->subscribe([&](Channel::Event const & e) { onTtsEvent(*channel, e); });
channel->subscribe(&onTtsError);
auto player = std::make_shared<Consumer::PaPlayer>();
player->setProgressCallback([] (Seconds b, Seconds e) { });
player->setFinishedCallback([] {
fmt::print("Playback finished. Exiting...\n");
EventLoop::instance().shutdown();
});
Pipeline pipeline;
pipeline.setProducer(channel);
pipeline.pushBackConsumer(player);
pipeline.start();
EventLoop::instance().queue([&] {
fmt::print("Synthesizing predefined phrase...\n");
channel->synthesizeFromText(phrase);
});
EventLoop::instance().run();
return EXIT_SUCCESS;
}
catch (std::exception const & e)
{
fmt::print(stderr, "Fatal error:\n");
Vsdk::printExceptionStack(e);
return EXIT_FAILURE;
}
You cannot create two instances of the same engine.
2. Listing voices
// C++17 or higher
for (auto const & [channel, voices] : engine->availableVoices())
fmt::print("Available voices for '{}': ['{}']\n", channel, fmt::join(voices, "'; '"));
// C++11 or higher
for (auto const & it : engine->availableVoices())
fmt::print("Available voices for '{}': ['{}']\n", it.first, fmt::join(it.second, "'; '"));
3. Set voice
channel->setCurrentVoice("enu,evan,embedded-pro");
You can also activate a voice right away:
auto channel = engine->channel(channelName, "enu,evan,embedded-pro");
4. Start/Stop Pipeline
pipeline.start();
pipeline.stop();
pipeline.run();
.start()runs the pipeline in a new thread.run()runs the pipeline and waits till it is finished (blocking).stop()is used to terminate the pipeline execution
Once a pipeline has been stopped, you can restart it at any time by simply calling .start() again.
To stop playing:
pipeline.stop();
audioPlayer.stop();
Before calling .synthesizeFromText() you need to start a pipeline first:
pipeline.start();
channel->synthesizeFromText("Hello world!");
String ssml = "<speak version=\"1.0\" xmlns=\"http://www.w3.org/2001/10/synthesis\" xml:lang=\"fr-FR\">Bonjour Vivoka</speak>";
channel->synthesizeFromText(ssml);
To pause/resume TTS:
audioPlayer->pause();
audioPlayer->resume();
If you call channel->synthesizeFromText() more than once, then the last call will override all the previous ones.
5. Channel additional methods
channel->sampleRate();
channel->channelCount(); // (mono/stereo)
Blocking Speech Synthesis
This method uses only the channel, without involving a Pipeline. It blocks execution until the synthesis is fully completed, then returns a buffer with audio. You can then choose to either save the buffer or play it directly.
#include <vsdk/tts/Utils.hpp>
Vsdk::Audio::Buffer const buffer = Vsdk::Tts::synthesizeFromText(channel, "Hello");
auto const ssml = R"(<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="enUS">
Here is an <say-as interpret-as="characters">SSML</say-as> sample.
</speak>)";
buffer = Vsdk::Tts::synthesizeFromText(channel, ssml);
buffer = Vsdk::Tts::synthesizeFromFile(channel, "path/to/file.txt");
buffer.saveToFile("output_audio.pcm");
Audio::Buffer is not a pointer type—avoid unnecessary copying, prefer move operations.
The synthesis result is a buffer containing raw audio data in 16-bit signed Little-Endian PCM format.
Playing Audio
You can play audio from a file using the following command:
aplay -f S16_LE -r 22050 -c 1 output_audio.pcm
On Windows, you can use Audacity to import the raw PCM data, then play or convert it as needed.
Or you can use PaStandalonePlayer (cross-platform player from vsdk-audio-portaudio) and play from buffer directly.
#include <vsdk/audio/PaStandalonePlayer.hpp>
Vsdk::Audio::PaStandalonePlayer player;
player.play(buffer);
Unlike PaPlayer, which is used for streaming (as shown in the Pipeline example), PaStandalonePlayer is not designed for streaming. Instead, it is intended for playing complete audio buffers.
TextMarker
We also support progress tracking, including word-level markers, which can be used to synchronize text display or trigger actions as speech is spoken.
The following code demonstrates how to integrate word-level markers into the pipeline for synchronized text playback.
#include "TextMarkerHandler.hpp"
using Seconds = Vsdk::Audio::Consumer::PaPlayer::Seconds;
void onWordReached(Vsdk::Tts::Events::WordMarker const & marker)
{
fmt::print("On word reached: {}\n", marker.word);
}
void onChannelEvent(Vsdk::Tts::Channel::Event const & e)
{
if (e.code == Vsdk::Tts::Channel::EventCode::WordMarkerStart)
{
Vsdk::Tts::Events::WordMarker marker = nlohmann::json::parse(e.message);
textMarker.addMarker(std::move(marker));
}
}
...
channel->subscribe(&onTtsEvent);
TextMarkerHandler textMarker;
textMarker.setAudioFormat(channel->sampleRate(), channel->channelCount());
textMarker.setReachedMarkerCallback(&onWordReached);
auto player = std::make_shared<Vsdk::Audio::Consumer::PaPlayer>();
player->setProgressCallback([] (Seconds b, Seconds e) { textMarker.onPlayerProgress(b, e); });
player->setFinishedCallback([&] {
textMarker.reset();
});
...