Speech Synthesis - C++

Introduction

Speech synthesis (also known as text-to-speech or TTS) is the process of converting written text into spoken audio.

In VSDK, speech synthesis is powered by CSDK, which offers a wide range of voices across different languages, genders, and voice quality (Voice quality availability).

Channels

Channel is what you use to generate speech. It holds one or more voices.

A channel itself doesn’t have a language—the language is defined by the voices you assign to it.
This means a single channel can include voices in different languages.

You can also define multiple channels in your configuration. This is useful when:

You want to synthesize multiple texts at the same time (parallel TTS).
You want to organize voices based on use case (e.g., one channel for alerts, another for navigation).

SSML Support

VSDK also supports SSML (Speech Synthesis Markup Language), which gives you finer control over how the text is spoken—allowing adjustments such as:

Pronunciation
Pauses
Pitch
Rate
Emphasis

SSML is supported for embedded voices, but not for neural voices (if present in your configuration). Neural voices are more natural-sounding but behave as a black box and do not support markup-based control.

Audio Format

The audio data is a 16-bit signed PCM buffer in Little-Endian format.
It is always mono (1 channel), and the sample rate depends on the engine being used.

Engine	Sample Rate (kHz)
csdk	22050

Voice Format

For <language>, refer to the table and use the value from the Vsdk-csdk Code column.
For <name>, use the lowercase version of the name shown in VDK-Studio.
For <quality>, you can find this information in VDK-Studio under Resources → Voice.

Engine	Format	Example
vsdk-csdk	`<language>,<name>,<quality>`	`enu,evan,embedded-pro`

Getting Started

Before you begin, make sure you’ve completed all the necessary preparation steps.
There are two ways to prepare your project for Voice Synthesis:

Using sample code
Starting from scratch

From Sample Code

To download the sample code, you'll need Conan. All the necessary steps are outlined in the general Getting Started guide.

📦 tts

Bash

conan search -r vivoka-customer tts  # To get the latest version.
conan install -if tts tts/<version>@vivoka/customer

Open project.vdk in VDK-Studio
Export in the same directory assets from VDK-Studio (don’t forget to add voice to channel and save configuration)

conan install . -if build
conan build . -if build
./build/Release/tts <voice_id>

From Scratch

Before proceeding, make sure you’ve completed the following steps:

1. Prepare your VDK Studio project

Create a new project in VDK Studio
Add the Voice Synthesis technology and channel with voice(s)
Export the project to generate the required assets and configuration

2. Set up your project

Install the necessary libraries
- vsdk-audio-portaudio/4.1.0@vivoka/customer
- vsdk-csdk-tts/1.1.0@vivoka/customer
- vsdk-samples-utils/1.1.0@vivoka/customer

These steps are better explained in the Get Started guide.

Start Recognition

1. Build Pipeline

For this example, we’ll implement a simple pipeline that records audio from the microphone and sends it to recognizer:

Start by initializing the Voice Synthesis engine:

C++

#include <vsdk/audio/Pipeline.hpp>
#include <vsdk/audio/consumers/PaPlayer.hpp>
#include <vsdk/tts/csdk.hpp>
#include <vsdk/utils/samples/EventLoop.hpp>
#include <vsdk/utils/Misc.hpp> // for printExceptionStack
#include <fmt/core.h>

#include <csignal>
#include <memory>
#include <string>

using namespace Vsdk;
using namespace Vsdk::Audio;
using namespace Vsdk::Tts;

using Vsdk::Utils::Samples::EventLoop;
using Seconds = Vsdk::Audio::Consumer::PaPlayer::Seconds;
using TtsEngine = Vsdk::Tts::Csdk::Engine;

namespace {
    constexpr auto channelName = "channel-1";
    constexpr auto voiceId     = "enu,evan,embedded-pro";
    constexpr auto phrase      = "Welcome to Vivoka text-to-speech demo.";
}

static void onTtsEvent(Channel & channel, Channel::Event const & e)
{
    fmt::print("[channel:{}] {}: {}\n", channel.name(), e.codeString, e.message);
}

static void onTtsError(Channel::Error const & e)
{
    auto type = (e.type == Channel::ErrorType::Error ? "Error" : "Warning");
    fmt::print(stderr, "[{:>7}] {}: {}\n", type, e.codeString, e.message);
}

int main() try
{
    std::shared_ptr<void> const guard(nullptr, [] (auto) { EventLoop::destroy(); });

    auto engine = Engine::make<TtsEngine>("config/vsdk.json");
    fmt::print("TTS Engine version: {}\n", engine->version());

    auto channel = engine->channel(channelName, voiceId);
    channel->subscribe([&](Channel::Event const & e) { onTtsEvent(*channel, e); });
    channel->subscribe(&onTtsError);

    auto player = std::make_shared<Consumer::PaPlayer>();
    player->setProgressCallback([] (Seconds b, Seconds e) { });
    player->setFinishedCallback([] {
        fmt::print("Playback finished. Exiting...\n");
        EventLoop::instance().shutdown();
    });

    Pipeline pipeline;
    pipeline.setProducer(channel);
    pipeline.pushBackConsumer(player);
    pipeline.start();

    EventLoop::instance().queue([&] {
        fmt::print("Synthesizing predefined phrase...\n");
        channel->synthesizeFromText(phrase);
    });

    EventLoop::instance().run();
    return EXIT_SUCCESS;
}
catch (std::exception const & e)
{
    fmt::print(stderr, "Fatal error:\n");
    Vsdk::printExceptionStack(e);
    return EXIT_FAILURE;
}

You cannot create two instances of the same engine.

2. Listing voices

C++

// C++17 or higher
for (auto const & [channel, voices] : engine->availableVoices())
    fmt::print("Available voices for '{}': ['{}']\n", channel, fmt::join(voices, "'; '"));

// C++11 or higher
for (auto const & it : engine->availableVoices())
    fmt::print("Available voices for '{}': ['{}']\n", it.first, fmt::join(it.second, "'; '"));

3. Set voice

C++

channel->setCurrentVoice("enu,evan,embedded-pro");

You can also activate a voice right away:

C++

auto channel = engine->channel(channelName, "enu,evan,embedded-pro");

4. Start/Stop Pipeline

pipeline.start();
pipeline.stop();
pipeline.run();

.start() runs the pipeline in a new thread
.run() runs the pipeline and waits till it is finished (blocking)
.stop() is used to terminate the pipeline execution

Once a pipeline has been stopped, you can restart it at any time by simply calling .start() again.

To stop playing:

pipeline.stop();
audioPlayer.stop();

Before calling .synthesizeFromText() you need to start a pipeline first:

C++

pipeline.start();

channel->synthesizeFromText("Hello world!");

String ssml = "<speak version=\"1.0\" xmlns=\"http://www.w3.org/2001/10/synthesis\" xml:lang=\"fr-FR\">Bonjour Vivoka</speak>";
channel->synthesizeFromText(ssml);

To pause/resume TTS:

audioPlayer->pause();
audioPlayer->resume();

If you call channel->synthesizeFromText() more than once, then the last call will override all the previous ones.

5. Channel additional methods

C++

channel->sampleRate();
channel->channelCount(); // (mono/stereo)

Blocking Speech Synthesis

This method uses only the channel, without involving a Pipeline. It blocks execution until the synthesis is fully completed, then returns a buffer with audio. You can then choose to either save the buffer or play it directly.

C++

#include <vsdk/tts/Utils.hpp>

Vsdk::Audio::Buffer const buffer = Vsdk::Tts::synthesizeFromText(channel, "Hello");

auto const ssml = R"(<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="enUS">
  Here is an <say-as interpret-as="characters">SSML</say-as> sample.
</speak>)";
buffer = Vsdk::Tts::synthesizeFromText(channel, ssml);

buffer = Vsdk::Tts::synthesizeFromFile(channel, "path/to/file.txt");

buffer.saveToFile("output_audio.pcm");

Audio::Buffer is not a pointer type—avoid unnecessary copying, prefer move operations.

The synthesis result is a buffer containing raw audio data in 16-bit signed Little-Endian PCM format.

Playing Audio

You can play audio from a file using the following command:

aplay -f S16_LE -r 22050 -c 1 output_audio.pcm

On Windows, you can use Audacity to import the raw PCM data, then play or convert it as needed.

Or you can use PaStandalonePlayer (cross-platform player from vsdk-audio-portaudio) and play from buffer directly.

C++

#include <vsdk/audio/PaStandalonePlayer.hpp>

Vsdk::Audio::PaStandalonePlayer player;
player.play(buffer);

Unlike PaPlayer, which is used for streaming (as shown in the Pipeline example), PaStandalonePlayer is not designed for streaming. Instead, it is intended for playing complete audio buffers.

TextMarker

We also support progress tracking, including word-level markers, which can be used to synchronize text display or trigger actions as speech is spoken.

Text Marker implementation

TextMarkerHandler.cpp

C++

#include "TextMarkerHandler.hpp"

TextMarkerHandler::TextMarkerHandler(int sampleRate, int channelCount)
    : _previousWordIndex(-1)
{
    setAudioFormat(sampleRate, channelCount);
}

void TextMarkerHandler::setAudioFormat(int sampleRate, int channelCount)
{
    VSDK_B_ASSERT(sampleRate   > 0, "Sample rate cannot be zero or less than zero");
    VSDK_B_ASSERT(channelCount > 0, "Channel count cnanot be zero or less than zero");

    _sampleRate   = sampleRate;
    _channelCount = channelCount;
}

void TextMarkerHandler::addMarker(Marker marker)
{
    _markers.emplace_back(std::move(marker));
}

void TextMarkerHandler::setReachedMarkerCallback(ReachedMarkerCallback callback)
{
    _callback = std::move(callback);
}

void TextMarkerHandler::onPlayerProgress(Seconds begin, Seconds end)
{
    if (_callback)
    {
        double const frameSize = _sampleRate * _channelCount;

        for (auto i = _previousWordIndex + 1;
             i < _markers.size() && Seconds(_markers[i].startPosInAudio / frameSize) <= begin;
             ++i)
        {
            _callback(_markers[i]);
            _previousWordIndex = i;
        }
    }
}

void TextMarkerHandler::reset()
{
    _previousWordIndex = -1;
    _markers.clear();
}

TextMarkerHandler.hpp

C++

#include <vsdk/audio/consumers/PaPlayer.hpp>
#include <vsdk/tts/Events.hpp>

class TextMarkerHandler
{
    public:
        using Marker                = Vsdk::Tts::Events::WordMarker;
        using Seconds               = Vsdk::Audio::Consumer::PaPlayer::Seconds;
        using ReachedMarkerCallback = std::function<void(Marker const &)>;

    public:
        TextMarkerHandler(int sampleRate = 16000, int channelCount = 1);

    public:
        void setAudioFormat(int sampleRate, int channelCount);
        void addMarker(Marker marker);
        void setReachedMarkerCallback(ReachedMarkerCallback callback);
        void onPlayerProgress(Seconds begin, Seconds end);
        void reset();

    private:
        std::vector<Marker>   _markers;
        int                   _sampleRate;
        int                   _channelCount;
        int                   _previousWordIndex;
        ReachedMarkerCallback _callback;
};

The following code demonstrates how to integrate word-level markers into the pipeline for synchronized text playback.

C++

#include "TextMarkerHandler.hpp"

using Seconds = Vsdk::Audio::Consumer::PaPlayer::Seconds;
  
void onWordReached(Vsdk::Tts::Events::WordMarker const & marker)
{
    fmt::print("On word reached: {}\n", marker.word);
}

void onChannelEvent(Vsdk::Tts::Channel::Event const & e)
{
    if (e.code == Vsdk::Tts::Channel::EventCode::WordMarkerStart)
    {
        Vsdk::Tts::Events::WordMarker marker = nlohmann::json::parse(e.message);
        textMarker.addMarker(std::move(marker));
    }
}
...
  
channel->subscribe(&onTtsEvent);

TextMarkerHandler textMarker;
textMarker.setAudioFormat(channel->sampleRate(), channel->channelCount());
textMarker.setReachedMarkerCallback(&onWordReached);

auto player = std::make_shared<Vsdk::Audio::Consumer::PaPlayer>();
player->setProgressCallback([] (Seconds b, Seconds e) { textMarker.onPlayerProgress(b, e); });
player->setFinishedCallback([&] {
    textMarker.reset();
});

...