Voice Synthesis - C++

VDK features two Voice Synthesis libraries: vsdk-csdk and vsdk-baratinoo.

Configuring the engine

Voice synthesis engines must be configured before the program starts.

An empty channel list will trigger an error, as well as an empty voice list!

Use the VDK Studio to generate the configuration and the data directory. After creating a custom project with the channels and the voices of your choice, just export it to your project’s location.

Voice ID format

Each engine has its own voice format, described in the following table:

Engine	Format	Example
vsdk-csdk	`<language>,<name>,<quality>`	`enu,evan,embedded-pro`
vsdk-baratinoo	`<name>`	`Arnaud_neutre`

Starting the engine

vsdk-baratinoo

Source code

CPP

#include <vsdk/tts/baratinoo.hpp>

using TtsEngine = Vsdk::Tts::Baratinoo::Engine;
auto engine     = Vsdk::Tts::Engine::make<TtsEngine>("config/vsdk.json");

vsdk-csdk

Source code

CPP

#include <vsdk/tts/csdk.hpp>

using TtsEngine = Vsdk::Tts::Csdk::Engine;
auto engine     = Vsdk::Tts::Engine::make<TtsEngine>("config/vsdk.json");

Listing the configured channels and voices

CPP

// C++17 or higher
for (auto const & [channel, voices] : engine->availableVoices())
    fmt::print("Available voices for '{}': ['{}']\n", channel, fmt::join(voices, "'; '"));

// C++11 or higher
for (auto const & it : engine->availableVoices())
    fmt::print("Available voices for '{}': ['{}']\n", it.first, fmt::join(it.second, "'; '"));

Getting a channel

CPP

auto channelFr = engine->channel("MyChannel_fr");
channelFr->setCurrentVoice("Arnaud_neutre"); // mandatory before any synthesis can take place

You can also activate a voice right away:

CPP

auto channelEn = engine->makeChannel("MyChannel_en", "laura");

The engine instance can't die while at least one channel instance is alive. Destruction order is important!

Blocking Speech Synthesis

The following method will block until the synthesis is fully finished, then return a buffer you can play right away.

Synthesizing raw text:

CPP

Vsdk::Audio::Buffer const buffer = Vsdk::Tts::synthesizeFromText(channel, "Hello");

Audio::Buffer is NOT a pointer type! Avoid copying it around, prefer move operations.

Synthesizing SSML text:

CPP

auto const ssml = R"(<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="enUS">
  Here is an <say-as interpret-as="characters">SSML</say-as> sample.
</speak>)";
auto const buffer = Vsdk::Tts::synthesizeFromText(channel, ssml);

Synthesizing from a file:

CPP

auto const buffer = Vsdk::Tts::synthesizeFromFile(channel, "path/to/file.txt");

The synthesis result is a buffer that contains raw audio data. Audio format is 16-bit signed Little-Endian PCM buffer. Channel count and sample rate can be queried using Channel::channelCount() and Channel::sampleRate().

SDK	Sample Rate (Hz)
vsdk-csdk	22050
vsdk-baratinoo	24000

Playing the result

VSDK provides a cross-platform player in the vsdk-audio-portaudio package.

Playing the result is very easy:

CPP

#include <vsdk/audio/PaStandalonePlayer.hpp>

auto buffer = Vsdk::Tts::synthesizeFromText(channel, "Text to synthesize");
...
Vsdk::Audio::PaStandalonePlayer player;
player.play(buffer);
...

Storing the result on disk

CPP

buffer.saveToFile("path/to/file.pcm");

Only PCM (raw) extension is available, which means the file has no audio header of any sort. You can play it by supplying the right parameter, i.e.: aplay -f S16_LE -s -c 1 file.pcm or add a WAV header: ffmpeg -f s16le -ar -ac 1 -i file.pcm file.wav.

In Windows you can use Audacity to import raw data and then you can play or convert it.

Streaming Speech Synthesis

Streaming the synthesis enables you to get audio chunk regularly instead of waiting for the whole generation process to be done. This is done using the pipeline system, which lets you choose the synchronous or asynchronous mode (pipeline.run() vs. pipeline.start());

Examples

Starts an asynchronous synthesis that plays the result on default output device right away:

CPP

#include <vsdk/audio/consumer/PaPlayer.hpp>

Vsdk::Audio::Pipeline pipeline;
pipeline.setProducer(channel);
pipeline.pushBackConsumer<Vsdk::Audio::Consumer::PaPlayer>();
pipeline.start(); // Starts the channel for future synthesis requests

// Starts the actual synthesis
channel->synthesizeFromText("..."); 
// Since we are using the asynchronous mode, we will reach here without blocking!
// Make sure to do something or wait else the pipeline will go out of score and stop.

Starts a synchronous synthesis from a text file whose audio data gets stored in a buffer:

CPP

#include <vsdk/audio/BufferModule.hpp>

Vsdk::Audio::Pipeline pipeline;
pipeline.setProducer(channel);
auto bufferModule = *pipeline.pushBackConsumer<Vsdk::Audio::BufferModule>();
// Channel not started yet, waiting for start() or run() to truly do the job
channel->synthesizeFromFile("...");
pipeline.run(); // Block this thread until task is done
...
auto const result = std::move(bufferModule->buffer()); // Get available result data

Most functions used have useful return values to indicate whether everything worked. Pipeline never throw so in case of an error using one of its operations, you can get the error string with the lastError() method.

Marker events and runtime errors or warnings

Be it synchronous or asynchronous, you can subscribe to a channel to get events and/or errors & warnings happening during synthesis:

CPP

channel->subscribe([&] (Vsdk::Tts::Channel::Event const & e)
{
    if (e.type == Vsdk::Tts::Channel::EventCode::WordMarkerStart)
    {
        Vsdk::Tts::Events::WordMarker const marker = nlohmann_json::parse(e.message);
        fmt::print("[{}] Current word being played: '{}'\n", channel->name(), marker.word);
    }
    else if (e.type == Vsdk::Tts::Channel::EventCode::ProcessFinished)
    {
        // Synthesis finished, can now start a new one or signal another process
    }
    else
    {
        auto const msg = e.message.empty() ? "" : ": " + e.message;
        fmt::print("[{}] {}{}\n", channel->name(), e.codeString, msg);
    }
});