Voice Synthesis - C++
VDK features three Voice Synthesis libraries: vsdk-csdk
, vsdk-baratinoo
and vsdk-vtapi
.
Configuring the engine
Voice synthesis engines must be configured before the program starts.
An empty channel list will trigger an error, as well as an empty voice list!
Use the VDK Studio to generate the configuration and the data directory. After creating a custom project with the channels and the voices of your choice, just export it to your project’s location.
Voice ID format
Each engine has its own voice format, described in the following table:
Engine | Format | Example |
---|---|---|
vsdk-csdk |
|
|
vsdk-vtapi |
|
|
vsdk-baratinoo |
|
|
Starting the engine
Listing the configured channels and voices
// C++17 or higher
for (auto const & [channel, voices] : engine->availableVoices())
fmt::print("Available voices for '{}': ['{}']\n", channel, fmt::join(voices, "'; '"));
// C++11 or higher
for (auto const & it : engine->availableVoices())
fmt::print("Available voices for '{}': ['{}']\n", it.first, fmt::join(it.second, "'; '"));
Getting a channel
auto channelFr = engine->channel("MyChannel_fr");
channelFr->setCurrentVoice("Arnaud_neutre"); // mandatory before any synthesis can take place
You can also activate a voice right away:
auto channelEn = engine->makeChannel("MyChannel_en", "laura");
The engine instance can't die while at least one channel instance is alive. Destruction order is important!
Blocking Speech Synthesis
The following method will block until the synthesis is fully finished, then return a buffer you can play right away.
Synthesizing raw text:
Vsdk::Audio::Buffer const buffer = Vsdk::Tts::synthesizeFromText(channel, "Hello");
Audio::Buffer is NOT a pointer type! Avoid copying it around, prefer move operations.
Synthesizing SSML text:
auto const ssml = R"(<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="enUS">
Here is an <say-as interpret-as="characters">SSML</say-as> sample.
</speak>)";
auto const buffer = Vsdk::Tts::synthesizeFromText(channel, ssml);
Synthesizing from a file:
auto const buffer = Vsdk::Tts::synthesizeFromFile(channel, "path/to/file.txt");
The synthesis result is a buffer that contains raw audio data. Audio format is 16-bit signed Little-Endian PCM buffer. Channel count and sample rate can be queried using Channel::channelCount()
and Channel::sampleRate()
.
SDK | Sample Rate (Hz) |
---|---|
vsdk-csdk | 22050 |
vsdk-baratinoo | 24000 |
vsdk-vtapi | 22050 |
Playing the result
VSDK provides a cross-platform player in the vsdk-audio-portaudio
package.
Playing the result is very easy:
#include <vsdk/audio/PaStandalonePlayer.hpp>
auto buffer = Vsdk::Tts::synthesizeFromText(channel, "Text to synthesize");
...
Vsdk::Audio::PaStandalonePlayer player;
player.play(buffer);
...
Storing the result on disk
buffer.saveToFile("path/to/file.pcm");
Only PCM (raw) extension is available, which means the file has no audio header of any sort. You can play it by supplying the right parameter, i.e.: aplay -f S16_LE -s -c 1 file.pcm
or add a WAV header: ffmpeg -f s16le -ar -ac 1 -i file.pcm file.wav
.
In Windows you can use Audacity to import raw data and then you can play or convert it.
Streaming Speech Synthesis
Streaming the synthesis enables you to get audio chunk regularly instead of waiting for the whole generation process to be done. This is done using the pipeline system, which lets you choose the synchronous or asynchronous mode (pipeline.run()
vs. pipeline.start()
);
Examples
Starts an asynchronous synthesis that plays the result on default output device right away:
#include <vsdk/audio/consumer/PaPlayer.hpp>
Vsdk::Audio::Pipeline pipeline;
pipeline.setProducer(channel);
pipeline.pushBackConsumer<Vsdk::Audio::Consumer::PaPlayer>();
pipeline.start(); // Starts the channel for future synthesis requests
// Starts the actual synthesis
channel->synthesizeFromText("...");
// Since we are using the asynchronous mode, we will reach here without blocking!
// Make sure to do something or wait else the pipeline will go out of score and stop.
Starts a synchronous synthesis from a text file whose audio data gets stored in a buffer:
#include <vsdk/audio/BufferModule.hpp>
Vsdk::Audio::Pipeline pipeline;
pipeline.setProducer(channel);
auto bufferModule = *pipeline.pushBackConsumer<Vsdk::Audio::BufferModule>();
// Channel not started yet, waiting for start() or run() to truly do the job
channel->synthesizeFromFile("...");
pipeline.run(); // Block this thread until task is done
...
auto const result = std::move(bufferModule->buffer()); // Get available result data
Most functions used have useful return values to indicate whether everything worked. Pipeline never throw so in case of an error using one of its operations, you can get the error string with the lastError()
method.
Marker events and runtime errors or warnings
Be it synchronous or asynchronous, you can subscribe to a channel to get events and/or errors & warnings happening during synthesis:
channel->subscribe([&] (Vsdk::Tts::Channel::Event const & e)
{
if (e.type == Vsdk::Tts::Channel::EventCode::WordMarkerStart)
{
Vsdk::Tts::Events::WordMarker const marker = nlohmann_json::parse(e.message);
fmt::print("[{}] Current word being played: '{}'\n", channel->name(), marker.word);
}
else if (e.type == Vsdk::Tts::Channel::EventCode::ProcessFinished)
{
// Synthesis finished, can now start a new one or signal another process
}
else
{
auto const msg = e.message.empty() ? "" : ": " + e.message;
fmt::print("[{}] {}{}\n", channel->name(), e.codeString, msg);
}
});