Speech Recognition - C++
Introduction
Recognizers & Models
Speech recognition relies on two main components: Recognizers and Models. They work together to turn audio into meaningful results.
Recognizers: The Acoustic Part
Recognizers handle incoming audio and use voice activity detection (VAD) to detect when someone is speaking or silent.
A recognizer can run without a model—VAD will still detect speech—but no transcription or interpretation will occur. Recognizers are configured for specific languages and accents (e.g., eng-US, fra-FR) and must be paired with models in the same language.
Models: The Language Part
Models define what the recognizer can understand—words, phrases, grammar structure, and even phoneme rules. They act as the language engine behind the acoustic processing.
There are three types of models:
Type | Description |
Static models | Predefined vocabularies and grammars. They define a fixed set of valid phrases, their structure, and optionally custom phonemes. |
Dynamic models | A special type of static model that includes slots—placeholders you can fill with vocabulary at runtime. The base model is compiled in the cloud, then updated on the device when slots are filled, requiring an additional on-device compilation. |
Free-speech models | Large-vocabulary models designed for open-ended input (e.g., dictation). Unlike static or dynamic models, they are not limited to a defined set of phrases. Implement it the same way as the static model. |
In short, recognizers listen, and models interpret. Both are essential for effective speech recognition.
Why Recognizers and Models Are Separated
Separating Recognizers and Models gives you more flexibility. For example, you can set or switch a model after speech has started—thanks to internal audio buffering—so the model still applies to earlier audio.
It also lets you reuse the same recognizer with different models, or prepare models in the background without interrupting audio input.
Audio Format
The audio data is a 16-bit signed PCM buffer in Little-Endian format.
It is always mono (1 channel), and the sample rate 16KHz.
Getting Started
Before you begin, make sure you’ve completed all the necessary preparation steps.
There are two ways to prepare your project for Voice Recognition:
Using sample code
Starting from scratch
From Sample Code
To download the sample code, you'll need Conan. All the necessary steps are outlined in the general Getting Started guide.
📦 simple-application
📦 chained-grammars
📦 dynamic-grammar
Quick example with simple-application:
conan search -r vivoka-customer simple-application # To get the latest version.
conan inspect -r vivoka-customer -a options simple-application/<version>@vivoka/customer
conan install -if simple-application simple-application/<version>@vivoka/customer -o asr_engine=csdk-asr -o tts_engine=csdk-tts
Open project.vdk in VDK-Studio
Export in the same directory assets from VDK-Studio (don’t forget to configure a voice for the channel and to compile models)
conan install . -if build
conan build . -if build
./build/Release/simple-application
From Scratch
Before proceeding, make sure you’ve completed the following steps:
1. Prepare your VDK Studio project
Create a new project in VDK Studio
Add the Voice Recognition technology
Add a model and a recognizer
Export the project to generate the required assets and configuration
To add a recognizer:
Click the settings icon next to Voice Recognition in the left panel.
Click “Add new recognizer”.
Enter a name for your recognizer.
Select the language.
Remember the name, you will need it later!
Otherwise the default recognizer will be added with a name rec_ + language. For example, rec_eng-US.
2. Set up your project
Install the necessary libraries
vsdk-csdk-asr/x.x.x@vivoka/customer
We also provide utility libraries providing a way to capture microphone and to build an event loop.
vsdk-audio-portaudio/x.x.x@vivoka/customervsdk-samples-utils/x.x.x@vivoka/customer
These steps are better explained in the Get Started guide.
3. Write the main body
This is a base example for a Speech Recognition pipeline. We need :
An event loop
A pipeline set up with a recognizer
Defined callbacks
You cannot create two instances of the same engine.
#include <vsdk/audio/Pipeline.hpp>
#include <vsdk/audio/producers/PaMicrophone.hpp>
#include <vsdk/asr/csdk.hpp>
#include <vsdk/utils/Misc.hpp> // formatTimeMarker
#include <vsdk/utils/PortAudio.hpp>
#include <vsdk/utils/samples/EventLoop.hpp>
#include <fmt/core.h>
#include <csignal>
#include <iostream>
#include <memory>
using Vsdk::Asr::Recognizer;
using Vsdk::Audio::Pipeline;
using Vsdk::Audio::Producer::PaMicrophone;
using Vsdk::Utils::Samples::EventLoop;
namespace
{
Vsdk::Asr::RecognizerPtr recognizer;
const std::string modelName = "model-1"; // Replace as needed
// Callbacks (we'll describe them later)
void onAsrEvent(Recognizer::Event const & event);
void onAsrError(Recognizer::Error const & error);
void onAsrResult(Recognizer::Result const & result);
// Set the model again. (!queue the operation)
void installModel(std::string const & model, Vsdk::Asr::RecognizerPtr & recognizer)
}
int main() try
{
// Ensure EventLoop is destroyed last
std::shared_ptr<void> eventLoopGuard(nullptr, [](auto) { EventLoop::destroy(); });
// Create the ASR Engine using CSDK(Cerence)
auto asrEngine = Vsdk::Asr::Engine::make<Vsdk::Asr::Csdk::Engine>("config/vsdk.json");
fmt::print("ASR Engine version: {}\n", asrEngine->version());
// Retrieve a configured Recognizer from the engine
const std::string recognizerName = "rec_eng-US";
recognizer = asrEngine->recognizer(recognizerName);
// Adding callbacks to the recognizer
recognizer->subscribe(&onAsrEvent);
recognizer->subscribe([](const Recognizer::Error & e) { onAsrError(e); });
recognizer->subscribe([](const Recognizer::Result & r) { onAsrResult(r); });
// Choose which grammar model to use
recognizer->setModel(modelName);
// Retrieve a producer (Vsdk::Audio::Producer::PaMicrophone)
auto mic = PaMicrophone::make();
fmt::print("Using input device: {}\n", mic->name());
// Creating a pipeline and assigning it the bare minimum.
// A producer = your microphone
// A consumer = the recognizer
Pipeline pipeline;
pipeline.setProducer(mic);
pipeline.pushBackConsumer(recognizer);
pipeline.start(); // Async start.
fmt::print("ASR pipeline started. Speak into the microphone. Press Ctrl+C to stop.\n");
// Run the event loop.
EventLoop::instance().run();
return EXIT_SUCCESS;
}
catch (std::exception const & e)
{
fmt::print(stderr, "Fatal error:\n");
Vsdk::printExceptionStack(e);
return EXIT_FAILURE;
}
4. Writing callbacks
We declared the following in the previous section and assigned them to the recognizer.
void onAsrEvent (Recognizer::Event const & event);
void onAsrError (Recognizer::Error const & error);
void onAsrResult(Recognizer::Result const & result);
...
recognizer->subscribe(&onAsrEvent);
recognizer->subscribe([](const Recognizer::Error & e) { onAsrError(e); });
recognizer->subscribe([](const Recognizer::Result & r) { onAsrResult(r); });
We now need to define them.
Be mindful that the functions were declared in an anonymous namespace. This isn’t strictly required, but to define them, you must do so within the same namespace.
The event callback
void onAsrEvent(Recognizer::Event const & event)
{
static Recognizer::Event lastEvent;
if (event != lastEvent)
{
auto const msg = event.message.empty() ? "" : ": " + event.message;
fmt::print("[{}] {}{}\\n", Vsdk::Utils::formatTimeMarker(event.timeMarker), event.codeString, msg);
lastEvent = event;
}
}
The error callback
static void onAsrError(Recognizer::Error const & error)
{
fmt::print(stderr, "[ERROR] {} - {}\n", error.codeString, error.message);
}
The result callback
The parsed result will typically contain multiple hypotheses, each with an associated confidence score—the higher the score, the better the match.
In general, a confidence threshold between 4000 and 5000 is considered acceptable, though this may vary depending on your specific use case.
void onAsrEvent(Recognizer::Result const & result)
{
int const confidenceThreshold = 5000;
if (result.type != Recognizer::ResultType::Asr)
return;
std::string text;
int confidence = -1;
if (!result.hypotheses.empty())
{
auto const & best = result.hypotheses[0]; // First is always best
fmt::print("Hypothesis: '{}'{}\n", best.text,
best.confidence != -1 ? fmt::format(" (confidence: {})", best.confidence) : "");
if (!best.tags.empty())
{
fmt::print("Tags:\n");
for (auto const & tags : best.tags)
fmt::print(" - '{}': '{}'\n", tags.first, tags.second);
confidence = best.confidence;
if (confidence >= confidenceThreshold)
{
text = best.tags[0].second;
}
}
else
fmt::print(stderr, "Tags: (None)\n");
}
if (!text.empty()) // Synthesis started, print what is being said
fmt::print("{}\n", text);
// received a result so we need to install the model again.
installModel(modelName, recognizer) // detailed below.
}
Detailed topics
Pipeline control
pipeline.start();
pipeline.stop();
pipeline.run();
.start()runs the pipeline in a new thread.run()runs the pipeline and waits till it is finished (blocking).stop()is used to terminate the pipeline execution
Once a pipeline has been stopped, you can restart it at any time by simply calling .start() again.
Managing models
In order to use a grammar model you have to call Recognizer::setModel.
Each time you receive an onAsrResult() event, the model is automatically unset. To continue recognition, you need to set the model again manually.
Be careful when calling .setModel() from within a recognizer callback such as onResult(), onEvent(), or onError(). The callback must fully return before you can safely set the model again.
recognizer->setModel("model-1");
recognizer->setModel("model-1", hypothesis.endTime);
The start time begins at 0 when the pipeline is started.
You can set a model slightly in the past—typically up to 2 seconds back—to capture speech that occurred just before the model was applied.
However, keep in mind that this introduces additional computation, which may impact performance on low-power devices. Be sure to test and validate this behavior in your target environment.
This is an example for installModel(…) function declared earlier.
void installModel(std::string const & model, Vsdk::Asr::RecognizerPtr & recognizer)
{
// The given lambda is passed to the EventLoop because calling setModel can't be done in
// the same thread as the recognition result callback, see Recognizer::subscribe()
EventLoop::instance().queue([=, &recognizer]
{
recognizer->setModel(model);
fmt::print("[{}] Model '{}' activated\n",
Vsdk::Utils::formatTimeMarker(recognizer->upTimeMs()), model);
});
}
Working with dynamic models
Dynamic models are designed to support runtime customization, allowing you to insert values into slots directly from your code.
This is useful when your application needs to adapt its vocabulary on the fly, such as recognizing names, product codes, or other user-specific data that isn’t known at model compile time.
Once compiled, the dynamic model behaves like a regular static model and can be set on a recognizer. It needs to be recompiled if you changed the values of slots.
Compiled models are not cached, so you will need to recompile them each time the application restarts or each time you destroy ASR Engine.
This is an example of a dynamic grammar with a slot called item.
Instead of simply setting a model by name, you need to perform the following steps first.
auto const model = engine->dynamicModel("dynamic-model");
model->clearData(); // optional
model->addData("item", "1");
model->addData("item", "2");
model->compile();
Then you can apply the model to the recognizer.
recognizer->setModel("dynamic-model"); // Or use setModel(model->name())
Because dynamic models are compiled on the device at runtime, injecting a large number of slot entries—especially thousands—can introduce noticeable delays during compilation.