Speech Recognition - C++

Introduction

Recognizers & Models

Speech recognition relies on two main components: Recognizers and Models. They work together to turn audio into meaningful results.

Recognizers: The Acoustic Part

Recognizers handle incoming audio and use voice activity detection (VAD) to detect when someone is speaking or silent.

A recognizer can run without a model—VAD will still detect speech—but no transcription or interpretation will occur. Recognizers are configured for specific languages and accents (e.g., eng-US, fra-FR) and must be paired with models in the same language.

Models: The Language Part

Models define what the recognizer can understand—words, phrases, grammar structure, and even phoneme rules. They act as the language engine behind the acoustic processing.

There are three types of models:

Type	Description
Static models	Predefined vocabularies and grammars. They define a fixed set of valid phrases, their structure, and optionally custom phonemes.
Dynamic models	A special type of static model that includes slots—placeholders you can fill with vocabulary at runtime. The base model is compiled in the cloud, then updated on the device when slots are filled, requiring an additional on-device compilation.
Free-speech models	Large-vocabulary models designed for open-ended input (e.g., dictation). Unlike static or dynamic models, they are not limited to a defined set of phrases. Implement it the same way as the static model.

In short, recognizers listen, and models interpret. Both are essential for effective speech recognition.

Why Recognizers and Models Are Separated

Separating Recognizers and Models gives you more flexibility. For example, you can set or switch a model after speech has started—thanks to internal audio buffering—so the model still applies to earlier audio.

It also lets you reuse the same recognizer with different models, or prepare models in the background without interrupting audio input.

Audio Format

The audio data is a 16-bit signed PCM buffer in Little-Endian format.
It is always mono (1 channel), and the sample rate 16KHz.

Getting Started

Before you begin, make sure you’ve completed all the necessary preparation steps.
There are two ways to prepare your project for Voice Recognition:

Using sample code
Starting from scratch

From Sample Code

To download the sample code, you'll need Conan. All the necessary steps are outlined in the general Getting Started guide.

📦 simple-application

📦 chained-grammars

📦 dynamic-grammar

Quick example with simple-application:

CODE

conan search -r vivoka-customer simple-application  # To get the latest version.
conan inspect -r vivoka-customer -a options simple-application/<version>@vivoka/customer
conan install -if simple-application simple-application/<version>@vivoka/customer -o asr_engine=csdk-asr -o tts_engine=csdk-tts

Open project.vdk in VDK-Studio
Export in the same directory assets from VDK-Studio (don’t forget to configure a voice for the channel and to compile models)

CODE

conan install . -if build
conan build . -if build
./build/Release/simple-application

From Scratch

Before proceeding, make sure you’ve completed the following steps:

1. Prepare your VDK Studio project

Create a new project in VDK Studio
Add the Voice Recognition technology
Add a model and a recognizer
Export the project to generate the required assets and configuration

To add a recognizer:

Click the settings icon next to Voice Recognition in the left panel.
Click “Add new recognizer”.
Enter a name for your recognizer.
Select the language.

Remember the name, you will need it later!

Otherwise the default recognizer will be added with a name rec_ + language. For example, rec_eng-US.

2. Set up your project

Install the necessary libraries

vsdk-csdk-asr/x.x.x@vivoka/customer

We also provide utility libraries providing a way to capture microphone and to build an event loop.

vsdk-audio-portaudio/x.x.x@vivoka/customer
vsdk-samples-utils/x.x.x@vivoka/customer

These steps are better explained in the Get Started guide.

CMakeLists.txt to get started faster

CODE

cmake_minimum_required(VERSION 3.16)
project(asr-test VERSION 1.0.0)

find_package(vsdk-csdk-asr        REQUIRED)
find_package(vsdk-audio-portaudio REQUIRED)
find_package(vsdk-samples-utils   REQUIRED)

set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

add_executable(${PROJECT_NAME} src/main.cpp)
set_target_properties(${PROJECT_NAME}
    PROPERTIES
        CXX_EXTENSIONS OFF
        INSTALL_RPATH  $ORIGIN/../lib/${PROJECT_NAME}
)
target_compile_features(${PROJECT_NAME} PRIVATE cxx_std_17)
target_compile_definitions(${PROJECT_NAME}
    PRIVATE
        $<$<BOOL:${MSVC}>:_CRT_SECURE_CPP_OVERLOAD_STANDARD_NAMES=1>
        $<$<BOOL:${MSVC}>:_CRT_SECURE_NO_WARNINGS>
)
target_link_options(${PROJECT_NAME} PRIVATE LINKER:$<$<PLATFORM_ID:Linux>:--disable-new-dtags>)
target_link_libraries(${PROJECT_NAME}
    PRIVATE
        vsdk-audio-portaudio::vsdk-audio-portaudio
        vsdk-samples-utils::vsdk-samples-utils
        vsdk-csdk-asr::vsdk-csdk-asr
)
target_link_libraries(vsdk::vsdk
    INTERFACE
        $<$<AND:$<CXX_COMPILER_ID:GNU>,$<VERSION_LESS:$<CXX_COMPILER_VERSION>,9.0>>:-lstdc++fs>
)

include(GNUInstallDirs)
install(TARGETS ${PROJECT_NAME})

conanfile.py to get started faster

PY

from conan             import ConanFile
from conan.tools.cmake import CMake, CMakeDeps, CMakeToolchain, cmake_layout
from conan.tools.files import copy, load
from os.path           import dirname, join
from re                import search


class TtsConan(ConanFile):
    name     = "asr-test"
    license  = "Copyright Vivoka"
    settings = "os", "compiler", "build_type", "arch"
    requires = [
        "vsdk-audio-portaudio/4.1.0@vivoka/customer",
        "vsdk-samples-utils/1.1.0@vivoka/customer",
        "vsdk-csdk-asr/1.1.4@vivoka/customer"
    ]
    generators      = "CMakeDeps", "CMakeToolchain", "VirtualBuildEnv", "VirtualRunEnv"
    exports_sources = "CMakeLists.txt", "src/*"

    def set_version(self):
        content      = load(self, join(dirname(__file__), "CMakeLists.txt"))
        self.version = search(r"project\(.+?VERSION\s+(\d+\.\d+(\.\d+)?)", content).group(1)

    def layout(self):
        cmake_layout(self)

    def generate(self):
        tc = CMakeToolchain(self)
        tc.generate()
        deps = CMakeDeps(self)
        deps.generate()

    def build(self):
        cmake = CMake(self)
        cmake.configure()
        cmake.build()

    def package(self):
        CMake(self).install()

    def deploy(self):
        exe = self.name + (".exe" if self.settings.os == "Windows" else "")
        copy(self, exe, join(self.package_folder, "bin"), join(self.install_folder, "bin"))

        deps = self.dependencies.host.values()
        for bindir in [dep.cpp_info.bindirs[0] for dep in deps if dep.cpp_info.bindirs]:
            copy(self, "*.dll", bindir, join(self.install_folder, "bin"), excludes="*/*")
        for libdir in [dep.cpp_info.libdirs[0] for dep in deps if dep.cpp_info.libdirs]:
            copy(self, "*.so*", libdir, join(self.install_folder, "lib", self.name), excludes="*/*")

3. Write the main body

This is a base example for a Speech Recognition pipeline. We need :

An event loop
A pipeline set up with a recognizer
Defined callbacks

You cannot create two instances of the same engine.

CPP

#include <vsdk/audio/Pipeline.hpp>
#include <vsdk/audio/producers/PaMicrophone.hpp>
#include <vsdk/asr/csdk.hpp>
#include <vsdk/utils/Misc.hpp> // formatTimeMarker
#include <vsdk/utils/PortAudio.hpp>
#include <vsdk/utils/samples/EventLoop.hpp>

#include <fmt/core.h>

#include <csignal>
#include <iostream>
#include <memory>

using Vsdk::Asr::Recognizer;
using Vsdk::Audio::Pipeline;
using Vsdk::Audio::Producer::PaMicrophone;
using Vsdk::Utils::Samples::EventLoop;

namespace
{
    Vsdk::Asr::RecognizerPtr recognizer;
    const std::string modelName = "model-1"; // Replace as needed
  
    // Callbacks (we'll describe them later)
    void onAsrEvent(Recognizer::Event const & event);
    void onAsrError(Recognizer::Error const & error);
    void onAsrResult(Recognizer::Result const & result);

   // Set the model again. (!queue the operation) 
   void installModel(std::string const & model, Vsdk::Asr::RecognizerPtr & recognizer)
}

int main() try
{
    // Ensure EventLoop is destroyed last
    std::shared_ptr<void> eventLoopGuard(nullptr, [](auto) { EventLoop::destroy(); });

    // Create the ASR Engine using CSDK(Cerence) 
    auto asrEngine = Vsdk::Asr::Engine::make<Vsdk::Asr::Csdk::Engine>("config/vsdk.json");
    fmt::print("ASR Engine version: {}\n", asrEngine->version());

    // Retrieve a configured Recognizer from the engine
    const std::string recognizerName = "rec_eng-US";
    recognizer = asrEngine->recognizer(recognizerName);

    // Adding callbacks to the recognizer
    recognizer->subscribe(&onAsrEvent);
    recognizer->subscribe([](const Recognizer::Error & e) { onAsrError(e); });
    recognizer->subscribe([](const Recognizer::Result & r) { onAsrResult(r); });

    // Choose which grammar model to use
    recognizer->setModel(modelName);

    // Retrieve a producer (Vsdk::Audio::Producer::PaMicrophone)
    auto mic = PaMicrophone::make();
    fmt::print("Using input device: {}\n", mic->name());

    // Creating a pipeline and assigning it the bare minimum.
    // A producer = your microphone
    // A consumer = the recognizer
    Pipeline pipeline;
    pipeline.setProducer(mic);
    pipeline.pushBackConsumer(recognizer);
    pipeline.start(); // Async start.

    fmt::print("ASR pipeline started. Speak into the microphone. Press Ctrl+C to stop.\n");

    // Run the event loop.
    EventLoop::instance().run();

    return EXIT_SUCCESS;
}
catch (std::exception const & e)
{
    fmt::print(stderr, "Fatal error:\n");
    Vsdk::printExceptionStack(e);
    return EXIT_FAILURE;
}

4. Writing callbacks

We declared the following in the previous section and assigned them to the recognizer.

CPP

void onAsrEvent (Recognizer::Event const & event);
void onAsrError (Recognizer::Error const & error);
void onAsrResult(Recognizer::Result const & result);
...
recognizer->subscribe(&onAsrEvent);
recognizer->subscribe([](const Recognizer::Error & e) { onAsrError(e); });
recognizer->subscribe([](const Recognizer::Result & r) { onAsrResult(r); });

We now need to define them.

Be mindful that the functions were declared in an anonymous namespace. This isn’t strictly required, but to define them, you must do so within the same namespace.

The event callback

CPP

void onAsrEvent(Recognizer::Event const & event)
{
    static Recognizer::Event lastEvent;
    
    if (event != lastEvent)
    {
        auto const msg = event.message.empty() ? "" : ": " + event.message;
        fmt::print("[{}] {}{}\\n", Vsdk::Utils::formatTimeMarker(event.timeMarker), event.codeString, msg);
        lastEvent = event;
    }
}

The error callback

CPP

static void onAsrError(Recognizer::Error const & error)
{
    fmt::print(stderr, "[ERROR] {} - {}\n", error.codeString, error.message);
}

The result callback

The parsed result will typically contain multiple hypotheses, each with an associated confidence score—the higher the score, the better the match.

In general, a confidence threshold between 4000 and 5000 is considered acceptable, though this may vary depending on your specific use case.

CPP

void onAsrEvent(Recognizer::Result const & result)
{
    int const confidenceThreshold = 5000;
    if (result.type != Recognizer::ResultType::Asr)
        return;

    std::string text;
    int confidence = -1;
    if (!result.hypotheses.empty())
    {
        auto const & best = result.hypotheses[0]; // First is always best
        fmt::print("Hypothesis: '{}'{}\n", best.text,
                   best.confidence != -1 ? fmt::format(" (confidence: {})", best.confidence) : "");

        if (!best.tags.empty())
        {
            fmt::print("Tags:\n");
            for (auto const & tags : best.tags)
                fmt::print(" - '{}': '{}'\n", tags.first, tags.second);

            confidence = best.confidence;
            if (confidence >= confidenceThreshold)
            {
                text = best.tags[0].second;
            }
        }
        else
            fmt::print(stderr, "Tags: (None)\n");
    }

    if (!text.empty()) // Synthesis started, print what is being said
        fmt::print("{}\n", text);

  // received a result so we need to install the model again.
  installModel(modelName, recognizer) // detailed below.
}

Detailed topics

Pipeline control

CODE

pipeline.start();
pipeline.stop();
pipeline.run();

.start() runs the pipeline in a new thread
.run() runs the pipeline and waits till it is finished (blocking)
.stop() is used to terminate the pipeline execution

Once a pipeline has been stopped, you can restart it at any time by simply calling .start() again.

Managing models

In order to use a grammar model you have to call Recognizer::setModel.

Each time you receive an onAsrResult() event, the model is automatically unset. To continue recognition, you need to set the model again manually.

Be careful when calling .setModel() from within a recognizer callback such as onResult(), onEvent(), or onError(). The callback must fully return before you can safely set the model again.

JAVA

recognizer->setModel("model-1");
recognizer->setModel("model-1", hypothesis.endTime);

The start time begins at 0 when the pipeline is started.

You can set a model slightly in the past—typically up to 2 seconds back—to capture speech that occurred just before the model was applied.

However, keep in mind that this introduces additional computation, which may impact performance on low-power devices. Be sure to test and validate this behavior in your target environment.

This is an example for installModel(…) function declared earlier.

CPP

void installModel(std::string const & model, Vsdk::Asr::RecognizerPtr & recognizer)
{
    // The given lambda is passed to the EventLoop because calling setModel can't be done in
    // the same thread as the recognition result callback, see Recognizer::subscribe()
    EventLoop::instance().queue([=, &recognizer]
    {
        recognizer->setModel(model);
        fmt::print("[{}] Model '{}' activated\n",
                   Vsdk::Utils::formatTimeMarker(recognizer->upTimeMs()), model);
    });
}

Working with dynamic models

Dynamic models are designed to support runtime customization, allowing you to insert values into slots directly from your code.

This is useful when your application needs to adapt its vocabulary on the fly, such as recognizing names, product codes, or other user-specific data that isn’t known at model compile time.

Once compiled, the dynamic model behaves like a regular static model and can be set on a recognizer. It needs to be recompiled if you changed the values of slots.

Compiled models are not cached, so you will need to recompile them each time the application restarts or each time you destroy ASR Engine.

This is an example of a dynamic grammar with a slot called item.

BNF with slots example

CODE

#BNF+EM V2.1;

!grammar YOUR_GRAMMAR_NAME;
!start   <main>;

!slot <item>;

<main>: "Hello world" <item>;

Instead of simply setting a model by name, you need to perform the following steps first.

CPP

auto const model = engine->dynamicModel("dynamic-model");
model->clearData(); // optional
model->addData("item", "1");
model->addData("item", "2");
model->compile();

Then you can apply the model to the recognizer.

CPP

recognizer->setModel("dynamic-model"); // Or use setModel(model->name())

Because dynamic models are compiled on the device at runtime, injecting a large number of slot entries—especially thousands—can introduce noticeable delays during compilation.