Voice Recognition - C++

VDK features two different ASR SDKs: vsdk-vasr and vsdk-csdk.

Basics

You will need to manipulate 2 concepts: Recognizers & Models. Both need to be configured but first let's explain who's who.
Models are fed to the Recognizer and describe the range of words and utterances that can be recognized. They will either be pre-compiled depending on the SDK (like “free speech” models), or compiled from a grammar that you've written beforehand in the VDK Studio.

There are 3 types of models:

Type	Description
static	Static models embedd all possible vocabulary inside a single file or folder.
dynamic	Dynamic models have “holes” where you can plug new vocabulary at runtime. These need to be prepared and compiled at runtime before installing it on a recognizer.
free-speech	Free-Speech models are very large vocabulary static models. They often require additional files and are not supported by all engines.

Recognizers inherit Audio::ConsumerModule and report results as they receive audio and compare it to the current models data.

Configuration

Each engine has its own configuration quirks and tweaks, but here is a common (though incomplete) pattern using VSDK-VASR, which supports all 3 types of models:

JSON

{
    "version": "2.0",
    "vasr": {
        "paths": {
            "data_root": "../data"
        },
        "asr": {
            "recognizers": {
                "rec": { ... }
            },
            "models": {
                "static_example": {
                    "type": "static",
                    "file": "<model_name>.vgg"
                },
                "dynamic_example": {
                    "type": "dynamic",
                    "file": "<base_model_name>.vgg",
                    "slots": {
                        "firstname": { ... },
                        "lastname": { ... }
                    },
                ...
                },
                "free-speech_example": {
                    "type": "free-speech",
                    "file": "<base_model_name>.vgg"
                }
            }
        }
    }
}

Starting the engine

C++

#include <vsdk/asr/vasr.hpp> // underlying ASR engine, here we choose VASR
using AsrEngine = Vsdk::Asr::Vasr::Engine;
Vsdk::Asr::EnginePtr const engine = Vsdk::Asr::Engine::make<AsrEngine>("config/vsdk.json");
// engine is a std::shared_ptr, copy it around as needed but don't let it go out of scope while you need it!
// const here means the pointer is const, not the pointee (the Engine)

You can't create two separate instances of the same engine! Attempting to create a second one will get you another pointer to the existing engine. Terminate the first engine (i.e. let it go out of scope) then you can make a new instance.

That's it! If no exception was thrown your engine is ready to be used. Each engine has its own configuration document, check it out for further details, as well as the ASR samples to get started with actual, production-ready code.

Creating a Recognizer

C++

auto const rec = engine->recognizer("rec"); // Instantiate the recognizer we configured above

You can then plug yourself to the reporting mechanism:

C++

rec->subscribe([] (Vsdk::Asr::Recognizer::Event const & e) { ... });
rec->subscribe([] (Vsdk::Asr::Recognizer::Error const & e) { ... });
rec->subscribe([] (Vsdk::Asr::Recognizer::Result const & r) { ... });

And finally, apply a model to actually recognize vocabulary:

C++

rec->setModel("static_example"); // same call whether the model is static, dynamic or free-speech!

Also, don't forget to insert it in the pipeline or nothing's going to happen by itself:

C++

p.pushBackConsumer(rec);

Dynamic Models

Only dynamic models need to be manipulated explicitly to add the missing data at runtime:

C++

auto const model = engine->dynamicModel("dynamic_example");
model->addData("firstname", "André");
model->addData("lastname", "Lemoine");
model->compile();
// We can now apply it to a recognizer!
rec->setModel("dynamic_example"); // Or use setModel(model->name())