Voice Recognition - C++

VDK features three different ASR libraries: CSDK, TNL and our very own: VASR.

Basics

You will need to manipulate 2 concepts: Recognizers & Models. Both need to be configured but first let's explain who's who.
Models are fed to the Recognizer and describe the range of words and utterances that can be recognized. They will either be pre-compiled by the provider (like “free speech” models), or compiled from a grammar that you've written beforehand in the VDK Studio.

There are 3 types of models:

Type	Description
static	Static models embedd all possible vocabulary inside a single file or folder.
dynamic	Dynamic models have “holes” where you can plug new vocabulary at runtime. These need to be prepared and compiled at runtime before installing it on a recognizer.
free-speech	Free-Speech models are very large vocabulary static models. They often require additional files and are not supported by all engines.

Recognizers inherit Audio::ConsumerModule and report results as they receive audio and compare it to the current models data.

Configuration

Each engine has its own configuration quirks and tweaks, but here is a common (though incomplete) pattern using VSDK-CSDK, which supports all 3 types of models:

JSON

{
    "version": "2.0",
    "csdk": {
        "paths": {
            "data_root": "../data"
        },
        "asr": {
            "recognizers": {
                "rec": { ... }
            },
            "models": {
                "static_example": {
                    "type": "static",
                    "file": "<model_name>.fcf"
                },
                "dynamic_example": {
                    "type": "dynamic",
                    "file": "<base_model_name>.fcf",
                    "slots": {
                    "firstname": { ... },
                    "lastname": { ... }
                },
                ...
                },
                "free-speech_example": {
                    "type": "free-speech",
                    "file": "<base_model_name>.fcf",
                    "extra_models": { ... }
                }
            }
        }
    }
}

Starting the engine

CPP

#include <vsdk/asr/csdk.hpp> // underlying ASR engine, here we choose CSDK
using AsrEngine = Vsdk::Asr::Csdk::Engine;
Vsdk::Asr::EnginePtr const engine = Vsdk::Asr::Engine::make<AsrEngine>("config/vsdk.json");
// engine is a std::shared_ptr, copy it around as needed but don't let it go out of scope while you
need it!
// const here means the pointer is const, not the pointee (the Engine)

You can't create two separate instances of the same engine! Attempting to create a second one will get you another pointer to the existing engine. Terminate the first engine (i.e. let it go out of scope) then you can make a new instance.

That's it! If no exception was thrown your engine is ready to be used. Each engine has its own configuration document, check it out for further details, as well as the ASR samples to get started with actual, production-ready code.

Creating a Recognizer

CPP

auto const rec = engine->recognizer("rec"); // Instantiate the recognizer we configured above

You can then plug yourself to the reporting mechanism:

CPP

rec->subscribe([] (Vsdk::Asr::Recognizer::Event const & e) { ... });
rec->subscribe([] (Vsdk::Asr::Recognizer::Error const & e) { ... });
rec->subscribe([] (Vsdk::Asr::Recognizer::Result const & r) { ... });

And finally, apply a model to actually recognize vocabulary:

CPP

rec->setModel("static_example"); // same call whether the model is static, dynamic or free-speech!

Also, don't forget to insert it in the pipeline or nothing's going to happen by itself:

CPP

p.pushBackConsumer(rec);

Dynamic Models

Only dynamic models need to be manipulated explicitely to add the missing data at runtime:

CPP

auto const model = engine->dynamicModel("dynamic_example");
model->addData("firstname", "André");
model->addData("lastname", "Lemoine");
model->compile();
// We can now apply it to a recognizer!
rec->setModel("dynamic_example"); // Or use setModel(model->name())