Speech Recognition
Models
In VDK-Studio, you can create three types of models.
Type | Description |
Static models | Predefined vocabularies and grammars. They define a fixed set of valid phrases, their structure, and optionally custom phonemes. |
Dynamic models | A special type of static model that includes slots—placeholders you can fill with vocabulary at runtime. The base model is compiled in the cloud, then updated on the device when slots are filled, requiring an additional on-device compilation. |
Free-speech models | Large-vocabulary models designed for open-ended input (e.g., dictation). Unlike static or dynamic models, they are not limited to a defined set of phrases. Implement it the same way as the static model. |
For Static and Free-speech models, you need to send a similar request to the REST API.
{
"stop_at_first_result": true,
"models": {
"model-name": { }
}
}
To provide slot values for a Dynamic model, send the following request:
{
"stop_at_first_result": true,
"models": {
"model-name": {
"slots": {
"slot-name": {
"values": ["Coffee", "Cola", "Mojito", "Cup of tea"]
}
}
}
}
}
The parsed result will typically contain multiple hypotheses, each with an associated confidence score—the higher the score, the better the match.
In general, a confidence threshold between 4000 and 5000 is considered acceptable (for csdk), though this may vary depending on your specific use case.
If "stop_at_first_result": true, the process will stop at the first result, regardless of its confidence level.
To ensure you only stop when a result meets your desired confidence threshold, you have two options:
Set
"stop_at_first_result"tofalseand wait until a result with a satisfactory confidence level is returned.Configure confidence thresholds directly by setting the following parameters in
config/vsdk.jsonfile:csdk/asr/models/<model_name_1>/settings/LH_SEARCH_PARAM_IG_LOWCONFcsdk/asr/models/<model_name_1>/settings/LH_SEARCH_PARAM_IG_HIGHCONF
For more information on adding additional configuration parameters, see the Configuration File page.
Audio Format
The audio data is a 16-bit signed PCM buffer in Little-Endian format.
It is always mono (1 channel), and the sample rate 16KHz.
Sample project
A sample project is available for Speech Recognition usage with VDK Service (in C# or Python).