Speech Recognition
Introduction
Models
In VDK-Studio, you can create three types of models.
Type | Description |
Static models | Predefined vocabularies and grammars. They define a fixed set of valid phrases, their structure, and optionally custom phonemes. |
Dynamic models | A special type of static model that includes slots—placeholders you can fill with vocabulary at runtime. The base model is compiled in the cloud, then updated on the device when slots are filled, requiring an additional on-device compilation. |
Free-speech models | Large-vocabulary models designed for open-ended input (e.g., dictation). Unlike static or dynamic models, they are not limited to a defined set of phrases. Implement it the same way as the static model. |
Audio Format
The audio data is a 16-bit signed PCM buffer in Little-Endian format.
It is always mono (1 channel), and the sample rate 16KHz.
Examples
You can see the different routes available in: REST API in the Voice recognition section.
Recognition
First, we can ensure the service has exported the models we wanted by retrieving a list of available models for usage. For this purpose we can use:
[GET] /voice-synthesis/voices
Now we want to perform the recognition by using the following route.
[POST] /voice-recognition/recognize
The body of the request is described in the REST API as a quick overview.
For Static and Free-speech models, we have something like this.
{
"stop_at_first_result": true,
"models": {
"model-name": {}
}
}
Dynamic models require on top of the previous configuration, the values of their slots.
{
"stop_at_first_result": true,
"models": {
"model-name": {
"slots": {
"slot-name": {
"values": ["Coffee", "Cola", "Mojito", "Cup of tea"]
}
}
}
}
}
model-name and slot-name are placeholders for your values.
The parsed result will typically contain multiple hypotheses, each with an associated confidence score—the higher the score, the better the match.
In general, a confidence threshold between 4000 and 5000 is considered acceptable (for csdk), though this may vary depending on your specific use case.
If "stop_at_first_result": true, the process will stop at the first result, regardless of its confidence level.
To ensure you only stop when a result meets your desired confidence threshold, you have two options:
Set
stop_at_first_resulttofalseand wait until a result with a satisfactory confidence level is returned.Configure confidence thresholds directly by setting the following parameters in
config/vsdk.jsonfile:csdk/asr/models/<model_name_1>/settings/LH_SEARCH_PARAM_IG_LOWCONFcsdk/asr/models/<model_name_1>/settings/LH_SEARCH_PARAM_IG_HIGHCONF
For more information on adding additional configuration parameters, see the Configuration File page.
UserWord
This is an advanced feature, to enhance recognition accuracy of slot values.
This feature can only be used with slot values (dynamic models).
What we call a User Word is the result of processing a specific word with a user’s speech. This produces data that can be used to enhance recognition accuracy by accounting for variations in pronunciation — for example, different accents — as well as the user’s original pronunciation.
This involves a two-step process:
Enrollment
Usage
Enrollment
For this step, we need to answer several questions:
Which user are we training for ?
Which word are we training ?
Which target model ? (model sharing the same language share the same UserWord data)
Let’s imagine you want to train the word coffee for the user paul, targeting the model dynamic-drinks (which must contains a slot to be filled later with the value coffee).
The corresponding route for the enrollment.
[POST] /voice-recognition/userwords/enroll
Please refer to the REST API documentation for the actual body part of the request)
This behaves like a recognition, you say what you want to be recognized as coffee and receive results, they differ from the recognition and they have no other usage than informing you of whether or not the UserWord has been added.
You can train/enroll multiple times the same pair of UserWord.
To inspect what has been trained so far you can use the following route:
[GET] /voice-recognition/userwords
Regarding deletion you have two options, you either have to delete everything or a single user.
[DELETE] /voice-recognition/userwords
[DELETE] /voice-recognition/userwords/{user}
Usage
The only difference the usual recognition is that we give user as a parameter of the model. If we follow the previous example for the enrollment part, we want to recognize against dynamic-drinks and add to our slot drinks the value coffee, also we specify that we want to use the model with the user paul.
[POST] /voice-recognition/recognize
{
"stop_at_first_result": true,
"models": {
"dynamic-drinks": {
"slots": {
"drinks": {
"values": ["coffee", ...]
},
...
},
"user": "paul"
}
}
}
Sample project
A sample project is available for Speech Recognition usage with VDK Service (in C# or Python).