Skip to main content
Skip table of contents

Speech Recognition ‎

Models

In VDK-Studio, you can create three types of models.

Type

Description

Static models

Predefined vocabularies and grammars. They define a fixed set of valid phrases, their structure, and optionally custom phonemes.

Dynamic models

A special type of static model that includes slotsplaceholders you can fill with vocabulary at runtime. The base model is compiled in the cloud, then updated on the device when slots are filled, requiring an additional on-device compilation.

Free-speech models

Large-vocabulary models designed for open-ended input (e.g., dictation). Unlike static or dynamic models, they are not limited to a defined set of phrases.

Implement it the same way as the static model.

For Static and Free-speech models, you need to send a similar request to the REST API.

JSON
{
  "stop_at_first_result": true,
  "models": {
    "model-name": { }
  }
}

To provide slot values for a Dynamic model, send the following request:

JSON
{
  "stop_at_first_result": true,
  "models": {
    "model-name": {
      "slots": {
        "slot-name": {
          "values": ["Coffee", "Cola", "Mojito", "Cup of tea"]
        }
      }
    }
  }
}

The parsed result will typically contain multiple hypotheses, each with an associated confidence score—the higher the score, the better the match.

In general, a confidence threshold between 4000 and 5000 is considered acceptable (for csdk), though this may vary depending on your specific use case.

If "stop_at_first_result": true, the process will stop at the first result, regardless of its confidence level.

To ensure you only stop when a result meets your desired confidence threshold, you have two options:

  1. Set "stop_at_first_result" to false and wait until a result with a satisfactory confidence level is returned.

  2. Configure confidence thresholds directly by setting the following parameters in config/vsdk.json file:

    • csdk/asr/models/<model_name_1>/settings/LH_SEARCH_PARAM_IG_LOWCONF

    • csdk/asr/models/<model_name_1>/settings/LH_SEARCH_PARAM_IG_HIGHCONF

For more information on adding additional configuration parameters, see the Configuration File page.

Audio Format

The audio data is a 16-bit signed PCM buffer in Little-Endian format.
It is always mono (1 channel), and the sample rate 16KHz.

Sample project

A sample project is available for Speech Recognition usage with VDK Service (in C# or Python).

Python
  • Download and extract the zip below

  • Head inside the project

  • Create and activate a virtual environment (Python Venv documentation)

  • Install the project : pip install -e .

  • Run the script : vdk-recognition --help

If you see the list of options, you can start your configured VDK Service and interact with it using the options available. For example vdk-recognition --list-models will list available models to recognize against.

VdkServiceSample-VoiceRecognition (python).zip

C#
  • Download and extract the zip below

  • Open the project solution (.sln)

  • Set VoiceRecognition as startup project

  • Build and run project with the argument “--help”

If you see the list of options, you can start your configured VDK Service and interact with it using the options available. For example --list will list available models to recognize.

VdkServiceSample-VoiceRecognition (C#).zip

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.