Skip to main content
Skip table of contents

Speech Recognition

Introduction

Models

In VDK-Studio, you can create three types of models.

Type

Description

Static models

Predefined vocabularies and grammars. They define a fixed set of valid phrases, their structure, and optionally custom phonemes.

Dynamic models

A special type of static model that includes slotsplaceholders you can fill with vocabulary at runtime. The base model is compiled in the cloud, then updated on the device when slots are filled, requiring an additional on-device compilation.

Free-speech models

Large-vocabulary models designed for open-ended input (e.g., dictation). Unlike static or dynamic models, they are not limited to a defined set of phrases.

Implement it the same way as the static model.

Audio Format

The audio data is a 16-bit signed PCM buffer in Little-Endian format.
It is always mono (1 channel), and the sample rate 16KHz.

Examples

You can see the different routes available in: REST API ‎ in the Voice recognition section.

Recognition

First, we can ensure the service has exported the models we wanted by retrieving a list of available models for usage. For this purpose we can use:

CODE
[GET] /voice-synthesis/voices

Now we want to perform the recognition by using the following route.

CODE
[POST] /voice-recognition/recognize

The body of the request is described in the REST API as a quick overview.

For Static and Free-speech models, we have something like this.

JSON
{
  "stop_at_first_result": true,
  "models": {
    "model-name": {}
  }
}

Dynamic models require on top of the previous configuration, the values of their slots.

JSON
{
  "stop_at_first_result": true,
  "models": {
    "model-name": {
      "slots": {
        "slot-name": {
          "values": ["Coffee", "Cola", "Mojito", "Cup of tea"]
        }
      }
    }
  }
}

model-name and slot-name are placeholders for your values.

The parsed result will typically contain multiple hypotheses, each with an associated confidence score—the higher the score, the better the match.

In general, a confidence threshold between 4000 and 5000 is considered acceptable (for csdk), though this may vary depending on your specific use case.

If "stop_at_first_result": true, the process will stop at the first result, regardless of its confidence level.

To ensure you only stop when a result meets your desired confidence threshold, you have two options:

  1. Set stop_at_first_result to false and wait until a result with a satisfactory confidence level is returned.

  2. Configure confidence thresholds directly by setting the following parameters in config/vsdk.json file:

    • csdk/asr/models/<model_name_1>/settings/LH_SEARCH_PARAM_IG_LOWCONF

    • csdk/asr/models/<model_name_1>/settings/LH_SEARCH_PARAM_IG_HIGHCONF

For more information on adding additional configuration parameters, see the Configuration File page.

UserWord

This is an advanced feature, to enhance recognition accuracy of slot values.

This feature can only be used with slot values (dynamic models).

What we call a User Word is the result of processing a specific word with a user’s speech. This produces data that can be used to enhance recognition accuracy by accounting for variations in pronunciation — for example, different accents — as well as the user’s original pronunciation.

This involves a two-step process:

  • Enrollment

  • Usage

Enrollment

For this step, we need to answer several questions:

  • Which user are we training for ?

  • Which word are we training ?

  • Which target model ? (model sharing the same language share the same UserWord data)

Let’s imagine you want to train the word coffee for the user paul, targeting the model dynamic-drinks (which must contains a slot to be filled later with the value coffee).

The corresponding route for the enrollment.

CODE
[POST] /voice-recognition/userwords/enroll

Please refer to the REST API documentation for the actual body part of the request)

This behaves like a recognition, you say what you want to be recognized as coffee and receive results, they differ from the recognition and they have no other usage than informing you of whether or not the UserWord has been added.

You can train/enroll multiple times the same pair of UserWord.

To inspect what has been trained so far you can use the following route:

CODE
[GET] /voice-recognition/userwords

Regarding deletion you have two options, you either have to delete everything or a single user.

CODE
[DELETE] /voice-recognition/userwords
[DELETE] /voice-recognition/userwords/{user}

Usage

The only difference the usual recognition is that we give user as a parameter of the model. If we follow the previous example for the enrollment part, we want to recognize against dynamic-drinks and add to our slot drinks the value coffee, also we specify that we want to use the model with the user paul.

CODE
[POST] /voice-recognition/recognize
JSON
{
  "stop_at_first_result": true,
  "models": {
    "dynamic-drinks": {
      "slots": {
        "drinks": {
          "values": ["coffee", ...]
        },
        ...
      },
      "user": "paul"
    }
  }
}

Sample project

A sample project is available for Speech Recognition usage with VDK Service (in C# or Python).

Python
  • Download and extract the zip below

  • Head inside the project

  • (Optional) Create and activate a virtual environment (Python Venv documentation)

  • Install the project : pip install .

  • Run the script : vdk-recognition --help

If you see the list of options, you can start your configured VDK Service and interact with it using the options available. For example vdk-recognition --list-models will list available models to recognize against.

VdkServiceSample-VoiceRecognition-Python-v1.0.0.zip

C#
  • Download and extract the zip below

  • Open the project solution (.sln)

  • Build and run project with the argument “--help”

If you see the list of options, you can start your configured VDK Service and interact with it using the options available. For example --list will list available models to recognize.

VdkServiceSample-VoiceRecognition-CSharp-v1.0.0.zip

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.