WebSocket API ‎

Establishing a WebSocket Connection

When calling asynchronous routes in the REST API, you first obtain a token that grants access to a WebSocket channel.
Using the WebSocket protocol, connect to the following endpoint:

CODE

ws://{HOSTNAME}:{VDK_SERVICE_PORT}/v1/ws/{TOKEN}

Each WebSocket instance is bound to the specific task triggered by its corresponding route.
Its behavior may vary depending on which endpoint issued the token.

As of now, this is the only route in the WebSocket API, since all other interactions occur through the socket itself.

Working with WebSocket Routes

The socket exchanges data in JSON format.
You may encounter up to four top-level objects in the messages you receive.

Objects:

Event
Error
Result
Data

Audio can be either streamed or received, and it is encoded in Base64 for transport through the socket. The same message structure applies in both directions — sending and receiving:

CODE

{
  "data": "data:audio/pcm;base64,<base64_audio>",
  "last": <bool>
}

Below, we describe in detail how the socket behaves for each supported technology.

Advanced Recognition

ROUTE Recognize

CODE

/v1/advanced-recognition/recognize

Messages

SEND Audio Chunk Message

JSON

{
  "data": "data:audio/pcm;base64,<base64_audio>",
  "last": <bool>
}

RECEIVE Asr Result Message

JSON

{
  "result": {
    "technology": "asr",
    "model_name": <string>,
    "type": <int>,
    "type_string": <string>,
    "is_final": <bool>,
    "begin_time": <int>,
    "end_time": <int>,
    "hypotheses": [ <hypothesis>, ... ]
  }
}

Fields	Possible values	Description
`model_name`	-	The model name associated to the result.
`type`	[`0,1`]	The result type as an int value
`type_string`	[ `ASR`, `NLU` ]	The result type as a string value
`is_final`	[ `false`, `true` ]	Indicates whether this result is final or not. if true, this is the final time this result will be returned; if not, then this result is an interim result and may be updated later on.
`begin_time`	[`0, INT_MAX`]	The system time in milliseconds at the start of the hypothesis recognition operation.
`end_time`	[`0, INT_MAX`]	The system time in milliseconds at the end of the hypothesis recognition operation.
`confidence`	[`0, 10000`]	Indicates the likelihood the recognized words are correct.
`hypotheses`	-	A JSON array containing all the hypotheses of the recognized speech content.

DETAILS Asr Result Hypothesis

JSON

  "hypotheses": [
    {
      "confidence": <int>,
      "begin_time": <int>,
      "end_time": <int>,
      "start_rule": <string>,
      "items": [ <item>, ... ]
    },
    ...
  ]

Fields	Description
`start_rule`	Represents the entry point of the grammar (<main>).
`items`	A JSON array containing all the matched tokens. An item object can be either a type `tag` or a `terminal`.

DETAILS Asr Result Item (Orthography)

JSON

  "items": [
    {
      "confidence": <int>,
      "begin_time": <int>,
      "end_time": <int>,
      "type": "terminal",
      "orthography": <string>
    },
    ...
  ]

Fields	Description
`orthography`	The matched terminal token.

DETAILS Asr Result Item (Tag)

JSON

  "items": [
    {
      "confidence": <int>,
      "begin_time": <int>,
      "end_time": <int>,
      "type": "tag",
      "name": <string>,
      "items": [ <item>, ... ]
    },
    ...
  ]

Fields	Description
`name`	Represents the name given as an attribute to the `!tag` directive. Note that when choosing `vsdk-csdk` this name becomes the concatenation of the grammar name and the actual tag name.
`items`	Same as `Result item (Orthography)`

RECEIVE Biometrics Result Message

JSON

{
  "result": {
    "technology": "biometrics",
    "model_name": <string>,
    "id": <string>,
    "probability": <double>,
    "score": <double>
  }
}

RECEIVE Event Message

JSON

{
  "event": {
    "technology": <string>,
    "model_name": <string>,
    "code": <int>
    "code_string": <string>,
    "message": <string>,
    "timestamp": <int>
  }
}

Technologies	Description
`asr`	Voice recognition
`biometrics`	Voice biometrics

Asr Events	Code	Description
`RECOGNIZER_STARTED`	0	Indicates that the recognizer has started processing speech input.
`RECOGNIZER_STOPPED`	1	Indicates that the recognizer is no longer processing speech input.
`SPEECH_DETECTED`	2	Indicates that the recognizer detects input that it can identify as speech.
`SILENCE_DETECTED`	3	Indicates that the recognizer is receiving silence or non-speech.

RECEIVE Error Message

JSON

{
  "error": {
    "technology": <string>,
    "model_name": <string>,
    "type": <string>,
    "code": <int>,
    "code_string": <string>,
    "message": <string>
  }
}

Voice Recognition

ROUTE Recognize

CODE

/v1/voice-recognition/recognize

Messages

SEND Audio Chunk Message

JSON

{
  "data": "data:audio/pcm;base64,<base64_audio>",
  "last": <bool>
}

RECEIVE Result Message

JSON

{
  "result": {
    "type": <int>,
    "type_string": <string>,
    "is_final": <bool>,
    "begin_time": <int>,
    "end_time": <int>,
    "hypotheses": [ <hypothesis>, ... ]
  }
}

Fields	Possible values	Description
`type`	[`0,1`]	The result type as an int value
`type_string`	[ `ASR`, `NLU` ]	The result type as a string value
`is_final`	[ `false`, `true` ]	Indicates whether this result is final or not. if true, this is the final time this result will be returned; if not, then this result is an interim result and may be updated later on.
`begin_time`	[`0, INT_MAX`]	The system time in milliseconds at the start of the hypothesis recognition operation.
`end_time`	[`0, INT_MAX`]	The system time in milliseconds at the end of the hypothesis recognition operation.
`confidence`	[`0, 10000`]	Indicates the likelihood the recognized words are correct.
`hypotheses`	-	A JSON array containing all the hypotheses of the recognized speech content.

DETAILS Result Hypothesis

JSON

  "hypotheses": [
    {
      "confidence": <int>,
      "begin_time": <int>,
      "end_time": <int>,
      "start_rule": <string>,
      "items": [ <item>, ... ]
    },
    ...
  ]

Fields	Description
`start_rule`	Represents the entry point of the grammar (<main>).
`items`	A JSON array containing all the matched tokens. An item object can be either a type `tag` or a `terminal`.

DETAILS Result Item (Orthography)

JSON

  "items": [
    {
      "confidence": <int>,
      "begin_time": <int>,
      "end_time": <int>,
      "type": "terminal",
      "orthography": <string>
    },
    ...
  ]

Fields	Description
`orthography`	The matched terminal token.

DETAILS Result Item (Tag)

JSON

  "items": [
    {
      "confidence": <int>,
      "begin_time": <int>,
      "end_time": <int>,
      "type": "tag",
      "name": <string>,
      "items": [ <item>, ... ]
    },
    ...
  ]

Fields	Description
`name`	Represents the name given as an attribute to the `!tag` directive. Note that when choosing `vsdk-csdk` this name becomes the concatenation of the grammar name and the actual tag name.
`items`	Same as `Result item (Orthography)`

RECEIVE Event Message

JSON

{
  "event": {
    "code": <int>
    "code_string": <string>,
    "message": <string>,
    "timestamp": <int>
  }
}

Events	Code	Description
`RECOGNIZER_STARTED`	0	Indicates that the recognizer has started processing speech input.
`RECOGNIZER_STOPPED`	1	Indicates that the recognizer is no longer processing speech input.
`SPEECH_DETECTED`	2	Indicates that the recognizer detects input that it can identify as speech.
`SILENCE_DETECTED`	3	Indicates that the recognizer is receiving silence or non-speech.

RECEIVE Error Message

JSON

{
  "error": {
    "type": <string>,
    "code": <int>,
    "code_string": <string>,
    "message": <string>
  }
}

Voice Synthesis

ROUTE Synthesize

CODE

/v1/voice-synthesis/synthesize

Messages

RECEIVE Audio Chunk Message

JSON

{
  "data": "data:audio/pcm;base64,<base64_audio>",
  "last": <bool>
}

RECEIVE Event Message

JSON

{
  "event": {
    "code": <int>
    "code_string": <string>,
    "message": <string>,
    "timestamp": <int>
  }
}

Fields	Possible values
`code`	[`0,7`]
`code_string`	[ `NativeEvent, GenerationStarted, GenerationEnded, ProcessFinished,` `TextRewritten, Marker, WordMarkerStart, WordMarkerEnd` ]

RECEIVE Error Message

JSON

{
  "error": {
    "type": <string>,
    "code": <int>,
    "code_string": <string>,
    "message": <string>
  }
}

Voice Biometrics

ROUTE Authenticate

CODE

/v1/voice-biometrics/authenticate

Messages

SEND Audio Chunk Message

JSON

{
  "data": "data:audio/pcm;base64,<base64_audio>",
  "last": <bool>
}

RECEIVE Result Message

JSON

{
  "id": <string>,
  "probability": <double>,
  "score": <double>
}

RECEIVE Error Message

JSON

{
  "error": {
    "type": <string>,
    "code": <int>,
    "code_string": <string>,
    "message": <string>
  }
}

ROUTE Identify

CODE

/v1/voice-biometrics/identify

Messages

SEND Audio Chunk Message

JSON

{
  "data": "data:audio/pcm;base64,<base64_audio>",
  "last": <bool>
}

RECEIVE Result Message

JSON

{
  "id": <string>,
  "probability": <double>,
  "type": <string>
}

ROUTE Enroll

CODE

/v1/voice-biometrics/enroll

Messages

SEND Audio Chunk Message

JSON

{
  "data": "data:audio/pcm;base64,<base64_audio>",
  "last": <bool>
}

RECEIVE Result MessagIne

JSON

{
  "accepted": <bool>,
  "progress": <int>,
  "speech_duration": <double>,
  "utterances": [ <utterance>, ... ]
}

DETAILS Result utterance

JSON

"utterances": [
    {
      "accepted": <bool>,
      "contains_speech": <bool>,
      "enough_speech": <bool>,
      "is_band_limited": <bool>,
      "is_consistent": <bool>,
      "is_peak_clipped": <bool>,
      "is_snr_ok": <bool>,
      "snr": <double>,
      "speech_duration": <double>
    },
    ...
  ]

Fields	Description
`accepted`	Indicates that an utterance is valid and could be added to the enrollment profile.
`contains_speech`	Indicates whether an audio contains speech or not.
`enough_speech`	Indicates whether the given speech duration is enough to pass the enrollment process checks.
`is_band_limited`	Check if the utterance is band-limited.
`is_consistent`	Check if the utterance is consistent with the previous utterances.
`is_peak_clipped`	Indicates whether the degree of peak clipping is below a certain threshold.
`is_snr_ok`	Indicates if the SNR value is sufficiently high enough.
`snr`	Represents the signal-to-noise ratio of the enrollment utterance. SNR value is measured in dB.
`speech_duration`	Represents the speech duration within an audio input.

RECEIVE Event Message

JSON

{
  "event": {
    "code": 0
    "code_string": "INFO",
    "message": <string>,
    "timestamp": <int>
  }
}

RECEIVE Error Message

JSON

{
  "error": {
    "type": <string>,
    "code": <int>,
    "code_string": <string>,
    "message": <string>
  }
}

Speech Enhancement

ROUTE Enhance

CODE

/v1/speech-enhancement/enhance

Messages

SEND Audio Chunk Message

JSON

{
  "data": "data:audio/pcm;base64,<base64_audio>",
  "last": <bool>
}

RECEIVE Audio Chunk Message

JSON

{
  "data": "data:audio/pcm;base64,<base64_audio>",
  "last": <bool>
}