Speech Enhancement ‎

Introduction

Speech enhancement allows you to improve the quality of audio captured from the microphone—reducing noise, removing artifacts, and enhancing clarity before sending it to ASR, saving it to file, or forwarding it elsewhere.

This makes it especially useful when you want to improve speech recognition accuracy.

You can configure your Speech Enhancer using either VDK-Studio. There’s no single configuration that fits all use cases, but you can start with one of the available templates and choose the one that best matches your needs.

Format

Input: 16 kHz, 16-bit signed PCM, mono or stereo.

Reference: 16 kHz, 16-bit signed PCM, mono.

Output: 16 kHz, 16-bit signed PCM, mono.

Note that mono or stereo input is defined when configuring a Speech Enhancement technology in VDK-Studio.

Examples

You can see the different routes available in: REST API ‎ in the Speech Enhancement section.

Enhancement

We can retrieve a list of available enhancers before starting the actual enhancements to make the enhancer we configured in the Studio is available.

Java

[GET] /speech-enhancement/enhancers

Then we can perform the enhancement by using the following route.

Java

[POST] /speech-enhancement/enhance

Java

{
  "speech_enhancer": "my_enhancer"
}

If the request is successful, we receive a token and we can head to the WebSocket API.

You can now send and receive audio through the newly opened socket. WebSocket API ‎ | SEND Audio Chunk Message.5

Acoustic Echo Cancellation (AEC)

Acoustic Echo Cancellation (AEC) is an audio processing technique used to remove playback echo from the microphone signal when a device is simultaneously playing and recording audio.

Without AEC, the system may incorrectly interpret its own playback audio as user speech, which can negatively affect barge-in performance.

Android

AEC is already available in the Android API. To enable it, set the AudioSource to VOICE_COMMUNICATION in your AudioPlayer configuration:

Java

import com.vivoka.vsdk.audio.producers.AudioRecorder;

AudioRecorder audioRecorder = new AudioRecorder(AudioSource.VOICE_COMMUNICATION);

Once your audio is clean enough, with your ASR module, you can use the events Speech Detected and Silence Detected to perform barge-in operations, for instance, you could stop the TTS pipeline.

Other platforms (Windows, Linux)

AEC uses the same API route, the difference lies in the enhancer configuration (with multi_mic enabled) and in the audio streams that are sent.

Java

[POST] /speech-enhancement/enhance

WebSocket API ‎ | Route Enhance

When performing AEC, two audio streams must be sent through the socket: the primary input stream, which needs to be cleaned and a reference stream, which will be removed from the input signal. The reference stream is identified using the is_reference parameter when sending the audio.

Sample project

A sample project is available for Speech Enhancement usage with VDK Service (in C# or Python).

Python

Download and extract the zip below
Head inside the project
(Optional) Create and activate a virtual environment (Python Venv documentation)
Install the project : pip install .
Run the script : vdk-enhancement --help

If you see the list of options, you can start your configured VDK Service and interact with it using the options available. For example vdk-enhancement --list will list available enhancers.

VdkServiceSample-SpeechEnhancement-Python-v1.0.0.zip

Download and extract the zip below
Open the project solution (.sln)
Build and run project with the argument “--help”

If you see the list of options, you can start your configured VDK Service and interact with it using the options available. For example --list will list available enhancers.

VdkServiceSample-SpeechEnhancement-CSharp-v1.0.0.zip

Assets

Along with this project, we provide you with assets containing audio that have been tested for AEC and noise reduction.

vdk-service-sample-assets-1.0.0.zip