Speech Enhancement
Introduction
Speech enhancement allows you to improve the quality of audio captured from the microphone—reducing noise, removing artifacts, and enhancing clarity before sending it to ASR, saving it to file, or forwarding it elsewhere.
This makes it especially useful when you want to improve speech recognition accuracy.
You can configure your Speech Enhancer using either VDK-Studio. There’s no single configuration that fits all use cases, but you can start with one of the available templates and choose the one that best matches your needs.
Format
Input: 16 kHz, 16-bit signed PCM, mono or stereo.
Reference: 16 kHz, 16-bit signed PCM, mono.
Output: 16 kHz, 16-bit signed PCM, mono.
Note that mono or stereo input is defined when configuring a Speech Enhancement technology in VDK-Studio.
Examples
You can see the different routes available in: REST API in the Speech Enhancement section.
Enhancement
We can retrieve a list of available enhancers before starting the actual enhancements to make the enhancer we configured in the Studio is available.
[GET] /speech-enhancement/enhancers
Then we can perform the enhancement by using the following route.
[POST] /speech-enhancement/enhance
{
"speech_enhancer": "my_enhancer"
}
If the request is successful, we receive a token and we can head to the WebSocket API.
You can now send and receive audio through the newly opened socket. WebSocket API | SEND-Audio-Chunk-Message.5
Acoustic Echo Cancellation (AEC)
Acoustic Echo Cancellation (AEC) is an audio processing technique used to remove playback echo from the microphone signal when a device is simultaneously playing and recording audio.
Without AEC, the system may incorrectly interpret its own playback audio as user speech, which can negatively affect barge-in performance.
Android
AEC is already available in the Android API. To enable it, set the AudioSource to VOICE_COMMUNICATION in your AudioPlayer configuration:
import com.vivoka.vsdk.audio.producers.AudioRecorder;
AudioRecorder audioRecorder = new AudioRecorder(AudioSource.VOICE_COMMUNICATION);
Once your audio is clean enough, with your ASR module, you can use the events Speech Detected and Silence Detected to perform barge-in operations, for instance, you could stop the TTS pipeline.
Other platforms (Windows, Linux)
AEC uses the same API route, the difference lies in the enhancer configuration (with multi_mic enabled) and in the audio streams that are sent.
[POST] /speech-enhancement/enhance
WebSocket API | Route-Enhance
When performing AEC, two audio streams must be sent through the socket: the primary input stream, which needs to be cleaned and a reference stream, which will be removed from the input signal. The reference stream is identified using the is_reference parameter when sending the audio.
Sample project
A sample project is available for Speech Enhancement usage with VDK Service (in C# or Python).
Assets
Along with this project, we provide you with assets containing audio that have been tested for AEC and noise reduction.