Speech Enhancement
Introduction
Speech enhancement allows you to improve the quality of audio captured from the microphone—reducing noise, removing artifacts, and enhancing clarity before sending it to ASR, saving it to file, or forwarding it elsewhere.
This makes it especially useful when you want to improve speech recognition accuracy.
You can configure your Speech Enhancer using either VDK-Studio. There’s no single configuration that fits all use cases, but you can start with one of the available templates and choose the one that best matches your needs.
Barge-In (AEC)
Acoustic Echo Cancellation (AEC) is a technique used to eliminate the echo that can occur when a device plays audio (e.g., TTS output) and simultaneously captures audio through its microphone. Without AEC, the playback audio may be picked up by the microphone and misinterpreted as user input—especially problematic in interactive voice applications.
Barge-In relies on AEC to ensure the system doesn’t mistakenly detect its own voice as user input.
AEC is already available in the Android API. To enable it, set the AudioSource to VOICE_COMMUNICATION in your AudioPlayer configuration:
import com.vivoka.vsdk.audio.producers.AudioRecorder;
AudioRecorder audioRecorder = new AudioRecorder(AudioSource.VOICE_COMMUNICATION);
Once your audio is clean enough, with your ASR module, you can use the events Speech Detected and Silence Detected to perform barge-in operations, for instance, you could stop the TTS pipeline.
Audio Format
Input: 16 kHz, 16-bit signed PCM, mono or stereo.
Output: 16 kHz, 16-bit signed PCM, mono.
Note that mono or stereo input is defined when configuring a Speech Enhancement technology in VDK-Studio.
Sample project
A sample project is available for Speech Enhancement usage with VDK Service (in C# or Python).