Speech Enhancement - Android

Introduction

Speech enhancement allows you to improve the quality of audio captured from the microphone—reducing noise, removing artifacts, and enhancing clarity before sending it to ASR, saving it to file, or forwarding it elsewhere.

This makes it especially useful when you want to improve speech recognition accuracy.

You can configure your Speech Enhancer using either VDK-Studio. There’s no single configuration that fits all use cases, but you can start with one of the available templates and choose the one that best matches your needs.

In a pipeline, it sits between the Producer (e.g., microphone) and the Consumer. To learn more about Pipeline check Get Started guide.

Barge-In (AEC)

Barge-In enables the system to immediately respond to user commands without waiting for TTS playback to finish, allowing users to interrupt ongoing speech and interact more naturally and efficiently.

The Barge-In feature is made up of two main components:

Acoustic Echo Cancellation (AEC)
This component prevents the device’s own playback audio (such as TTS output) from being captured by the microphone and mistakenly interpreted as user speech. Android already provides AEC through the system’s audio framework. To enable it, you can use the audio source AudioSource.VOICE_COMMUNICATION, which activates built-in echo cancellation. Starting from vsdk 6.0.0, this is the default audio source. If you want raw microphone input instead, you can still use AudioSource.MIC.
Logic to Stop TTS on Speech Detection
While Android handles the echo cancellation, it does not automatically stop TTS playback when user speech is detected — this logic must be implemented by the developer.
A common approach is to use speech recognition events that indicate “speech detected” and “silence detected”, typically emitted every 10 ms. You can monitor these events and decide what duration or level of speech activity should trigger TTS interruption.
Once speech is detected according to your chosen threshold, you can stop the TTS pipeline and continue with the rest of your voice interaction logic.

JAVA

import com.vivoka.vsdk.audio.producers.AudioRecorder;

// Record using android AEC
AudioRecorder audioRecorder = new AudioRecorder(AudioSource.VOICE_COMMUNICATION);

// Record from microphone as it is
AudioRecorder audioRecorder = new AudioRecorder(AudioSource.MIC);

Example code:

JAVA

private boolean detectBargeIn(String codeString, float time) {
    if (codeString.equals("SilenceDetected")) {
        lastTimeSilenceDetected = time;
    } 
    else if (codeString.equals("SpeechDetected")) {
        if (time - lastTimeSilenceDetected > 200 && tts.isSynthesizing()) {
            Log.i("Barge-in", "Barge in detected!");
            return true;
        }
    }
    return false;
}

Audio Format

Input: 16 kHz, 16-bit signed PCM, mono or stereo.

Output: 16 kHz, 16-bit signed PCM, mono.

Note that mono or stereo input is defined when configuring a Speech Enhancement technology in VDK-Studio.

Getting Started

Before you begin, make sure you’ve completed all the necessary preparation steps.
There are two ways to prepare your Android project for Speech Enhancement:

Using sample code
Starting from scratch

From Sample Code

Start by downloading the sample package from the Vivoka Console:

Open the Vivoka Console and navigate to your Project Settings.
Go to the Downloads section.
In the search bar enter package name from table.

📦 sample-speech-enhancement-x.x.x-deps-vsdk-x.x.x.zip

Once downloaded, you’ll have a fully functional project that you can test, customize, and extend to fit your specific use case.

From Scratch

Before proceeding, make sure you’ve completed the following steps:

1. Prepare your VDK Studio project

Create a new project in VDK Studio
Add the Speech Enhancement technology and add enhancer.
Export the project to generate the required assets and configuration

2. Set up your Android project

Install the necessary libraries (vsdk-s2c-x.x.x-android-deps-vsdk-x.x.x.zip)
Initialize VSDK in your application code

These steps are better explained in the Integrating Vsdk Libraries guide.

Start Speech Enhancement

1. Initialize Engine

Start by initializing the VSDK, followed by the Speech Enhancement engine:

JAVA

import com.vivoka.vsdk.Vsdk;

Vsdk.init(context, "config/vsdk.json", vsdkSuccess -> {
    if (!vsdkSuccess) {
        return;
    }
    com.vivoka.vsdk.speechenhancement.s2c.Engine.getInstance().init(engineSuccess -> {
        if (!engineSuccess) {
            return;
        }
        // The ASR engine is now ready!
    });
});

You cannot create two instances of the same engine.

If you call Engine.getInstance() multiple times, you will receive the same singleton instance.

2. Build Pipeline

For the sake of this example, we’ll implement a simple pipeline that records audio from the microphone, applies speech enhancement, and saves the processed audio to a file:

JAVA

import com.vivoka.vsdk.audio.Pipeline;
import com.vivoka.vsdk.audio.consumers.File;
import com.vivoka.vsdk.audio.modifiers.ChannelExtractor;
import com.vivoka.vsdk.audio.producers.AudioRecorder;
import com.vivoka.vsdk.Exception;


// Create audio recorder (producer)
AudioRecorder audioRecorder = new AudioRecorder(16000, AudioFormat.CHANNEL_IN_STEREO, 1024); 

// Create speech enhancer (modifier)
SpeechEnhancer enhancer = com.vivoka.vsdk.speechenhancement.s2c.Engine.getInstance().getSpeechEnhancer("enhancer-1");

// Create File (consumer)
File file = new File(getFilesDir().getAbsolutePath() + "/audio.pcm", true); // Will be saved in internal storage of you app.

// Create the pipeline and start it
Pipeline pipeline = new Pipeline();
pipeline.setProducer(audioRecorder);
if (mEnhancer.getInputChannelCount() == 1) {
    // If the enhancer needs only one channel use the first one. 
    // You choose mono or stereo in VDK-Studio.
    mEnhancedPipeline.pushBackModifier(new ChannelExtractor(0));
}
pipeline.pushBackModifier(enhancer);
pipeline.pushBackConsumer(file);
pipeline.start();  //.stop()

3. Start/Stop Pipeline

JAVA

pipeline.start();
pipeline.stop();
pipeline.run();

.start() runs the pipeline in a new thread
.run() runs the pipeline and waits till it is finished (blocking)
.stop() is used to terminate the pipeline execution

Once a pipeline has been stopped, you can restart it at any time by simply calling .start() again.

4. Release the Engine

JAVA

com.vivoka.vsdk.speechenhancement.s2c.Engine.getInstance().release();