How-to: Setup a barge-in

What is Barge-in ?

Barge-in is the capability of a voice system to detect and process user speech while it is playing audio output, allowing the user to interrupt the system.

Barge-in is a conversational interaction feature. It relies on ASR and speech events to determine when a user starts speaking. It does not refer to any specific audio processing technique.

Since barge-in relies on speech event detection, its performance is strongly influenced by speech enhancement and noise reduction. However, speech enhancement alone may not be sufficient to filter out Text-to-Speech (TTS) audio that is being played back and simultaneously captured by the microphone. To ensure reliable barge-in across diverse acoustic environments, we provide Acoustic Echo Cancellation (AEC).

Setting Up

(Optional but recommended) Perform AEC and denoise audio
Set up real-time speech recognition to capture speech events.
When a speech detection event occurs, interrupt the system’s audio playback.

VSDK C++ — Barge-In

1. Performing AEC and Audio Denoising

Please refer to the SpeechEnhancement page.

2. Setting up

Once your audio has been cleaned of environmental noise, you can trust the events emitted by the ASR. There are two specific events you need to pay attention to: SpeechDetected and SilenceDetected.
You could rely solely on SpeechDetected to interrupt the TTS, but you may also want to enforce a minimum speech duration using the SilenceDetected event.
Either way, this is the entry point for any barge-in handling.

CPP

rec->subscribe([&] (Recognizer::Event const & e) { onAsrEvent(e); });
...
  
void onAsrEvent(Recognizer::Event const & e)
{
        static Recognizer::Event lastEvent;
        if (e == lastEvent) return;
  
        if(e.codeString == "SpeechDetected")
        {
          // Stop TTS pipeline and audio player 
          // async to prevent blocking the ASR thread.
        } 
        if(e.codeString == "SilenceDetected")
        {
            ..
        }
}

VSDK Android — Barge-In

1. Performing AEC and Audio Denoising

Please refer to the SpeechEnhancement page.

2. Setting up

A common approach is to use speech recognition events that indicate “speech detected” and “silence detected”, typically emitted every 10 ms. You can monitor these events and decide what duration or level of speech activity should trigger TTS interruption.

Once speech is detected according to your chosen threshold, you can stop the TTS pipeline and continue with the rest of your voice interaction logic.

JAVA

import com.vivoka.vsdk.audio.producers.AudioRecorder;

// Record using android AEC
AudioRecorder audioRecorder = new AudioRecorder(AudioSource.VOICE_COMMUNICATION);

// Record from microphone as it is
AudioRecorder audioRecorder = new AudioRecorder(AudioSource.MIC);

Example code:

JAVA

private boolean detectBargeIn(String codeString, float time) {
    if (codeString.equals("SilenceDetected")) {
        lastTimeSilenceDetected = time;
    } 
    else if (codeString.equals("SpeechDetected")) {
        if (time - lastTimeSilenceDetected > 200 && tts.isSynthesizing()) {
            Log.i("Barge-in", "Barge in detected!");
            return true;
        }
    }
    return false;
}