Speech Synthesis - Android

Introduction

Speech synthesis (also known as text-to-speech or TTS) is the process of converting written text into spoken audio.

In VSDK, speech synthesis is powered by CSDK, which offers a wide range of voices across different languages, genders, and voice quality (Voice quality availability).

Channels

Channel is what you use to generate speech. It holds one or more voices.

A channel itself doesn’t have a language—the language is defined by the voices you assign to it.
This means a single channel can include voices in different languages.

You can also define multiple channels in your configuration. This is useful when:

You want to synthesize multiple texts at the same time (parallel TTS).
You want to organize voices based on use case (e.g., one channel for alerts, another for navigation).

SSML Support

VSDK also supports SSML (Speech Synthesis Markup Language), which gives you finer control over how the text is spoken—allowing adjustments such as:

Pronunciation
Pauses
Pitch
Rate
Emphasis

SSML is supported for embedded voices, but not for neural voices (if present in your configuration). Neural voices are more natural-sounding but behave as a black box and do not support markup-based control.

Audio Format

The audio data is a 16-bit signed PCM buffer in Little-Endian format.
It is always mono (1 channel), and the sample rate depends on the engine being used.

Engine	Sample Rate (kHz)
csdk	22050

Voice Format

For <language>, refer to the table and use the value from the Vsdk-csdk Code column.
For <name>, use the lowercase version of the name shown in VDK-Studio.
For <quality>, you can find this information in VDK-Studio under Resources → Voice.

Engine	Format	Example
vsdk-csdk	`<language>,<name>,<quality>`	`enu,evan,embedded-pro`

Getting Started

Before you begin, make sure you’ve completed all the necessary preparation steps.
There are two ways to prepare your Android project for Voice Synthesis:

Using sample code
Starting from scratch

From Sample Code

Start by downloading the sample package from the Vivoka Console:

Open the Vivoka Console and navigate to your Project Settings.
Go to the Downloads section.
In the search bar enter package name from table.

📦 sample-tts-x.x.x-android-deps-vsdk-x.x.x.zip

Once downloaded, you’ll have a fully functional project that you can test, customize, and extend to fit your specific use case.

From Scratch

Before proceeding, make sure you’ve completed the following steps:

1. Prepare your VDK Studio project

Create a new project in VDK Studio
Add the Voice Synthesis technology and channel with voice(s)
Export the project to generate the required assets and configuration

2. Set up your Android project

Install the necessary libraries (vsdk-csdk-tts-x.x.x-android-deps-vsdk-x.x.x.zip)
Initialize VSDK in your application code

These steps are better explained in the Integrating Vsdk Libraries guide.

Start Synthesis

1. Initialize Engine

Start by initializing the VSDK, followed by the Voice Recognition engine:

JAVA

import com.vivoka.vsdk.Vsdk;

Vsdk.init(context, "config/vsdk.json", vsdkSuccess -> {
    if (!vsdkSuccess) {
        return;
    }
    com.vivoka.vsdk.tts.csdk.Engine.getInstance().init(engineSuccess -> {
        if (!engineSuccess) {
            return;
        }
        // The TTS engine is now ready
        Log.i("CSDK", "TTS Engine is successfully started.");
    });
});

You cannot create two instances of the same engine.

If you call Engine.getInstance() multiple times, you will receive the same singleton instance.

2. Build Pipeline

For the sake of this example, we’ll implement a simple pipeline that plays synthesized voice directly to the speaker:

JAVA

import com.vivoka.vsdk.audio.Pipeline;
import com.vivoka.vsdk.tts.Channel;
import com.vivoka.vsdk.tts.IChannelListener;
import com.vivoka.vsdk.audio.consumers.AudioPlayer;
import com.vivoka.vsdk.common.Error;
import com.vivoka.vsdk.common.Event;

try {
    // Create channel (producer)
    String channelName = "channel-1";
    String voice = "enu,ava,embedded-compact";
    Channel channel = com.vivoka.vsdk.tts.csdk.Engine.getInstance().getChannel(
            channelName, voice, new IChannelListener() {
                @Override
                public void onEvent(Event<Channel.EventCode> event) {
                    Log.d(TAG, "onEvent: " + event.codeString + " : " + event.message);
                }
    
                @Override
                public void onError(Error<Channel.ErrorCode> error) {
                    Log.e(TAG, "onError: " + error.type.toString() + " on channel '" + channelName + "': " + error.message);
                }
            }
    );
    
    // Create audio player (consumer)
    AudioPlayer audioPlayer = new AudioPlayer(channel.getSampleRate(), channel.getChannelCount());
    audioPlayer.setOnFinished(() -> {
        // On audio finished playing.
    });
    
    // Create and start pipeline
    Pipeline pipeline = new Pipeline();
    pipeline.setProducer(channel);
    pipeline.pushBackConsumer(audioPlayer);
    pipeline.start();
} catch (Exception e) {
    e.printFormattedMessage();
}

Calling .getChannel() with the same name twice will always return the same channel instance.

When you call pipeline.pushBackConsumer(audioPlayer), the audioPlayer becomes linked to the channel instance.

If you do this a second time—even with a new Pipeline—the same channel will send audio to both consumers, resulting in the audio being played twice.

To avoid this issue, make sure to create the pipeline for a given channel only once, or remove existing consumers from the pipeline before creating a new one.

3. Start/Stop Pipeline

CODE

pipeline.start();
pipeline.stop();
pipeline.run();

.start() runs the pipeline in a new thread
.run() runs the pipeline and waits till it is finished (blocking)
.stop() is used to terminate the pipeline execution

Once a pipeline has been stopped, you can restart it at any time by simply calling .start() again.

To stop playing:

JAVA

pipeline.stop();
audioPlayer.stop();

Before calling .synthesizeFromText() you need to start a pipeline first:

JAVA

pipeline.start();

channel.synthesizeFromText("Hello world!");
String ssml = "<speak version=\"1.0\" xmlns=\"http://www.w3.org/2001/10/synthesis\" xml:lang=\"fr-FR\">Bonjour Vivoka</speak>";
channel.synthesizeFromText(ssml);

To pause/resume TTS:

JAVA

audioPlayer.pause();
audioPlayer.resume();

If you call .synthesizeFromText() more than once, then the last call will override all the previous ones.

5. Destroy Engine

JAVA

com.vivoka.vsdk.tts.csdk.Engine.getInstance().release();

The engine instance cannot be destroyed while at least one channel is still active.
Make sure to release all channel instances before shutting down the engine—the destruction order matters!

Audio Player

AudioPlayer is a consumer module provided by VSDK that handles the playback of synthesized audio.

When used in a pipeline, it receives audio data from a TTS channel and plays it back through the device’s speaker. It also supports progress tracking, including word-level markers, which can be used to synchronize text display or trigger actions as speech is spoken.

Text Marker

TextMarker Implementation

JAVA

import androidx.annotation.NonNull;
import androidx.arch.core.util.Function;
import androidx.core.util.Consumer;
import androidx.core.util.Supplier;

import com.vivoka.vsdk.Exception;

import org.json.JSONException;
import org.json.JSONObject;

import java.util.ArrayList;
import java.util.List;

public class TextMarker {
    private int _sampleRate;
    private int _channelCount;
    private int _previousWordIndex;
    private final List<Marker> _markers;
    private final Object _callbackMutex = new Object();
    private Consumer<Marker> _callback;

    public TextMarker(int sampleRate, int channelCount) throws Exception {
        setAudioFormat(sampleRate, channelCount);
        _previousWordIndex = -1;
        _markers = new ArrayList<>();
    }

    public void setAudioFormat(int sampleRate, int channelCount) throws Exception {
        Exception.bAssert(sampleRate > 0,
            "Sample rate cannot be zero or less than zero");

        Exception.bAssert(channelCount > 0,
            "Channel count cannot be zero or less than zero");

        this._sampleRate = sampleRate;
        this._channelCount = channelCount;
    }

    public void addMarker(String markerMessage) throws Exception {
        synchronized (_markers) {
            _markers.add(new Marker(markerMessage));
        }
    }

    public void setReachedMarkerCallback(Consumer<Marker> callback) {
        synchronized (_callbackMutex) {
            this._callback = callback;
        }
    }

    public void onPlayerProgress(Long position) throws Exception {
        try {
            Function<Integer, Marker> safeGet = (Integer idx) -> {
                synchronized (_markers) {
                    return 0 <= idx && idx < _markers.size() ? _markers.get(idx) : null;
                }
            };

            Supplier<Integer> size = () -> {
                synchronized (_markers) {
                    return _markers.size();
                }
            };

            for (int i = _previousWordIndex + 1; i < size.get(); ++i) {
                Marker marker = safeGet.apply(i);
                if (marker != null) {
                    if (marker.startPosInAudio() > position) {
                        break;
                    }
                    synchronized (_callbackMutex) {
                        if (_callback != null) {
                            _callback.accept(marker);
                        }
                    }
                    _previousWordIndex = i;
                }
            }
        } catch (RuntimeException e) {
            throw new Exception("Failed to update TextMarker progress", e);
        }
    }

    public void reset() {
        _previousWordIndex = -1;
        synchronized (_markers) {
            _markers.clear();
        }
    }

    // Marker class for demonstration purposes
    public static class Marker {
        private final double _startPosInAudio;
        private final double _endPosInAudio;
        private final double _startPosInText;
        private final double _endPosInText;
        private final String _word;
        private final String _text;

        public Marker(String jsonString) throws Exception {
            try {
                JSONObject obj   = new JSONObject(jsonString);
                _text            = obj.getString("text");
                _word            = obj.getString("word");
                _startPosInText  = obj.getLong("start_pos_in_text");
                _endPosInText    = obj.getLong("end_pos_in_text");
                _startPosInAudio = obj.getLong("start_pos_in_audio");
                _endPosInAudio   = obj.getLong("end_pos_in_audio");
            } catch (JSONException e) {
                throw new Exception("Failed to parse json word marker", e);
            }
        }

        public double startPosInAudio() {
            return _startPosInAudio;
        }

        public double endPosInAudio() {
            return _endPosInAudio;
        }

        public double startPosInText() {
            return _startPosInText;
        }

        public double endPosInText() {
            return _endPosInText;
        }

        public String word() {
            return _word;
        }

        public String text() {
            return _text;
        }

        @NonNull
        @Override
        public String toString() {
            JSONObject obj = new JSONObject();
            try {
                obj.put("word", _word);
                obj.put("text", _text);
                obj.put("start_pos_in_audio", _startPosInAudio);
                obj.put("end_pos_in_audio", _endPosInAudio);
                obj.put("start_pos_in_text", _startPosInText);
                obj.put("end_pos_in_text", _endPosInText);
            } catch (JSONException e) {
                (new Exception("Failed to convert marker to string", e)).printFormattedMessage();
            }
            return obj.toString();
        }
    }
}

The following code demonstrates how to integrate word-level markers into the pipeline for synchronized text playback.

JAVA

import com.vivoka.vsdk.audio.Pipeline;
import com.vivoka.vsdk.tts.Channel;
import com.vivoka.vsdk.tts.IChannelListener;
import com.vivoka.vsdk.audio.consumers.AudioPlayer;
import com.vivoka.vsdk.common.Error;
import com.vivoka.vsdk.common.Event;

String channelName = "channel-1";
String voice = "enu,ava,embedded-compact";

try {
    // Create channel (producer)
    channel = Engine.getInstance().getChannel(channelName, voice, new IChannelListener() {
            @Override
            public void onEvent(Event<Channel.EventCode> event) {
                Log.d(TAG, "onEvent: " + event.codeString + " : " + event.message);
                if (event.code == Channel.EventCode.WORD_MARKER_END_EVENT) {
                    try {
                        if (textMarker != null) {
                            textMarker.addMarker(event.message);
                        }
                    } catch (Exception e) {
                        e.printFormattedMessage();
                    }
                }
            }
  
            @Override
            public void onError(Error<Channel.ErrorCode> error) {
                Log.e(TAG, "onError: " + error.type.toString() + " on channel '" + channelName + "': " + error.message);
            }
        }
    );
  
    // Create text marker
    textMarker = new TextMarker(channel.getSampleRate(), channel.getChannelCount());
    textMarker.setReachedMarkerCallback(marker -> {
        Log.d("Marker", marker.text());
    });
  
    // Create audio player (consumer)
    audioPlayer = new AudioPlayer(channel.getSampleRate(), channel.getChannelCount());
    audioPlayer.setOnProgress(position -> {
        try {
            textMarker.onPlayerProgress(position);
        } catch (Exception e) {
            e.printFormattedMessage();
        }
    });    
    audioPlayer.setOnFinished(() -> {
        textMarker.reset();
    });

    // Create and start pipeline
    Pipeline pipeline = new Pipeline();
    pipeline.setProducer(channel);
    pipeline.pushBackConsumer(audioPlayer);
    channel.synthesizeFromText("Hello world");
    pipeline.start();
} catch (Exception e) {
    e.printFormattedMessage();
}