Speech Recognition - Android
Introduction
Recognizers & Models
Speech recognition relies on two main components: Recognizers and Models. They work together to turn audio into meaningful results.
Recognizers: The Acoustic Part
Recognizers handle incoming audio and use voice activity detection (VAD) to detect when someone is speaking or silent.
A recognizer can run without a model—VAD will still detect speech—but no transcription or interpretation will occur. Recognizers are configured for specific languages and accents (e.g., eng-US, fra-FR) and must be paired with models in the same language.
Models: The Language Part
Models define what the recognizer can understand—words, phrases, grammar structure, and even phoneme rules. They act as the language engine behind the acoustic processing.
There are three types of models:
Type | Description |
Static models | Predefined vocabularies and grammars. They define a fixed set of valid phrases, their structure, and optionally custom phonemes. |
Dynamic models | A special type of static model that includes slots—placeholders you can fill with vocabulary at runtime. The base model is compiled in the cloud, then updated on the device when slots are filled, requiring an additional on-device compilation. |
Free-speech models | Large-vocabulary models designed for open-ended input (e.g., dictation). Unlike static or dynamic models, they are not limited to a defined set of phrases. Implement it the same way as the static model. |
In short, recognizers listen, and models interpret. Both are essential for effective speech recognition.
Why Recognizers and Models Are Separated
Separating Recognizers and Models gives you more flexibility. For example, you can set or switch a model after speech has started—thanks to internal audio buffering—so the model still applies to earlier audio.
It also lets you reuse the same recognizer with different models, or prepare models in the background without interrupting audio input.
Audio Format
The input audio data for recognition is a 16-bit signed PCM buffer in Little-Endian format. It is always mono (1 channel), and the sample rate 16KHz.
Getting Started
Before you begin, make sure you’ve completed all the necessary preparation steps.
There are two ways to prepare your Android project for Voice Recognition:
Using sample code
Starting from scratch
From Sample Code
Start by downloading the sample package from the Vivoka Console:
Open the Vivoka Console and navigate to your Project Settings.
Go to the Downloads section.
In the search bar enter package name from table.
📦 sample-chained-grammars-x.x.x-android-deps-vsdk-x.x.x.zip
📦 sample-simple-application-x.x.x-android-deps-vsdk-x.x.x.zip
📦 sample-dynamic-grammar-x.x.x-android-deps-vsdk-x.x.x.zip
Once downloaded, you’ll have a fully functional project that you can test, customize, and extend to fit your specific use case.
From Scratch
Before proceeding, make sure you’ve completed the following steps:
1. Prepare your VDK Studio project
Create a new project in VDK Studio
Add the Voice Recognition technology
Add a model and a recognizer
Export the project to generate the required assets and configuration
Add model in case of static or dynamic models.
To add a recognizer:
Click the settings icon next to Voice Recognition in the left panel.
Click “Add new recognizer”.
Enter a name for your recognizer.
Select the language.
Remember the name, you will need it later!
Otherwise the default recognizer will be added with a name rec_ + language. For example, rec_eng-US.
2. Set up your Android project
Install the necessary libraries (
vsdk-csdk-asr-x.x.x-android-deps-vsdk-x.x.x.zip)Initialize VSDK in your application code
These steps are better explained in the Integrating Vsdk Libraries guide.
Start Recognition
1. Initialize Engine
Start by initializing the VSDK, followed by the Voice Recognition engine:
import com.vivoka.vsdk.Vsdk;
Vsdk.init(context, "config/vsdk.json", vsdkSuccess -> {
if (!vsdkSuccess) {
return;
}
com.vivoka.vsdk.asr.csdk.Engine.getInstance().init(engineSuccess -> {
if (!engineSuccess) {
return;
}
// The ASR engine is now ready
});
});
You cannot create two instances of the same engine.
If you call Engine.getInstance() multiple times, you will receive the same singleton instance.
2. Build Pipeline
For this example, we’ll implement a simple pipeline that records audio from the microphone and sends it to recognizer:
import com.vivoka.vsdk.asr.IRecognizerListener;
import com.vivoka.vsdk.asr.Recognizer;
import com.vivoka.vsdk.asr.Recognizer.ErrorCode;
import com.vivoka.vsdk.asr.Recognizer.ResultType;
import com.vivoka.vsdk.audio.Pipeline;
import com.vivoka.vsdk.audio.producers.AudioRecorder;
import com.vivoka.vsdk.common.Error;
import com.vivoka.vsdk.common.Event;
try {
// Create audio recoder (producer)
AudioRecorder audioRecorder = new AudioRecorder(); // Set a producer (microphone)
// Create Recognizer (consumer)
String recognizerName = "rec_eng-US";
Recognizer recognizer = com.vivoka.vsdk.asr.csdk.Engine.getInstance().getRecognizer(recognizerName, new IRecognizerListener() {
@Override
public void onResult(ResultType resultType, String result, boolean isFinal) {
Log.e(TAG,"onResult: " + resultType.name() + " " + result);
}
@Override
public void onEvent(Event<EventCode> event) {
Log.e(TAG, "onEvent: [" + event.formattedTimeMarker() + "] " + event.codeString +
" " + event.message);
}
@Override
public void onError(Error<ErrorCode> error) {
Log.e(TAG, "OnError" + error.codeString + " " + error.message);
}
});
// Create and start pipeline
Pipeline pipeline = new Pipeline();
pipeline.setProducer(audioRecorder);
pipeline.pushBackConsumer(recognizer);
pipeline.start();
} catch (Exception e) {
e.printFormattedMessage();
}
Calling .getRecognizer() with the same name twice will always return the same recognizer instance.
AudioRecorder will require Android permission RECORD_AUDIO.
3. Start/Stop Pipeline
pipeline.start();
pipeline.stop();
pipeline.run();
.start()runs the pipeline in a new thread.run()runs the pipeline and waits till it is finished (blocking).stop()is used to terminate the pipeline execution
Once a pipeline has been stopped, you can restart it at any time by simply calling .start() again.
4. Set Model
Calling .setModel(...) is only required for grammar-based models (static or dynamic).
Each time you receive an onResult() event, the model is automatically unset. To continue recognition, you need to set the model again manually.
recognizer.setModel("model-1");
recognizer.setModel("model-1", hypothesis.endTime);
The start time begins at 0 when the pipeline is started.
You can set a model slightly in the past—typically up to 2 seconds back—to capture speech that occurred just before the model was applied.
However, keep in mind that this introduces additional computation, which may impact performance on low-power devices. Be sure to test and validate this behavior in your target environment.
Be careful when calling .setModel() from within a recognizer callback such as onResult(), onEvent(), or onError(). The callback must fully return before you can safely set the model again.
To work around this, you can call .setModel() from a separate thread—this ensures the callback has fully exited before the model is set.
new Thread(() -> {
try {
recognizer.setModel(model);
} catch (Exception e) {
// Handle exception...
}
}).start();
5. Parse Result
When you receive a result, you’ll need to parse it using the AsrResultParser class.
The parsed result will typically contain multiple hypotheses, each with an associated confidence score—the higher the score, the better the match.
In general, a confidence threshold between 4000 and 5000 is considered acceptable, though this may vary depending on your specific use case.
import com.vivoka.vsdk.asr.utils.AsrResultParser;
import com.vivoka.vsdk.asr.utils.AsrResultHypothesis;
import com.vivoka.vsdk.asr.utils.AsrResult;
import com.vivoka.vsdk.asr.Recognizer.ResultType;
private AsrResultHypothesis processRecognitionResult(ResultType resultType, String result) {
if (resultType != ResultType.ASR) {
Log.w("AsrResult", "Unsupported result type: " + resultType);
return null;
}
if (result == null || result.isEmpty()) {
Log.e("AsrResult", "Input result is null or empty.");
return null;
}
try {
AsrResult asrResult = AsrResultParser.parseResult(result);
AsrResultHypothesis best = asrResult.getBestHypothesis();
if (best == null) {
Log.w("AsrResult", "No hypotheses found in ASR result.");
return null;
}
Log.d("AsrResult", "Best hypothesis: " + best);
return best;
} catch (Exception e) {
Log.e("AsrResult", "Error while processing recognition result", e);
return null;
}
}
6. Release Engine
com.vivoka.vsdk.asr.csdk.Engine.getInstance().release();
Dynamic Models
Dynamic models are an extension of static grammar models. They are designed to support runtime customization, allowing you to insert new vocabulary or values—called slots—directly from your code.
This is useful when your application needs to adapt its vocabulary on the fly, such as recognizing names, product codes, or other user-specific data that isn’t known at compile time.
Once compiled, the dynamic model behaves like a regular static model and can be set on a recognizer.
import com.vivoka.vsdk.asr.DynamicModel;
DynamicModel model = com.vivoka.vsdk.asr.csdk.Engine.getInstance().getDynamicModel("dynamic-model-name");
model.addData("item", "coffee", new ArrayList<>(Arrays.asList("'kO.fi", "Ek.'spR+Es.o&U"))); // With custom phonetics
model.addData("item", "cheese");
model.addData("item", "potato");
model.compile(); // Don't forget to compile after adding data to apply changes
// We can now apply it to the recognizer.
recognizer.setModel("dynamic-model-name");
Because dynamic models are compiled on the device at runtime, injecting a large number of slot entries—especially thousands—can introduce noticeable delays during compilation.
Compiled models are not cached, so you will need to recompile them each time the application restarts or each time you destroy ASR Engine.