Acoustic model file
When you use the VASR (VASR - C++) you need to have 1 or more acoustic model (.vam
files) to provide to the engine. This page explain what exacly is inside those files.
First of all the .vam
file is in reality an encrypted zip archive which contains multiple files and here is a list of all of them:
Filename | Description |
---|---|
| Main configuration file for this acoustic model. A complete description of its content is given later in this page |
| G2P configuration file. A complete description of its content is given later in this page |
| ONNX decoder of G2P used when converting a word (or set of words) into their phonetic representation |
| ONNX encoder of G2P used when converting a word (or set of words) into their phonetic representation |
| Represent a list of all the graphemes supported by this G2P model |
| Represent a list of all the phonemes supported by this G2P model |
| CSV files with 2 columns: an index and its associated phoneme. Used to convert from one phonetic alphabet to another |
| CSV files with 2 columns: an index and its associated phoneme. Used to convert from one phonetic alphabet to another |
| CSV files with 2 columns: an index and its associated phoneme. Used to convert from one phonetic alphabet to another |
| VAD (Voice Activity Detection) ONNX model. Mainly used in the ASR to detect the beginning and end of speech for a given utterance |
| Confidence model used by the engine to compute the confidence score of the word of the utterance |
| Acoustic model it-self. Receive audio features as input and output an encoded nnet result |
| The list of phonemes specific to the acoustic model engine (used by the VASR-compiler) |
It is important to note that with the exception of config.json
the actual name of those files can differ as they don't matter for the engine. They are written in the config.json file.
Config.json content
{
"feature_extraction_attributes": {
"sample_rate": 16000,
"dither": 0.0,
"snip_edges": false,
"num_mel_bins": 80,
"vtln_wrap": 1.0
},
"decoding_attributes": {
"search_beam": 20.0,
"output_beam": 5.0,
"min_activate_states": 30,
"max_activate_states": 10000,
"sub_sampling_factor": 3,
"num_detailed_nbest": 10
},
"online_decoding_attributes": {
"cache_size": 2,
"hidden_layer_size": 500,
"num_lstms": 2,
"num_tdnns": 7,
"feat_frame_size": 25,
"feat_frame_shift": 10,
"decode_chunk_size": 75,
"inference_chunk_size": 75,
"accumulate_decoding": true,
"padding_frames": 45,
"nbest_scale": 0.5
},
"grammar_compiler_attributes": {
"sil_score": 0.693147182,
"no_sil_score": 0.693147182,
"add_self_loops": true
},
"vad_attributes": {
"min_speech_duration": 500,
"min_silence_duration": 700,
"speech_probability_threshold": 0.5,
"window_sample_size": 1536,
"onnx": {
"session_options": {
"intra_op_thread": 1
},
"input": {
"input": "input",
"history_context": "h0",
"cell_state": "c0"
},
"output": {
"output": "output",
"history_context": "hn",
"cell_state": "cn"
}
}
},
"ignored_phone_id_of_alphabets": [
100,
103
],
"onnx_model_names": {
"session_options": {
"intra_op_thread": 1
},
"input": {
"feature": "feats",
"feature_cache": "feat_cache",
"tdnn_cache": "tdnn_cache",
"lstm_context": "in_lstm_cntxts"
},
"output": {
"result": "posts",
"tdnn_cache": "tdnn_out",
"lstm_context": "out_lstm_cntxts"
}
},
"files": {
"accoustic_model": "pretrained.dyn.uint8-quant.onnx",
"confidence_model": "t7l2.conf_model.txt",
"g2p_config": "g2p_config.json",
"vad_model": "silero_vad.onnx",
"phoneme_table": "tokens.txt",
"phonetic_alphabets": {
"ipa": "ipa.csv",
"kirshenbaum": "kirshenbaum.csv",
"lhp": "lhp.csv"
}
},
"metadata": {
"version": 1,
"model_version": 1,
"language": "eng-US",
"id": "vasr-eng-US-t7l2-1.0"
}
}