SSML reference
The Speech Synthesis Markup Language (SSML) is an XML-based markup language for speech synthesis. Its essential role is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms.
Note that not all of the elements and options described in the W3 SSML specification are currently supported by all SDKs. This page details which elements are available for each SDK.
Reserve characters
Avoid using SSML reserve characters in the text that is to be converted to audio. When you need to use an SSML reserve character, prevent the character from being read as code by using its escape code. The following table shows reserved SSML characters and their associated escape codes.
Character | Escape code |
---|---|
" |
|
& |
|
' |
|
< |
|
> |
|
Markups
Audio VSDK-CSDK VSDK-BARATINOO VSDK-VTAPI
Audio SSML Markup is used to insert a recorded audio file. If the audio file cannot be retrieved, the element's contents are synthesized. The content is the fallback text used when the audio file is not supported.
Example:
<audio src="file:laugh">
haha
</audio>
Attribute | Description |
---|---|
src | The URI of a document with an appropriate MIME type. URIs may be absolute or relative to the base:uri specified in <speak> element. Audio files may be local ( Supported audio files:
VSDK-VTAPI VSDK-CSDK VSDK-VTAPI |
mode VSDK-VTAPI | It is a custom attribute. If it is set as |
fetchtimeout VSDK-BARATINOO VSDK-VTAPI (SSML 1.1) | Signed or unsigned positive number or zero followed by VSDK-BARATINOO VSDK-VTAPI |
fetchhint VSDK-BARATINOO (SSML 1.1) | This tells the synthesis processor whether or not it can attempt to optimize rendering by pre-fetching audio. Available values:
|
maxage VSDK-BARATINOO (SSML 1.1) | A positive integer or zero. |
maxstale VSDK-BARATINOO (SSML 1.1) | A positive integer or zero. |
clipBegin VSDK-BARATINOO (SSML 1.1) | Signed or unsigned positive number or zero followed by |
clipEnd VSDK-BARATINOO (SSML 1.1) | Signed or unsigned positive number or zero followed by |
repeatCount VSDK-BARATINOO (SSML 1.1) | Signed or unsigned positive number or zero. Default VSDK-BARATINOO |
repeatDur VSDK-BARATINOO (SSML 1.1) | Signed or unsigned positive number or zero followed by VSDK-BARATINOO |
soundLevel VSDK-VTAPI VSDK-BARATINOO (SSML 1.1) | Signed number followed by VSDK-BARATINOO |
speed VSDK-VTAPI VSDK-BARATINOO (SSML 1.1) | Unsigned positive number or zero followed by VSDK-BARATINOO |
vox:gain VSDK-BARATINOO | Signed number followed by |
vox:fadelevel VSDK-BARATINOO | Signed number followed by |
vox:fadein VSDK-BARATINOO | Signed or unsigned positive number or zero followed by VSDK-BARATINOO |
vox:fadeout VSDK-BARATINOO | Signed or unsigned positive number or zero followed by VSDK-BARATINOO |
vox:fadeinAttack VSDK-BARATINOO | Signed or unsigned positive number or zero followed by |
vox:fadeinRelease VSDK-BARATINOO | Signed or unsigned positive number or zero followed by |
vox:fadeoutAttack VSDK-BARATINOO | Signed or unsigned positive number or zero followed by |
vox:fadeoutRelease VSDK-BARATINOO | Signed or unsigned positive number or zero followed by |
vox:tempo VSDK-BARATINOO | The tempo attribute can be used to speed up or slow down the rate of the audio file without changing the pitch level. VSDK-BARATINOO |
Audio Mix VSDK-BARATINOO
Audiomix SSML Markup is used to insert a recorded audio file, and mix it with the element content. If the audio file is longer than the speech, it is truncated. If he is shorter, he is repeatedly read.
Attributes of the <audiomix>
element have the same meaning and restrictions as those of the <audio>
element, but the default fade attack and release durations may differ.
Example:
<vox:audiomix src="file:laugh" fetchtimeout="3ms">
haha
</vox:audiomix>
Attribute | Description |
---|---|
src | Name of file (absolute or relative URI) |
fetchtimeout | Signed or unsigned positive number or zero followed by |
fetchhint | This tells the synthesis processor whether or not it can attempt to optimize rendering by pre-fetching audio. Available values:
|
maxage | A positive integer or zero. |
maxstale | A positive integer or zero. |
clipBegin | Signed or unsigned positive number or zero followed by |
clipEnd | Signed or unsigned positive number or zero followed by |
soundLevel | Signed number followed by VSDK-BARATINOO |
speed | Unsigned positive number or zero followed by VSDK-BARATINOO |
gain | Signed number followed by |
fadelevel | Signed number followed by |
fadein | Signed or unsigned positive number or zero followed by VSDK-BARATINOO |
fadeout | Signed or unsigned positive number or zero followed by VSDK-BARATINOO |
fadeinAttack | Signed or unsigned positive number or zero followed by |
fadeinRelease | Signed or unsigned positive number or zero followed by |
fadeoutAttack | Signed or unsigned positive number or zero followed by |
fadeoutRelease | Signed or unsigned positive number or zero followed by |
Break VSDK-CSDK VSDK-VTAPI VSDK-BARATINOO
Break SSML Markup is used to temporarily pause the speech.
It is inserted at cursor position as an empty element, and can be used with milliseconds or seconds.
Example:
<break time="300ms"/>
Attribute | Description | |||
---|---|---|---|---|
time | Signed or unsigned positive number or zero followed by VSDK-BARATINOO Extension: percentage values are also accepted. | |||
strength | Value | VSDK-CSDK | VSDK-BARATINOO | VSDK-VTAPI |
none | 0ms | ≅ 0ms | ≅ 0ms | |
x-weak | 20ms | ≅ 50ms | ≅ 200ms | |
weak | 100ms | ≅ 100ms | ≅ 450ms | |
medium | 500ms | ≅ 500ms | ≅ 700ms | |
strong | 1000ms | ≅ 1000s | ≅ 900ms | |
x-strong | 1500ms | ≅ 2000s | ≅ 1200ms |
Checksum VSDK-BARATINOO
Enable a cyclic-redundancy check to be performed on the signal and events in the most recent breath group (delimited by silence) rendered from the content of the current document.
Example:
<vox:checksum crc32="2016915618"></vox:checksum>
Attribute | Description | |||
---|---|---|---|---|
crc32 | Unsigned positive number or zero. |
Computed duration VSDK-BARATINOO
Example:
<prosody vox:computedduration="on"></prosody>
Attribute | Description | |||
---|---|---|---|---|
vox:computedduration |
|
Computed pitch VSDK-BARATINOO
Example:
<prosody vox:computedpitch="on"></prosody>
Attribute | Description | |||
---|---|---|---|---|
vox:computedpitch |
|
Contour VSDK-BARATINOO
Contour SSML Markup is used to set different pitch values at different timestamps.
In each pair (time, pitch)
, the first value is a percentage of the period of the contained text and the second value is the value of the pitch attribute.
Example:
<prosody contour="(0%, +10%) (50%, +50%) (100%, +90%)">
I am speaking.
</prosody>
DurationVSDK-BARATINOO
Duration SSML Markup is used to set the duration of the marked speech. Signed or unsigned positive number or zero followed by s
for seconds or ms
for milliseconds.
Example:
<prosody duration="5s">I'm speaking very slow.</prosody>
Emphasis VSDK-CSDK VSDK-VTAPI VSDK-BARATINOO
Emphasis SSML Markup is used to request that the contained text be spoken with emphasis. Please note that the realization of emphasis is voice dependent.
Example:
That is a <emphasis> big </emphasis> car!
VSDK-BARATINOO
The realization of emphasis is voice dependent.
Attribute | Description | |
---|---|---|
level |
|
Lang VSDK-CSDK VSDK-VTAPI VSDK-BARATINOO
With the lang SSML makup it is possible to switch language. Changing the language also changes the voice, if there is a voice available.
The lang
element can only contain text to be rendered and the following elements: audio
, break
, emphasis
, lang
, lookup
, mark
, p
, phoneme
, prosody
, say-as
, sub
, s
, token
, voice
and w
.
Example:
English, <lang xml:lang="de">Deutsch</lang>, English.
Attribute | Description | |
---|---|---|
xml:lang | A required attribute specifying the language of the element. | |
onlangfailure | An optional attribute specifying the desired behavior upon language speaking failure. VSDK-VTAPI | |
Value | Description | |
ignoretext | The synthesis processor will not attempt to render the text that is in the failed language. | |
ignorelang | The synthesis processor will ignore the change in language and speak as if the content were in the previous language. | |
changevoice | If a voice exists that can speak the language, the synthesis processor will switch to that voice and speak the content. Otherwise, the processor chooses another behavior (either | |
processorchoice | The synthesis processor chooses the behavior (either |
Lexicon VSDK-VTAPI VSDK-BARATINOO
Lexicon SSML markup is used to reference a lexicon document.
VSDK-VTAPI
Supported format: PLS (Pronunciation Lexicon Specification 1.0) and CSV (User-Dictionary of vsdk-vtapi).
VSDK-BARATINOO
Supported format: PLS (Pronunciation Lexicon Specification 1.0).
Example:
<lexicon xml:id="myLexiconDoc"></lexicon>
Attribute | Description |
---|---|
uri | Location of the lexicon document. |
xml:id | A unique identifier for the lexicon document. |
type | VSDK-BARATINOO |
fetchtimeout | VSDK-BARATINOO |
maxage | VSDK-BARATINOO |
maxstale | VSDK-BARATINOO |
Lookup VSDK-VTAPI VSDK-BARATINOO
Example:
<lookup ref="myLexiconDoc"></lookup>
Attribute | Description | |
---|---|---|
ref | The |
Mark VSDK-VTAPI VSDK-BARATINOO
The mark element specifies a named event which is triggered by the TTS engine when that location in the text is encountered in the generated audio stream. (What effect this event has is application specific, but it doesn’t affect the audion being generated)
The mark event must have a name attribute. The given name doesn’t have any meaning to the TS engine, but is included in the generated event.
Note that built-in normalization rules might, in some particular contexts such as date and currency expressions, cause adjacent words and numbers to be reordered. The TTS engine will generally try to preserve the association between marks and adjacent words in such cases, meaning that the mark events are not necessarily triggered in the exact order in which they occur in the SSML input but rather in a way that is more true to the reading order.
Example:
<mark name="item1"/>First item, <mark name="item2"/>second item.
Attribute | Description | |
---|---|---|
name | Marker name | |
vox:typeVSDK-BARATINOO | Value | Description |
sync (default) | The voice synthesis engine will trigger an event when that location in the text is encountered in the generated audio stream. | |
wait | A wait marker allows rendering of the audio signal to be deferred until the duration of the immediately following content has been determined. The end of the content whose duration is to be determined is marked by either the end of the root <speak> element or a <mark> element, of any type, that bears the same name (case-sensitive).
XML
When Baratinoo processes the above markup, notification is first made by a |
Paragraph VSDK-CSDK VSDK-VTAPI VSDK-BARATINOO
Paragraph SSML markup is used to indicate a paragraph in your text.
While the TTS engine already recognize paragraphs automatically, it can help it to better understand and render your text.
Example:
<p>
You have 4 new messages.
</p>
Vsdk-vtapi adds a sentence break before and after the element.
Attribute | Description | |
---|---|---|
xml:lang | An optional attribute specifying the language of the element. | |
onlangfailure | An optional attribute specifying the desired behavior upon language speaking failure. Not supported by VSDK-VTAPI. | |
Value | Description | |
ignoretext | The synthesis processor will not attempt to render the text that is in the failed language. | |
ignorelang | The synthesis processor will ignore the change in language and speak as if the content were in the previous language. | |
changevoice | If a voice exists that can speak the language, the synthesis processor will switch to that voice and speak the content. Otherwise, the processor chooses another behavior (either | |
processorchoice | The synthesis processor chooses the behavior (either |
Phoneme VSDK-CSDK VSDK-VTAPI VSDK-BARATINOO
Phoneme SSML Markup is used to provide a phonetic pronunciation for the contained text.
Example:
<phoneme alphabet="ipa" ph="vivo͡ʊkə">
Vivoka
</phoneme>
Support of the alphabet is limited to sounds that map to the phonetic symbols of the current voice.
VSDK-BARATINOO
The value of the ph
attribute is ignored for unsupported alphabets and a warning is issued.
Attribute | Description | |
---|---|---|
alphabet | SDK | Values |
VSDK-CSDK |
| |
VSDK-VTAPI |
| |
VSDK-BARATINOO |
| |
ph | List of phonetic symbols. Separated by underscore | |
type | VSDK-BARATINOO VSDK-VTAPI | |
vox:idl | VSDK-BARATINOO Control the inclusion or exclusion of specific acoustic units as candidate realizations for each part of the given phonetic pronunciation.
|
Pitch VSDK-CSDK VSDK-VTAPI VSDK-BARATINOO
Pitch SSML Markup is used to set the pitch of the voice.
It accepts predefined values as well as relative percentages numbers followed by %.
Example:
<prosody pitch="x-low">Oh my voice</prosody>
Value | VSDK-BARATINOO | VSDK-CSDK | VSDK-VTAPI | |
---|---|---|---|---|
| 50% of | -30% | 50 | |
| 75% of | -15% | 75 | |
| 100% of | 0% | 100 | |
| 133% of | +35% | 150 | |
| 200% of | +60% | 200 | |
| Initial value for current voice | 0% | 100 | |
Relative percentage | [+/-] number followed by |
Prompt VSDK-CSDK
Prompt SSML Markup is used to insert an ActivePrompt at a specific location in the text.
Example:
<prompt id="myPrompt"></prosody>
Attribute | Description | |
---|---|---|
id | The prompt id. |
Range VSDK-BARATINOO
Range SSML Markup is used to set the range of the voice.
It accepts predefined values as well as relative percentages numbers followed by %.
Example:
I'm going <prosody range="x-low">far</prosody>
Value | Description |
---|---|
| 50% of |
| 75% of |
| 100% of |
| 133% of |
| 200% of |
| Initial value for current voice |
Relative percentage | [+/-] number followed by |
Relative change | [+/-] number followed by |
Absolute value in Hertz | Unsigned number followed by |
Rate VSDK-CSDK VSDK-VTAPI VSDK-BARATINOO
Rate SSML Markup is used to set speech rate of the voice.
It accepts predefined values as well as relative percentages numbers followed by %.
Example:
<p>
<s>
The subject is <prosody rate="-20%">ski trip</prosody>
</s>
</p>
Value | VSDK-BARATINOO | VSDK-CSDK | VSDK-VTAPI |
---|---|---|---|
| 50% of | 50 | 50 |
| 75% of | 75 | 75 |
| 100% of | 100 | 100 |
| 125% of | 150 | 125 |
| 150% of | 200 | 150 |
| Initial value for current voice | 100 | 100 |
Relative percentage | [+/-] number followed by |
Rate subject VSDK-BARATINOO
Example:
<prosody vox:rate-subject="pause"></prosody>
Value | Description |
---|---|
vox:rate-subject |
|
Say as VSDK-CSDK VSDK-VTAPI VSDK-BARATINOO
Say-as SSML Markup is used to indicate the type of text construct contained within the element.
Multiple format values are available for each interpret-as values, but their realization is voice-dependant.
The attribute values that may have an effect on rendering depend on the current voice.
Example: Will be read as "third"
<say-as interpret-as="ordinal">3</say-as>
Attribute | Description | |
---|---|---|
format | The date format may be optionally specified via format attribute, to supersede the language defaults, e.g. | |
interpret-as | Indicates the content type of the contained text construct. | |
Value | Description | |
address VSDK-CSDK | Expand text as an address, including street names and numbers, zip codes, state names, etc. | |
boolean VSDK-VTAPI | Reads as a boolean. | |
cardinal VSDK-CSDK | Reads as a cardinal number. | |
characters VSDK-BARATINOO VSDK-VTAPI | Spells out letters, reads digits one by one, and expands | |
code VSDK-CSDK | Expand numbers or codes reading them digit by digit | |
currencyVSDK-CSDKVSDK-VTAPI | Expand text as a decimal currency including currency abbreviations. | |
date VSDK-CSDK VSDK-BARATINOO VSDK-VTAPI | Read digits as date. | |
decimal VSDK-CSDK | Same as number but including comma/dot normalization. | |
digits VSDK-CSDKVSDK-VTAPI | Expand numbers or codes reading them digit by digit. | |
distance VSDK-CSDK | Expand text as a distance measurement. | |
normal VSDK-CSDK | Default text normalization | |
number VSDK-CSDK VSDK-VTAPI | Expand cardinal/ comma formatted numbers up to 15 digits. | |
ordinal VSDK-CSDK | Reads as an ordinal number. | |
phone VSDK-CSDK VSDK-VTAPI | Expand text as a telephone number including country codes, prefixes, tel. word indicators, etc. | |
rational VSDK-CSDK | Same as number but including comma/dot normalization. | |
real VSDK-CSDK | Same as number but including comma/dot normalization. | |
sms VSDK-CSDK | Expand text as a sms message, reading web addresses, smileys, email addresses, etc. | |
spell VSDK-CSDK | Spell out the input text that follows. | |
telephone VSDK-CSDK | Reads as a telephone number. | |
time VSDK-CSDK VSDK-BARATINOO VSDK-VTAPI | Expand text as a clock reading (hour, minutes, am, pm), a duration or a time range. | |
zip VSDK-CSDK | Expand text as a zip code. | |
detailVSDK-VTAPI | An optional attribute, a value changes, depending on the interpret-as. | |
typeVSDK-VTAPI | A custom attribute, the interpret-as can be bypassed. it renders by defining a duration format. ( |
Sentence VSDK-CSDK VSDK-VTAPI VSDK-BARATINOO
Sentence SSML markup is used to indicate a sentence in your text.
While the TTS engine already recognize sentences automatically, it can help it to better understand and render your text. You can place multiple sentences in a paragraph.
Example:
<p>
<s>This is the first sentence of the paragraph.</s>
<s>Here's another sentence.</s>
</p>
Attribute | Description | |
---|---|---|
xml:lang | An optional attribute specifying the language of the element. | |
onlangfailure | An optional attribute specifying the desired behavior upon language speaking failure. Not supported by VSDK-VTAPI. | |
Value | Description | |
ignoretext | The synthesis processor will not attempt to render the text that is in the failed language. | |
ignorelang | The synthesis processor will ignore the change in language and speak as if the content were in the previous language. | |
changevoice | If a voice exists that can speak the language, the synthesis processor will switch to that voice and speak the content. Otherwise, the processor chooses another behavior (either | |
processorchoice | The synthesis processor chooses the behavior (either |
Style VSDK-CSDK
Style SSML Markup is used to set an alternative speaking style instead of the normal one.
Please note that a particular style can be incompatible with some voices.
Example:
Sorry <break time="300ms"/>
<style name="lively">sorry</style>
Not all styles are supported by all vsdk-csdk voices.
Attribute | Description |
---|---|
name | The speaking style name to use. You can check this page to get the supported values for each voice. |
Sub VSDK-CSDK VSDK-VTAPI VSDK-BARATINOO
Sub SSML Markup is used to substitute text for the purposes of pronunciation. The sub element can contain only text (no elements).
Example:
<sub alias="Voice Development Kit">VDK</sub>
Attribute | Description |
---|---|
alias | The content that the voice synthesis will read instead of the content of the element. |
Timbre VSDK-CSDK VSDK-BARATINOO
Timbre SSML Markup is a rate/pitch warping coefficient that maintains the duration of phonemes and enables voice timbre to be modified.
It accepts predefined values as well as relative percentages numbers followed by %.
Vsdk-csdk example:
<prosody timbre="+100%">
I am speaking with a different voice timber.
</prosody>
Vsdk-baratinoo example:
<prosody vox:timbre="+100%">
I am speaking with a different voice timber.
</prosody>
Attribute | Description | |
---|---|---|
timbreVSDK-CSDK | x-young | +35% |
x-young | +20% | |
medium | 0% | |
old | -20% | |
x-old | -35% | |
default | 0% | |
Relative percentage | [+/-] number followed by | |
vox:timbreVSDK-BARATINOO | Relative percentage | [+/-] number followed by |
Relative value | [+/-] number with no units | |
Absolute value | Multiplier of the initial timbre value for the current voice (unsigned number with no units or followed by |
Token VSDK-VTAPI VSDK-BARATINOO
Token SSML Markup can be used to disambiguate heteronyms.
Example:
<token xml:id="myToken">VDK</token>
Attribute | Description |
---|---|
xml:lang | An optional attribute specifying the language of the element. |
role | A QName used in conjunction with lexicons. |
onlangfailure | VSDK-BARATINOO |
xml:id | VSDK-BARATINOO |
Voice VSDK-CSDK VSDK-VTAPI VSDK-BARATINOO
Voice SSML Markup is used to change the language and voice applied to the text for rendering.
Example:
<voice xml:lang="de">Deutsch</voice>
Attribute | Description |
---|---|
name | VSDK-CSDK VSDK-VTAPI VSDK-BARATINOO |
gender | VSDK-CSDK VSDK-VTAPI VSDK-BARATINOO |
xml:lang | VSDK-CSDK VSDK-BARATINOO |
age | VSDK-CSDK VSDK-BARATINOO |
languages | VSDK-VTAPI VSDK-BARATINOO |
required | VSDK-BARATINOO |
ordering | VSDK-BARATINOO |
onvoicefailure | VSDK-BARATINOO
|
variant | VSDK-BARATINOO |
Volume VSDK-CSDK VSDK-VTAPI VSDK-BARATINOO
Volume SSML Markup is used to set the volume of the voice. It accepts predefined values as well as positive numbers.
Example:
<prosody volume="+100%">
I am speaking this at approximately twice the original signal amplitude.
</prosody>
Value | VSDK-BARATINOO | VSDK-CSDK | VSDK-VTAPI |
---|---|---|---|
| Initial value for current voice (60) | 80 | 100 |
| 0 relative to | 0 | 0 |
| 20 relative to | 26 | 32 |
| 40 relative to | 52 | 66 |
| 60 relative to | 80 | 100 |
| 80 relative to | 90 | 200 |
| 100 relative to | 100 | 300 |
Relative percentage | [+/-] number followed by | ||
Relative value | [+/-] number with no units | ||
Absolute value | Multiplier of the initial timbre value for the current voice (unsigned number with no units or followed by |
Word VSDK-CSDK VSDK-VTAPI VSDK-BARATINOO
Word SSML Markup can be used to express segmentation of a word.
Example:
<w>Apple</w>
Attribute | Description |
---|---|
xml:lang | An optional attribute specifying the language of the element. |
role | A QName used in conjunction with lexicons. |
onlangfailure | VSDK-BARATINOO |
xml:id | VSDK-BARATINOO |
vox:modes | VSDK-BARATINOO |