SSML reference
The Speech Synthesis Markup Language (SSML) is an XML-based markup language for speech synthesis. Its essential role is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms.
Note that not all of the elements and options described in the W3 SSML specification are currently supported by all SDKs. This page details which elements are available for each SDK.
Reserve characters
Avoid using SSML reserve characters in the text that is to be converted to audio. When you need to use an SSML reserve character, prevent the character from being read as code by using its escape code. The following table shows reserved SSML characters and their associated escape codes.
Character | Escape code |
---|---|
" |
|
& |
|
' |
|
< |
|
> |
|
Markups
Audio VSDK-CSDK
Audio SSML Markup is used to insert a recorded audio file. If the audio file cannot be retrieved, the element's contents are synthesized. The content is the fallback text used when the audio file is not supported.
Example:
<audio src="file:laugh">
haha
</audio>
Attribute | Description |
---|---|
src | The URI of a document with an appropriate MIME type. URIs may be absolute or relative to the base:uri specified in <speak> element. Audio files may be local ( Supported audio files:
VSDK-CSDK |
Break VSDK-CSDK
Break SSML Markup is used to temporarily pause the speech.
It is inserted at cursor position as an empty element, and can be used with milliseconds or seconds.
Example:
<break time="300ms"/>
Attribute | Description | |
---|---|---|
time | Signed or unsigned positive number or zero followed by VSDK-BARATINOO Extension: percentage values are also accepted. | |
strength | Value | VSDK-CSDK |
none | 0ms | |
x-weak | 20ms | |
weak | 100ms | |
medium | 500ms | |
strong | 1000ms | |
x-strong | 1500ms |
Emphasis VSDK-CSDK
Emphasis SSML Markup is used to request that the contained text be spoken with emphasis. Please note that the realization of emphasis is voice dependent.
Example:
That is a <emphasis> big </emphasis> car!
Attribute | Description | |
---|---|---|
level |
|
Lang VSDK-CSDK
With the lang SSML makup it is possible to switch language. Changing the language also changes the voice, if there is a voice available.
The lang
element can only contain text to be rendered and the following elements: audio
, break
, emphasis
, lang
, lookup
, mark
, p
, phoneme
, prosody
, say-as
, sub
, s
, token
, voice
and w
.
Example:
English, <lang xml:lang="de">Deutsch</lang>, English.
Attribute | Description | |
---|---|---|
xml:lang | A required attribute specifying the language of the element. | |
onlangfailure | An optional attribute specifying the desired behavior upon language speaking failure. | |
Value | Description | |
ignoretext | The synthesis processor will not attempt to render the text that is in the failed language. | |
ignorelang | The synthesis processor will ignore the change in language and speak as if the content were in the previous language. | |
changevoice | If a voice exists that can speak the language, the synthesis processor will switch to that voice and speak the content. Otherwise, the processor chooses another behavior (either | |
processorchoice | The synthesis processor chooses the behavior (either |
Paragraph VSDK-CSDK
Paragraph SSML markup is used to indicate a paragraph in your text.
While the TTS engine already recognize paragraphs automatically, it can help it to better understand and render your text.
Example:
<p>
You have 4 new messages.
</p>
Attribute | Description | |
---|---|---|
xml:lang | An optional attribute specifying the language of the element. | |
onlangfailure | An optional attribute specifying the desired behavior upon language speaking failure. | |
Value | Description | |
ignoretext | The synthesis processor will not attempt to render the text that is in the failed language. | |
ignorelang | The synthesis processor will ignore the change in language and speak as if the content were in the previous language. | |
changevoice | If a voice exists that can speak the language, the synthesis processor will switch to that voice and speak the content. Otherwise, the processor chooses another behavior (either | |
processorchoice | The synthesis processor chooses the behavior (either |
Phoneme VSDK-CSDK
Phoneme SSML Markup is used to provide a phonetic pronunciation for the contained text.
Example:
<phoneme alphabet="ipa" ph="vivo͡ʊkə">
Vivoka
</phoneme>
Support of the alphabet is limited to sounds that map to the phonetic symbols of the current voice.
Attribute | Description | |
---|---|---|
alphabet | SDK | Values |
VSDK-CSDK |
| |
ph | List of phonetic symbols. Separated by underscore |
Pitch VSDK-CSDK
Pitch SSML Markup is used to set the pitch of the voice.
It accepts predefined values as well as relative percentages numbers followed by %.
Example:
<prosody pitch="x-low">Oh my voice</prosody>
Value | VSDK-CSDK |
---|---|
| -30% |
| -15% |
| 0% |
| +35% |
| +60% |
| 0% |
Relative percentage | [+/-] number followed by |
Prompt VSDK-CSDK
Prompt SSML Markup is used to insert an ActivePrompt at a specific location in the text.
Example:
<prompt id="myPrompt"></prosody>
Attribute | Description | |
---|---|---|
id | The prompt id. |
Rate VSDK-CSDK
Rate SSML Markup is used to set speech rate of the voice.
It accepts predefined values as well as relative percentages numbers followed by %.
Example:
<p>
<s>
The subject is <prosody rate="-20%">ski trip</prosody>
</s>
</p>
Value | VSDK-CSDK |
---|---|
| 50 |
| 75 |
| 100 |
| 150 |
| 200 |
| 100 |
Relative percentage | [+/-] number followed by %, Extension of SSML 1.1. |
Say as VSDK-CSDK
Say-as SSML Markup is used to indicate the type of text construct contained within the element.
Multiple format values are available for each interpret-as values, but their realization is voice-dependant.
The attribute values that may have an effect on rendering depend on the current voice.
Example: Will be read as "third"
<say-as interpret-as="ordinal">3</say-as>
Attribute | Description | |
---|---|---|
format | The date format may be optionally specified via format attribute, to supersede the language defaults, e.g. | |
interpret-as | Indicates the content type of the contained text construct. | |
Value | Description | |
address VSDK-CSDK | Expand text as an address, including street names and numbers, zip codes, state names, etc. | |
cardinal VSDK-CSDK | Reads as a cardinal number. | |
code VSDK-CSDK | Expand numbers or codes reading them digit by digit | |
currencyVSDK-CSDK | Expand text as a decimal currency including currency abbreviations. | |
date VSDK-CSDK | Read digits as date. | |
decimal VSDK-CSDK | Same as number but including comma/dot normalization. | |
digits VSDK-CSDK | Expand numbers or codes reading them digit by digit. | |
distance VSDK-CSDK | Expand text as a distance measurement. | |
normal VSDK-CSDK | Default text normalization | |
number VSDK-CSDK | Expand cardinal/ comma formatted numbers up to 15 digits. | |
ordinal VSDK-CSDK | Reads as an ordinal number. | |
phone VSDK-CSDK | Expand text as a telephone number including country codes, prefixes, tel. word indicators, etc. | |
rational VSDK-CSDK | Same as number but including comma/dot normalization. | |
real VSDK-CSDK | Same as number but including comma/dot normalization. | |
sms VSDK-CSDK | Expand text as a sms message, reading web addresses, smileys, email addresses, etc. | |
spell VSDK-CSDK | Spell out the input text that follows. | |
telephone VSDK-CSDK | Reads as a telephone number. | |
time VSDK-CSDK | Expand text as a clock reading (hour, minutes, am, pm), a duration or a time range. | |
zip VSDK-CSDK | Expand text as a zip code. |
Sentence VSDK-CSDK
Sentence SSML markup is used to indicate a sentence in your text.
While the TTS engine already recognize sentences automatically, it can help it to better understand and render your text. You can place multiple sentences in a paragraph.
Example:
<p>
<s>This is the first sentence of the paragraph.</s>
<s>Here's another sentence.</s>
</p>
Attribute | Description | |
---|---|---|
xml:lang | An optional attribute specifying the language of the element. | |
onlangfailure | An optional attribute specifying the desired behavior upon language speaking failure. | |
Value | Description | |
ignoretext | The synthesis processor will not attempt to render the text that is in the failed language. | |
ignorelang | The synthesis processor will ignore the change in language and speak as if the content were in the previous language. | |
changevoice | If a voice exists that can speak the language, the synthesis processor will switch to that voice and speak the content. Otherwise, the processor chooses another behavior (either | |
processorchoice | The synthesis processor chooses the behavior (either |
Style VSDK-CSDK
Style SSML Markup is used to set an alternative speaking style instead of the normal one.
Please note that a particular style can be incompatible with some voices.
Example:
Sorry <break time="300ms"/>
<style name="lively">sorry</style>
Not all styles are supported by all vsdk-csdk voices.
Attribute | Description |
---|---|
name | The speaking style name to use. You can check this page to get the supported values for each voice. |
Sub VSDK-CSDK
Sub SSML Markup is used to substitute text for the purposes of pronunciation. The sub element can contain only text (no elements).
Example:
<sub alias="Voice Development Kit">VDK</sub>
Attribute | Description |
---|---|
alias | The content that the voice synthesis will read instead of the content of the element. |
Timbre VSDK-CSDK
Timbre SSML Markup is a rate/pitch warping coefficient that maintains the duration of phonemes and enables voice timbre to be modified.
It accepts predefined values as well as relative percentages numbers followed by %.
Vsdk-csdk example:
<prosody timbre="+100%">
I am speaking with a different voice timber.
</prosody>
Vsdk-baratinoo example:
<prosody vox:timbre="+100%">
I am speaking with a different voice timber.
</prosody>
Attribute | Description | |
---|---|---|
timbreVSDK-CSDK | x-young | +35% |
x-young | +20% | |
medium | 0% | |
old | -20% | |
x-old | -35% | |
default | 0% | |
Relative percentage | [+/-] number followed by |
Voice VSDK-CSDK
Voice SSML Markup is used to change the language and voice applied to the text for rendering.
Example:
<voice xml:lang="de">Deutsch</voice>
Attribute | Description |
---|---|
name | VSDK-CSDK |
gender | VSDK-CSDK |
xml:lang | VSDK-CSDK |
age | VSDK-CSDK |
Volume VSDK-CSDK
Volume SSML Markup is used to set the volume of the voice. It accepts predefined values as well as positive numbers.
Example:
<prosody volume="+100%">
I am speaking this at approximately twice the original signal amplitude.
</prosody>
Value | VSDK-CSDK |
---|---|
| 80 |
| 0 |
| 26 |
| 52 |
| 80 |
| 90 |
| 100 |
Relative percentage | [+/-] number followed by |
Relative value | [+/-] number with no units |
Absolute value | Multiplier of the initial timbre value for the current voice (unsigned number with no units or followed by |
Word VSDK-CSDK
Word SSML Markup can be used to express segmentation of a word.
Example:
<w>Apple</w>
Attribute | Description |
---|---|
xml:lang | An optional attribute specifying the language of the element. |
role | A QName used in conjunction with lexicons. |