Skip to main content
Skip table of contents

SSML reference

The Speech Synthesis Markup Language (SSML) is an XML-based markup language for speech synthesis. Its essential role is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms.

Note that not all of the elements and options described in the W3 SSML specification are currently supported by all SDKs. This page details which elements are available for each SDK.

Reserve characters

Avoid using SSML reserve characters in the text that is to be converted to audio. When you need to use an SSML reserve character, prevent the character from being read as code by using its escape code. The following table shows reserved SSML characters and their associated escape codes.

Character

Escape code

"

"

&

&

'

'

<

&lt;

>

&gt;

Markups

Audio VSDK-CSDK

Audio SSML Markup is used to insert a recorded audio file. If the audio file cannot be retrieved, the element's contents are synthesized. The content is the fallback text used when the audio file is not supported.

Example:

XML
<audio src="file:laugh">
  haha
</audio>

Attribute

Description

src

The URI of a document with an appropriate MIME type. URIs may be absolute or relative to the base:uri specified in <speak> element. Audio files may be local (file://, or absolute paths) or remote (http://).

Supported audio files:

  • VSDK-CSDK .WAV containing linear 16 bit PCM samples.

VSDK-CSDK
The audio file will automatically be resampled to match the current sampling rate before inserting it in the speech output.

Break VSDK-CSDK

Break SSML Markup is used to temporarily pause the speech.

It is inserted at cursor position as an empty element, and can be used with milliseconds or seconds.

Example:

XML
<break time="300ms"/>

Attribute

Description

time

Signed or unsigned positive number or zero followed by s for seconds or
ms for milliseconds.

VSDK-BARATINOO Extension: percentage values are also accepted.

strength

Value

VSDK-CSDK

none

0ms

x-weak

20ms

weak

100ms

medium

500ms

strong

1000ms

x-strong

1500ms

Emphasis VSDK-CSDK

Emphasis SSML Markup is used to request that the contained text be spoken with emphasis. Please note that the realization of emphasis is voice dependent.

Example:

XML
That is a <emphasis> big </emphasis> car!

Attribute

Description

level

none
reduced
moderate (default)
strong

Lang VSDK-CSDK

With the lang SSML makup it is possible to switch language. Changing the language also changes the voice, if there is a voice available.

The lang element can only contain text to be rendered and the following elements: audio, break, emphasis, lang, lookup, mark, p, phoneme, prosody, say-as, sub, s, token, voice and w.

Example:

XML
English, <lang xml:lang="de">Deutsch</lang>, English.

Attribute

Description

xml:lang

A required attribute specifying the language of the element.

onlangfailure

An optional attribute specifying the desired behavior upon language speaking failure.

Value

Description

ignoretext

The synthesis processor will not attempt to render the text that is in the failed language.

ignorelang

The synthesis processor will ignore the change in language and speak as if the content were in the previous language.

changevoice

If a voice exists that can speak the language, the synthesis processor will switch to that voice and speak the content. Otherwise, the processor chooses another behavior (either ignoretext or ignorelang).

processorchoice

The synthesis processor chooses the behavior (either changevoice, ignoretext, or ignorelang).

Paragraph VSDK-CSDK

Paragraph SSML markup is used to indicate a paragraph in your text.

While the TTS engine already recognize paragraphs automatically, it can help it to better understand and render your text.

Example:

XML
<p>
  You have 4 new messages.
</p>

Attribute

Description

xml:lang

An optional attribute specifying the language of the element.

onlangfailure

An optional attribute specifying the desired behavior upon language speaking failure.

Value

Description

ignoretext

The synthesis processor will not attempt to render the text that is in the failed language.

ignorelang

The synthesis processor will ignore the change in language and speak as if the content were in the previous language.

changevoice

If a voice exists that can speak the language, the synthesis processor will switch to that voice and speak the content. Otherwise, the processor chooses another behavior (either ignoretext or ignorelang).

processorchoice

The synthesis processor chooses the behavior (either changevoice, ignoretext, or ignorelang).

Phoneme VSDK-CSDK

Phoneme SSML Markup is used to provide a phonetic pronunciation for the contained text.

Example:

XML
<phoneme alphabet="ipa" ph="vivo͡ʊkə">
  Vivoka
</phoneme>

Support of the alphabet is limited to sounds that map to the phonetic symbols of the current voice.

Attribute

Description

alphabet

SDK

Values

VSDK-CSDK

lhp, nt-sampa, sxm-sampa, pinyin (for Chinese only), diacritized (for Arabic only)

ph

List of phonetic symbols.

Separated by underscore _ when x-voxygen alphabet is used.

Pitch VSDK-CSDK

Pitch SSML Markup is used to set the pitch of the voice.

It accepts predefined values as well as relative percentages numbers followed by %.

Example:

XML
<prosody pitch="x-low">Oh my voice</prosody>

Value

VSDK-CSDK

x-low

-30%

low

-15%

medium

0%

high

+35%

x-high

+60%

default

0%

Relative percentage

[+/-] number followed by %

Prompt VSDK-CSDK

Prompt SSML Markup is used to insert an ActivePrompt at a specific location in the text.

Example:

XML
<prompt id="myPrompt"></prosody>

Attribute

Description

id

The prompt id.

Rate VSDK-CSDK

Rate SSML Markup is used to set speech rate of the voice.

It accepts predefined values as well as relative percentages numbers followed by %.

Example:

XML
<p>
  <s>
    The subject is <prosody rate="-20%">ski trip</prosody>
  </s>
</p>

Value

VSDK-CSDK

x-slow

50

slow

75

medium

100

fast

150

x-fast

200

default

100

Relative percentage

[+/-] number followed by %, Extension of SSML 1.1.

Say as VSDK-CSDK

Say-as SSML Markup is used to indicate the type of text construct contained within the element.

Multiple format values are available for each interpret-as values, but their realization is voice-dependant.

The attribute values that may have an effect on rendering depend on the current voice.

Example: ​ Will be read as "third"

XML
<say-as interpret-as="ordinal">3</say-as>

Attribute

Description

format

The date format may be optionally specified via format attribute, to supersede the language defaults, e.g. dmy or mdy.

interpret-as

Indicates the content type of the contained text construct.

Value

Description

address VSDK-CSDK

Expand text as an address, including street names and numbers, zip codes, state names, etc.

cardinal VSDK-CSDK

Reads as a cardinal number.

code VSDK-CSDK

Expand numbers or codes reading them digit by digit

currencyVSDK-CSDK

Expand text as a decimal currency including currency abbreviations.

date VSDK-CSDK

Read digits as date.

decimal VSDK-CSDK

Same as number but including comma/dot normalization.

digits VSDK-CSDK

Expand numbers or codes reading them digit by digit.

distance VSDK-CSDK

Expand text as a distance measurement.

normal VSDK-CSDK

Default text normalization

number VSDK-CSDK

Expand cardinal/ comma formatted numbers up to 15 digits.

ordinal VSDK-CSDK

Reads as an ordinal number.

phone VSDK-CSDK

Expand text as a telephone number including country codes, prefixes, tel. word indicators, etc.

rational VSDK-CSDK

Same as number but including comma/dot normalization.

real VSDK-CSDK

Same as number but including comma/dot normalization.

sms VSDK-CSDK

Expand text as a sms message, reading web addresses, smileys, email addresses, etc.

spell VSDK-CSDK

Spell out the input text that follows.

telephone VSDK-CSDK

Reads as a telephone number.

time VSDK-CSDK

Expand text as a clock reading (hour, minutes, am, pm), a duration or a time range.

zip VSDK-CSDK

Expand text as a zip code.

Sentence VSDK-CSDK

Sentence SSML markup is used to indicate a sentence in your text.

While the TTS engine already recognize sentences automatically, it can help it to better understand and render your text. You can place multiple sentences in a paragraph.

Example:

XML
  <p>
    <s>This is the first sentence of the paragraph.</s>
    <s>Here's another sentence.</s>
  </p>

Attribute

Description

xml:lang

An optional attribute specifying the language of the element.

onlangfailure

An optional attribute specifying the desired behavior upon language speaking failure.

Value

Description

ignoretext

The synthesis processor will not attempt to render the text that is in the failed language.

ignorelang

The synthesis processor will ignore the change in language and speak as if the content were in the previous language.

changevoice

If a voice exists that can speak the language, the synthesis processor will switch to that voice and speak the content. Otherwise, the processor chooses another behavior (either ignoretext or ignorelang).

processorchoice

The synthesis processor chooses the behavior (either changevoice, ignoretext, or ignorelang).

Style VSDK-CSDK

Style SSML Markup is used to set an alternative speaking style instead of the normal one.

Please note that a particular style can be incompatible with some voices.

Example:

XML
Sorry <break time="300ms"/>
<style name="lively">sorry</style>

Not all styles are supported by all vsdk-csdk voices.

Attribute

Description

name

The speaking style name to use.

You can check this page to get the supported values for each voice.

Sub VSDK-CSDK

Sub SSML Markup is used to substitute text for the purposes of pronunciation. The sub element can contain only text (no elements).

Example:

XML
<sub alias="Voice Development Kit">VDK</sub>

Attribute

Description

alias

The content that the voice synthesis will read instead of the content of the element.

Timbre VSDK-CSDK

Timbre SSML Markup is a rate/pitch warping coefficient that maintains the duration of phonemes and enables voice timbre to be modified.

It accepts predefined values as well as relative percentages numbers followed by %.

Vsdk-csdk example:

XML
<prosody timbre="+100%">
    I am speaking with a different voice timber.
</prosody>

Vsdk-baratinoo example:

XML
<prosody vox:timbre="+100%">
    I am speaking with a different voice timber.
</prosody>

Attribute

Description

timbreVSDK-CSDK

x-young

+35%

x-young

+20%

medium

0%

old

-20%

x-old

-35%

default

0%

Relative percentage

[+/-] number followed by %

Voice VSDK-CSDK

Voice SSML Markup is used to change the language and voice applied to the text for rendering.

Example:

XML
<voice xml:lang="de">Deutsch</voice>

Attribute

Description

name

VSDK-CSDK
Voice name

gender

VSDK-CSDK
male
female
neutral

xml:lang

VSDK-CSDK
An optional attribute specifying the language of the element.

age

VSDK-CSDK
Positive integer or zero.

Volume VSDK-CSDK

Volume SSML Markup is used to set the volume of the voice. It accepts predefined values as well as positive numbers.

Example:

XML
<prosody volume="+100%">
    I am speaking this at approximately twice the original signal amplitude.
</prosody>

Value

VSDK-CSDK

default

80

silent

0

x-soft

26

soft

52

medium

80

loud

90

x-loud

100

Relative percentage

[+/-] number followed by %

Relative value

[+/-] number with no units

Absolute value

Multiplier of the initial timbre value for the current voice (unsigned number with no units or followed by %).

Word VSDK-CSDK

Word SSML Markup can be used to express segmentation of a word.

Example:

XML
<w>Apple</w>

Attribute

Description

xml:lang

An optional attribute specifying the language of the element.

role

A QName used in conjunction with lexicons.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.