SSML reference

The Speech Synthesis Markup Language (SSML) is an XML-based markup language for speech synthesis. Its essential role is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms.

Note that not all of the elements and options described in the W3 SSML specification are currently supported by all SDKs. This page details which elements are available for each SDK.

Reserve characters

Avoid using SSML reserve characters in the text that is to be converted to audio. When you need to use an SSML reserve character, prevent the character from being read as code by using its escape code. The following table shows reserved SSML characters and their associated escape codes.

Character	Escape code
"	`"`
&	`&`
'	`'`
<	`<`
>	`>`

Markups

Audio VSDK-CSDK

Audio SSML Markup is used to insert a recorded audio file. If the audio file cannot be retrieved, the element's contents are synthesized. The content is the fallback text used when the audio file is not supported.

Example:

XML

<audio src="file:laugh">
  haha
</audio>

Attribute

Description

src

The URI of a document with an appropriate MIME type. URIs may be absolute or relative to the base:uri specified in <speak> element. Audio files may be local (file://, or absolute paths) or remote (http://).

Supported audio files:

VSDK-CSDK .WAV containing linear 16 bit PCM samples.

VSDK-CSDK
The audio file will automatically be resampled to match the current sampling rate before inserting it in the speech output.

Break VSDK-CSDK

Break SSML Markup is used to temporarily pause the speech.

It is inserted at cursor position as an empty element, and can be used with milliseconds or seconds.

Example:

XML

<break time="300ms"/>

Attribute	Description
time	Signed or unsigned positive number or zero followed by `s` for seconds or `ms` for milliseconds. VSDK-BARATINOO Extension: percentage values are also accepted.
strength	Value	VSDK-CSDK
	none	0ms
	x-weak	20ms
	weak	100ms
	medium	500ms
	strong	1000ms
	x-strong	1500ms

Emphasis VSDK-CSDK

Emphasis SSML Markup is used to request that the contained text be spoken with emphasis. Please note that the realization of emphasis is voice dependent.

Example:

XML

That is a <emphasis> big </emphasis> car!

Attribute	Description
level	`none` `reduced` `moderate` (default) `strong`

Lang VSDK-CSDK

With the lang SSML makup it is possible to switch language. Changing the language also changes the voice, if there is a voice available.

The lang element can only contain text to be rendered and the following elements: audio, break, emphasis, lang, lookup, mark, p, phoneme, prosody, say-as, sub, s, token, voice and w.

Example:

XML

English, <lang xml:lang="de">Deutsch</lang>, English.

Attribute	Description
xml:lang	A required attribute specifying the language of the element.
onlangfailure	An optional attribute specifying the desired behavior upon language speaking failure.
	Value	Description
	ignoretext	The synthesis processor will not attempt to render the text that is in the failed language.
	ignorelang	The synthesis processor will ignore the change in language and speak as if the content were in the previous language.
	changevoice	If a voice exists that can speak the language, the synthesis processor will switch to that voice and speak the content. Otherwise, the processor chooses another behavior (either `ignoretext` or `ignorelang`).
	processorchoice	The synthesis processor chooses the behavior (either `changevoice`, `ignoretext`, or `ignorelang`).

Paragraph VSDK-CSDK

Paragraph SSML markup is used to indicate a paragraph in your text.

While the TTS engine already recognize paragraphs automatically, it can help it to better understand and render your text.

Example:

XML

<p>
  You have 4 new messages.
</p>

Attribute	Description
xml:lang	An optional attribute specifying the language of the element.
onlangfailure	An optional attribute specifying the desired behavior upon language speaking failure.
	Value	Description
	ignoretext	The synthesis processor will not attempt to render the text that is in the failed language.
	ignorelang	The synthesis processor will ignore the change in language and speak as if the content were in the previous language.
	changevoice	If a voice exists that can speak the language, the synthesis processor will switch to that voice and speak the content. Otherwise, the processor chooses another behavior (either `ignoretext` or `ignorelang`).
	processorchoice	The synthesis processor chooses the behavior (either `changevoice`, `ignoretext`, or `ignorelang`).

Phoneme VSDK-CSDK

Phoneme SSML Markup is used to provide a phonetic pronunciation for the contained text.

Example:

XML

<phoneme alphabet="ipa" ph="vivo͡ʊkə">
  Vivoka
</phoneme>

Support of the alphabet is limited to sounds that map to the phonetic symbols of the current voice.

Attribute	Description
alphabet	SDK	Values
alphabet	VSDK-CSDK	`lhp`, `nt-sampa`, `sxm-sampa`, `pinyin` (for Chinese only), `diacritized` (for Arabic only)
ph	List of phonetic symbols. Separated by underscore `_` when `x-voxygen` alphabet is used.

Pitch VSDK-CSDK

Pitch SSML Markup is used to set the pitch of the voice.

It accepts predefined values as well as relative percentages numbers followed by %.

Example:

XML

<prosody pitch="x-low">Oh my voice</prosody>

Value	VSDK-CSDK
`x-low`	-30%
`low`	-15%
`medium`	0%
`high`	+35%
`x-high`	+60%
`default`	0%
Relative percentage	[+/-] number followed by `%`

Prompt VSDK-CSDK

Prompt SSML Markup is used to insert an ActivePrompt at a specific location in the text.

Example:

XML

<prompt id="myPrompt"></prosody>

Attribute	Description
id	The prompt id.

Rate VSDK-CSDK

Rate SSML Markup is used to set speech rate of the voice.

It accepts predefined values as well as relative percentages numbers followed by %.

Example:

XML

<p>
  <s>
    The subject is <prosody rate="-20%">ski trip</prosody>
  </s>
</p>

Value	VSDK-CSDK
`x-slow`	50
`slow`	75
`medium`	100
`fast`	150
`x-fast`	200
`default`	100
Relative percentage	[+/-] number followed by %, Extension of SSML 1.1.

Say as VSDK-CSDK

Say-as SSML Markup is used to indicate the type of text construct contained within the element.

Multiple format values are available for each interpret-as values, but their realization is voice-dependant.

The attribute values that may have an effect on rendering depend on the current voice.

Example: Will be read as "third"

XML

<say-as interpret-as="ordinal">3</say-as>

Attribute	Description
format	The date format may be optionally specified via format attribute, to supersede the language defaults, e.g. `dmy` or `mdy`.
interpret-as	Indicates the content type of the contained text construct.
	Value	Description
	address VSDK-CSDK	Expand text as an address, including street names and numbers, zip codes, state names, etc.
	cardinal VSDK-CSDK	Reads as a cardinal number.
	code VSDK-CSDK	Expand numbers or codes reading them digit by digit
	currencyVSDK-CSDK	Expand text as a decimal currency including currency abbreviations.
	date VSDK-CSDK	Read digits as date.
	decimal VSDK-CSDK	Same as number but including comma/dot normalization.
	digits VSDK-CSDK	Expand numbers or codes reading them digit by digit.
	distance VSDK-CSDK	Expand text as a distance measurement.
	normal VSDK-CSDK	Default text normalization
	number VSDK-CSDK	Expand cardinal/ comma formatted numbers up to 15 digits.
	ordinal VSDK-CSDK	Reads as an ordinal number.
	phone VSDK-CSDK	Expand text as a telephone number including country codes, prefixes, tel. word indicators, etc.
	rational VSDK-CSDK	Same as number but including comma/dot normalization.
	real VSDK-CSDK	Same as number but including comma/dot normalization.
	sms VSDK-CSDK	Expand text as a sms message, reading web addresses, smileys, email addresses, etc.
	spell VSDK-CSDK	Spell out the input text that follows.
	telephone VSDK-CSDK	Reads as a telephone number.
	time VSDK-CSDK	Expand text as a clock reading (hour, minutes, am, pm), a duration or a time range.
	zip VSDK-CSDK	Expand text as a zip code.

Sentence VSDK-CSDK

Sentence SSML markup is used to indicate a sentence in your text.

While the TTS engine already recognize sentences automatically, it can help it to better understand and render your text. You can place multiple sentences in a paragraph.

Example:

XML

  <p>
    <s>This is the first sentence of the paragraph.</s>
    <s>Here's another sentence.</s>
  </p>

Attribute	Description
xml:lang	An optional attribute specifying the language of the element.
onlangfailure	An optional attribute specifying the desired behavior upon language speaking failure.
	Value	Description
	ignoretext	The synthesis processor will not attempt to render the text that is in the failed language.
	ignorelang	The synthesis processor will ignore the change in language and speak as if the content were in the previous language.
	changevoice	If a voice exists that can speak the language, the synthesis processor will switch to that voice and speak the content. Otherwise, the processor chooses another behavior (either `ignoretext` or `ignorelang`).
	processorchoice	The synthesis processor chooses the behavior (either `changevoice`, `ignoretext`, or `ignorelang`).

Style VSDK-CSDK

Style SSML Markup is used to set an alternative speaking style instead of the normal one.

Please note that a particular style can be incompatible with some voices.

Example:

XML

Sorry <break time="300ms"/>
<style name="lively">sorry</style>

Not all styles are supported by all vsdk-csdk voices.

Attribute	Description
name	The speaking style name to use. You can check this page to get the supported values for each voice.

Sub VSDK-CSDK

Sub SSML Markup is used to substitute text for the purposes of pronunciation. The sub element can contain only text (no elements).

Example:

XML

<sub alias="Voice Development Kit">VDK</sub>

Attribute	Description
alias	The content that the voice synthesis will read instead of the content of the element.

Timbre VSDK-CSDK

Timbre SSML Markup is a rate/pitch warping coefficient that maintains the duration of phonemes and enables voice timbre to be modified.

It accepts predefined values as well as relative percentages numbers followed by %.

Vsdk-csdk example:

XML

<prosody timbre="+100%">
    I am speaking with a different voice timber.
</prosody>

Vsdk-baratinoo example:

XML

<prosody vox:timbre="+100%">
    I am speaking with a different voice timber.
</prosody>

Attribute	Description
timbreVSDK-CSDK	x-young	+35%
	x-young	+20%
	medium	0%
	old	-20%
	x-old	-35%
	default	0%
	Relative percentage	[+/-] number followed by `%`

Voice VSDK-CSDK

Voice SSML Markup is used to change the language and voice applied to the text for rendering.

Example:

XML

<voice xml:lang="de">Deutsch</voice>

Attribute	Description
name	VSDK-CSDK Voice name
gender	VSDK-CSDK `male` `female` `neutral`
xml:lang	VSDK-CSDK An optional attribute specifying the language of the element.
age	VSDK-CSDK Positive integer or zero.

Volume VSDK-CSDK

Volume SSML Markup is used to set the volume of the voice. It accepts predefined values as well as positive numbers.

Example:

XML

<prosody volume="+100%">
    I am speaking this at approximately twice the original signal amplitude.
</prosody>

Value	VSDK-CSDK
`default`	80
`silent`	0
`x-soft`	26
`soft`	52
`medium`	80
`loud`	90
`x-loud`	100
Relative percentage	[+/-] number followed by `%`
Relative value	[+/-] number with no units
Absolute value	Multiplier of the initial timbre value for the current voice (unsigned number with no units or followed by `%`).

Word VSDK-CSDK

Word SSML Markup can be used to express segmentation of a word.

Example:

XML

<w>Apple</w>

Attribute	Description
xml:lang	An optional attribute specifying the language of the element.
role	A QName used in conjunction with lexicons.