SSML Elements

Speech Synthesis Markup Language (SSML) is an XML-based markup language that gives you precise control over how text-to-speech (TTS) engines pronounce and deliver spoken content. This reference documents the SSML elements supported by Capacity Private Cloud TTS, with practical examples to help you create natural-sounding speech output.

Why Use SSML?

While plain text works for simple TTS applications, SSML enables you to:

Control pronunciation of ambiguous words, acronyms, and abbreviations
Add natural pauses and phrasing
Adjust speaking rate, pitch, and volume
Specify how dates, times, numbers, and currencies should be spoken
Insert audio clips within synthesized speech
Switch between voices within a single utterance

Document Structure Elements

speak

The root element that wraps all SSML content. Every SSML document must begin with this element. → Full documentation

<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
  xml:lang="en-US">

  Your content goes here.
</speak>

Key attributes:

version - SSML specification version (typically "1.1")
xml:lang - Default language for the document (e.g., "en-US", "es-ES", "fr-FR")

meta and metadata

These elements allow you to include metadata about the SSML document. While not affecting speech output, they can be useful for tracking and documentation purposes. → meta documentation | → metadata documentation

<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis">
  <meta name="author" content="Customer Service Team"/>
  <meta name="version" content="2.1"/>

  Welcome to our service.
</speak>

lexicon

References an external pronunciation lexicon file (PLS format) that defines custom pronunciations for specific words or phrases. This is useful when you have domain-specific terminology that the TTS engine might not pronounce correctly by default. → Full documentation

<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis">
  <lexicon uri="https://example.com/medical-terms.pls" type="application/pls+xml"/>
  
  The patient was diagnosed with hyperlipidemia.
</speak>

Text Structure Elements

p (paragraph) and s (sentence)

These elements define paragraph and sentence boundaries, helping the TTS engine apply appropriate prosody and pausing. While the engine can detect sentences automatically, explicit markup ensures consistent behavior. → p documentation | → s documentation

<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis">
  <p>
    <s>Welcome to Capacity Private Cloud.</s>
    <s>We're glad you're here.</s>
  </p>
  <p>
    <s>Let's get started with your request.</s>
  </p>
</speak>

When to use: Use explicit paragraph and sentence markup when you need predictable pausing between sections, or when punctuation alone doesn't convey the intended structure.

Pronunciation Elements

say-as

Specifies how particular text should be interpreted and spoken. This is essential for content that could be read multiple ways—dates, times, phone numbers, currencies, and more. → Full documentation

Common interpret-as values:

cardinal - Reads as a number: "123" becomes "one hundred twenty-three"
ordinal - Reads as an ordinal: "3" becomes "third"
characters - Spells out letter by letter: "ABC" becomes "A B C"
telephone - Reads as a phone number with appropriate grouping
date - Reads as a date (use format attribute to specify order)
time - Reads as a time value
currency - Reads as a monetary amount
unit - Reads measurement units appropriately

Examples:

<!-- Phone number -->
<say-as interpret-as="telephone">8005551234</say-as>
<!-- Speaks: "eight hundred, five five five, one two three four" -->

<!-- Date with format -->
<say-as interpret-as="date" format="mdy">03/15/2026</say-as>
<!-- Speaks: "March fifteenth, twenty twenty-six" -->

<!-- Currency -->
<say-as interpret-as="currency">$42.50</say-as>
<!-- Speaks: "forty-two dollars and fifty cents" -->

<!-- Spell out an acronym -->
<say-as interpret-as="characters">API</say-as>
<!-- Speaks: "A P I" -->

<!-- Time -->
<say-as interpret-as="time" format="hms24">14:30:00</say-as>
<!-- Speaks: "fourteen thirty" -->

phoneme

Provides explicit phonetic pronunciation using the International Phonetic Alphabet (IPA) or a vendor-specific alphabet. Use this when the default pronunciation is incorrect and you need precise control. → Full documentation

<!-- Using IPA -->
<phoneme alphabet="ipa" ph="təˈmeɪtoʊ">tomato</phoneme>

<!-- Company name with specific pronunciation -->
<phoneme alphabet="ipa" ph="ˈnɪkeɪ">Nike</phoneme>

Tip: For frequently mispronounced words across your application, consider using a lexicon file rather than marking up each instance individually.

sub

Substitutes the spoken text with an alternative. The original text is retained in the document but the TTS engine speaks the alias instead. This is useful for abbreviations, acronyms, or symbols. → Full documentation

<sub alias="World Wide Web Consortium">W3C</sub>
<!-- Displays "W3C" but speaks "World Wide Web Consortium" -->

<sub alias="doctor">Dr.</sub> Smith will see you now.
<!-- Speaks: "Doctor Smith will see you now" -->

<sub alias="miles per hour">mph</sub>
<!-- Speaks: "miles per hour" -->

Prosody and Style Elements

prosody

Controls the pitch, rate, and volume of speech. This element is essential for creating natural, expressive speech output. → Full documentation

Attributes:

rate - Speaking speed: "x-slow", "slow", "medium", "fast", "x-fast", or percentage (e.g., "80%", "120%")
pitch - Voice pitch: "x-low", "low", "medium", "high", "x-high", or relative values (e.g., "+10%", "-5%")
volume - Loudness: "silent", "x-soft", "soft", "medium", "loud", "x-loud", or decibels (e.g., "+6dB")

<!-- Slower, louder speech for emphasis -->
<prosody rate="slow" volume="loud">
  This is very important information.
</prosody>

<!-- Faster rate for less critical content -->
<prosody rate="fast">
  Terms and conditions apply.
</prosody>

<!-- Higher pitch for questions -->
<prosody pitch="+15%">
  Would you like to continue?
</prosody>

<!-- Combined adjustments -->
<prosody rate="90%" pitch="-5%" volume="soft">
  This is a confidential message.
</prosody>

emphasis

Adds emphasis to words or phrases, similar to how a human speaker would stress important content. → Full documentation

Levels: "strong", "moderate", "reduced", "none"

<emphasis level="strong">Never</emphasis> share your password with anyone.

Your account balance is <emphasis level="moderate">$500.00</emphasis>.

break

Inserts a pause in the speech output. Use breaks to improve pacing, separate list items, or add dramatic effect. → Full documentation

Attributes:

time - Specific duration: "250ms", "1s", "2s"
strength - Relative pause: "none", "x-weak", "weak", "medium", "strong", "x-strong"

<!-- Pause for effect -->
The winner is<break time="1s"/>you!

<!-- Short pause between items -->
Press 1 for sales.<break time="500ms"/>
Press 2 for support.<break time="500ms"/>
Press 3 for billing.

<!-- Using strength -->
Thank you for calling.<break strength="strong"/>Goodbye.

voice

Switches to a different voice within the document. This allows you to use multiple voices in a single TTS request—useful for dialogues or distinguishing between speakers. → Full documentation

<voice name="en-US-female-1">
  Hello, how may I help you today?
</voice>

<voice name="en-US-male-1">
  I'd like to check my account balance.
</voice>

Note: Available voice names depend on your Capacity Private Cloud configuration. Contact support for a list of available voices in your deployment.

Audio and Markers

audio

Inserts a pre-recorded audio file into the synthesized speech. This is useful for jingles, sound effects, or pre-recorded segments that should blend with TTS output. → Full documentation

<audio src="https://example.com/sounds/chime.wav">
    <!-- Fallback text if audio cannot play -->
    [Chime sound]
</audio>
Thank you for holding.

Supported formats: WAV, MP3 (verify specific format support with your deployment configuration)

mark

Inserts a named marker into the SSML that generates a callback event when the TTS engine reaches that point. This enables synchronization between speech output and application events—such as updating a visual display or triggering animations. → Full documentation

<mark name="greeting_start"/>
Hello and welcome.
<mark name="greeting_end"/>
<mark name="menu_start"/>
Please select from the following options.
<mark name="menu_end"/>

Use cases:

Synchronizing captions or transcripts with audio playback
Triggering UI updates at specific points in the speech
Tracking progress through longer TTS content

desc

Provides a text description of non-speech audio content (used within the audio element). This is primarily for accessibility purposes. → Full documentation

<audio src="https://example.com/sounds/notification.wav">
  <desc>A notification chime indicating a new message</desc>
  You have a new message.
</audio>

Best Practices

Start simple: Begin with plain text and add SSML markup only where needed. Over-marking can make content harder to maintain.
Test with target voices: Pronunciation and prosody can vary between voices. Test your SSML with each voice you plan to use.
Use lexicons for repeated terms: If you have domain-specific vocabulary, create a lexicon file rather than adding phoneme tags throughout your content.
Validate your SSML: Malformed XML will cause synthesis failures. Validate your SSML before deploying to production.
Consider localization: SSML that works well in one language may need adjustment for others. Plan for this when building multilingual applications.

Additional Resources

W3C SSML Specification - The official specification for Speech Synthesis Markup Language
Integration Documentation - Guides for integrating TTS into your applications
Product Glossary - Definitions of terms used in this documentation

Was this article helpful?