Introduction to SSML

Speech Synthesis Markup Language (SSML) gives you precise control over how text-to-speech (TTS) engines pronounce your content. While TTS engines do a reasonable job with plain text, SSML lets you fine-tune pronunciation, pacing, emphasis, and prosody to create natural, professional-sounding speech output.

Why Use SSML?

Plain text works for simple cases, but real-world applications often need more control:

  • Ambiguous pronunciations — Is "read" past or present tense? Is "1/2" spoken as "one half" or "January second"?
  • Domain-specific terms — Product names, acronyms, and technical terms often need pronunciation hints.
  • Natural pacing — Pauses between sentences, slower speech for important information, or faster delivery for disclaimers.
  • Emphasis and tone — Highlighting key words or adjusting prosody to convey meaning.
  • Mixed content — Spelling out codes, reading phone numbers digit-by-digit, or handling currency and dates correctly.

SSML addresses all of these scenarios with a standardized, portable markup language.

Standards Compliance

Capacity Private Cloud TTS supports SSML 1.1 as defined by the W3C. This means your SSML markup is portable and follows industry-standard conventions. For a complete list of supported elements, see SSML Elements.

Basic Structure

SSML documents are XML with a root <speak> element. Here's a minimal example:

<speak>
  Hello, and welcome to our service.
</speak>

Within the <speak> element, you can mix plain text with SSML tags to control synthesis.

Common Use Cases

Adding pauses:

Use <break> to insert silence, giving listeners time to absorb information:

<speak>
  Your account balance is $1,250.00. <break time="500ms"/>
  Would you like to make a payment?
</speak>

Controlling pronunciation:

Use <say-as> to specify how content should be interpreted:

<speak>
  Your confirmation code is <say-as interpret-as="characters">ABC123</say-as>.
  Please call us at <say-as interpret-as="telephone">+1-800-555-1234</say-as>.
</speak>

Adding emphasis:

Use <emphasis> to stress important words:

<speak>
  This action <emphasis level="strong">cannot</emphasis> be undone.
</speak>

Adjusting speech rate and pitch:

Use <prosody> for fine-grained control over delivery:

<speak>
  <prosody rate="slow">Please listen carefully to the following terms.</prosody>
  <prosody rate="fast" pitch="-10%">Terms and conditions apply. See website for details.</prosody>
</speak>

Custom pronunciation with phonemes:

Use <phoneme> when the TTS engine mispronounces a word:

<speak>
  Welcome to <phoneme alphabet="ipa" ph="ˈkwɪrki">Quirky</phoneme> Software.
</speak>

SSML in VoiceXML Applications

SSML integrates naturally with VoiceXML 2.0. You can embed SSML within <prompt>, <audio>, <choice>, and <enumerate> elements:

<prompt>
  <speak>
    Your appointment is scheduled for
    <say-as interpret-as="date" format="mdy">03/15/2026</say-as>
    at <say-as interpret-as="time">2:30pm</say-as>.
  </speak>
</prompt>

Quick Reference

ElementPurpose
<speak>Root element for all SSML content
<break>Insert a pause (by time or strength)
<say-as>Specify how to interpret content (digits, date, currency, etc.)
<emphasis>Add stress to words or phrases
<prosody>Control rate, pitch, and volume
<phoneme>Specify exact pronunciation using phonetic alphabet
<sub>Substitute spoken text for written abbreviations
<voice>Switch between different voices
<audio>Insert pre-recorded audio

For complete element documentation including attributes and examples, see SSML Elements.

Best Practices

  • Start simple — Use plain text first, then add SSML only where needed. Over-marking text can make it harder to maintain.
  • Test with real listeners — Synthetic speech can sound different than expected. Test your prompts with actual users.
  • Use <say-as> for data — Dates, times, currency, and codes are common sources of mispronunciation. Always mark them explicitly.
  • Be consistent — If you pronounce a product name a certain way in one prompt, use the same markup everywhere.
  • Consider localization — Different languages have different text normalization rules. See our language-specific guides for details.

Related Documentation


Was this article helpful?