Media Server Connectivity
The Media Server provides connectivity to clients over a variety of network mediums, all adhering to published standards. This makes connecting to speech products straightforward, well-documented, and compatible with numerous libraries available for Media Server integration.
Because this communication involves multiple networking protocols and standards, understanding several protocols and their acronyms is helpful. You may not need to understand all implementation details to use them effectively, since many speech-related products offer these types of interfaces.
Control Protocols
Control Protocols establish connections (sessions) when using MRCP to make requests of the Automatic Speech Recognition (ASR), Text-to-Speech (TTS), or other services.
The platform supports both RTSP and SIP control protocols for establishing connections to the Media Server from client applications. Typically you would use one or the other depending on your specific requirements, though both can be active simultaneously. For example, you could use RTSP for audio synthesis (TTS) and SIP for speech recognition (ASR) if needed.
Note that the API can be used directly without involving MRCP connections or the Media Server at all. Our Application Programming Interfaces are fully documented and supported as described in the API documentation. The choice between API and MRCP integration is yours. Many developers worldwide use both options, each with respective benefits, but both can perform most ASR and TTS tasks required by typical applications.
Because Media Server connectivity is primarily network-based, it can connect processes within a single machine, across a local network, or over the Internet with similar speed and performance. This flexibility is particularly valuable in today's cloud and mobile-dominated environments.
RTSP - Real Time Streaming Protocol
RTSP is an IETF-defined network control protocol designed for entertainment and communications systems controlling media streams (such as those used by the Media Server for ASR and TTS audio streams).
RTSP is used with MRCP Version 1 (MRCPv1) implementations.
Diagram: RTSP session between an MRCP client application and the Media Server
From a networking perspective, RTSP sessions use TCP connections only (not UDP). The default communication port is 554. Note that two processes on the same machine cannot share the same port, so changing the port number may occasionally be necessary. See Media Server Specific Parameters for configuration details.
MRCPv1 uses two settings to specify recognizer and synthesizer resource references. If these strings mismatch between client and Media Server, resource requests may fail. These resource definitions (resource URLs) can be adjusted at either end to match. For details on configuring these resource URLs, see the Media Server Configuration article.
The platform supports the IETF RFC2326 specification for Real Time Streaming Protocol.
SIP - Session Initiation Protocol
SIP is an IETF-defined signaling protocol widely adopted for controlling communication sessions, including VoIP calls. Originally designed in 1996 by Henning Schulzrinne and Mark Handley, SIP has become the preferred choice for many speech application developers. Most VXML implementations use SIP connectivity for integrating telephony and speech systems in Interactive Voice Response environments.
SIP is used with MRCP Version 2 (MRCPv2) implementations.
Diagram: SIP session between an MRCP client application and the Media Server
From a networking perspective, SIP differs from RTSP by supporting both UDP and TCP connections depending on client requirements.
When a SIP session is established, a second TCP-based communication channel handles MRCP traffic. Port numbers for MRCP are negotiated during session initialization via SDP. This means session control information travels via SIP (over UDP or TCP), while MRCP information uses a dedicated TCP connection.
This separation benefits network engineers managing traffic routing. For example, using SIP enables configuring proxy servers and routers to send session control traffic via one network path and MRCP resource control traffic via another. This capability supports Session Border Controllers (SBCs) and proxy server configurations in large deployments, though detailed coverage is beyond this overview's scope.
The platform supports the IETF RFC3261 specification for Session Initiation Protocol.
Media Resource Control Protocol - MRCP
MRCP is the protocol used after establishing a session via SIP or RTSP. Unlike session control protocols, MRCP controls the speech resources used within the session. For example, when a client application requests TTS audio or speech recognition, MRCP facilitates these communications.
RTSP uses MRCP Version 1, while SIP uses MRCP Version 2. Both versions perform similar functions with subtle differences that may require consideration if implementing your own MRCP protocol handler (a non-trivial undertaking).
Beyond communicating requests, responses, and events between client and Media Server, MRCP messages also control audio media streamed between endpoints. These audio streams are transported by RTP, described below.
The platform supports IETF RFC4463 for MRCP (v1) and IETF RFC6787 for MRCP Version 2.
Real-time Transport Protocol - RTP
RTP transports audio streams over networks. These streams may be TTS or ASR audio, with TTS traffic typically flowing away from the Media Server and ASR traffic flowing toward it.
The platform supports PCMU (mu-law) and PCMA (a-law) audio encoded at 8 kHz. No other formats are supported for RTP transport.
Audio is typically split into small packets representing approximately 20 ms of time. These packets stream sequentially, with the receiver reassembling them for playback or speech recognition input.
Each packet includes attributes describing its format and sequence number. This is important because UDP datagrams transport RTP audio. While UDP is efficient for this purpose, packets can occasionally be lost or arrive out of sequence. The receiver reviews sequence information and reassembles the stream to maintain audio quality.
In addition to audio, Dual-Tone Multi-Frequency (DTMF/touch-tone) signals are sent over the RTP stream. The tones themselves are not sent as audio (which would interfere with speech recognition). Instead, DTMF tones are sent as RTP Events—special packets indicating which key was pressed.
RTP port assignments are negotiated during session establishment, associating RTP streams with specific resources (recognizer or synthesizer) and determining stream direction.
The platform supports IETF RFC3550 for Real-time Transport Protocol and IETF RFC2833 for DTMF over RTP.
Session Description Protocol - SDP
SDP works with SIP and RTSP to establish multimedia sessions. Using SDP, you describe the audio and MRCP streams required within a session.
SDP is used during session negotiation to describe streaming media initialization parameters, including audio format selection and port assignments for RTP and MRCP (in SIP sessions).
The platform supports IETF RFC4566 for Session Description Protocol.
Integration Overview
Using these protocols in combination, you can connect the Media Server to a wide range of applications—from large automated telephony-based IVR systems to desktop applications to in-vehicle systems. Any application requiring speech technology can use MRCP to connect seamlessly across networks or within a single self-contained system.
Diagram: Client applications connecting to the Media Server via MRCP
