Description
Voice Extensible Markup Language (VXML), standardized by the W3C and adopted by 3GPP in specification 23.333, is a key technology for developing voice-based services in telecommunications networks, particularly within the IP Multimedia Subsystem (IMS). It functions as an application-layer protocol that defines a dialog flow between a user and a voice service. A VXML document, or script, is processed by a special interpreter called a Voice Browser, which resides in a media server (e.g., a Media Resource Function Processor, MRFP). The browser executes the script, controls audio playback (synthesized speech or pre-recorded audio), processes user input (speech or DTMF tones), and makes logic decisions to navigate the call flow.
The architecture involves several key components. The VXML Forum's architecture, referenced by 3GPP, includes the Voice Browser, which fetches VXML documents from an Application Server (AS) via HTTP. The AS hosts the service logic and business rules, generating dynamic VXML pages. The Media Server provides the actual speech recognition (ASR), speech synthesis (TTS), and audio playback resources. A VXML script is composed of a series of dialog states (like <form> and <menu>) containing <field> elements to collect input, <prompt> elements to play audio, and <filled> blocks that define actions to take when input is received. Event handling (<catch>) manages errors and unexpected inputs. This declarative model separates the service logic on the AS from the media processing details, allowing developers to focus on the conversational design.
In the 3GPP IMS network, VXML plays a crucial role in enabling standardized, network-agnostic voice applications. When an IMS subscriber initiates a voice call to a service (like a voice portal, automated customer service, or conference system), the Serving-Call Session Control Function (S-CSCF) routes the call to an appropriate Application Server based on initial Filter Criteria (iFC). This AS can then act as a VXML interpreter or, more commonly, fetch VXML documents from a web server and relay them to a dedicated Media Resource Function (MRF) that hosts the Voice Browser. The MRF establishes a media session with the user's device using protocols like RTP and executes the VXML dialog. This allows for rich, interactive services such as voice-activated dialing, voice messaging, audio conferencing controls, and natural language voice portals, all delivered seamlessly over packet-switched IMS networks alongside other multimedia services.
Purpose & Motivation
VXML was created to solve the historical problem of proprietary, complex, and costly development of interactive voice response (IVR) systems. Before VXML, IVR applications were typically built using low-level, vendor-specific programming languages and tools that tightly coupled the application logic with the telephony hardware and media resources. This made applications difficult to port, expensive to develop and maintain, and limited innovation to a small pool of specialized developers.
3GPP's adoption of VXML, beginning in Release 7, was motivated by the move towards all-IP networks and the IMS. IMS aimed to provide a standardized, service-creation environment for multimedia. For voice services, a web-inspired model was needed. VXML provided exactly that: it applied the successful paradigm of web development (client-server, markup languages, HTTP) to the voice world. By using XML, it became easy to generate dynamic voice dialogs from web application servers, allowing a vast community of web developers to create telephony applications. This addressed the limitations of the old approach by promoting interoperability, reducing development time, fostering a tools ecosystem, and enabling the easy integration of voice services with web data and business logic. It was a key enabler for delivering consistent, advanced voice services across the evolving network landscape towards LTE and 5G.
Key Features
- XML-based declarative language for defining voice dialogs and call flows
- Separation of service logic (on Application Server) from media processing (on Media Server)
- Support for speech recognition (ASR) and text-to-speech (TTS) integration
- Event-driven architecture with handlers for errors, no-input, and no-match conditions
- Ability to fetch and execute scripts dynamically via HTTP from web servers
- Support for DTMF input, audio file playback, and variable data submission (HTTP POST)
Evolution Across Releases
Initial adoption of VoiceXML 2.0/2.1 into the 3GPP IMS service framework. Defined the architecture for VXML-based services, specifying the role of the Application Server (AS) and Media Resource Function (MRF) with an integrated Voice Browser. Established the basic mechanisms for call routing from the S-CSCF to a VXML service and the execution of voice dialogs over the IMS Media Plane.
Defining Specifications
| Specification | Title |
|---|---|
| TS 23.333 | 3GPP TS 23.333 |