EMMA (Extensible MultiModal Annotation markup language) — 3GPP Glossary

EMMA is an XML-based markup language standardized by W3C and referenced by 3GPP for representing user input interpretations in multimodal systems. It allows applications to process inputs from various modalities like voice, keyboard, and pen in a unified format.

Description

The Extensible MultiModal Annotation markup language (EMMA) is a World Wide Web Consortium (W3C) recommendation that provides a data interchange format for representing the semantics of user inputs in multimodal interactive systems. While not a 3GPP-invented protocol, it is referenced within 3GPP specifications (notably TS 23.333) as a potential component for standardizing how multimodal inputs are annotated and processed in telecom service architectures, particularly for Multimedia Telephony (MMTel) and other IP-based services. EMMA documents are XML instances that describe interpretations of user inputs, which could originate from different modalities such as speech, keyboard, touch, or gesture.

An EMMA document structures information about an interpretation, including the raw input data, a derived meaning (like a recognized intent or extracted entities), confidence scores from the recognition process, timing information, and the source modality. This allows a multimodal application, potentially running on a network server or a device, to fuse inputs from different sources. For example, a user might say "show me this" while tapping a map; a speech recognizer would generate an EMMA structure for the utterance, and a gesture recognizer would generate one for the tap coordinates. A dialogue manager could then process these combined EMMA structures to execute the command.

Within a 3GPP context, EMMA's role is to enable interoperable, advanced user interfaces for services defined in the IP Multimedia Subsystem (IMS) framework. By providing a standard way to annotate input, it allows service logic to be decoupled from specific recognition technologies or devices. This supports the creation of richer, more natural human-machine interfaces for services like interactive voice response (IVR) with visual complements, or unified messaging where input can be speech or text. The 3GPP specifications reference EMMA as part of defining the architecture and information flows for multimodal services, ensuring that network-based multimodal interaction managers can process inputs in a vendor-neutral format.

Purpose & Motivation

EMMA was developed by the W3C Multimodal Interaction Working Group to solve the problem of interoperability in multimodal systems, where applications need to combine inputs from diverse recognition technologies (speech, handwriting, vision). Before standardization, each recognizer or fusion engine would use its own proprietary data format, making it difficult to build modular, scalable multimodal applications. EMMA provided a common, extensible XML vocabulary to represent interpretations, enabling plug-and-play integration of different recognition components.

3GPP's motivation for referencing EMMA in its specifications (starting in Release 7) was to support the evolution of telephony services beyond simple voice calls towards rich, interactive Multimedia Telephony (MMTel) within IMS. As services became more complex, allowing users to interact via voice, touch, and keypad simultaneously, a standardized way to handle these combined inputs at the service layer was needed. Adopting an existing W3C standard like EMMA allowed 3GPP to avoid reinventing the wheel and to align with web standards, facilitating the convergence of telecom and web services. It addressed the limitation of previous telecom service architectures, which were largely modality-siloed (e.g., voice call control separate from text messaging), by providing a foundation for unified, context-aware interaction management.

Key Features

XML-based format for representing semantic interpretations of user input
Supports annotation of input from multiple modalities (speech, touch, keyboard, etc.)
Includes metadata such as confidence scores, timestamps, and source modality identification
Enables information fusion by providing a common structure for inputs from different recognizers
Extensible through XML namespaces to accommodate domain-specific semantics
Facilitates interoperability between different recognition engines and multimodal application logic

Evolution Across Releases

Rel-7 Initial

Initially referenced in 3GPP TS 23.333 as part of the framework for Multimedia Telephony (MMTel) and other IMS-based services. It was introduced to provide a standardized data format for multimodal interaction management, allowing network servers to process combined user inputs from voice, text, and other modalities in a unified way.

TS 23.333

Defining Specifications

Specification	Title
TS 23.333	3GPP TS 23.333