Description
OMASA, standardized in 3GPP TS 26.253 and related specifications, is an advanced audio format that falls under the broader category of Interactive Spatial Media (ISM). It focuses on the delivery of audio objects—discrete audio signals associated with a specific sound source in a scene—accompanied by rich metadata that describes their spatial behavior. Unlike channel-based (e.g., 5.1 surround) or scene-based (e.g., Ambisonics) audio, object-based audio like OMASA treats each sound as an independent entity with dynamic attributes such as 3D position (X, Y, Z coordinates), size, velocity, and gain. This allows for precise rendering and interaction.
The architecture of an OMASA system involves a creation side and a playback side. During creation, audio objects are captured or synthesized, and their spatial metadata (position over time, etc.) is authored. This data is then encoded and packaged. OMASA leverages the ISO Base Media File Format (ISOBMFF) and typically uses the Immersive Sound Model (ISM) framework defined by MPEG-I. The audio objects can be encoded using codecs like MPEG-H 3D Audio or AC-4. Crucially, the metadata is synchronized with the media timeline and can also be linked to visual metadata from the accompanying video (e.g., a bounding box for a visual object).
For delivery, OMASA supports adaptive streaming protocols like DASH. At the client/player, the OMASA renderer receives the audio object streams and their metadata. Using a renderer (often part of the device's audio processing or a dedicated SDK), it computes the final audio signal for the listener's specific output setup (headphones, speaker array) based on the current object positions and the listener's orientation (tracked via head-tracking in VR). This allows sounds to remain fixed in the world space as the user turns their head. In a network context, OMASA is designed to work in tandem with video formats like OMAF, providing a complete audiovisual immersive experience where sound objects are tied to visual objects or general scene positions.
Purpose & Motivation
OMASA was created to address the limitations of traditional audio formats in interactive and immersive media scenarios. For 360-degree video and virtual reality, static channel-based or even first-order Ambisonics audio can lack precision and flexibility. They cannot easily represent discrete, moving sound sources that correspond to specific visual objects (e.g., a character speaking as they walk around the user). This breaks immersion and reduces the sense of presence. OMASA solves this by providing a standardized way to describe and deliver such dynamic audio objects.
The key problem it tackles is the synchronization and efficient delivery of audio that is intrinsically linked to visual objects and their metadata. Prior to OMASA, ad-hoc methods or proprietary formats were used, leading to interoperability issues. OMASA provides a unified, interoperable format that ensures an audio object rendered as 'behind and to the left' is consistently reproduced as such on any compliant playback device. This is critical for mass-market immersive services.
Its development in 3GPP Rel-18 was motivated by the evolution of immersive media beyond simple 360-degree video towards more interactive and object-rich experiences, sometimes referred to as ' volumetric media' or '6 Degrees of Freedom (6DoF) media'. As part of the broader Interactive Spatial Media (ISM) work item, OMASA enables new use cases like interactive storytelling, social VR, and immersive training, where audio objects need to respond to user interaction or scene changes. It builds upon the foundation of OMAF for video and MPEG-I standards for audio, creating a complete, standards-based toolkit for next-generation immersive services over 5G networks.
Key Features
- Object-based audio with dynamic 3D spatial metadata (position, size, velocity)
- Synchronization of audio objects with visual object metadata from immersive video
- Support for interactive scenarios where audio objects can be triggered or modified
- Delivery via adaptive streaming (DASH) using ISOBMFF encapsulation
- Rendering adapted to listener's head orientation for VR/360 experiences
- Utilizes advanced audio codecs like MPEG-H 3D Audio within the ISM framework
Evolution Across Releases
Initial standardization of the OMASA format within the Interactive Spatial Media (ISM) work item. It defined the core architecture for delivering audio objects with spatial metadata, including the data formats, synchronization mechanisms with visual object metadata, and integration with the 3GPP media delivery framework (PSS/DASH). This established the foundation for interactive spatial audio in immersive media services.
Defining Specifications
| Specification | Title |
|---|---|
| TS 26.253 | 3GPP TS 26.253 |
| TS 26.255 | 3GPP TS 26.255 |
| TS 26.260 | 3GPP TS 26.260 |
| TS 26.261 | 3GPP TS 26.261 |