OMASA (Objects (ISM) with Metadata-Assisted Spatial Audio) — 3GPP Glossary

OMASA is a 3GPP standard that defines a format for delivering interactive spatial audio objects alongside immersive video, such as 360-degree or volumetric content. It enables audio objects (like a bird chirping or a car passing) to be positioned and moved dynamically in 3D space around the listener, creating a more realistic and interactive soundscape. This enhances immersive media experiences by synchronizing audio objects with visual elements and their metadata.

Description

OMASA, standardized in 3GPP TS 26.253 and related specifications, is an advanced audio format that falls under the broader category of Interactive Spatial Media (ISM). It focuses on the delivery of audio objects—discrete audio signals associated with a specific sound source in a scene—accompanied by rich metadata that describes their spatial behavior. Unlike channel-based (e.g., 5.1 surround) or scene-based (e.g., Ambisonics) audio, object-based audio like OMASA treats each sound as an independent entity with dynamic attributes such as 3D position (X, Y, Z coordinates), size, velocity, and gain. This allows for precise rendering and interaction.

The architecture of an OMASA system involves a creation side and a playback side. During creation, audio objects are captured or synthesized, and their spatial metadata (position over time, etc.) is authored. This data is then encoded and packaged. OMASA leverages the ISO Base Media File Format (ISOBMFF) and typically uses the Immersive Sound Model (ISM) framework defined by MPEG-I. The audio objects can be encoded using codecs like MPEG-H 3D Audio or AC-4. Crucially, the metadata is synchronized with the media timeline and can also be linked to visual metadata from the accompanying video (e.g., a bounding box for a visual object).

For delivery, OMASA supports adaptive streaming protocols like DASH. At the client/player, the OMASA renderer receives the audio object streams and their metadata. Using a renderer (often part of the device's audio processing or a dedicated SDK), it computes the final audio signal for the listener's specific output setup (headphones, speaker array) based on the current object positions and the listener's orientation (tracked via head-tracking in VR). This allows sounds to remain fixed in the world space as the user turns their head. In a network context, OMASA is designed to work in tandem with video formats like OMAF, providing a complete audiovisual immersive experience where sound objects are tied to visual objects or general scene positions.

Purpose & Motivation

OMASA was created to address the limitations of traditional audio formats in interactive and immersive media scenarios. For 360-degree video and virtual reality, static channel-based or even first-order Ambisonics audio can lack precision and flexibility. They cannot easily represent discrete, moving sound sources that correspond to specific visual objects (e.g., a character speaking as they walk around the user). This breaks immersion and reduces the sense of presence. OMASA solves this by providing a standardized way to describe and deliver such dynamic audio objects.

The key problem it tackles is the synchronization and efficient delivery of audio that is intrinsically linked to visual objects and their metadata. Prior to OMASA, ad-hoc methods or proprietary formats were used, leading to interoperability issues. OMASA provides a unified, interoperable format that ensures an audio object rendered as 'behind and to the left' is consistently reproduced as such on any compliant playback device. This is critical for mass-market immersive services.

Its development in 3GPP Rel-18 was motivated by the evolution of immersive media beyond simple 360-degree video towards more interactive and object-rich experiences, sometimes referred to as ' volumetric media' or '6 Degrees of Freedom (6DoF) media'. As part of the broader Interactive Spatial Media (ISM) work item, OMASA enables new use cases like interactive storytelling, social VR, and immersive training, where audio objects need to respond to user interaction or scene changes. It builds upon the foundation of OMAF for video and MPEG-I standards for audio, creating a complete, standards-based toolkit for next-generation immersive services over 5G networks.

Key Features

Object-based audio with dynamic 3D spatial metadata (position, size, velocity)
Synchronization of audio objects with visual object metadata from immersive video
Support for interactive scenarios where audio objects can be triggered or modified
Delivery via adaptive streaming (DASH) using ISOBMFF encapsulation
Rendering adapted to listener's head orientation for VR/360 experiences
Utilizes advanced audio codecs like MPEG-H 3D Audio within the ISM framework

Evolution Across Releases

Rel-18 Initial

Initial standardization of the OMASA format within the Interactive Spatial Media (ISM) work item. It defined the core architecture for delivering audio objects with spatial metadata, including the data formats, synchronization mechanisms with visual object metadata, and integration with the 3GPP media delivery framework (PSS/DASH). This established the foundation for interactive spatial audio in immersive media services.

TS 26.253 TS 26.255 TS 26.260 TS 26.261

Defining Specifications

Specification	Title
TS 26.253	3GPP TS 26.253
TS 26.255	3GPP TS 26.255
TS 26.260	3GPP TS 26.260
TS 26.261	3GPP TS 26.261