What is OMASA? Objects (ISM) with Metadata-Assisted Spatial Audio

Description

OMASA, standardized in 3GPP TS 26.253 and related specifications, is an advanced audio format that falls under the broader category of Interactive Spatial Media (ISM). It focuses on the delivery of audio objects—discrete audio signals associated with a specific sound source in a scene—accompanied by rich metadata that describes their spatial behavior. Unlike channel-based (e.g., 5.1 surround) or scene-based (e.g., Ambisonics) audio, object-based audio like OMASA treats each sound as an independent entity with dynamic attributes such as 3D position (X, Y, Z coordinates), size, velocity, and gain. This allows for precise rendering and interaction.

The architecture of an OMASA system involves a creation side and a playback side. During creation, audio objects are captured or synthesized, and their spatial metadata (position over time, etc.) is authored. This data is then encoded and packaged. OMASA leverages the ISO Base Media File Format (ISOBMFF) and typically uses the Immersive Sound Model (ISM) framework defined by MPEG-I. The audio objects can be encoded using codecs like MPEG-H 3D Audio or AC-4. Crucially, the metadata is synchronized with the media timeline and can also be linked to visual metadata from the accompanying video (e.g., a bounding box for a visual object).

For delivery, OMASA supports adaptive streaming protocols like DASH. At the client/player, the OMASA renderer receives the audio object streams and their metadata. Using a renderer (often part of the device's audio processing or a dedicated SDK), it computes the final audio signal for the listener's specific output setup (headphones, speaker array) based on the current object positions and the listener's orientation (tracked via head-tracking in VR). This allows sounds to remain fixed in the world space as the user turns their head. In a network context, OMASA is designed to work in tandem with video formats like OMAF, providing a complete audiovisual immersive experience where sound objects are tied to visual objects or general scene positions.

Purpose & Motivation

OMASA was created to address the limitations of traditional audio formats in interactive and immersive media scenarios. For 360-degree video and virtual reality, static channel-based or even first-order Ambisonics audio can lack precision and flexibility. They cannot easily represent discrete, moving sound sources that correspond to specific visual objects (e.g., a character speaking as they walk around the user). This breaks immersion and reduces the sense of presence. OMASA solves this by providing a standardized way to describe and deliver such dynamic audio objects.

The key problem it tackles is the synchronization and efficient delivery of audio that is intrinsically linked to visual objects and their metadata. Prior to OMASA, ad-hoc methods or proprietary formats were used, leading to interoperability issues. OMASA provides a unified, interoperable format that ensures an audio object rendered as 'behind and to the left' is consistently reproduced as such on any compliant playback device. This is critical for mass-market immersive services.

Its development in 3GPP Rel-18 was motivated by the evolution of immersive media beyond simple 360-degree video towards more interactive and object-rich experiences, sometimes referred to as ' volumetric media' or '6 Degrees of Freedom (6DoF) media'. As part of the broader Interactive Spatial Media (ISM) work item, OMASA enables new use cases like interactive storytelling, social VR, and immersive training, where audio objects need to respond to user interaction or scene changes. It builds upon the foundation of OMAF for video and MPEG-I standards for audio, creating a complete, standards-based toolkit for next-generation immersive services over 5G networks.

Classification

Part ofOMAF

Related approaches

Detected Changes Across Releases

from 3GPP Change Requests

Specific changes extracted from the „Change history“ tables of 3GPP specifications (1 CRs across 1 releases). Complements the general historical overview above with the evidence-based evolution of this function.

Rel-15 1 change

In Release 15, the OMASA (Objects with Metadata-Assisted Spatial Audio) function was introduced as a combined immersive audio format, supporting the encoding and decoding of both object-based audio (ISM) and metadata-assisted spatial audio (MASA) together. This new combined format operates within a bitrate range of 13.2 to 512 kbps, supporting wideband, super-wideband, and full-band audio. Its inclusion expanded the IVAS codec framework's capabilities for low-delay, high-quality immersive communication.

Correction of sensitivity calculation for immersive audio playback TS 26.260CR002

Explore further

Broader topics and technologies where OMASA plays a role.

Topics

SON (Self-Organizing Networks)Authentication (5G-AKA, EAP)Lawful Intercept Services & Applications Protocols & Interfaces

Technologies

Defining Specifications

3GPP specifications that define or reference OMASA, with the latest known release. Sourced from the 3GPP document catalog — see methodology.

Specification	Title	Release
TS 26.253 vj00	IVAS Codec Algorithmic Description	Rel-19
TS 26.255 vj00	IVAS Frame Loss Concealment Procedure	Rel-19
TS 26.260 vj00	Immersive Audio Objective Test Methods	Rel-19
TS 26.261 vj00	Electro-acoustic specs for immersive terminals	Rel-19