OSBA

Objects (ISM) with Scene-Based Audio

Services
Introduced in Rel-18
Objects (ISM) with Scene-Based Audio (OSBA) is a 3GPP standard for immersive audio content delivery. It defines how to represent and stream audio scenes composed of individual sound objects and associated metadata, enabling personalized, interactive, and high-quality spatial audio experiences for media services like XR.

Description

Objects (ISM) with Scene-Based Audio (OSBA) is a media delivery standard within the 3GPP Immersive Sound Model (ISM) framework, specifically designed for representing and rendering complex, object-based audio scenes. An audio scene in OSBA is not a single monolithic audio track but a composition of multiple individual audio objects, each with its own audio essence (the sound data) and rich spatial metadata. This metadata precisely defines each object's position, movement, size, and other acoustic properties within a three-dimensional coordinate system. The core of OSBA's operation involves the authoring, encapsulation, delivery, and client-side rendering of these scenes. Content creators author scenes using tools that output audio objects and metadata, which are then packaged according to 3GPP specifications, typically within ISOBMFF (MP4) containers for streaming.

For delivery, OSBA leverages existing adaptive streaming protocols like DASH or HLS. The audio objects and their dynamic metadata are packaged as separate media components or tracks within a media presentation. This allows the streaming client to request and receive only the components necessary for the current scene and user perspective. A key technical aspect is the synchronization of object audio essence with its time-varying spatial metadata, ensuring that sounds are rendered at the correct location at the correct time. The client-side renderer, which could be on a smartphone, XR headset, or home theater system, receives these components, decodes the audio objects, and uses the metadata to spatially render the audio scene in real-time, often using binaural rendering for headphones or channel-based rendering for speaker arrays.

The role of OSBA in the network is as an application-layer media format standard. It sits atop the core network's data transport capabilities, enabling service providers to offer next-generation audio experiences. It is integral to media services like extended reality (XR), interactive live events, and personalized audio for video. By separating the audio scene description (metadata) from the audio essence, OSBA enables advanced features like selective object enhancement, accessibility features (e.g., boosting commentary audio), and bandwidth efficiency, as objects can be added, removed, or substituted based on network conditions or user preferences without re-encoding the entire scene.

Purpose & Motivation

OSBA was created to address the limitations of traditional channel-based (e.g., 5.1, 7.1) and scene-based (e.g., Ambisonics) audio formats in delivering truly immersive and interactive audio experiences for emerging media. Channel-based audio is tied to a specific speaker layout and offers no interactivity, while first-order Ambisonics has limited spatial resolution. The rise of applications like virtual reality (VR), augmented reality (AR), and interactive 360-degree video demanded an audio format that could provide precise, dynamic spatial audio that reacts to user head movements and interactions.

The primary problem OSBA solves is how to efficiently stream complex, multi-object audio scenes over potentially constrained mobile networks while allowing for client-side personalization and adaptation. Previous approaches either required pre-mixing audio for a specific output (losing flexibility) or transmitted high-order Ambisonics fields (which can be bandwidth-inefficient and lack object-level control). OSBA's object-based approach allows the network to transmit a scene description and discrete audio elements, enabling the end-user's device to perform the final, personalized rendering. This is crucial for XR, where the audio must update in real-time based on the user's head orientation.

Therefore, the motivation for OSBA was to standardize an interoperable format for object-based immersive audio, ensuring content created by one provider can be rendered correctly on devices from different manufacturers. This standardization, part of the broader 3GPP media codec and delivery work, aims to catalyze the ecosystem for immersive media services over 5G and beyond, making personalized, cinematic-quality audio a viable service for mobile users.

Key Features

  • Object-based audio scene representation with individual audio objects and metadata
  • Precise 3D spatial metadata defining object position, width, and movement
  • Synchronized delivery of audio essence and dynamic spatial metadata streams
  • Support for client-side, personalized rendering based on user perspective (e.g., head-tracking)
  • Efficient packaging and streaming using ISOBMFF and DASH/HLS protocols
  • Enables interactive audio features like object selection, emphasis, and accessibility adjustments

Evolution Across Releases

Rel-18 Initial

OSBA was initially introduced, defining the complete architecture for object-based scene audio delivery. This included the specification for the Immersive Sound Model (ISM) scene description, the formats for encapsulating audio objects and their metadata within ISOBMFF, and the protocols for streaming these components. It established the end-to-end workflow from authoring to client-side rendering for immersive media services.

Defining Specifications

SpecificationTitle
TS 26.253 3GPP TS 26.253
TS 26.255 3GPP TS 26.255
TS 26.260 3GPP TS 26.260
TS 26.261 3GPP TS 26.261