SBA (Scene-Based Audio (Ambisonics)) — 3GPP Glossary

A 3D audio format for capturing and reproducing a full spherical sound field, enabling immersive audio experiences like virtual reality. It represents sound as spherical harmonics, independent of specific speaker layouts. 3GPP standardizes its delivery over mobile networks.

Description

Scene-Based Audio (SBA), specifically based on the Ambisonics technique, is a full-sphere surround sound format that captures and represents a three-dimensional sound field. Unlike channel-based audio (e.g., 5.1, 7.1.4) which encodes audio for specific speaker positions, or object-based audio (e.g., MPEG-H) which encodes individual sound objects with metadata, SBA encodes the sound field itself as a set of spherical harmonic components. This mathematical representation describes the pressure and velocity of sound waves at a point in space, allowing for the reconstruction of the original sound field over a variety of playback systems, from headphones with binaural rendering to complex speaker arrays.

The core of SBA is the B-format signal, which consists of at least four channels: W (omnidirectional pressure), and X, Y, Z (the three orthogonal figure-of-eight components representing pressure gradients). This first-order Ambisonics (FOA) can be extended to higher-order Ambisonics (HOA) by including more spherical harmonic components, which increases the spatial resolution and accuracy of the reconstructed sound field, particularly for elevated sounds and more precise localization. The 3GPP standardization focuses on efficiently compressing, transporting, and rendering these Ambisonics components within media services, such as streaming for virtual reality (VR), augmented reality (AR), and 360-degree video.

Within the 3GPP architecture, SBA is integrated into the media delivery pipeline. The specifications define how SBA content is encapsulated in media containers (like ISOBMFF), compressed using audio codecs (with specific handling for the spherical harmonic channels), and described in media presentation descriptions. A key aspect is the support for dynamic rendering: the SBA bitstream, containing the sound field coefficients, is delivered to the client device. The device's audio renderer then uses a set of decoding matrices, potentially tailored to the user's specific head orientation (tracked via head-mounted displays) and output setup (headphones or speakers), to binauralize or decode the audio for immersive playback. This allows for six degrees of freedom (6DoF) audio where the listener can move within the sound scene.

3GPP's work on SBA involves multiple technical specifications (TS) covering codecs, file formats, system protocols, and security. It ensures interoperability for immersive audio services across different networks and devices. The specifications also address metadata for coordinating SBA with 360-degree video, ensuring audio-visual synchronization as the user's viewpoint changes. This makes SBA a foundational technology for delivering next-generation, interactive media experiences over 5G and beyond networks.

Purpose & Motivation

Scene-Based Audio (Ambisonics) was standardized by 3GPP to address the growing market for immersive media, particularly driven by virtual and augmented reality. Traditional channel-based audio is tied to fixed speaker configurations and cannot adapt to user head movement or different playback environments. Object-based audio provides flexibility but requires significant metadata and computational power for rendering many objects. SBA was motivated by the need for a format that inherently describes a complete sound scene in a compact, playback-agnostic manner.

The historical context is the rise of 360-degree video and VR content. Early VR experiences often used basic binaural audio or simple multi-channel mixes, which broke immersion when the user turned their head. Ambisonics, a decades-old academic concept, was identified as a suitable solution because it encodes the sound field mathematically. 3GPP's role was to standardize its use in a telecommunications ecosystem, solving the problems of efficient compression for transmission over bandwidth-constrained mobile networks and defining how clients receive and render the audio in sync with video.

It addresses key limitations of previous audio formats for immersive applications. Channel-based audio lacks adaptability. Object-based audio can become computationally complex for dense scenes. SBA provides a sweet spot: a scene description that is relatively compact, independent of the output setup, and perfectly suited for head-tracked binaural rendering, which is essential for VR. Its standardization enables content creators to produce a single audio stream that works on any compliant device, from mobile phones with headphones to dedicated VR systems, fostering an interoperable ecosystem for immersive 3GPP media services.

Key Features

Encodes the sound field using spherical harmonics (B-format)
Playback-agnostic; supports rendering to headphones, speaker arrays, etc.
Enables six degrees of freedom (6DoF) audio with head-tracking
Integrated with 360-degree video delivery in 3GPP media services
Supports compression using standardized audio codecs (e.g., with specific channel mapping)
Defines metadata for synchronization with visual viewpoint and playback configuration

Evolution Across Releases

Rel-14 Initial

Initial standardization of Scene-Based Audio (Ambisonics) within 3GPP. Defined the core framework for representing, transporting, and rendering SBA content. This included specifying the use of first-order Ambisonics (FOA) channels, their identification in media containers, and initial support for integration with VR/360-degree video services.

TS 23.433 TS 23.501 TS 23.540 TS 23.700 TS 24.229 TS 26.253 TS 26.255 TS 26.258 TS 26.260 TS 26.261 TS 26.501 TS 26.918 TS 26.997 TS 28.541 TS 29.309 TS 29.829 TS 33.117 TS 33.514 TS 33.794 TS 33.835 TS 33.841 TS 33.848

Defining Specifications

Specification	Title
TS 23.433	3GPP TS 23.433
TS 23.501	3GPP TS 23.501
TS 23.540	3GPP TS 23.540
TS 23.700	3GPP TS 23.700
TS 24.229	3GPP TS 24.229
TS 26.253	3GPP TS 26.253
TS 26.255	3GPP TS 26.255
TS 26.258	3GPP TS 26.258
TS 26.260	3GPP TS 26.260
TS 26.261	3GPP TS 26.261
TS 26.501	3GPP TS 26.501
TS 26.918	3GPP TS 26.918
TS 26.997	3GPP TS 26.997
TS 28.541	3GPP TS 28.541
TS 29.309	3GPP TS 29.309
TS 29.829	3GPP TS 29.829
TS 33.117	3GPP TR 33.117
TS 33.514	3GPP TR 33.514
TS 33.794	3GPP TR 33.794
TS 33.835	3GPP TR 33.835
TS 33.841	3GPP TR 33.841
TS 33.848	3GPP TR 33.848