Description
Scene-Based Audio (SBA), specifically based on the Ambisonics technique, is a full-sphere surround sound format that captures and represents a three-dimensional sound field. Unlike channel-based audio (e.g., 5.1, 7.1.4) which encodes audio for specific speaker positions, or object-based audio (e.g., MPEG-H) which encodes individual sound objects with metadata, SBA encodes the sound field itself as a set of spherical harmonic components. This mathematical representation describes the pressure and velocity of sound waves at a point in space, allowing for the reconstruction of the original sound field over a variety of playback systems, from headphones with binaural rendering to complex speaker arrays.
The core of SBA is the B-format signal, which consists of at least four channels: W (omnidirectional pressure), and X, Y, Z (the three orthogonal figure-of-eight components representing pressure gradients). This first-order Ambisonics (FOA) can be extended to higher-order Ambisonics (HOA) by including more spherical harmonic components, which increases the spatial resolution and accuracy of the reconstructed sound field, particularly for elevated sounds and more precise localization. The 3GPP standardization focuses on efficiently compressing, transporting, and rendering these Ambisonics components within media services, such as streaming for virtual reality (VR), augmented reality (AR), and 360-degree video.
Within the 3GPP architecture, SBA is integrated into the media delivery pipeline. The specifications define how SBA content is encapsulated in media containers (like ISOBMFF), compressed using audio codecs (with specific handling for the spherical harmonic channels), and described in media presentation descriptions. A key aspect is the support for dynamic rendering: the SBA bitstream, containing the sound field coefficients, is delivered to the client device. The device's audio renderer then uses a set of decoding matrices, potentially tailored to the user's specific head orientation (tracked via head-mounted displays) and output setup (headphones or speakers), to binauralize or decode the audio for immersive playback. This allows for six degrees of freedom (6DoF) audio where the listener can move within the sound scene.
3GPP's work on SBA involves multiple technical specifications (TS) covering codecs, file formats, system protocols, and security. It ensures interoperability for immersive audio services across different networks and devices. The specifications also address metadata for coordinating SBA with 360-degree video, ensuring audio-visual synchronization as the user's viewpoint changes. This makes SBA a foundational technology for delivering next-generation, interactive media experiences over 5G and beyond networks.
Purpose & Motivation
Scene-Based Audio (Ambisonics) was standardized by 3GPP to address the growing market for immersive media, particularly driven by virtual and augmented reality. Traditional channel-based audio is tied to fixed speaker configurations and cannot adapt to user head movement or different playback environments. Object-based audio provides flexibility but requires significant metadata and computational power for rendering many objects. SBA was motivated by the need for a format that inherently describes a complete sound scene in a compact, playback-agnostic manner.
The historical context is the rise of 360-degree video and VR content. Early VR experiences often used basic binaural audio or simple multi-channel mixes, which broke immersion when the user turned their head. Ambisonics, a decades-old academic concept, was identified as a suitable solution because it encodes the sound field mathematically. 3GPP's role was to standardize its use in a telecommunications ecosystem, solving the problems of efficient compression for transmission over bandwidth-constrained mobile networks and defining how clients receive and render the audio in sync with video.
It addresses key limitations of previous audio formats for immersive applications. Channel-based audio lacks adaptability. Object-based audio can become computationally complex for dense scenes. SBA provides a sweet spot: a scene description that is relatively compact, independent of the output setup, and perfectly suited for head-tracked binaural rendering, which is essential for VR. Its standardization enables content creators to produce a single audio stream that works on any compliant device, from mobile phones with headphones to dedicated VR systems, fostering an interoperable ecosystem for immersive 3GPP media services.
Key Features
- Encodes the sound field using spherical harmonics (B-format)
- Playback-agnostic; supports rendering to headphones, speaker arrays, etc.
- Enables six degrees of freedom (6DoF) audio with head-tracking
- Integrated with 360-degree video delivery in 3GPP media services
- Supports compression using standardized audio codecs (e.g., with specific channel mapping)
- Defines metadata for synchronization with visual viewpoint and playback configuration
Evolution Across Releases
Initial standardization of Scene-Based Audio (Ambisonics) within 3GPP. Defined the core framework for representing, transporting, and rendering SBA content. This included specifying the use of first-order Ambisonics (FOA) channels, their identification in media containers, and initial support for integration with VR/360-degree video services.
Defining Specifications
| Specification | Title |
|---|---|
| TS 23.433 | 3GPP TS 23.433 |
| TS 23.501 | 3GPP TS 23.501 |
| TS 23.540 | 3GPP TS 23.540 |
| TS 23.700 | 3GPP TS 23.700 |
| TS 24.229 | 3GPP TS 24.229 |
| TS 26.253 | 3GPP TS 26.253 |
| TS 26.255 | 3GPP TS 26.255 |
| TS 26.258 | 3GPP TS 26.258 |
| TS 26.260 | 3GPP TS 26.260 |
| TS 26.261 | 3GPP TS 26.261 |
| TS 26.501 | 3GPP TS 26.501 |
| TS 26.918 | 3GPP TS 26.918 |
| TS 26.997 | 3GPP TS 26.997 |
| TS 28.541 | 3GPP TS 28.541 |
| TS 29.309 | 3GPP TS 29.309 |
| TS 29.829 | 3GPP TS 29.829 |
| TS 33.117 | 3GPP TR 33.117 |
| TS 33.514 | 3GPP TR 33.514 |
| TS 33.794 | 3GPP TR 33.794 |
| TS 33.835 | 3GPP TR 33.835 |
| TS 33.841 | 3GPP TR 33.841 |
| TS 33.848 | 3GPP TR 33.848 |