Description
The Spatially Oriented Format for Acoustics (SOFA) is a technical specification developed by 3GPP (primarily in TS 26.118) for the carriage of spatial audio data. Unlike traditional channel-based (e.g., stereo, 5.1) or scene-based (e.g., Ambisonics) audio, SOFA is an object-based audio format. It encapsulates individual audio objects, each comprising an audio essence (the sound signal) and associated spatial metadata that describes the object's position, orientation, and other acoustic properties in a three-dimensional coordinate system. This separation of audio essence from rendering metadata is a key architectural principle.
How SOFA works involves content creation, delivery, and playback phases. During creation, audio objects are produced with their spatial trajectories. The SOFA format packages these objects into a structured file or streaming format. The specification defines the syntax for storing the audio data (e.g., PCM codec) and the XML-based metadata. During delivery over a network, the format can be adapted or partially delivered based on network conditions or device capabilities. At the playback device, a SOFA renderer receives the audio objects and metadata. The core function of the renderer is to synthesize the final binaural or multichannel output based on the *current* listener position and orientation, which is typically provided in real-time by head-tracking sensors in AR/VR headsets. This allows the soundscape to remain stable relative to the virtual world as the user moves their head.
SOFA's role in the 3GPP ecosystem is to provide a standardized, interoperable format for immersive audio services within media-rich applications like VR telephony, AR remote assistance, and 360-degree video. It is part of 3GPP's broader Media Streaming (4G/5G Media) work item. The specification details profiles and levels to ensure interoperability, defining subsets of features for different complexity devices. It also specifies how SOFA content can be delivered using Dynamic Adaptive Streaming over HTTP (DASH), aligning it with 3GPP's media delivery framework.
Purpose & Motivation
SOFA was created to address the lack of a standardized, efficient format for spatial audio in mobile and streaming contexts, particularly for emerging AR/VR applications. Previous approaches for 3D audio, such as high-order Ambisonics or multi-channel audio, were either computationally heavy, not dynamically adaptable to listener movement, or required a fixed speaker setup, making them unsuitable for personalized, head-tracked experiences on mobile devices. The problem was delivering a compelling, immersive audio experience that is as critical as visual immersion for presence in virtual environments.
The motivation for its standardization within 3GPP stemmed from the industry's move towards 5G-enabled immersive media. 5G's high bandwidth and low latency are enablers for rich media streaming, but without a standard audio format, fragmentation would occur. SOFA solves this by providing an object-based format that is bandwidth-efficient (only necessary objects are transmitted) and rendering-agnostic (the same content can be rendered optimally on different output devices, from headphones to speaker arrays). It addresses the limitations of channel-based audio, which is tied to a specific playback configuration, and provides more flexibility and interactivity than static binaural recordings. Its creation was driven by the need to ensure interoperability across content creators, network streaming services, and end-user devices in the 5G media landscape.
Key Features
- Object-based spatial audio format separating audio essence from spatial metadata.
- XML-based metadata defining 3D position, orientation, and acoustic properties of audio objects.
- Support for dynamic, real-time adaptation of soundscape based on listener head tracking.
- Designed for efficient streaming and adaptation using DASH (Dynamic Adaptive Streaming over HTTP).
- Defines interoperability profiles and levels for different device capabilities.
- Enables immersive audio experiences for AR, VR, and 360-degree video services.
Evolution Across Releases
SOFA was initially standardized in 3GPP Release 15 as part of the 5G media streaming enhancements. The specification TS 26.118 defined the core file format structure, metadata schema based on XML, and the conceptual model for object-based audio rendering. It established the foundation for streaming spatial audio over mobile networks, with initial integration profiles for DASH.
Defining Specifications
| Specification | Title |
|---|---|
| TS 26.118 | 3GPP TS 26.118 |