ISAR

Immersive Audio for Split Rendering Scenarios

Services
Introduced in Rel-18
ISAR is a 3GPP media codec and delivery framework for immersive audio experiences in Extended Reality (XR) applications using split rendering. It enables high-quality, low-latency spatial audio streaming where audio rendering is split between the device and network.

Description

Immersive Audio for Split Rendering Scenarios (ISAR) is a 3GPP media codec and system specification designed to deliver high-quality, object-based spatial audio for Extended Reality (XR) applications, particularly in network architectures where rendering is split between a user device (e.g., XR headset) and a network edge server. ISAR addresses the unique challenges of streaming immersive audio, which requires six-degrees-of-freedom (6DoF) rendering, low end-to-end latency, and high compression efficiency to conserve bandwidth. The architecture typically involves an XR application server in the network (e.g., at the edge) that generates or processes the raw audio scene, containing multiple audio objects with metadata describing their positions, orientations, and acoustic properties. The ISAR encoder compresses this audio scene. A key innovation is the split of the rendering pipeline: part of the rendering (e.g., early reflections, basic binauralization) can be performed on the server, while the final stage (e.g., late reverberation, personalized head-related transfer function (HRTF) application, and compensation for last-moment head movements) is performed on the user equipment (UE). This split reduces the data rate that needs to be transmitted compared to sending fully rendered binaural audio, while also offloading complex processing from the potentially resource-constrained UE. The ISAR stream, containing encoded audio objects and rendering metadata, is delivered over the 5G network. The UE's ISAR decoder and renderer then complete the audio rendering based on the latest sensor data (head position) to create a precise, personalized spatial audio experience. The specifications (TS 26.249, 26.251, etc.) define the codec formats, metadata schemas, APIs, and system interfaces to enable this interoperable, low-latency immersive audio service.

Purpose & Motivation

ISAR was created to solve the audio delivery challenges for truly immersive and interactive XR experiences over mobile networks. Traditional audio codecs (like MPEG-H 3D Audio or Dolby Atmos) are designed for cinematic or broadcast scenarios with fixed playback environments and higher latency tolerance. For interactive XR, where a user can move their head and body in real-time, audio must be rendered dynamically with ultra-low latency (<20ms) to match the visual scene and prevent motion sickness. Transmitting fully rendered binaural audio for every possible head position is prohibitively bandwidth-intensive. ISAR's purpose is to enable efficient streaming by adopting a split-rendering model, which aligns with the overall XR split rendering paradigm studied in 3GPP. This model leverages the compute resources of the 5G network edge for heavy audio processing while keeping final, user-specific rendering on the device. It addresses the limitations of previous approaches: either high bandwidth consumption (sending pre-rendered audio) or high device compute load (rendering everything locally from raw objects, which may not be feasible on lightweight XR glasses). By standardizing ISAR, 3GPP aims to ensure interoperability between XR application providers, network operators, and device manufacturers, fostering a ecosystem for high-quality cloud/edge-rendered XR services over 5G and beyond.

Key Features

  • Object-based spatial audio codec optimized for low-latency, interactive XR applications
  • Split rendering architecture dividing audio processing between network edge and user device
  • Support for six-degrees-of-freedom (6DoF) audio with dynamic update of audio object metadata
  • Efficient compression of audio scenes and associated spatial metadata to reduce bandwidth
  • Defined system interfaces and APIs for integration between XR application servers, media functions, and UE
  • Personalized audio rendering on the UE using device-specific parameters like HRTF

Evolution Across Releases

Rel-18 Initial

Initial specification of the ISAR framework within the 3GPP SA4 working group. This included the definition of the core codec, the split rendering model, metadata formats, and the end-to-end system architecture for delivering immersive audio in XR split rendering scenarios. Key specs like TS 26.249 (codec) and TS 26.251 (system) were created.

Enhancements and refinements to the ISAR specifications based on implementation feedback and evolving XR requirements. This may include performance optimizations, support for new audio object types, improved compression efficiency, and tighter integration with other 3GPP XR and media streaming features like dynamic adaptive streaming.

Defining Specifications

SpecificationTitle
TS 26.249 3GPP TS 26.249
TS 26.251 3GPP TS 26.251
TS 26.252 3GPP TS 26.252
TS 26.258 3GPP TS 26.258
TS 26.260 3GPP TS 26.260
TS 26.996 3GPP TS 26.996
TS 26.997 3GPP TS 26.997