IVAS (Immersive Voice and Audio Services) — 3GPP Glossary

A 3GPP-standardized media codec and service framework for delivering immersive, multi-channel spatial audio experiences over mobile networks. It enables realistic soundscapes for voice calls, music, and extended reality (XR) applications by supporting audio objects and channels. IVAS is key for next-generation communication services like augmented reality telephony.

Description

Immersive Voice and Audio Services (IVAS) is a comprehensive 3GPP standard introduced in Release 16 that defines a new media codec and an associated service framework designed to deliver high-quality, immersive audio experiences over 5G and evolved packet core networks. At its core is the IVAS codec, a highly efficient and flexible audio codec capable of encoding not just traditional stereo or mono signals, but also complex spatial audio scenes comprising multiple audio channels (e.g., 5.1, 7.1.4) and discrete audio objects with associated metadata (like position, size, and gain). This allows for the rendering of sound in a three-dimensional space around the listener.

Architecturally, IVAS integrates into the 3GPP Multimedia Telephony Service for IMS (MTSI) and other media streaming frameworks. It operates within the media plane of the IP Multimedia Subsystem (IMS). Key components include the IVAS encoder, which compresses the immersive audio scene; the IVAS decoder, which reconstructs it; and the IVAS renderer, which uses head-related transfer functions (HRTF) and playback system information to correctly spatialize the audio for the listener's specific setup (headphones, speaker arrays). The service framework, detailed in specs like 26.114 and 26.119, defines session negotiation procedures using Session Description Protocol (SDP) to establish IVAS-capable media sessions, including support for dynamic switching between codec modes based on network conditions.

How it works: During a call or streaming session, endpoints negotiate IVAS support. The capturing device (e.g., a 360-degree microphone array or an XR headset) captures a spatial audio scene. The IVAS encoder compresses this scene, efficiently representing ambient channels and moving audio objects. This bitstream is packetized and transmitted over the 5G network, benefiting from ultra-reliable low-latency communication (URLLC) for real-time applications. The receiving device's IVAS decoder reconstructs the scene, and the renderer adapts it in real-time based on the listener's head orientation (using head-tracking data) to maintain a fixed sound field, creating a compelling sense of presence. Its role is to be the enabling audio technology for telepresence, social XR, and immersive entertainment.

Purpose & Motivation

IVAS was created to address the limitations of traditional voice and audio codecs (like AMR, EVS) in the emerging era of extended reality (XR), telepresence, and immersive media. Legacy codecs were designed for mono or stereo playback, incapable of conveying the spatial cues necessary for realistic virtual environments or group communication where understanding who is speaking and from where is critical. The motivation was to define a single, efficient standard for all immersive audio use cases, avoiding fragmentation.

The historical context is the evolution of 5G, which promises enhanced mobile broadband (eMBB), massive IoT, and URLLC. While 5G provides the pipe, IVAS provides the next-generation audio content that justifies the need for high bandwidth and low latency. It solves the problem of delivering cinema-quality, object-based audio over wireless networks for applications like multi-player VR gaming, remote collaboration in virtual spaces, and immersive live music streaming. Prior approaches required proprietary codecs or bulky, uncompressed multi-channel audio, which were inefficient and not interoperable.

Furthermore, IVAS enables new service paradigms like 'Augmented Reality Telephony,' where remote participants can be represented as spatial audio objects in the user's environment. It addresses the need for a codec that is both high-quality for music and low-bitrate for conversational speech, with seamless switching between modes. Its creation was motivated by industry convergence from telecom, broadcasting, and consumer electronics to establish a universal immersive audio standard for 5G.

Key Features

Spatial Audio Encoding: Supports encoding of channel-based audio (up to 22.2), object-based audio, and mixed scenes with metadata.
High Efficiency and Scalability: Delivers high audio quality at bitrates from 32 kbps for speech to over 512 kbps for rich music scenes, with scalable complexity.
Dynamic Mode Switching: Allows seamless switching between dedicated speech and general audio modes within an ongoing session for optimal quality.
Low Latency Operation: Designed for real-time conversational services with end-to-end latency targets suitable for XR applications.
Head-Tracked Rendering: Integrates with head-tracking data to render binaural audio that adapts to listener head movement, preserving sound field stability.
Standardized IMS Integration: Defined as a media codec within 3GPP's MTSI and streaming services, ensuring interoperability across networks and devices.

Evolution Across Releases

Rel-16 Initial

Initially standardized as the Immersive Voice and Audio Services codec and framework. Release 16 defined the core codec tools, the service architecture for integration with MTSI, and the initial set of profiles and levels. It established the fundamental capabilities for encoding spatial audio objects and channels, and the session negotiation procedures for immersive communication over IMS.

Defining Specifications

Specification	Title
TS 23.333	3GPP TS 23.333
TS 23.334	3GPP TS 23.334
TS 26.114	3GPP TS 26.114
TS 26.119	3GPP TS 26.119
TS 26.244	3GPP TS 26.244
TS 26.249	3GPP TS 26.249
TS 26.250	3GPP TS 26.250
TS 26.251	3GPP TS 26.251
TS 26.252	3GPP TS 26.252
TS 26.254	3GPP TS 26.254
TS 26.255	3GPP TS 26.255
TS 26.256	3GPP TS 26.256
TS 26.258	3GPP TS 26.258
TS 26.260	3GPP TS 26.260
TS 26.261	3GPP TS 26.261
TS 26.511	3GPP TS 26.511
TS 26.865	3GPP TS 26.865
TS 26.926	3GPP TS 26.926
TS 26.928	3GPP TS 26.928
TS 26.933	3GPP TS 26.933
TS 26.996	3GPP TS 26.996
TS 26.997	3GPP TS 26.997
TS 26.998	3GPP TS 26.998
TS 29.162	3GPP TS 29.162
TS 29.163	3GPP TS 29.163
TS 29.232	3GPP TS 29.232
TS 29.238	3GPP TS 29.238
TS 29.292	3GPP TS 29.292
TS 29.332	3GPP TS 29.332
TS 29.333	3GPP TS 29.333
TS 29.334	3GPP TS 29.334