Audio Enablers

Specification: 26818

🔴Draftvf00
Rel-15
Relevance:7/10

Summary

This document defines audio operation points, parameters, and media profiles for 3GPP. It focuses on interoperability with media decoders and defines requirements for audio decoders and renderers.

Specification Intelligence

This is a Technical Document in the Unknown Series series, focusing on Technical Document. The document is currently in immature draft.

Classification

Type: Technical Document
Subject: Unknown Series
Series: 26.xxx
Target: Technical Implementers

Specifics

Status: Development

Version

0.0.0
Draft Stage
0 technical • 0 editorial

Full Document vf00

3GPP TSG-SA WG4 Meeting #99 S4-180977

Rome, Italy, July 9-13 Revision of S4-180966

CR-Form-v11.2

PSEUDO CHANGE REQUEST

26.118

CR

CRNum

rev

3

Current version:

1.0.0

For HELP on using this form: comprehensive instructions can be found at
http://www.3gpp.org/Change-Requests.

Proposed change affects:

UICC apps

ME

X

Radio Access Network

Core Network

Title:

OMAF 3D Audio Baseline Media Profile for VRStream

Source to WG:

Fraunhofer IIS, Qualcomm Incorporated, Orange, Deutsche Telekom AG, VoiceAge Corporation, Ericsson LM, WILUS Inc., Philips International B.V., Huawei Technologies Co. Ltd.

Source to TSG:

S4

Work item code:

VRStream

Date:

2018-07-03

Category:

B

Release:

Rel-15

Use one of the following categories:
F
(correction)
A (mirror corresponding to a change in an earlier release)
B (addition of feature),
C (functional modification of feature)
D (editorial modification)

Detailed explanations of the above categories can
be found in 3GPP TR 21.900.

Use one of the following releases:
Rel-8 (Release 8)
Rel-9 (Release 9)
Rel-10 (Release 10)
Rel-11 (Release 11)
Rel-12 (Release 12)
Rel-13 (Release 13)
Rel-14 (Release 14)
Rel-15 (Release 15)
Rel-16 (Release 16)

Reason for change:

Audio media profiles required for audiovisual VR streams. Industry alignment between 3GPP VRStream and MPEG OMAF specifications.

Summary of change:

Adds the MPEG OMAF 3D Audio Baseline Profile together with additional specification of an external binaural renderer.

Consequences if not approved:

Lack of audio for VR streams resulting in silent streams. Missing industry alignment between MPEG, VR-IF, and 3GPP.

Clauses affected:

Y

N

Other specs

X

Other core specifications

TR 26.918

affected:

X

Test specifications

TS/TR ... CR ...

(show related CRs)

X

O&M Specifications

TS/TR ... CR ...

Other comments:

*** Start change 1 ***

2 References

[X1] ISO/IEC FDIS 23090-2: "Information technology -- Coded representation of immersive media -- Part 2: Omnidirectional media format"

[X2] ISO/IEC 23008-3:2015: "Information technology -- High efficiency coding and media delivery in heterogeneous environments - Part 3: 3D audio", ISO/IEC 23008-3:2015/Amd2:2016: "MPEG-H 3D Audio File Format Support ", ISO/IEC 23008-3:2015/Amd 3:2017: "MPEG-H 3D Audio Phase 2", ISO/IEC 23008-3:2015/DAmd 5: "Audio metadata enhancements".

[X3] IETF RFC 6381: "The 'Codecs' and 'Profiles' Parameters for "Bucket" Media Types", R. Gellens, D. Singer, P. Frojdh, August 2011.

[X4] AES standard for file exchange - Spatial acoustic data file format, 2015

*** Start change 2 ***

6 Audio Enablers

6.1 Audio Operation Points

6.1.1 Definition of Operation Point

For the purpose to define interfaces to a conforming audio decoder, audio operation points are defined. In this case the following definitions hold:

- Operation Point: A collection of discrete combinations of different content formats and VR specific rendering metadata, etc. and the encoding format.

- Receiver: A receiver that can decode and render any bitstream that is conforming to a certain Operation Point.

- Bitstream: A video bitstream that conforms to an audio format

Figure 6.1: Audio Operation Points

This clause focuses on the interoperability point to a media decoder as indicated in Figure 5.1. This clause does not deal with the access engine and file parser which addresses aspects how the video bitstream is delivered.

In all audio operation points, the VR Presentation can be rendered using a single or multiple media decoder which provides decoded PCM signals and rendering metadata to the audio renderer.

6.1.2 Parameters of Audio Operation Point

This clause defines the potential parameters of Audio Operation Points. This includes the detailed audio decoder requirements and audio rendering metadata. The requirements are defined from the perspective of the audio decoder and renderer.

Parameters for an Audio Operation Point include:

- the audio decoder that the bitstream needs to conform to

- the permitted rendering data to be included in the audio bitstream

6.1.3 Summary of Audio Operation Points

Table 1 provides an informative overview of the Audio Pperating Points. The detailed, normative specification for each audio operating point is subsequently provided in the referenced clause.

Table 1 - Overview of OMAF media profiles for audio (informative)

Operating Point

Codec

Profile

Level

Max Sampling Rate

Clause

3GPP MPEG-H Audio Operating Point

MPEG-H Audio

Low Complexity

1, 2 or 3

48 kHz

6.1.4

6.1.4 3GPP MPEG-H Audio Operation Point

6.1.4.1 Overview

The 3GPP MPEG-H Audio Operation Point fulfills the requiremens to support 3D audio and is specified in ISO/IEC 23090-2, clause 10.2.2 [X1]. Channels, Objects and First/Higher-Order Ambisonics (FOA/HOA) are supported, as well as combinations of those. The Operation Point is based on MPEG-H 3D Audio [X2].

A bitstream conforming to the 3GPP MPEG-H Audio Operation Point shall conform to the requirements in of clause 6.1.4.2.

A receiver conforming to the 3GPP MPEG-H Audio Operation Point shall support decoding and rendering a Bitstream conforming to the 3GPP MPEG-H Audio Operation Point. Detailed receiver requirements are provided in clause 6.1.4.3.

6.1.4.2 Bitstream requirements

The audio stream shall comply with the MPEG-H 3D Audio Low Complexity (LC) Profile, Levels 1, 2 or 3 as defined in ISO/IEC 23008-3, clause 4.8 [X2]. The values of the mpegh3daProfileLevelIndication for LC Profile Levels 1, 2 and 3 are "0x0B", "0x0C" and "0x0D", respectively, as specified in ISO/IEC 23008-3, clause 5.3.2 [X2].

Audio encapsulation shall be done according to ISO/IEC 23090-2, clause 10.2.2.2 [X1].

All Low Complexity Profile and Levels restrictions specified in ISO/IEC 23008-3, clause 4.8.2 [X2] shall apply. The constraints on input and output configurations are provided in Table 3 — "Levels and their corresponding restrictions for the Low Complexity Profile", of ISO/IEC 23008-3, [X2]. This includes the following for Low Complexity Profile Level 3:

  • Maximum number of core coded channels (in compressed data stream): 32,
  • Maximum number of decoder processed core channels: 16,
  • Maximum number of loudspeaker output channels: 12
  • Maximum number of decoded objects: 16
  • Maximum HOA order: 6

MPEG-H Audio sync samples contain Immediate Playout Frames (IPFs), as specified in ISO/IEC 23008-3, clause 20.2 [X2] and shall follow the requirements specified in ISO/IEC 23090-2, clause 10.2.2.3.1 [X1].

6.1.4.3 Receiver requirements

A receiver supporting the 3GPP MPEG-H Audio Operation Point shall fulfill all requirements specified in this section.

6.1.4.3.1 Decoding process

The receiver shall be capable of decoding MPEG-H Audio LC Profile Level 1, Level 2 and Level 3 bitstreams as specified in ISO/IEC 23008-3, clause 4.8 [X2] with the following relaxations:

  • The Immersive Renderer defined in ISO/IEC 23008-3, clause 11 [X2] is optional.
  • The carriage of generic data defined in ISO/IEC 23008-3, clause 14.7 [X2] is optional and thus MHAS packets of the type PACTYP_GENDATA are optional and the decoder may ignore packets of this type.

The decoder shall read and process MHAS packets of the following types in accordance with ISO/IEC 23008-3, clause 14 [X2]: PACTYP_SYNC, PACTYP_MPEGH3DACFG, PACTYP_AUDIOSCENEINFO, PACTYP_AUDIOTRUNCATION, PACTYP_MPEGH3DAFRAME, PACTYP_USERINTERACTION, PACTYP_LOUDNESS_DRC, PACTYP_EARCON, PACTYP_PCMCONFIG and PACTYP_PCMDATA.

The decoder may read and process MHAS packets of the following types: PACTYP_SYNCGAP, PACTYP_BUFFERINFO, PACTYP_MARKER and PACTYP_DESCRIPTOR.

Other MHAS packets may be present in an MHAS elementary stream and may be ignored.

The Earcon metadata shall be processed and applied as described in ISO/IEC 23008-3, clause 28 [X2].

6.1.4.3.2 Tune-in

At a tune-in into a live stream the audio decoder is able to start decoding a new audio stream at every random access point (RAP). As defined in 6.1.4.2, the sync sample (RAP) contains the configuration information (PACTYP_MPEGH3DACFG and PACTYP_AUDIOSCENEINFO) that is used to initialize the audio decoder. After initialization, the audio decoder reads encoded audio frames (PACTYP_MPEGH3DAFRAME) and decodes them.

To optimize startup delay at tune-in the information from the MHAS PACTYP_BUFFERINFO packet should be taken into account. The input buffer should be filled at least to the state indicated in the MHAS PACTYP_BUFFERINFO packet before starting to decode audio frames.

Note, that it may be necessary to feed several audio frames into the decoder before the first decoded PCM output buffer is available, as described in ISO/IEC 23008-3, clause 5.5.6.3 and clause 22 [X2].

It is recommended that, on tune-in, the receiving device performs a 100ms fade-in on the first PCM output buffer that it receives from the audio decoder.

6.1.4.3.3 Configuration change

If the decoder receives an MHAS stream that contains a configuration change, the decoder shall perform a configuration change according to ISO/IEC 23008-3, clause 5.5.6 [X2]. The configuration change can, for instance, be detected through the change of the MHASPacketLabel of the packet PACTYP_MPEGH3DACFG compared to the value of the MHASPacketLabel of previous MHAS packets.

If MHAS packets of type PACTYP_AUDIOTRUNCATION are present, they shall be used as described in ISO/IEC 23008‑3, clause 14 [X2].

The Access Unit that contains the configuration change and the last Access Unit before the configuration change may contain a truncation message (PACTYP_AUDIOTRUNCATION) as defined in ISO/IEC 23008-3, clause 14 [X2]. The MHAS packet of type PACTYP_AUDIOTRUNCATION enables synchronization between video and audio elementary streams at program boundaries. When used, sample-accurate splicing and reconfiguration of the audio stream are possible.

6.1.4.3.4 MPEG-H Multi-stream Audio

The VRStream Client shall be capable of simultaneously receiving at least 3 MHAS streams. The MHAS streams can be simultaneously decoded or combined into a single stream prior to the decoder, by utilizing the field mae_bsMetaDataElementIDoffset in the Audio Scene Information as described in ISO/IEC 23008-3, clause 14.6 [X2].

6.1.4.3.5 Rendering requirements

The 3GPP MPEG-H Audio Operation Point builds on the MPEG-H 3D Audio codec, which includes rendering to loudspeakers, binaural rendering and also provides an interface for external rendering. Legacy binaural rendering using fixed loudspeaker setups can be supported by using loudspeaker feeds as output of the decoder.

6.1.4.3.5.1 Rendering to Loudspeakers

Rendering to loudspeakers shall be done according to ISO/IEC 23008-3 [X2] using the interface for local loudspeaker setup and rendering as defined in ISO/IEC 23008-3, clause 17.3 [X2].

NOTE: ISO/IEC 23008-3 [X2] specifies rendering to predefined loudspeaker setups as well as rendering to arbitrary setups.

6.1.4.3.5.2 Binaural Rendering of MPEG-H 3D Audio

MPEG-H 3D Audio specifies methods for binauralizing the presentation of immersive content for playback via headphones, as is needed for omnidirectinal media presentations. MPEG-H 3D Audio specifies a normative interface for the user’s viewing orientation and permits low-complexity, low-latency rendering of the audio scene to any user orientation.

The binaural rendering of MPEG-H 3D Audio shall be applied as described in ISO/IEC 23008-3, clause 13 [X2] according to the Low Complexity Profile and Levels restrictions for binaural rendering specified in ISO/IEC 23008-3, clause 4.8.2.2 [X2].

6.1.4.3.5.2.1 Head Tracking Interface

For binaural rendering using head tracking the useTrackingMode flag in the BinauralRendering() syntax element shall be set to 1, as described in ISO/IEC 23008-3, clause 17.4 [X2]. This flag defines if a tracker device is connected and the binaural rendering shall be processed in a special headtracking mode, using the scene displacement values (yaw, pitch and roll).

The values for the scene displacement data shall be sent using the interface for scene displacement data specified in ISO/IEC 23008-3, clause 17.9 [X2]. The syntax of mpegh3daSceneDisplacementData() interface provided in ISO/IEC 23008-3, clause 17.9.3 [X2] shall be used.

6.1.4.3.5.2.2 Signaling and processing of diegetic and non-diegetic audio

The metadata flag fixedPosition in SignalGroupInformation() indicates if the corresponding audio signals are updated during the processing of scene-displacement angles. In case the flag is equal to one, the positions of the corresponding audio signals are not updated during the processing of scene displacement angles.

Channel groups for which the flag gca_directHeadphone is set to "1" in the mpegh3da_getChannelMetadata()sytax element are routed to left and right output channel directly and are excluded from binaural rendering using scene displacement data (non-diegetic content). Non-diegetic content may have stereo or mono format. For mono, the signal is mixed to left and right headphone channel with a gain factor of 0.707.

6.1.4.3.5.2.3 HRIR/BRIR Interface processing

The interface for binaural room impulse responses (BRIRs) specified in ISO/IEC 23008-3, clause 17.4 [X2] shall be used for external BRIRs and HRIRs. The HRIR/BRIR data for the binaural rendering can be fed to the decoder by using the syntax element BinauralRendering(). The number of BRIR/HRIR pairs in each BRIR/HRIR set shall correspond to the number indicated in the relevant level-dependent row in Table 9 - "The binaural restrictions for the LC profile" of ISO/IEC 23008-3 [X2] according to the Low Complexity Profile and Levels restrictions in ISO/IEC 23008‑3, clause 4.8.2.2 [X2].

The measured BRIR positions are passed to the mpegh3daLocalSetupInformation(), as specified in ISO/IEC 23008-3, clause 4.8.2.2 [X2]. Thus, all renderer stages are set to the target layout that is equal to the transmitted channel configuration. As one BRIR is available per regular input channel, the Format Converter can be passed through in case regular input channel positions are used. Preferably, the BRIR measurement positions for standard target layouts 2.0, 5.1, 10.2 and 7.1.4 should be provided.

6.1.4.3.5.3 Rendering with External Binaural Renderer

MPEG-H 3DA provides the output interfaces for the delivery of un-rendered channels, objects, and HOA content and associated metadata as specified in clause 6.1.4.3.5.4. External binaural renderers can connect to this interface e.g. for playback of head-tracked audio via headphones. An example of such external binaural renderer that connects to the external rendering interface of MPEG-H 3DA is specified in Annex X.

6.1.4.3.5.4 External Renderer Interface

ISO/IEC 23008-3, clause 17.10 [X2] specifies the output interfaces for the delivery of un-rendered channels, objects, and HOA content and associated metadata. For connecting to external renderers, a VRStream client shall implement the interfaces for object output, channel output and HOA output as specified in ISO/IEC 23008-3, clause 17.10 [X2], including the additional specification of production metadata defined in ISO/IEC 23008-3, clause 27 [X2]. Any external renderer should apply the metadata provided in this interface and related audio data in the same manner as if MPEG-H internal rendering is applied:

  • Correct handling of loudness-related metadata in particular with the aim of preserving intended target loudness
  • Preserving artistic intent, such as applying transmitted Downmix and HOA Rendering matrices correctly
  • Rendering spatial attributes of objects appropriately (position, spatial extent, etc.)

NOTE: The external example binaural renderer in Annex X only handles a subset of the paraeters to illustrate the use of the output interface. Alternative external binaural renderers are expected to apply and handle the metadata provided in this interface and related audio data in the same manner as if internal rendering is applied.

In this interface the PCM data of the channels and objects interfaces is provided through the decoder PCM buffer, which first contains the regular rendered PCM signals (e.g. 12 signals for a 7.1+4 setup). Subsequently additional signals carry the PCM data of the originally transmitted channel representation. These are followed by signals carrying the PCM data of the un-rendered output objects. Then additional signals carry the HOA audio PCM data which number is indicated in the HOA metadata interface via the HOA order (e.g. 16 signals for HOA order 3). The HOA audio PCM data in the HOA output interface is provided in the so-called Equivalent Spatial Domain (ESD) representation. The conversion from the HOA domain into the ESD representation and vice versa is described in ISO/IEC 23008-3, Annex C.5.1 [X2].

The metadata for channels, objects, and HOA is available once per frame and their syntax is specified in mpegh3da_getChannelMetadata(), mpegh3da_getObjectAudioAndMetadata(), and mpegh3da_getHoaMetadata() respectively. The metadata and PCM data shall be aligned for an external renderer to match each metadata element with the respective PCM frame.

6.2 Audio Media Profiles

6.2.1 Introduction and Overview

This clause defines the media profiles for audio. Media profiles include specification on the following:

- Elementary stream constraints based on the audio operation points defined in clause 6.1.

- File format encapsulation constraints and signalling including capability signalling. The defines to a 3GPP VR Track as defined above.

- DASH Adaptation Set constraints and signalling including capability signalling. This defines a DASH content format profile.

Table 6.2-1 provides an overview of the Media Profiles in defined in the remainder of clause 6.2.

6.2.2 OMAF 3D Audio Baseline Media Profile

6.2.2.1 Overview

MPEG-H 3D Audio [X2] specifies coding of immersive audio material and the storage of the coded representation in an ISOBMFF track. The MPEG-H 3D Audio decoder has a constant latency, see Table 1 — "MPEG-H 3DA functional blocks and internal processing domain", of ISO/IEC 23008-3 [X2]. With this information, content authors could synchronize audio and video portions of a media presentation, e.g. ensuring lip-synch.

ISO BMFF integration for this profile is provided following the requirements and recommendations in ISO/IEC 23090-2, clause 10.2.2.3 [X1].

6.2.2.2 File Format Signaling and Encapsulation

3GP VR Tracks conforming to this media profile used in the context of the specification shall conform to the 3GP File Format [7] with the following further requirements:

- The audio track shall comply to the Bitstream requirements and recommendations for the Operation Point as defined in clause 6.1.

- The sample entry 'mhm1' shall be used for encapsulation of MHAS packets into ISOBMFF files, per ISO/IEC 23008‑3, clause 20.6 [X2].

- All ISO Base Media File Format constraints specified in ISO/IEC 23090-2, clause 10.2.2.3 [X1] shall apply.

- ISO BMFF Tracks shall be encoded following the requirements in ISO/IEC 23090-2, clause 10.2.2.3.1 [X1].

6.2.2.2.1 Configuration change constraints

A configuration change takes place in an audio stream when the content setup or the Audio Scene Information changes (e.g., when changes occur in the channel layout, the number of objects etc.), and therefore new PACTYP_MPEGH3DACFG and PACTYP_AUDIOSCENEINFO packets are required upon such occurrences. A configuration change usually happens at program boundaries, but it may also occur within a program.

Configuration change constraints specified in ISO/IEC 23090-2, clause 10.2.2.3.2 [X1] shall apply.

6.3.1.2 Multi-stream constraints

The multi-stream-enabled MPEG‑H Audio System is capable of handling Audio Programme Components delivered in several different elementary streams (e.g., the main MHAS stream containing one complete audio main, and one or more auxiliary MHAS streams, containing different languages and audio descriptions). The MPEG-H Audio Metadata information (MAE) allows the MPEG‑H Audio Decoder to correctly decode several MHAS streams.

The sample entry 'mhm2' shall be used in cases of multi-stream delivery, i.e., the MPEG‑H Audio Scene is split into two or more streams for delivery as described in ISO/IEC 23008-3, clause 14.6 [X2]. All constraints for file formats using the sample entry 'mhm2' specified in ISO/IEC 23090-2, clause 10.2.2.3.3 [X1] shall apply.

6.2.2.3 Additional Restrictions for DASH Representations

DASH Integration is provided following the requirements and recommendations in ISO/IEC 23090-2, clause B.2.1 [X1]. All constraints in ISO/IEC 23090-2, clause B.2.1 [X1] shall apply.

6.2.2.4 DASH Adaptation Set Constraints

An instatiation of an OMAF 3D Audio Baseline Profile in DASH should be represented as one Adaptation Set. If so the Adaptation Set should provide the following signalling according to [X3] and ISO/IEC 23008‑3, clause 21 [X2] as shown in Table B.1.

Table B.1 – MPEG-H Audio MIME parameter according to RFC 6381 and ISO/IEC 23008‑3

Codec

MIME type

codecs parameter

profiles

ISOBMFF Encapsulation

MPEG-H Audio LC Profile Level 1

audio/mp4

mhm1.0x0B

'oabl'

ISO/IEC 23008-3

MPEG-H Audio LC Profile Level 2

audio/mp4

mhm1.0x0C

'oabl'

ISO/IEC 23008-3

MPEG-H Audio LC Profile Level 3

audio/mp4

mhm1.0x0D

'oabl'

ISO/IEC 23008-3

MPEG-H Audio LC Profile Level 1, multi-stream

audio/mp4

mhm2.0x0B

'oabl'

ISO/IEC 23008-3

MPEG-H Audio LC Profile Level 2, multi-stream

audio/mp4

mhm2.0x0C

'oabl'

ISO/IEC 23008-3

MPEG-H Audio LC Profile Level 3, multi-stream

audio/mp4

mhm2.0x0D

'oabl'

ISO/IEC 23008-3

Mapping of relevant MPD elements and attributes to MPEG-H Audio as well as the Preselection Element and Preselection descriptor are specified in ISO/IEC 23090-2, clause B.2.1.2 [X1].

6.2.2.4.1 DASH Adaptive Bitrate Switching

MPEG-H 3D Audio enables seamless bitrate switching in a DASH environment with different Representations (i.e., bit streams encoded at different bitrates) of the same content, i.e., those Representations are part of the same Adaptation Set.

If the decoder receives a DASH Segment of another Representation of the same Adaptation Set, the decoder shall perform an adaptive switch according to ISO/IEC 23008-3, clause 5.5.6 [X2].

*** Start change 3 ***

Annex X:
Example External Binaural Renderer (Informative)

X.1 General

Binaural rendering allows 3D audio content to be played back via headphones. The rendering is performed as a fast convolution of point sound source streams in the 3D space with head-related impulse responses (HRIRs) or binaural room impulse responses (BRIRs) corresponding to the direction of incidence relative to the listener. HRIRs shall be provided from an external source.

/Volumes/SCHWAER-SKS/high_level_overview(1).png

Figure 1: High level overview of an external binaural renderer setup.

The renderer has three input interfaces (see Fig. 1): the audio streams and metadata from the MPEG-H decoder, a head tracking interface for scene displacement information (for listener tracking), and a head-related impulse response (HRIR) interface providing binaural impulse responses for a given direction of incidence. The metadata as described in X.3, together with the scene displacement information, is used to construct a scene model, from which the renderer can infer the proper listener-relative point source positions.

The audio input streams may include Channel content, Object content, HOA content. The renderer performs preprocessing steps to translate the respective content type into several point sources that are then processed for binaural rendering. Channel groups and objects that are marked a non-diagetic in the metadata are excluded from any scene displacement processing.

X.2 Interfaces

X.2.1 Interface for Audio Data and Metadata

The example external binaural renderer has an interface for the input of un-rendered channels, objects, and HOA content and associated metadata. The syntax of this input interface follows the specification of the External Renderer Interface for MPEG-H 3D Audio to output un-rendered channels, objects, and HOA content and associated metadata according to clause 6.1.4.3.5.4.

The input PCM data of the channels and objects interfaces is provided through an input PCM buffer, which first contains signals carry the PCM data of the channel content. These are followed by signals carrying the PCM data of the un-rendered objects. Then additional signals carry the HOA data which number is indicated in the HOA metadata via the HOA order (e.g. 16 signals for HOA order 3). The HOA audio data in the HOA interface is provided in the ESD representation. The conversion from the HOA domain into the equivalent spatial domain representation and vice versa is described in ISO/IEC 23008-3, Annex C.5.1 [X2].

The metadata for channels, objects, and HOA is received via the input interface once per frame and their syntax is specified in mpegh3da_getChannelMetadata(), mpegh3da_getObjectAudioAndMetadata(), and mpegh3da_getHoaMetadata() respectively, see ISO/IEC 23008-3, clause 17.10 [X2]. The metadata and PCM data shall be aligned to match each metadata element with the respective PCM frame.

X.2.2 Head Tracking Interface.

The external binaural renderer receives scene displacement values (yaw, pitch and roll) e.g. from an external head tracking device via the head tracking interface. The syntax is specified in mpegh3daSceneDisplacementData() as defined in ISO/IEC 23008-3, clause 17.9.3 [X2].

X.2.3 Interface for Head-Related Impulse Responses

An interface is provided to specify the set of HRIRs used for the binaural rendering. These directional FIR filters shall be input using the SOFA (Spatially Oriented Format for Acoustics) files format according to [AES69-2015]. The SimpleFreeFieldHRIR convention shall be used, where binaural filters are indexed by polar coordinates (azimuth φ in radians, elevation ϕ in radians, and radius r in meters) relative to the listener.

X.3 Preprocessing

X.3.1 Channel Content

Channel input content is converted into a corresponding set of point sources with associated positions using the loudspeaker configuration data included in mpegh3da_getChannelMetadata() and the associated PCM data obtained via the interface specified in X.2.1

X.3.2 Object Content

Object input content is converted into corresponding point sources with associated positions using the metadata included in mpegh3da_getObjectAudioAndMetadata() and the associated PCM data obtained via the interface specified in clause X.2.1

X.3.3 HOA Content

As specified in clause X.2.1 HOA content is input in the ESD representation together with the metadata included in mpegh3da_getHoaMetadata(). As a preprocessing step, the ESD representation is first converted into HOA coefficients. All coefficients associated with HOA of order larger than three are discarded to limit the maximum computational complexity.

X.3.4 Non-diegetic Content

Channel groups for which the gca_directHeadphone flag is set in mpegh3da_getChannelMetadata() are routed to left and right output channel directly and are excluded from binaural rendering using scene displacement data (non-diegetic content). Non-diegetic content may have stereo or mono format. For mono, the signal is mixed to left and right headphone channel with a gain factor of 0.707.

For each channel group it has to be checked in the mpegh3da_getChannelMetadata() if the gca_fixedChannelsPosition flag is equal to 0 or 1. A channel group with an associated 'gca_fixedChannelsPosition == 1' is included in the binaural rendereding but excluded from the scene displacement processing according to clause X.4, i.e. its position is not updated.

For each object it has to be checked in the mpegh3da_getObjectAudioAndMetadata() if the goa_fixedPosition flag is equal to 0 or 1. An object with an associated 'goa_fixedPosition == 1' is included in the binaural rendering but excluded from the scene displacement processing according to clause X.4, i.e. its position is not updated.

X.4 Scene Displacement Processing

The position of each point source derived from the channels and objects input is represented by a 3-dimensional vector in a Cartesian coordinate system. The scene displacement information is used to compute an updated version of the position vector as described in clause X.4.1. The position of point sources that result from non-diegetic channel groups with an associated 'gca_fixedChannelsPosition == 1' or from non-diegetic objects with an associated 'goa_fixedPosition == 1' (see clause X.3.4) is not updated, i.e. is equal to .

X.4.1 Applying Scene Displacement Information

The vector representation of a point source is transformed to the listener-relative coordinate system by rotation based on the scene displacement values obtained via the head tracking interface. This is achieved by multiplying the position with a rotation matrix calculated from the orientation of the listener:

The determination of the rotation matrix is defined in [3DA, Annex I].

For HOA content, the rotation matrix suited for rotating the spherical harmonic representation is calculated as defined in ISO/IEC 23008-3, Annex I [X2]. After the rotation, the HOA coefficients are transformed back into the ESD respresentation. Each ESD component is then converted to the corresponding point source with its associated positional information. For the ESD components the position information is fixed, i.e. , as the rotation due to scene displacement is performed in the spherical harmonic representation.

X.5 Headphone Output Signal Computation

The overall Scene Model is represented by the collection of all point sources with updated position obtained from the rotated channels, objects, and the ESD components as well as the non-diagetic channels and objects for which 'gca_fixedChannelsPosition == 1' or 'goa_fixedPosition == 1'. The overall number of point sources in the Scene Model is denoted with .

X.5.1 HRIR Selection

The position of each point source in the listener-relative coordinate system is used to query a best-match HRIR pair from the set of available HRIRs. For lookup, the polar coordinates of the HRIR locations are transformed into the internally used cartesian coordinates and the closest-match available HRIR for a given point source position is selected. As no interpolation between different HRIRs is performed, HRIR datasets with sufficient spatial resolution should be provided.

X.5.2 Initialization

The HRIR filters used for binauralization are asynchronously partitioned and transformed into the frequency domain using a Fast FourierTransform (FFT). The necessary steps for each of the HRIR filter pairs are as follows:

  1. Uniformly partition the length N HRIR filter pairs into filter partitions of length .
  2. Zero-pad the filter partitions to length .
  3. Transform all filter partitions into the frequency domain using real-to-complex FFT to obtain the frequency domain filter pairs , where denotes the frequency index.

X.5.3 Convolution and Crossfade

Each audio block of a point source of the Scene Model is convolved with its selected HRIR filter pair for the left and right ear respectively. To reduce the computational complexity, a fast frequency domain convolution technique of uniformly partitioned overlap-save processing is useful for typical FIR filter lengths for HRIRs/BRIRs. The required processing steps are described in the following.

The following block processing steps are performed for each of the point sources of the Scene Model:

  1. Obtain a block of new input samples of the point source .
  2. Perform a real-to-complex FFT transforms of length to obtain the frequency domain representation of the input .
  3. Compute the frequency domain headphone output signal pair for the point source by multiplying each HRIR frequency domain filter partition with the associated frequency domain input block and adding the product results over all partitions.
  4. samples of the time domain output signal pair are obtained from by performing a complex-to-real IFFT.
  5. Only the last output samples represent valid output samples. The samples before are time-aliased and are discarded.
  6. In case of a HRIR filter exchange happens due to changes in the scene displacement, steps 3-5 are computed for both the current HRIR filter and the ones used in the previous block. A time-domain crossfade is performed over the B output samples obtained in step 5:

The crossfade envelopes are defined as

to preserve a constant power of the resulting output signal.

The crossfade operation define in step 6 is only applied to point sources of the Scene Model that have been generated from channel or object content. For HOA content, the crossfade is applied between the current and the previous rotation matrices (see X.4.1).

X.5.4 Binaural Downmix

The rendered headphone output signal is computed as the sum over all binauralized point source signal pairs . In case that the metadata provided together with the audio data at the input interface (see X.3.1) includes gain values applicable to a specific channel group (gca_channelGain in mpegh3da_getChannelMetadata()) or objects (goa_objectGainFactor in mpegh3da_getObjectAudioAndMetadata()), these gain values are applied to the corresponding binauralized point source signal before the summation:

Finally, any additional non-binauralized non-diegetic audio input ('gca_directHeadphone == 1', see X.3.4) is added time-aligned to the two downmix channels.

X.5.5 Complexity

The algorithmic complexity of the external binaural renderer using a fast convolution approach can be evaluated for the following computations:

Convolution (X.5.3)

  1. RFFT:

(with as an estimated additional complexity factor for the FFT)

  1. complex multiplications:
  2. complex additions:
  3. IRFFT:

Downmix (X.5.4)

  1. real multiplications:
  2. real additions:

Filter Exchange and Crossfade (X.5.3)

  1. RFFT:
  2. Time-domain crossfade (real multiplications):
  3. Time-domain crossfade (real additions):


Additional computations are required for scene displacement processing (see X.5).

The total complexity per output sample can be determined by adding the complexity estimation for convolution and downmix and dividing by the blocklength B. In blocks where a filter exchange is performed, items 2-4 from the convolution contribute two times to the overall complexity in addition to the time-domain crossfade multiplications and additions (filter exchange items 2 and 3). The partitioning and FFT for the filter exchange, as well as the scene displacement, can be performed independent of the input block processing.

X.5.6 Motion Latency

The Scene Model can be updated with arbitrary temporal precision, but the resulting HRIR exchange is only done at processing block boundaries of the convolution. With a standard block size of samples at 48 kHz sampling rate, this leads to a maximum onset latency of 5.3 ms until there is an audible effect of a motion of sources or the listener. In the following block, a time-domain crossfade between the new and the previous filtered signal is performed (see Convolution/Initialization), so that a discrete, instantaneous motion is completed after a maximum of two convolution processing blocks (10.6 ms for 512 samples at 48 kHz sampling rate). Additional latency from head trackers, audio buffering, etc. is not considered.

The rotation of the HOA content is performed at a block boundary resulting in a maximum latency of one processing block, until a motion is completed.

*** End of changes ***

Version Control

Version Control

Toto je jediná verze této specifikace.

vf00

Download & Access

Technical Details

AI Classification

Category: 7. Testování a interoperabilita
Subcategory: 7.1 Conformance Testing
Function: Test specification

Version Information

Release: Rel-15
Version: f00
Series: 26_series

Document Info

WGs:
SA

File Info

File: 26818-f00
Processed: 2025-06-25

3GPP Spec Explorer - Enhanced specification intelligence