TSM

Time Scale Modification

Services
Introduced in Rel-8
A media processing function that adjusts the playback speed of audio or video streams without altering the pitch. It is used in telecommunication services like Voice over IP (VoIP) to compensate for network jitter, synchronize streams, or enable features like playback speed control.

Description

Time Scale Modification (TSM) is a digital signal processing technique standardized within 3GPP for use in multimedia telecommunication services. Its primary function is to compress or expand the time axis of an audio (or video) signal. Crucially, it achieves this without changing the perceptual pitch of the audio. For example, speeding up a speech signal by 10% using TSM results in faster speech, but the speaker's voice does not sound higher-pitched. This is a key distinction from simple sample rate conversion, which would change both speed and pitch.

Architecturally, TSM can be implemented in various network elements or in user equipment (UE), depending on the service. In a Voice over IP (VoIP) or Video Telephony service, a TSM function may reside in a Media Resource Function (MRF) within the IP Multimedia Subsystem (IMS) or in an application server. It can also be a capability of the UE's media codec or post-processing software. The TSM algorithm works by analyzing the input media stream, typically after decoding it to a linear PCM format. It then segments the signal, often using techniques based on the Short-Time Fourier Transform (STFT) or waveform similarity overlap-and-add (WSOLA) methods, to find optimal points for removing or duplicating small segments of signal without creating audible artifacts.

How TSM works involves a synthesis phase where these modified segments are overlapped and added back together to construct the output signal at the new time scale. For time compression, redundant or less perceptually critical periods (like silences or steady-state vowel sounds) are shortened or removed. For time expansion, additional segments are inserted by carefully overlapping and cross-fading similar waveform sections. The process is controlled by a scaling factor (e.g., 0.9 for 10% speed-up, 1.1 for 10% slow-down). In a network context, a common application is jitter buffer management. A receiver's jitter buffer uses TSM to slightly adjust the playback rate to match the long-term average arrival rate of packets, preventing buffer underflow or overflow without requiring clock synchronization between sender and receiver.

Its role in the network extends beyond jitter compensation. TSM is used for synchronizing independently delivered media streams, such as aligning audio with video in multimedia messaging or broadcast services. It also enables user-centric features like fast-forward or slow-motion playback of recorded voice messages or lecture videos without unnatural pitch distortion. The specifications detail performance requirements, such as the acceptable range of scale factors and the maximum permissible degradation in speech quality, ensuring interoperability between different implementations from various vendors.

Purpose & Motivation

Time Scale Modification was introduced to solve practical problems arising in packet-based multimedia communication, where perfect isochronous delivery cannot be guaranteed. In traditional circuit-switched voice networks, a dedicated, synchronous channel ensured constant delay. In VoIP and 3GPP packet-switched multimedia services, packets experience variable delay (jitter) as they traverse the IP network. A simple playout buffer can absorb this jitter, but if the sender's and receiver's clocks drift even slightly, the buffer will eventually underflow or overflow, causing audible gaps or skips in speech.

TSM provides an elegant solution to this clock drift problem without requiring complex, network-wide clock synchronization (like IEEE 1588). By applying very slight, imperceptible time scaling (e.g., ±50 ppm), the playout buffer can adjust its consumption rate to match the long-term average arrival rate of packets. This is far more efficient and lower cost than attempting to synchronize every endpoint and network node to a common clock source. It directly addresses the limitation of simple buffering in asynchronous packet networks.

Furthermore, TSM enables enhanced user services. The ability to change playback speed without pitch alteration was a desired feature for messaging services (e.g., listening to voicemail faster) and for accessibility (e.g., slowing down instructional audio). Before standardized TSM algorithms, proprietary solutions led to interoperability issues. 3GPP standardization ensured a consistent level of quality and functionality across networks and devices, promoting a better user experience for time-adjusted media playback and robust, resilient real-time communication over unreliable packet networks.

Key Features

  • Modifies playback duration (speed) of audio/video without altering perceptual pitch
  • Used for dynamic jitter buffer control to compensate for network clock drift
  • Enables audio-video synchronization in multimedia services
  • Supports user-controlled playback speed for messaging and streaming
  • Based on advanced DSP algorithms like WSOLA or phase vocoders
  • Standardized performance requirements to ensure quality and interoperability

Evolution Across Releases

Rel-4 Initial

Time Scale Modification concepts began appearing in the context of adaptive multi-rate (AMR) codec and voice services over packet networks, addressing the need for playout buffer control.

With the introduction of IMS, the need for standardized media processing functions increased. TSM was identified as a key component for media adaptation.

Further work on packet-switched streaming services (PSS) and multimedia messaging (MMS) included requirements for time-scale modification of audio content.

Specifications for Voice over IP (VoIP) and video telephony over IMS detailed the use of TSM for jitter buffer management and media synchronization.

TSM was formally specified for use in the Enhanced Voice Services (EVS) codec framework and for IMS-based telephony. Requirements and test procedures for TSM performance were standardized to ensure quality in LTE voice services (VoLTE).

Integration of TSM capabilities into the Media Resource Function (MRF) for network-based media processing in IMS.

Enhancements for high-definition voice services and continued refinement of TSM for jitter management in real-time communication.

Support for TSM in evolved multimedia broadcast/multicast services (eMBMS) for stream synchronization.

No major architectural changes. TSM remained a stable component for media handling.

Considerations for TSM in WebRTC integration and VoWiFi (Voice over Wi-Fi) scenarios.

Enhanced support for immersive audio and video services, where precise synchronization is critical, leveraging TSM.

TSM carried into the 5G system for Voice over New Radio (VoNR) and real-time interactive services, ensuring continuity from 4G.

Integration with 5G Media Streaming and enhanced support for industrial IoT applications requiring precise media timing.

Continued role in 5G Advanced for ultra-reliable low-latency communication (URLLC) media streams and extended reality (XR) applications.

TSM is maintained as a core media processing function within the 5G service-based architecture for all real-time communication services.

Ongoing specification support for TSM in evolving multimedia codecs and service frameworks.

Defining Specifications

SpecificationTitle
TS 26.253 3GPP TS 26.253
TS 26.256 3GPP TS 26.256
TS 26.448 3GPP TS 26.448
TS 28.062 3GPP TS 28.062