TSM (Time Scale Modification) — 3GPP Glossary

A media processing function that adjusts the playback speed of audio or video streams without altering the pitch. It is used in telecommunication services like Voice over IP (VoIP) to compensate for network jitter, synchronize streams, or enable features like playback speed control.

Description

Time Scale Modification (TSM) is a digital signal processing technique standardized within 3GPP for use in multimedia telecommunication services. Its primary function is to compress or expand the time axis of an audio (or video) signal. Crucially, it achieves this without changing the perceptual pitch of the audio. For example, speeding up a speech signal by 10% using TSM results in faster speech, but the speaker's voice does not sound higher-pitched. This is a key distinction from simple sample rate conversion, which would change both speed and pitch.

Architecturally, TSM can be implemented in various network elements or in user equipment (UE), depending on the service. In a Voice over IP (VoIP) or Video Telephony service, a TSM function may reside in a Media Resource Function (MRF) within the IP Multimedia Subsystem (IMS) or in an application server. It can also be a capability of the UE's media codec or post-processing software. The TSM algorithm works by analyzing the input media stream, typically after decoding it to a linear PCM format. It then segments the signal, often using techniques based on the Short-Time Fourier Transform (STFT) or waveform similarity overlap-and-add (WSOLA) methods, to find optimal points for removing or duplicating small segments of signal without creating audible artifacts.

How TSM works involves a synthesis phase where these modified segments are overlapped and added back together to construct the output signal at the new time scale. For time compression, redundant or less perceptually critical periods (like silences or steady-state vowel sounds) are shortened or removed. For time expansion, additional segments are inserted by carefully overlapping and cross-fading similar waveform sections. The process is controlled by a scaling factor (e.g., 0.9 for 10% speed-up, 1.1 for 10% slow-down). In a network context, a common application is jitter buffer management. A receiver's jitter buffer uses TSM to slightly adjust the playback rate to match the long-term average arrival rate of packets, preventing buffer underflow or overflow without requiring clock synchronization between sender and receiver.

Its role in the network extends beyond jitter compensation. TSM is used for synchronizing independently delivered media streams, such as aligning audio with video in multimedia messaging or broadcast services. It also enables user-centric features like fast-forward or slow-motion playback of recorded voice messages or lecture videos without unnatural pitch distortion. The specifications detail performance requirements, such as the acceptable range of scale factors and the maximum permissible degradation in speech quality, ensuring interoperability between different implementations from various vendors.

Purpose & Motivation

Time Scale Modification was introduced to solve practical problems arising in packet-based multimedia communication, where perfect isochronous delivery cannot be guaranteed. In traditional circuit-switched voice networks, a dedicated, synchronous channel ensured constant delay. In VoIP and 3GPP packet-switched multimedia services, packets experience variable delay (jitter) as they traverse the IP network. A simple playout buffer can absorb this jitter, but if the sender's and receiver's clocks drift even slightly, the buffer will eventually underflow or overflow, causing audible gaps or skips in speech.

TSM provides an elegant solution to this clock drift problem without requiring complex, network-wide clock synchronization (like IEEE 1588). By applying very slight, imperceptible time scaling (e.g., ±50 ppm), the playout buffer can adjust its consumption rate to match the long-term average arrival rate of packets. This is far more efficient and lower cost than attempting to synchronize every endpoint and network node to a common clock source. It directly addresses the limitation of simple buffering in asynchronous packet networks.

Furthermore, TSM enables enhanced user services. The ability to change playback speed without pitch alteration was a desired feature for messaging services (e.g., listening to voicemail faster) and for accessibility (e.g., slowing down instructional audio). Before standardized TSM algorithms, proprietary solutions led to interoperability issues. 3GPP standardization ensured a consistent level of quality and functionality across networks and devices, promoting a better user experience for time-adjusted media playback and robust, resilient real-time communication over unreliable packet networks.

Key Features

Modifies playback duration (speed) of audio/video without altering perceptual pitch
Used for dynamic jitter buffer control to compensate for network clock drift
Enables audio-video synchronization in multimedia services
Supports user-controlled playback speed for messaging and streaming
Based on advanced DSP algorithms like WSOLA or phase vocoders
Standardized performance requirements to ensure quality and interoperability

Evolution Across Releases

Rel-4 Initial

Time Scale Modification concepts began appearing in the context of adaptive multi-rate (AMR) codec and voice services over packet networks, addressing the need for playout buffer control.

Specification	Title
TS 26.253	3GPP TS 26.253
TS 26.256	3GPP TS 26.256
TS 26.448	3GPP TS 26.448
TS 28.062	3GPP TS 28.062

Time Scale Modification

Description

Purpose & Motivation

Key Features

Evolution Across Releases

Defining Specifications