Description
Time Scale Modification (TSM) is a digital signal processing technique standardized within 3GPP for use in multimedia telecommunication services. Its primary function is to compress or expand the time axis of an audio (or video) signal. Crucially, it achieves this without changing the perceptual pitch of the audio. For example, speeding up a speech signal by 10% using TSM results in faster speech, but the speaker's voice does not sound higher-pitched. This is a key distinction from simple sample rate conversion, which would change both speed and pitch.
Architecturally, TSM can be implemented in various network elements or in user equipment (UE), depending on the service. In a Voice over IP (VoIP) or Video Telephony service, a TSM function may reside in a Media Resource Function (MRF) within the IP Multimedia Subsystem (IMS) or in an application server. It can also be a capability of the UE's media codec or post-processing software. The TSM algorithm works by analyzing the input media stream, typically after decoding it to a linear PCM format. It then segments the signal, often using techniques based on the Short-Time Fourier Transform (STFT) or waveform similarity overlap-and-add (WSOLA) methods, to find optimal points for removing or duplicating small segments of signal without creating audible artifacts.
How TSM works involves a synthesis phase where these modified segments are overlapped and added back together to construct the output signal at the new time scale. For time compression, redundant or less perceptually critical periods (like silences or steady-state vowel sounds) are shortened or removed. For time expansion, additional segments are inserted by carefully overlapping and cross-fading similar waveform sections. The process is controlled by a scaling factor (e.g., 0.9 for 10% speed-up, 1.1 for 10% slow-down). In a network context, a common application is jitter buffer management. A receiver's jitter buffer uses TSM to slightly adjust the playback rate to match the long-term average arrival rate of packets, preventing buffer underflow or overflow without requiring clock synchronization between sender and receiver.
Its role in the network extends beyond jitter compensation. TSM is used for synchronizing independently delivered media streams, such as aligning audio with video in multimedia messaging or broadcast services. It also enables user-centric features like fast-forward or slow-motion playback of recorded voice messages or lecture videos without unnatural pitch distortion. The specifications detail performance requirements, such as the acceptable range of scale factors and the maximum permissible degradation in speech quality, ensuring interoperability between different implementations from various vendors.
Purpose & Motivation
Time Scale Modification was introduced to solve practical problems arising in packet-based multimedia communication, where perfect isochronous delivery cannot be guaranteed. In traditional circuit-switched voice networks, a dedicated, synchronous channel ensured constant delay. In VoIP and 3GPP packet-switched multimedia services, packets experience variable delay (jitter) as they traverse the IP network. A simple playout buffer can absorb this jitter, but if the sender's and receiver's clocks drift even slightly, the buffer will eventually underflow or overflow, causing audible gaps or skips in speech.
TSM provides an elegant solution to this clock drift problem without requiring complex, network-wide clock synchronization (like IEEE 1588). By applying very slight, imperceptible time scaling (e.g., ±50 ppm), the playout buffer can adjust its consumption rate to match the long-term average arrival rate of packets. This is far more efficient and lower cost than attempting to synchronize every endpoint and network node to a common clock source. It directly addresses the limitation of simple buffering in asynchronous packet networks.
Furthermore, TSM enables enhanced user services. The ability to change playback speed without pitch alteration was a desired feature for messaging services (e.g., listening to voicemail faster) and for accessibility (e.g., slowing down instructional audio). Before standardized TSM algorithms, proprietary solutions led to interoperability issues. 3GPP standardization ensured a consistent level of quality and functionality across networks and devices, promoting a better user experience for time-adjusted media playback and robust, resilient real-time communication over unreliable packet networks.
Key Features
- Modifies playback duration (speed) of audio/video without altering perceptual pitch
- Used for dynamic jitter buffer control to compensate for network clock drift
- Enables audio-video synchronization in multimedia services
- Supports user-controlled playback speed for messaging and streaming
- Based on advanced DSP algorithms like WSOLA or phase vocoders
- Standardized performance requirements to ensure quality and interoperability
Evolution Across Releases
Defining Specifications
| Specification | Title |
|---|---|
| TS 26.253 | 3GPP TS 26.253 |
| TS 26.256 | 3GPP TS 26.256 |
| TS 26.448 | 3GPP TS 26.448 |
| TS 28.062 | 3GPP TS 28.062 |