PESQ

Perceptual Evaluation of Speech Quality

Other
Introduced in Rel-13
An ITU-T standardized objective method (ITU-T P.862) for automated assessment of speech quality as perceived by human listeners. It compares a degraded speech signal (e.g., after network transmission) to a clean reference signal, predicting a Mean Opinion Score (MOS). It is widely used for benchmarking and monitoring voice service quality in telecom networks.

Description

Perceptual Evaluation of Speech Quality (PESQ) is an algorithm defined by ITU-T Recommendation P.862 for objectively predicting the subjective quality of narrowband and wideband speech codecs as they would be rated by human listeners in a listening test. Unlike simple signal-to-noise ratio measurements, PESQ models the human auditory system to provide an assessment that correlates highly with subjective Mean Opinion Score (MOS) tests. The algorithm takes two inputs: the original, undistorted reference speech signal and the degraded output signal that has passed through the system under test (e.g., a voice codec, packet network, or a complete voice call path). PESQ performs a perceptual transformation of both signals, aligning them in time to compensate for delays, and then compares them to compute a disturbance value that quantifies perceived differences.

The internal processing of PESQ involves several key stages. First, it performs level alignment and time alignment to ensure a fair comparison, correcting for gain variations and bulk transmission delays. Next, both signals are transformed into a perceptually relevant representation using a model of the human auditory system, which includes frequency warping (to the Bark scale) to mimic the ear's non-linear frequency sensitivity. The algorithm then calculates a perceptual disturbance, which is a combination of an "asymmetric" disturbance (where added noise or distortions are weighted) and a "symmetric" disturbance (for other linear distortions). These disturbances are aggregated across time and frequency to produce two intermediate values: a disturbance density and an asymmetrical disturbance density.

Finally, PESQ maps these aggregated disturbance values to a prediction of the subjective listening quality score. The output is a raw PESQ score, which typically ranges from -0.5 to 4.5, and can be further mapped to a MOS-LQO (Mean Opinion Score - Listening Quality Objective) scale from 1 (bad) to 5 (excellent). While PESQ is highly effective for evaluating one-way speech quality impairments like codec distortions, packet loss, and noise, it has limitations. It does not model the effects of very long delays, echo, or sidetone, which are better assessed by other metrics like POLQA (P.863). In 3GPP, PESQ is referenced (e.g., in TS 22.179 for Mission Critical Push-to-Talk services) as a standard methodology for defining minimum speech quality performance requirements for codecs and end-to-end systems, ensuring a consistent and repeatable quality benchmark across the industry.

Purpose & Motivation

PESQ was developed to solve the critical need for an efficient, repeatable, and standardized method to evaluate speech quality in telecommunications, replacing expensive and time-consuming subjective listening tests. Before objective models like PESQ, the only reliable way to assess the perceptual quality of a speech codec or network path was to conduct formal subjective tests with human listeners, which are costly, slow, and difficult to repeat consistently across different labs and conditions. As digital voice codecs and packet-based networks (like VoIP) proliferated, the industry required a tool for rapid development, optimization, and benchmarking of speech processing algorithms.

The primary motivation was to create an algorithm that could accurately emulate the results of an Absolute Category Rating (ACR) listening test, the gold standard for subjective quality defined in ITU-T P.800. PESQ addressed the shortcomings of earlier objective models (like PSQM), which did not perform well with modern codec-specific distortions such as variable delay and frame erasures common in packet networks. By incorporating a more sophisticated perceptual model and robust time alignment, PESQ provided a high correlation with human judgments for a wide range of impairments including coding distortions, packet loss, jitter, and transcoding effects.

Its adoption by standards bodies like 3GPP and ITU-T allowed equipment vendors and network operators to specify and verify speech quality performance in a consistent manner. For example, 3GPP uses PESQ-derived scores to define minimum quality thresholds for voice services over LTE (VoLTE) and 5G (VoNR), ensuring a baseline user experience. It became an indispensable tool for R&D, network planning, quality monitoring, and service level agreement (SLA) verification, enabling the industry to confidently deploy new voice technologies while maintaining or improving perceived call quality.

Key Features

  • Objective prediction of subjective listening quality (MOS)
  • High correlation with ITU-T P.800 subjective listening tests
  • Robust time alignment to handle variable network delays
  • Perceptual modeling based on human auditory system (Bark scale)
  • Evaluation of impairments from codecs, packet loss, and noise
  • Output of raw PESQ score and mapped MOS-LQO value

Evolution Across Releases

Rel-13 Initial

First referenced in 3GPP specifications, notably in TS 22.179 for Mission Critical Push-to-Talk (MCPTT) services. It was adopted as a standardized methodology to define minimum speech quality performance requirements for codecs used in critical communications, ensuring a consistent and measurable quality benchmark for these services.

Defining Specifications

SpecificationTitle
TS 22.179 3GPP TS 22.179