DSR

Distributed Speech Recognition

Services
Introduced in R99
A service architecture where speech recognition processing is split between the mobile device (front-end) and a network server (back-end). The device extracts and compresses speech features, which are transmitted over the network for final recognition. This enables robust, network-based voice control and dictation services even in limited bandwidth conditions.

Description

Distributed Speech Recognition (DSR) is a client-server architecture designed to provide accurate speech recognition services over mobile networks. It operates by dividing the recognition task between the User Equipment (UE) and a remote recognition server in the network. The UE runs the 'front-end' processing: it captures the audio via the microphone, applies acoustic preprocessing (noise suppression, echo cancellation), and then extracts a set of compact parametric representations (features) of the speech signal, typically Mel-Frequency Cepstral Coefficients (MFCCs). These features are then encoded using a standardized, bit-efficient codec and transmitted over the data channel to the network.

In the network, a dedicated DSR server receives the feature stream. This server hosts the 'back-end' recognition engine, which includes the acoustic models, pronunciation dictionaries, and language models. The server decodes the feature stream and uses statistical pattern matching (like Hidden Markov Models or deep neural networks) to convert the features into a text string or a semantic command. The result is then sent back to the UE or to another application server. This separation is key; it allows the computationally intensive and memory-heavy modeling and search processes to reside on powerful, updatable servers, while the UE handles the lighter, standardized front-end.

DSR's role is to deliver a consistent, high-accuracy recognition experience independent of the UE's processing power and the varying quality of the audio channel. By transmitting only features (a few kbps) instead of the full audio stream (e.g., 64 kbps for PCM), it conserves bandwidth and is more robust to transmission errors and low-bitrate voice codec distortions that would degrade server-side recognition if applied to decoded audio. It is a service enabler for network-based voice assistants, automated voice dialing, and voice-controlled services in vehicles.

Purpose & Motivation

DSR was created to solve the problem of providing high-quality, server-based speech recognition in the variable and sometimes constrained environment of early mobile networks (2G, 3G). Traditional 'server-only' recognition, where the UE sends compressed audio (e.g., using AMR), suffered because the voice codecs were optimized for human listening, not machine recognition. Codec artifacts and transmission errors could significantly degrade recognition accuracy.

The purpose of DSR was to standardize the interface between the mobile device and the recognition server, ensuring interoperability. It addressed the limitations of device-only recognition, which was constrained by the UE's limited processing and memory, making it impossible to host large vocabulary or complex models. By distributing the process, DSR leveraged the network's computational resources to provide a more powerful and updatable service, while the standardized front-end ensured the features sent to the server were clean and optimized for recognition, not listening, thus improving overall accuracy and reliability across different networks and devices.

Key Features

  • Standardized front-end feature extraction (ETSI ES 201 108/3GPP 26.243)
  • Robust transmission of speech features over error-prone channels
  • Separation of acoustic processing (client) from linguistic decoding (server)
  • Bandwidth efficiency compared to transmitting full-bandwidth audio
  • Independence from the voice telephony codec and its artifacts
  • Support for server-side updates to acoustic and language models

Evolution Across Releases

R99 Initial

Initially standardized to enable voice-driven services over circuit-switched data channels. The architecture defined the basic split between terminal-based front-end feature extraction and network-based back-end recognition. Specs defined the feature extraction algorithm and the packet format for transporting features over the network.

Defining Specifications

SpecificationTitle
TS 22.977 3GPP TS 22.977
TS 26.177 3GPP TS 26.177
TS 26.235 3GPP TS 26.235
TS 26.236 3GPP TS 26.236
TS 26.243 3GPP TS 26.243
TS 26.943 3GPP TS 26.943
TS 38.300 3GPP TR 38.300
TS 38.306 3GPP TR 38.306
TS 38.321 3GPP TR 38.321
TS 38.322 3GPP TR 38.322
TS 38.323 3GPP TR 38.323
TS 38.331 3GPP TR 38.331
TS 45.912 3GPP TR 45.912