What is DSR? Distributed Speech Recognition

Description

Distributed Speech Recognition (DSR) is a client-server architecture designed to provide accurate speech recognition services over mobile networks. It operates by dividing the recognition task between the User Equipment (UE) and a remote recognition server in the network. The UE runs the 'front-end' processing: it captures the audio via the microphone, applies acoustic preprocessing (noise suppression, echo cancellation), and then extracts a set of compact parametric representations (features) of the speech signal, typically Mel-Frequency Cepstral Coefficients (MFCCs). These features are then encoded using a standardized, bit-efficient codec and transmitted over the data channel to the network.

In the network, a dedicated DSR server receives the feature stream. This server hosts the 'back-end' recognition engine, which includes the acoustic models, pronunciation dictionaries, and language models. The server decodes the feature stream and uses statistical pattern matching (like Hidden Markov Models or deep neural networks) to convert the features into a text string or a semantic command. The result is then sent back to the UE or to another application server. This separation is key; it allows the computationally intensive and memory-heavy modeling and search processes to reside on powerful, updatable servers, while the UE handles the lighter, standardized front-end.

DSR's role is to deliver a consistent, high-accuracy recognition experience independent of the UE's processing power and the varying quality of the audio channel. By transmitting only features (a few kbps) instead of the full audio stream (e.g., 64 kbps for PCM), it conserves bandwidth and is more robust to transmission errors and low-bitrate voice codec distortions that would degrade server-side recognition if applied to decoded audio. It is a service enabler for network-based voice assistants, automated voice dialing, and voice-controlled services in vehicles.

Purpose & Motivation

DSR was created to solve the problem of providing high-quality, server-based speech recognition in the variable and sometimes constrained environment of early mobile networks (2G, 3G). Traditional 'server-only' recognition, where the UE sends compressed audio (e.g., using AMR), suffered because the voice codecs were optimized for human listening, not machine recognition. Codec artifacts and transmission errors could significantly degrade recognition accuracy.

The purpose of DSR was to standardize the interface between the mobile device and the recognition server, ensuring interoperability. It addressed the limitations of device-only recognition, which was constrained by the UE's limited processing and memory, making it impossible to host large vocabulary or complex models. By distributing the process, DSR leveraged the network's computational resources to provide a more powerful and updatable service, while the standardized front-end ensured the features sent to the server were clean and optimized for recognition, not listening, thus improving overall accuracy and reliability across different networks and devices.

Classification

Part ofIMS

Related approaches

Detected Changes Across Releases

from 3GPP Change Requests

Specific changes extracted from the „Change history“ tables of 3GPP specifications (2 CRs across 2 releases). Complements the general historical overview above with the evidence-based evolution of this function.

Rel-18 1 change

In Release 18, a specific enhancement was introduced for the Distributed Speech Recognition (DSR) function concerning data volume calculation. The change addresses the scenario where DSR is associated with at least two RLC entities, ensuring accurate accounting of the uplink encoded speech data. This update provides necessary clarification for the DSR Optimised Codec operation within the Speech Recognition Framework.

Data volume calculation for DSR when associated with at least two RLC entities TS 38.323CR0133

Rel-19 1 change

In Release 19, the primary update for the Distributed Speech Recognition (DSR) function was a correction to the DSR triggering procedure. This change refined the mechanism that initiates the use of the DSR Optimised Codec on the UE, which extracts and encodes acoustic features for uplink transmission to server-side speech engines. The adjustment ensures the framework for automated voice services activates the optimized speech recognition path more reliably.

Correction on DSR triggering TS 38.321CR2140

Explore further

Broader topics and technologies where DSR plays a role.

Topics

Lawful Intercept Services & Applications Radio Access Network

Defining Specifications

3GPP specifications that define or reference DSR, with the latest known release. Sourced from the 3GPP document catalog — see methodology.

Specification	Title	Release
TR 22.977 vj00	Speech Enabled Services and Multimodal Framework	Rel-19
TS 26.177 vj00	DSR Extended Advanced Front-end Test Sequences	Rel-19
TS 26.235 vc00	Default Codecs for 3GPP IP Multimedia Subsystem	Rel-12
TS 26.236 vc00	Packet Switched Conversational Multimedia Protocols	Rel-12
TS 26.243 vj00	DSR Extended Advanced Front-end C Code	Rel-19
TR 26.943 vj00	SES Codec Selection Report	Rel-19
TS 38.300 vj00	NG-RAN Overall Description	Rel-19
TS 38.306 vj00	NR UE Radio Access Capability Parameters	Rel-19
TS 38.321 vj00	NR MAC Protocol Specification	Rel-19
TS 38.322 vj00	NR Radio Link Control (RLC) Protocol	Rel-19
TS 38.323 vj00	Packet Data Convergence Protocol (PDCP)	Rel-19
TS 38.331 vj00	NR Radio Resource Control (RRC) Protocol Specification	Rel-19
TR 45.912 vj00	GERAN Evolution Feasibility Study	Rel-19