Description
UTF-8 is a character encoding that maps Unicode code points to a sequence of 8-bit bytes. It is variable-width, using 1 to 4 bytes to represent a character. ASCII characters (U+0000 to U+007F) are encoded as a single byte, identical to their ASCII representation, providing full backward compatibility. Characters outside this range use multi-byte sequences where the first byte indicates the number of continuation bytes, and continuation bytes have a specific bit pattern. This design allows efficient processing and avoids issues with byte order, as UTF-8 is byte-order agnostic.
In 3GPP standards, UTF-8 is specified across multiple technical specifications (TS) for various services. For example, TS 26.140 (Multimedia Messaging Service; Media formats and codecs) and TS 26.141 (Presence service; Data formats) define its use for text in messaging and presence information. TS 26.234 (Transparent end-to-end packet-switched streaming service; Protocols) and TS 26.245/246/247 (related to streaming and file format) specify UTF-8 for metadata, session description, and text tracks. The encoding is used in protocols like SIP, HTTP, and within multimedia containers to ensure text data is universally interpretable.
The encoding works by dividing the Unicode code point value into bits and distributing them across the bytes according to a defined pattern. A single-byte character has the high bit set to 0. For multi-byte characters, the first byte has several high bits set to 1 followed by a 0, indicating the total number of bytes, and continuation bytes start with '10'. This structure allows easy validation and parsing. Within the 3GPP network architecture, UTF-8 encoded text is typically carried in the payload of application-layer protocols. Its role is crucial for services requiring text interchange, such as MMS, IMS messaging, and streaming services, as it supports global languages while being efficient for ASCII-heavy text and compatible with existing internet infrastructure.
Purpose & Motivation
UTF-8 was developed to provide a Unicode encoding that is backward compatible with the widely used ASCII standard and efficient for network transmission. Before Unicode, multiple incompatible encodings (like ISO-8859 series) caused interoperability issues, especially on the internet. The creation of UTF-8, by Ken Thompson and Rob Pike, offered a solution where ASCII text remains valid UTF-8, easing adoption. Its design minimizes overhead for English and other Latin-script languages while still capable of encoding all Unicode characters.
3GPP adopted UTF-8 starting from Release 8 to align with internet protocols and ensure seamless integration with web services. As mobile networks evolved to support IP-based services (IMS, streaming), using UTF-8 became essential for protocols like SIP and HTTP that dominate internet communication. It solved the problem of text corruption when exchanging messages between different systems and regions. For multimedia services, UTF-8 allowed metadata and subtitles to be efficiently encoded, particularly beneficial for services where text is predominantly ASCII, reducing bandwidth compared to fixed-width encodings like UTF-16.
The motivation was driven by the need for a universal, efficient, and robust text encoding for global mobile services. UTF-8's byte-oriented nature avoids byte-order issues, simplifying processing. By specifying UTF-8 in core specs, 3GPP ensured that mobile devices and network elements could interoperate with servers and services on the broader internet, supporting the trend toward all-IP networks and rich communication services.
Classification
Detected Changes Across Releases
from 3GPP Change RequestsSpecific changes extracted from the „Change history“ tables of 3GPP specifications (2 CRs across 1 releases). Complements the general historical overview above with the evidence-based evolution of this function.
Studied in Rel-8, normative work from Rel-18.
In Release 18, the specification updates for codecs and formats introduced explicit normative references for UTF-8, citing IETF RFC 2279. This formalizes the use of UTF-8 as a character encoding for MMS message bodies, aligning with the existing requirement that any charset used must contain a subset of Unicode characters.
Explore further
Broader topics and technologies where UTF-8 plays a role.
Defining Specifications
3GPP specifications that define or reference UTF-8, with the latest known release. Sourced from the 3GPP document catalog — see methodology.
| Specification | Title | Release |
|---|---|---|
| TS 26.140 vj00 | MMS Media Formats and Codecs Specification | Rel-19 |
| TS 26.141 vj00 | IMS Messaging & Presence Media Formats | Rel-19 |
| TS 26.234 vj00 | 3GPP PSS Protocols and Codecs Specification | Rel-19 |
| TS 26.245 vj00 | 3GPP Timed Text Format Specification | Rel-19 |
| TS 26.246 vj00 | 3GPP SMIL Language Profile Specification | Rel-19 |
| TS 26.247 vj00 | 3GPP Progressive Download & DASH over HTTP | Rel-19 |