What is UTF-8? Unicode Transformation Format - 8-bit

Description

UTF-8 is a character encoding that maps Unicode code points to a sequence of 8-bit bytes. It is variable-width, using 1 to 4 bytes to represent a character. ASCII characters (U+0000 to U+007F) are encoded as a single byte, identical to their ASCII representation, providing full backward compatibility. Characters outside this range use multi-byte sequences where the first byte indicates the number of continuation bytes, and continuation bytes have a specific bit pattern. This design allows efficient processing and avoids issues with byte order, as UTF-8 is byte-order agnostic.

In 3GPP standards, UTF-8 is specified across multiple technical specifications (TS) for various services. For example, TS 26.140 (Multimedia Messaging Service; Media formats and codecs) and TS 26.141 (Presence service; Data formats) define its use for text in messaging and presence information. TS 26.234 (Transparent end-to-end packet-switched streaming service; Protocols) and TS 26.245/246/247 (related to streaming and file format) specify UTF-8 for metadata, session description, and text tracks. The encoding is used in protocols like SIP, HTTP, and within multimedia containers to ensure text data is universally interpretable.

The encoding works by dividing the Unicode code point value into bits and distributing them across the bytes according to a defined pattern. A single-byte character has the high bit set to 0. For multi-byte characters, the first byte has several high bits set to 1 followed by a 0, indicating the total number of bytes, and continuation bytes start with '10'. This structure allows easy validation and parsing. Within the 3GPP network architecture, UTF-8 encoded text is typically carried in the payload of application-layer protocols. Its role is crucial for services requiring text interchange, such as MMS, IMS messaging, and streaming services, as it supports global languages while being efficient for ASCII-heavy text and compatible with existing internet infrastructure.

Purpose & Motivation

UTF-8 was developed to provide a Unicode encoding that is backward compatible with the widely used ASCII standard and efficient for network transmission. Before Unicode, multiple incompatible encodings (like ISO-8859 series) caused interoperability issues, especially on the internet. The creation of UTF-8, by Ken Thompson and Rob Pike, offered a solution where ASCII text remains valid UTF-8, easing adoption. Its design minimizes overhead for English and other Latin-script languages while still capable of encoding all Unicode characters.

3GPP adopted UTF-8 starting from Release 8 to align with internet protocols and ensure seamless integration with web services. As mobile networks evolved to support IP-based services (IMS, streaming), using UTF-8 became essential for protocols like SIP and HTTP that dominate internet communication. It solved the problem of text corruption when exchanging messages between different systems and regions. For multimedia services, UTF-8 allowed metadata and subtitles to be efficiently encoded, particularly beneficial for services where text is predominantly ASCII, reducing bandwidth compared to fixed-width encodings like UTF-16.

The motivation was driven by the need for a universal, efficient, and robust text encoding for global mobile services. UTF-8's byte-oriented nature avoids byte-order issues, simplifying processing. By specifying UTF-8 in core specs, 3GPP ensured that mobile devices and network elements could interoperate with servers and services on the broader internet, supporting the trend toward all-IP networks and rich communication services.

Classification

Specific typesUCS

Related approaches

Detected Changes Across Releases

from 3GPP Change Requests

Specific changes extracted from the „Change history“ tables of 3GPP specifications (2 CRs across 1 releases). Complements the general historical overview above with the evidence-based evolution of this function.

Studied in Rel-8, normative work from Rel-18.

Rel-18 2 changes

In Release 18, the specification updates for codecs and formats introduced explicit normative references for UTF-8, citing IETF RFC 2279. This formalizes the use of UTF-8 as a character encoding for MMS message bodies, aligning with the existing requirement that any charset used must contain a subset of Unicode characters.

CR 26.140-0021r7 Updates to codecs and formats (Rel-18) TS 26.140CR0021
CR 26.141-0011r2 Updates to codecs and formats (Rel-18) TS 26.141CR0011

Explore further

Broader topics and technologies where UTF-8 plays a role.

Topics

SON (Self-Organizing Networks)IMS & Voice (VoLTE, VoNR)Lawful Intercept SMS & Messaging Services & Applications Protocols & Interfaces

Defining Specifications

3GPP specifications that define or reference UTF-8, with the latest known release. Sourced from the 3GPP document catalog — see methodology.

Specification	Title	Release
TS 26.140 vj00	MMS Media Formats and Codecs Specification	Rel-19
TS 26.141 vj00	IMS Messaging & Presence Media Formats	Rel-19
TS 26.234 vj00	3GPP PSS Protocols and Codecs Specification	Rel-19
TS 26.245 vj00	3GPP Timed Text Format Specification	Rel-19
TS 26.246 vj00	3GPP SMIL Language Profile Specification	Rel-19
TS 26.247 vj00	3GPP Progressive Download & DASH over HTTP	Rel-19