UTF-8

Unicode Transformation Format - 8-bit

Other
Introduced in Rel-8
UTF-8 is a variable-length character encoding for Unicode that uses one to four 8-bit bytes per character. It is widely used in 3GPP for messaging, presence, and multimedia services due to its backward compatibility with ASCII and efficiency for text dominated by Latin scripts, making it fundamental for internet-based communication.

Description

UTF-8 is a character encoding that maps Unicode code points to a sequence of 8-bit bytes. It is variable-width, using 1 to 4 bytes to represent a character. ASCII characters (U+0000 to U+007F) are encoded as a single byte, identical to their ASCII representation, providing full backward compatibility. Characters outside this range use multi-byte sequences where the first byte indicates the number of continuation bytes, and continuation bytes have a specific bit pattern. This design allows efficient processing and avoids issues with byte order, as UTF-8 is byte-order agnostic.

In 3GPP standards, UTF-8 is specified across multiple technical specifications (TS) for various services. For example, TS 26.140 (Multimedia Messaging Service; Media formats and codecs) and TS 26.141 (Presence service; Data formats) define its use for text in messaging and presence information. TS 26.234 (Transparent end-to-end packet-switched streaming service; Protocols) and TS 26.245/246/247 (related to streaming and file format) specify UTF-8 for metadata, session description, and text tracks. The encoding is used in protocols like SIP, HTTP, and within multimedia containers to ensure text data is universally interpretable.

The encoding works by dividing the Unicode code point value into bits and distributing them across the bytes according to a defined pattern. A single-byte character has the high bit set to 0. For multi-byte characters, the first byte has several high bits set to 1 followed by a 0, indicating the total number of bytes, and continuation bytes start with '10'. This structure allows easy validation and parsing. Within the 3GPP network architecture, UTF-8 encoded text is typically carried in the payload of application-layer protocols. Its role is crucial for services requiring text interchange, such as MMS, IMS messaging, and streaming services, as it supports global languages while being efficient for ASCII-heavy text and compatible with existing internet infrastructure.

Purpose & Motivation

UTF-8 was developed to provide a Unicode encoding that is backward compatible with the widely used ASCII standard and efficient for network transmission. Before Unicode, multiple incompatible encodings (like ISO-8859 series) caused interoperability issues, especially on the internet. The creation of UTF-8, by Ken Thompson and Rob Pike, offered a solution where ASCII text remains valid UTF-8, easing adoption. Its design minimizes overhead for English and other Latin-script languages while still capable of encoding all Unicode characters.

3GPP adopted UTF-8 starting from Release 8 to align with internet protocols and ensure seamless integration with web services. As mobile networks evolved to support IP-based services (IMS, streaming), using UTF-8 became essential for protocols like SIP and HTTP that dominate internet communication. It solved the problem of text corruption when exchanging messages between different systems and regions. For multimedia services, UTF-8 allowed metadata and subtitles to be efficiently encoded, particularly beneficial for services where text is predominantly ASCII, reducing bandwidth compared to fixed-width encodings like UTF-16.

The motivation was driven by the need for a universal, efficient, and robust text encoding for global mobile services. UTF-8's byte-oriented nature avoids byte-order issues, simplifying processing. By specifying UTF-8 in core specs, 3GPP ensured that mobile devices and network elements could interoperate with servers and services on the broader internet, supporting the trend toward all-IP networks and rich communication services.

Key Features

  • Variable-width encoding using 1 to 4 bytes per character
  • Full backward compatibility with ASCII (ASCII is a subset of UTF-8)
  • Byte-order agnostic, eliminating the need for Byte Order Marks (BOMs) in most contexts
  • Widely used in 3GPP for messaging (MMS), presence, streaming protocols, and metadata
  • Efficient for text with many ASCII characters, reducing size compared to UTF-16 for such content
  • Self-synchronizing design allows recovery from partial data streams and easy validation

Evolution Across Releases

Rel-8 Initial

UTF-8 was initially introduced in 3GPP Release 8 across several specifications, including TS 26.140 for MMS media formats, TS 26.141 for presence data, and TS 26.234 for streaming protocols. The initial architecture established it as a mandatory or recommended encoding for text in these services, enabling interoperability with internet standards and supporting global character sets for emerging IP-based multimedia applications.

Defining Specifications

SpecificationTitle
TS 26.140 3GPP TS 26.140
TS 26.141 3GPP TS 26.141
TS 26.234 3GPP TS 26.234
TS 26.245 3GPP TS 26.245
TS 26.246 3GPP TS 26.246
TS 26.247 3GPP TS 26.247