A code point, codepoint or code position is a unique position in a quantized n-dimensional space that has been assigned a semantic meaning.

In other words, a code point is a particular position in a table, where the position has been assigned a meaning. The table has discrete positions (1, 2, 3, 4, but not fractions) and may be one dimensional (a column), two dimensional (like cells in a spreadsheet), three dimensional (sheets in a workbook), etc... in any number of dimensions.

Code points are used in a multitude of formal information processing and telecommunication standards.^[1]^[2] For example ITU-T Recommendation T.35^[3] contains a set of country codes for telecommunications equipment (originally fax machines) which allow equipment to indicate its country of manufacture or operation. In T.35, Argentina is represented by the code point 0x07, Canada by 0x20, Gambia by 0x41, etc.

In character encoding

Code points are commonly used in character encoding, where a code point is a numerical value that maps to a specific character. In character encoding code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but sometimes represent symbols, control characters, or formatting.^[4] The set of all possible code points within a given encoding/character set make up that encoding's codespace.^[5]^[6]

For example, the character encoding scheme ASCII comprises 128 code points in the range 0_hex to 7F_hex, Extended ASCII comprises 256 code points in the range 0_hex to FF_hex, and Unicode comprises 1,114,112 code points in the range 0_hex to 10FFFF_hex. The Unicode code space is divided into seventeen planes (the basic multilingual plane, and 16 supplementary planes), each with 65,536 (= 2¹⁶) code points. Thus the total size of the Unicode code space is 17 × 65,536 = 1,114,112.

In Unicode

For Unicode, the particular sequence of bits is called a code unit – for the UCS-4 encoding, any code point is encoded as 4-byte (octet) binary numbers, while in the UTF-8 encoding, different code points are encoded as sequences from one to four bytes long, forming a self-synchronizing code. See comparison of Unicode encodings for details. Code points are normally assigned to abstract characters. An abstract character is not a graphical glyph but a unit of textual data. However, code points may also be left reserved for future assignment (most of the Unicode code space is unassigned), or given other designated functions.^{[citation needed]}

The distinction between a code point and the corresponding abstract character is not pronounced in Unicode but is evident for many other encoding schemes, where numerous code pages may exist for a single code space.^{[citation needed]}

History

The concept of a code point dates to the earliest standards for digital information processing and digital telecommunications.

In Unicode, code points are part of Unicode's solution to a difficult conundrum faced by character encoding developers in the 1980s.^[7] If they added more bits per character to accommodate larger character sets, that design decision would also constitute an unacceptable waste of then-scarce computing resources for Latin script users (who constituted the vast majority of computer users at the time), since those extra bits would always be zeroed out for such users.^[8] The code point avoids this problem by breaking the old idea of a direct one-to-one correspondence between characters and particular sequences of bits.

References

^ ETSI TS 101 773 (section 4), https://www.etsi.org/deliver/etsi_ts/101700_101799/101773/01.02.01_60/ts_101773v010201p.pdf
^ RFC4190 (section 1), https://datatracker.ietf.org/doc/html/rfc4190
^ "T.35 : Procedure for the allocation of ITU-T defined codes for non-standard facilities".
^ "The Unicode® Standard Version 11.0 – Core Specification" (PDF). Unicode Consortium. 30 June 2018. p. 23. Archived from the original (PDF) on 19 September 2018. Retrieved 25 December 2018. Format: Invisible but affects neighboring characters; includes line/paragraph separators
^ Unicode. "Glossary of Unicode Terms". unicode.org. Retrieved 20 March 2023.
^ "The Unicode® Standard Version 11.0 – Core Specification" (PDF). Unicode Consortium. 30 June 2018. p. 22. Archived from the original (PDF) on 19 September 2018. Retrieved 25 December 2018. On a computer, abstract characters are encoded internally as numbers. To create a complete character encoding, it is necessary to define the list of all characters to be encoded and to establish systematic rules for how the numbers represent the characters. The range of integers used to code the abstract characters is called the codespace. A particular integer in this set is called a code point. When an abstract character is mapped or assigned to a particular code point in the codespace, it is then referred to as an encodedcharacter.
^ Constable, Peter (13 June 2001). "Understanding Unicode™ - I". NRSI: Computers & Writing Systems. Archived from the original (html) on 16 September 2010. Retrieved 25 December 2018. By the early 1980s, the software industry was starting to recognise the need for a solution to the problems involved with using multiple character encoding standards. Some particularly innovative work was begun at Xerox. The Xerox Star workstation used a multi-byte encoding that allowed it to support a single character set with potentially millions of characters.
^ Mark Davis; Ken Whistler (23 March 2001). "Unicode Technical Standard #10 UNICODE COLLATION ALGORITHM". Unicode Consortium. Archived from the original (html) on 25 August 2001. Retrieved 25 December 2018. 6.2 Large Weight Values

External links

Unicode

Code points

Characters

Special purpose	BOM Combining grapheme joiner Left-to-right mark / Right-to-left mark Soft hyphen Variant form Word joiner Zero-width joiner Zero-width non-joiner Zero-width space
Lists	Characters CJK Unified Ideographs Combining character Duplicate characters Numerals Scripts Spaces Symbols Halfwidth and fullwidth Alias names and abbreviations Whitespace characters

Processing

Algorithms	Bidirectional text Collation ISO/IEC 14651 Equivalence Variation sequences International Ideographs Core
Comparison of encodings	BOCU-1 CESU-8 Punycode SCSU UTF-1 UTF-7 UTF-8 UTF-16/UCS-2 UTF-32/UCS-4 UTF-EBCDIC

On pairs of
code points

Usage

Related standards

Related topics

Scripts and symbols in Unicode
Common and inherited scripts	Combining marks Diacritics Punctuation marks Spaces Numbers
Modern scripts	Adlam Arabic Armenian Balinese Bamum Batak Bengali Bopomofo Braille Buhid Burmese Canadian Aboriginal Chakma Cham Cherokee CJK Unified Ideographs (Han) Cyrillic Deseret Devanagari Geʽez Georgian Greek Gujarati Gunjala Gondi Gurmukhi Hangul Hanifi Rohingya Hanja Hanunuoo Hebrew Hiragana Javanese Kanji Kannada Katakana Kayah Li Khmer Lao Latin Lepcha Limbu Lisu (Fraser) Lontara Malayalam Masaram Gondi Mende Kikakui Medefaidrin Miao (Pollard) Mongolian Mru N'Ko Nag Mundari New Tai Lue Nüshu Nyiakeng Puachue Hmong Odia Ol Chiki Osage Osmanya Pahawh Hmong Pau Cin Hau Pracalit (Newa) Ranjana Rejang Samaritan Saurashtra Shavian Sinhala Sorang Sompeng Sundanese Syriac Tagbanwa Tai Le Tai Tham Tai Viet Tamil Tangsa Telugu Thaana Thai Tibetan Tifinagh Tirhuta Toto Vai Wancho Warang Citi Yi
Ancient and historic scripts	Ahom Anatolian hieroglyphs Ancient North Arabian Avestan Bassa Vah Bhaiksuki Brāhmī Carian Caucasian Albanian Coptic Cuneiform Cypriot Cypro-Minoan Dives Akuru Dogra Egyptian hieroglyphs Elbasan Elymaic Glagolitic Gothic Grantha Hatran Imperial Aramaic Inscriptional Pahlavi Inscriptional Parthian Kaithi Kawi Kharosthi Khitan small script Khojki Khudawadi Khwarezmian (Chorasmian) Linear A Linear B Lycian Lydian Mahajani Makasar Mandaic Manichaean Marchen Meetei Mayek Meroitic Modi Multani Nabataean Nandinagari Ogham Old Hungarian Old Italic Old Permic Old Persian cuneiform Old Sogdian Old Turkic Old Uyghur Palmyrene ʼPhags-pa Phoenician Psalter Pahlavi Runic Sharada Siddham Sogdian South Arabian Soyombo Sylheti Nagri Tagalog (Baybayin) Takri Tangut Ugaritic Vithkuqi Yezidi Zanabazar Square
Notational scripts	Duployan SignWriting
Symbols, emojis	Cultural, political, and religious symbols Currency Control Pictures Mathematical operators and symbols List by subject Phonetic symbols (including IPA) Emoji
Category: Unicode Category: Unicode blocks

In character encoding

In Unicode

History

See also

References

External links