Unicode
Logo of the Unicode Consortium
Alias(es)	Universal Coded Character Set (UCS) ISO/IEC 10646
Language(s)	See list of scripts
Standard	Unicode Standard
Encoding formats	UTF-8 UTF-16 GB18030 UTF-32 BOCU SCSU (uncommon) UTF-7 UTF-1 (obsolete)
Preceded by	ISO/IEC 8859 various others
Official website Technical website
v t e

This article contains uncommon Unicode characters. Without proper rendering support, you may see question marks, boxes, or other symbols.

Unicode, formally The Unicode Standard,^{[note 1]} is a text encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 15.1 of the standard^[A] defines 149813 characters^[3] and 161 scripts used in various ordinary, literary, academic, and technical contexts.

Many common characters, including numerals, punctuation, and other symbols, are unified within the standard and are not treated as specific to any given writing system. Unicode encodes thousands of emoji, with the continued development thereof conducted by the Consortium as a part of the standard.^[4] Moreover, the widespread adoption of Unicode was in large part responsible for the initial popularization of emoji outside of Japan. Unicode is ultimately capable of encoding more than 1.1 million characters.

Unicode has largely supplanted the previous environment of a myriad of incompatible character sets, each used within different locales and on different computer architectures. Unicode is used to encode the vast majority of text on the Internet, including most web pages, and relevant Unicode support has become a common consideration in contemporary software development.

The Unicode character repertoire is synchronized with ISO/IEC 10646, each being code-for-code identical with one another. However, The Unicode Standard is more than just a repertoire within which characters are assigned. To aid developers and designers, the standard also provides charts and reference data, as well as annexes explaining concepts germane to various scripts, providing guidance for their implementation. Topics covered by these annexes include character normalization, character composition and decomposition, collation, and directionality.^[5]

Unicode text is processed and stored as binary data using one of several encodings, which define how to translate the standard's abstracted codes for characters into sequences of bytes. The Unicode Standard itself defines three encodings: UTF-8, UTF-16, and UTF-32, though several others exist. Of these, UTF-8 is the most widely used by a large margin, in part due to its backwards-compatibility with ASCII.

Version	Date	Book	UCS edition	Total		Details
Version	Date	Book	UCS edition	Scripts	Characters^[a]	Details
1.0.0^[22]	October 1991	ISBN 0-201-56788-1 (vol. 1)	—	24	7129	Initial scripts covered: Arabic, Armenian, Bengali, Bopomofo, Cyrillic, Devanagari, Georgian, Greek and Coptic, Gujarati, Gurmukhi, Hangul, Hebrew, Hiragana, Kannada, Katakana, Lao, Latin, Malayalam, Odia, Tamil, Telugu, Thai, and Tibetan
1.0.1^[23]	June 1992	ISBN 0-201-60845-6 (vol. 2)	—	25	28327⁺²¹²⁰⁴ ₋₆	The initial 20,902 CJK Unified Ideographs
1.1^[24]	June 1993	—	ISO/IEC 10646-1:1993 ^[b]	24	34168⁺⁵⁹⁶³ ₋₉	33 reclassified as control characters. 4,306 Hangul syllables, Tibetan removed
2.0^[25]	July 1996	ISBN 0-201-48345-9		25	38885⁺¹¹³⁷³ ₋₆₆₅₆	Original set of Hangul syllables removed, new set of 11,172 Hangul syllables added at new location, Tibetan added back in a new location and with a different character repertoire, Surrogate character mechanism defined, Plane 15 and Plane 16 Private Use Areas allocated
2.1^[26]	May 1998	—		25	38887⁺²	U+20AC € EURO SIGN, U+FFFC OBJECT REPLACEMENT CHARACTER^[26]
3.0^[27]	September 1999	ISBN 0-201-61633-5	ISO/IEC 10646-1:2000	38	49194⁺¹⁰³⁰⁷	Cherokee, Geʽez, Khmer, Mongolian, Burmese, Ogham, runes, Sinhala, Syriac, Thaana, Canadian Aboriginal syllabics, and Yi Syllables, Braille patterns
3.1^[28]	March 2001	—	ISO/IEC 10646-1:2000^[c] ISO/IEC 10646-2:2001	41	94140⁺⁴⁴⁹⁴⁶	Deseret, Gothic and Old Italic, sets of symbols for Western and Byzantine music, 42,711 additional CJK Unified Ideographs
3.2^[29]	March 2002	—	ISO/IEC 10646-1:2000^[c] ISO/IEC 10646-2:2001	45	95156⁺¹⁰¹⁶	Philippine scripts (Buhid, Hanunoo, Tagalog, and Tagbanwa)
4.0^[30]	April 2003	ISBN 0-321-18578-1	ISO/IEC 10646:2003 ^[d]	52	96382⁺¹²²⁶	Cypriot syllabary, Limbu, Linear B, Osmanya, Shavian, Tai Le, and Ugaritic, Hexagram symbols
4.1^[31]	March 2005	—		59	97655⁺¹²⁷³	Buginese, Glagolitic, Kharosthi, New Tai Lue, Old Persian, Sylheti Nagri, and Tifinagh, Coptic disunified from Greek, ancient Greek numbers and musical symbols First named character sequences were introduced.^[32]
5.0	July 2006	ISBN 0-321-48091-0		64	99024⁺¹³⁶⁹	Balinese, cuneiform, N'Ko, ʼPhags-pa, Phoenician^[33]
5.1^[34]	April 2008	—		75	100648⁺¹⁶²⁴	Carian, Cham, Kayah Li, Lepcha, Lycian, Lydian, Ol Chiki, Rejang, Saurashtra, Sundanese, and Vai, sets of symbols for the Phaistos Disc, Mahjong tiles, Domino tiles, additions to Burmese, Scribal abbreviations, U+1E9E ẞ LATIN CAPITAL LETTER SHARP S
5.2^[35]	October 2009	ISBN 978-1-936213-00-9		90	107296⁺⁶⁶⁴⁸	Avestan, Bamum, Gardiner's sign list of Egyptian hieroglyphs, Imperial Aramaic, Inscriptional Pahlavi, Inscriptional Parthian, Javanese, Kaithi, Lisu, Meetei Mayek, Old South Arabian, Old Turkic, Samaritan, Tai Tham and Tai Viet, additional CJK Unified Ideographs, Jamo for Old Hangul, Vedic Sanskrit
6.0^[36]	October 2010	ISBN 978-1-936213-01-6	ISO/IEC 10646:2010 ^[e]	93	109384⁺²⁰⁸⁸	Batak, Brahmi, Mandaic, playing card symbols, transport and map symbols, alchemical symbols, emoticons and emoji,^[37] additional CJK Unified Ideographs
6.1^[38]	January 2012	ISBN 978-1-936213-02-3	ISO/IEC 10646:2012 ^[f]	100	110116⁺⁷³²	Chakma, Meroitic cursive, Meroitic hieroglyphs, Miao, Sharada, Sora Sompeng, and Takri
6.2^[39]	September 2012	ISBN 978-1-936213-07-8			110117⁺¹	U+20BA ₺ TURKISH LIRA SIGN
6.3^[40]	September 2013	ISBN 978-1-936213-08-5			110122⁺⁵	5 bidirectional formatting characters
7.0^[41]	June 2014	ISBN 978-1-936213-09-2		123	112956⁺²⁸³⁴	Bassa Vah, Caucasian Albanian, Duployan, Elbasan, Grantha, Khojki, Khudawadi, Linear A, Mahajani, Manichaean, Mende Kikakui, Modi, Mro, Nabataean, Old North Arabian, Old Permic, Pahawh Hmong, Palmyrene, Pau Cin Hau, Psalter Pahlavi, Siddham, Tirhuta, Warang Citi, and dingbats
8.0^[42]	June 2015	ISBN 978-1-936213-10-8	ISO/IEC 10646:2014 ^[g]	129	120672⁺⁷⁷¹⁶	Ahom, Anatolian hieroglyphs, Hatran, Multani, Old Hungarian, SignWriting, additional CJK Unified Ideographs, lowercase letters for Cherokee, 5 emoji skin tone modifiers
9.0^[45]	June 2016	ISBN 978-1-936213-13-9	ISO/IEC 10646:2014 ^[g]	135	128172⁺⁷⁵⁰⁰	Adlam, Bhaiksuki, Marchen, Newa, Osage, Tangut, 72 emoji^[46]
10.0^[47]	June 2017	ISBN 978-1-936213-16-0	ISO/IEC 10646:2017 ^[h]	139	136690⁺⁸⁵¹⁸	Zanabazar Square, Soyombo, Masaram Gondi, Nüshu, hentaigana, 7,494 CJK Unified Ideographs, 56 emoji, bitcoin symbol
11.0^[48]	June 2018	ISBN 978-1-936213-19-1		146	137374⁺⁶⁸⁴	Dogra, Georgian Mtavruli capital letters, Gunjala Gondi, Hanifi Rohingya, Indic Siyaq Numbers, Makasar, Medefaidrin, Old Sogdian and Sogdian, Maya numerals, 5 CJK Unified Ideographs, symbols for xiangqi and star ratings, 145 emoji
12.0^[49]	March 2019	ISBN 978-1-936213-22-1		150	137928⁺⁵⁵⁴	Elymaic, Nandinagari, Nyiakeng Puachue Hmong, Wancho, Miao script, hiragana and katakana small letters, Tamil historic fractions and symbols, Lao letters for Pali, Latin letters for Egyptological and Ugaritic transliteration, hieroglyph format controls, 61 emoji
12.1^[50]	May 2019	ISBN 978-1-936213-25-2		150	137929⁺¹	U+32FF ㋿ SQUARE ERA NAME REIWA
13.0^[51]	March 2020	ISBN 978-1-936213-26-9	ISO/IEC 10646:2020 ^[52]	154	143859⁺⁵⁹³⁰	Chorasmian, Dhives Akuru, Khitan small script, Yezidi, 4,969 CJK ideographs, Arabic script additions used to write Hausa, Wolof, and other African languages, additions used to write Hindko and Punjabi in Pakistan, Bopomofo additions used for Cantonese, Creative Commons license symbols, graphic characters for compatibility with teletext and home computer systems, 55 emoji
14.0^[53]	September 2021	ISBN 978-1-936213-29-0		159	144697⁺⁸³⁸	Toto, Cypro-Minoan, Vithkuqi, Old Uyghur, Tangsa, extended IPA, Arabic script additions for use in languages across Africa and in Iran, Pakistan, Malaysia, Indonesia, Java, and Bosnia, additions for honorifics and Quranic use, additions to support languages in North America, the Philippines, India, and Mongolia, U+20C0 ⃀ SOM SIGN, Znamenny musical notation, 37 emoji
15.0^[54]	September 2022	ISBN 978-1-936213-32-0		161	149186⁺⁴⁴⁸⁹	Kawi and Mundari, 20 emoji, 4,192 CJK ideographs, control characters for Egyptian hieroglyphs
15.1^[55]	September 2023	ISBN 978-1-936213-33-7		161	149813⁺⁶²⁷	Additional CJK ideographs

General Category (Unicode Character Property)^[a] v t e
Value	Category Major, minor	Basic type^[b]	Character assigned^[b]	Count^[c] (as of 15.1)	Remarks

L, Letter; LC, Cased Letter (Lu, Ll, and Lt only)^[d]
Lu	Letter, uppercase	Graphic	Character	1,831
Ll	Letter, lowercase	Graphic	Character	2,233
Lt	Letter, titlecase	Graphic	Character	31	Ligatures or digraphs containing an uppercase followed by a lowercase part (e.g., ǅ, ǈ, ǋ, and ǲ)
Lm	Letter, modifier	Graphic	Character	397	A modifier letter
Lo	Letter, other	Graphic	Character	132,234	An ideograph or a letter in a unicase alphabet
M, Mark
Mn	Mark, nonspacing	Graphic	Character	1,985
Mc	Mark, spacing combining	Graphic	Character	452
Me	Mark, enclosing	Graphic	Character	13
N, Number
Nd	Number, decimal digit	Graphic	Character	680	All these, and only these, have Numeric Type = De^[e]
Nl	Number, letter	Graphic	Character	236	Numerals composed of letters or letterlike symbols (e.g., Roman numerals)
No	Number, other	Graphic	Character	915	E.g., vulgar fractions, superscript and subscript digits
P, Punctuation
Pc	Punctuation, connector	Graphic	Character	10	Includes spacing underscore characters such as "_", and other spacing tie characters. Unlike other punctuation characters, these may be classified as "word" characters by regular expression libraries.^[f]
Pd	Punctuation, dash	Graphic	Character	26	Includes several hyphen characters
Ps	Punctuation, open	Graphic	Character	79	Opening bracket characters
Pe	Punctuation, close	Graphic	Character	77	Closing bracket characters
Pi	Punctuation, initial quote	Graphic	Character	12	Opening quotation mark. Does not include the ASCII "neutral" quotation mark. May behave like Ps or Pe depending on usage
Pf	Punctuation, final quote	Graphic	Character	10	Closing quotation mark. May behave like Ps or Pe depending on usage
Po	Punctuation, other	Graphic	Character	628
S, Symbol
Sm	Symbol, math	Graphic	Character	948	Mathematical symbols (e.g., +, −, =, ×, ÷, √, ∊, ≠). Does not include parentheses and brackets, which are in categories Ps and Pe. Also does not include !, *, -, or /, which despite frequent use as mathematical operators, are primarily considered to be "punctuation".
Sc	Symbol, currency	Graphic	Character	63	Currency symbols
Sk	Symbol, modifier	Graphic	Character	125
So	Symbol, other	Graphic	Character	6,639
Z, Separator
Zs	Separator, space	Graphic	Character	17	Includes the space, but not TAB, CR, or LF, which are Cc
Zl	Separator, line	Format	Character	1	Only U+2028 LINE SEPARATOR (LSEP)
Zp	Separator, paragraph	Format	Character	1	Only U+2029 PARAGRAPH SEPARATOR (PSEP)
C, Other
Cc	Other, control	Control	Character	65 (will never change)^[e]	No name,^[g] <control>
Cf	Other, format	Format	Character	170	Includes the soft hyphen, joining control characters (ZWNJ and ZWJ), control characters to support bidirectional text, and language tag characters
Cs	Other, surrogate	Surrogate	Not (only used in UTF-16)	2,048 (will never change)^[e]	No name,^[g] <surrogate>
Co	Other, private use	Private-use	Character (but no interpretation specified)	137,468 total (will never change)^[e] (6,400 in BMP, 131,068 in Planes 15–16)	No name,^[g] <private-use>
Cn	Other, not assigned	Noncharacter	Not	66 (will not change unless the range of Unicode code points is expanded)^[e]	No name,^[g] <noncharacter>
Cn	Other, not assigned	Reserved	Not	824,652	No name,^[g] <reserved>
^ "Table 4-4: General Category" (PDF). The Unicode Standard. Unicode Consortium. September 2022. ^ ^a ^b "Table 2-3: Types of code points" (PDF). The Unicode Standard. Unicode Consortium. September 2022. ^ "DerivedGeneralCategory.txt". The Unicode Consortium. 2022-04-26. ^ "5.7.1 General Category Values". UTR #44: Unicode Character Database. Unicode Consortium. 2020-03-04. ^ ^a ^b ^c ^d ^e Unicode Character Encoding Stability Policies: Property Value Stability Stability policy: Some gc groups will never change. gc=Nd corresponds with Numeric Type=De (decimal). ^ "Annex C: Compatibility Properties (§ word)". Unicode Regular Expressions. Version 23. Unicode Consortium. 2022-02-08. Unicode Technical Standard #18. ^ ^a ^b ^c ^d ^e "Table 4-9: Construction of Code Point Labels" (PDF). The Unicode Standard. Unicode Consortium. September 2022. A Code Point Label may be used to identify a nameless code point. E.g. <control-hhhh>, <control-0088>. The Name remains blank, which can prevent inadvertently replacing, in documentation, a Control Name with a true Control code. Unicode also uses <not a character> for <noncharacter>.

Row	Cells	Range(s)
00	20–7E	Basic Latin (00–7F)
00	A0–FF	Latin-1 Supplement (80–FF)
01	00–13, 14–15, 16–2B, 2C–2D, 2E–4D, 4E–4F, 50–7E, 7F	Latin Extended-A (00–7F)
01	8F, 92, B7, DE-EF, FA–FF	Latin Extended-B (80–FF ...)
02	18–1B, 1E–1F	Latin Extended-B (... 00–4F)
	59, 7C, 92	IPA Extensions (50–AF)
	BB–BD, *C6, C7,* C9, D6, D8–DB, DC, DD,** DF, EE	Spacing Modifier Letters (B0–FF)
03	74–75, 7A, 7E, 84–8A, 8C, 8E–A1, A3–CE, D7, DA–E1	Greek (70–FF)
04	00–5F, 90–91, 92–C4, C7–C8, CB–CC, D0–EB, EE–F5, F8–F9	Cyrillic (00–FF)
1E	02–03, 0A–0B, 1E–1F, 40–41, 56–57, 60–61, 6A–6B, 80–85, 9B, F2–F3	Latin Extended Additional (00–FF)
1F	00–15, 18–1D, 20–45, 48–4D, 50–57, 59, 5B, 5D, 5F–7D, 80–B4, B6–C4, C6–D3, D6–DB, DD–EF, F2–F4, F6–FE	Greek Extended (00–FF)
20	*13–14, 15,* 17, 18–19, 1A–1B, 1C–1D, 1E, 20–22, 26, 30, 32–33, 39–3A, 3C, 3E, 44,** 4A	General Punctuation (00–6F)
	7F, 82	Superscripts and Subscripts (70–9F)
	A3–A4, A7, AC, AF	Currency Symbols (A0–CF)
21	*05, 13, 16, 22, 26,* 2E**	Letterlike Symbols (00–4F)
	5B–5E	Number Forms (50–8F)
	90–93, 94–95, A8	Arrows (90–FF)
22	00, 02, 03, 06, 08–09, 0F, 11–12, 15, 19–1A, 1E–1F, 27–28, 29, 2A, 2B, 48, 59, 60–61, 64–65, 82–83, 95, 97	Mathematical Operators (00–FF)
23	02, 0A, 20–21, 29–2A	Miscellaneous Technical (00–FF)
25	00, 02, 0C, 10, 14, 18, 1C, 24, 2C, 34, 3C, 50–6C	Box Drawing (00–7F)
	80, 84, 88, 8C, 90–93	Block Elements (80–9F)
	A0–A1, AA–AC, B2, BA, BC, C4, CA–CB, CF, D8–D9, E6	Geometric Shapes (A0–FF)
26	*3A–3C, 40, 42, 60, 63, 65–66, 6A,* 6B**	Miscellaneous Symbols (00–FF)
F0	(01–02)	Private Use Area (00–FF ...)
FB	01–02	Alphabetic Presentation Forms (00–4F)
FF	FD	Specials

v t e Character encodings
Early telecommunications	Telegraph code Needle Morse Non-Latin Wabun/Kana Chinese Cyrillic Korean Baudot and Murray Fieldata ASCII ISO/IEC 646 BCDIC Teletex and Videotex/Teletext T.51/ISO/IEC 6937 ITU T.61 ITU T.101 World System Teletext background sets Transcode
ISO/IEC 8859	Approved parts -1 (Western Europe) -2 (Central Europe) -3 (Maltese/Esperanto) -4 (North Europe) -5 (Cyrillic) -6 (Arabic) -7 (Greek) -8 (Hebrew) -9 (Turkish) -10 (Nordic) -11 (Thai) -13 (Baltic) -14 (Celtic) -15 (New Western Europe) -16 (Romanian) Abandoned parts -12 (Devanagari) Proposed but not approved KOI-8 Cyrillic Sámi Adaptations Welsh Barents Cyrillic Estonian Ukrainian Cyrillic
Bibliographic use	MARC-8 ANSEL CCCII/EACC ISO 5426 5426-2 5427 5428 6438 6862
National standards	ArmSCII Big5 BraSCII CNS 11643 DIN 66003 ELOT 927 GOST 10859 GB 2312 GB 12345 GB 12052 GB 18030 HKSCS ISCII JIS X 0201 JIS X 0208 JIS X 0212 JIS X 0213 KOI-7 KPS 9566 KS X 1001 KS X 1002 LST 1564 LST 1590-4 PASCII Shift JIS SI 960 TIS-620 TSCII VISCII VSCII YUSCII
ISO/IEC 2022	ISO/IEC 8859 ISO/IEC 10367 Extended Unix Code / EUC
Mac OS Code pages ("scripts")	Armenian Arabic Barents Cyrillic Celtic Central European Croatian Cyrillic Devanagari Farsi (Persian) Font X (Kermit) Gaelic Georgian Greek Gujarati Gurmukhi Hebrew Iceland Inuit Keyboard Latin (Kermit) Maltese/Esperanto Ogham Roman Romanian Sámi Turkish Turkic Cyrillic Ukrainian VT100
DOS code pages	437 668 708 720 737 770 773 775 776 777 778 850 851 852 853 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 897 899 903 904 932 936 942 949 950 951 1034 1040 1042 1043 1044 1098 1115 1116 1117 1118 1127 3846 ABICOMP CS Indic CSX Indic CSX+ Indic CWI-2 Iran System Kamenický Mazovia MIK
IBM AIX code pages	895 896 912 915 921 922 1006 1008 1009 1010 1012 1013 1014 1015 1016 1017 1018 1019 1046 1124 1133
Windows code pages	CER-GS 932 936 (GBK) 950 1169 Extended Latin-8 1250 1251 1252 1253 1254 1255 1256 1257 1258 1270 Cyrillic + Finnish Cyrillic + French Cyrillic + German Polytonic Greek
EBCDIC code pages	Japanese language in EBCDIC DKOI
DEC terminals (VTx)	Multinational (MCS) National Replacement (NRCS) French Canadian Swiss Spanish United Kingdom Dutch Finnish French Norwegian and Danish Swedish Norwegian and Danish (alternative) 8-bit Greek 8-bit Turkish SI 960 Hebrew Special Graphics Technical (TCS)
Platform specific	1052 1053 1054 1055 1056 1057 1058 Acorn RISC OS Amstrad CPC Apple II ATASCII Atari ST BICS Casio calculators CDC Compucolor 8001 Compucolor II CP/M+ DEC RADIX 50 DEC MCS/NRCS DG International Galaksija GEM GSM 03.38 HP Roman HP FOCAL HP RPL SQUOZE LICS LMBCS MSX NEC APC NeXT PETSCII PostScript Standard PostScript Latin 1 SAM Coupé Sega SC-3000 Sharp calculators Sharp MZ Sinclair QL Teletext TI calculators TRS-80 Ventura International WISCII XCCS ZX80 ZX81 ZX Spectrum
Unicode / ISO/IEC 10646	UTF-1 UTF-7 UTF-8 UTF-16 UTF-32 UTF-EBCDIC GB 18030 DIN 91379 BOCU-1 CESU-8 SCSU TACE16 Comparison of Unicode encodings
TeX typesetting system	Cork LY1 OML OMS OT1
Miscellaneous code pages	ABICOMP ASMO 449 Digital encoding of APL symbols ISO-IR-68 ARIB STD-B24 Fieldata HZ IEC-P27-1 INIS 7-bit 8-bit ISO-IR-169 ISO 2033 KOI KOI8-R KOI8-RU KOI8-U Mojikyō SEASCII Stanford/ITS Symbol TRON Unified Hangul Code
Control character	Morse prosigns C0 and C1 control codes ISO/IEC 6429 JIS X 0211 Unicode control, format and separator characters Whitespace characters
Related topics	CCSID Character encodings in HTML Charset detection Han unification Hardware code page MICR code Mojibake Variable-length encoding
Character sets

Origin and development

History

Unicode Consortium

Scripts covered

Script Encoding Initiative

Versions

Projected versions

Architecture and terminology

Codespace and code points

Code planes and blocks

General Category property

Abstract characters

Ready-made versus composite characters

Ligatures

Standardized subsets

Mapping and encodings

Adoption

Operating systems

Input methods

Email

Web

Fonts

Newlines

Issues

Character unification

Han unification

Italic or cursive characters in Cyrillic

Localised case pairs

Diacritics on lowercase I

Security

Mapping to legacy character sets

Indic scripts

Combining characters

Anomalies

See also

Notes

References

Further reading

External links