Unicode

From Wikipedia for FEVERv2
Jump to navigation Jump to search

For what the term "Unicode" means in Microsoft documentation, see UTF-16. Unicode_sentence_0

Unicode_table_infobox_0

UnicodeUnicode_table_caption_0
Alias(es)Unicode_header_cell_0_0_0 Universal Coded Character Set (UCS)Unicode_cell_0_0_1
Language(s)Unicode_header_cell_0_1_0 InternationalUnicode_cell_0_1_1
StandardUnicode_header_cell_0_2_0 Unicode StandardUnicode_cell_0_2_1
Encoding formatsUnicode_header_cell_0_3_0 UTF-8, UTF-16, GB18030

Less common: UTF-32, BOCU, SCSU, UTF-7 (obsolete)Unicode_cell_0_3_1

Preceded byUnicode_header_cell_0_4_0 ISO/IEC 8859, various othersUnicode_cell_0_4_1

Unicode is an information technology (IT) standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. Unicode_sentence_1

The standard is maintained by the Unicode Consortium, and as of March 2020, there is a total of 143,859 characters, with Unicode 13.0 (these characters consist of 143,696 graphic characters and 163 format characters) covering 154 modern and historic scripts, as well as multiple symbol sets and emoji. Unicode_sentence_2

The character repertoire of the Unicode Standard is synchronized with ISO/IEC 10646, and both are code-for-code identical. Unicode_sentence_3

The Unicode Standard consists of a set of code charts for visual reference, an encoding method and set of standard character encodings, a set of reference , and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering, and bidirectional text display order (for the correct display of text containing both right-to-left scripts, such as Arabic and Hebrew, and left-to-right scripts). Unicode_sentence_4

Unicode's success at unifying character sets has led to its widespread and predominant use in the internationalization and localization of computer software. Unicode_sentence_5

The standard has been implemented in many recent technologies, including modern operating systems, XML, Java (and other programming languages), and the .NET Framework. Unicode_sentence_6

Unicode can be implemented by different character encodings. Unicode_sentence_7

The Unicode standard defines Unicode Transformation Formats (UTF) UTF-8, UTF-16, and UTF-32, and several other encodings. Unicode_sentence_8

The most commonly used encodings are UTF-8, UTF-16, and UCS-2 (a precursor of UTF-16 without full support for Unicode); GB18030 is standardized in China and implements Unicode fully, while not an official Unicode standard. Unicode_sentence_9

UTF-8, the dominant encoding on the World Wide Web (used in over 95% of websites as of 2020, and up to 100% for some languages) uses one byte for the first 128 code points, and up to 4 bytes for other characters. Unicode_sentence_10

The first 128 Unicode code points represent the ASCII characters, which means that any ASCII text is also a UTF-8 text. Unicode_sentence_11

UCS-2 uses two bytes (16 bits) for each character but can only encode the first 65,536 code points, the so-called Basic Multilingual Plane (BMP). Unicode_sentence_12

With 1,112,064 possible Unicode code points corresponding to characters (see below) on 17 planes, and with over 143,000 code points defined as of version 13.0, UCS-2 is only able to represent less than half of all encoded Unicode characters. Unicode_sentence_13

Therefore, UCS-2 is outdated, though still widely used in software. Unicode_sentence_14

UTF-16 extends UCS-2, by using the same 16-bit encoding as UCS-2 for the Basic Multilingual Plane, and a 4-byte encoding for the other planes. Unicode_sentence_15

As long as it contains no code points in the reserved range U+D800–U+DFFF, a UCS-2 text is valid UTF-16 text. Unicode_sentence_16

UTF-32 (also referred to as UCS-4) uses four bytes to encode any given codepoint, but not necessarily any given user-perceived character (loosely speaking, a grapheme), since a user-perceived character may be represented by a grapheme cluster (a sequence of multiple codepoints). Unicode_sentence_17

Like UCS-2, the number of bytes per codepoint is fixed, facilitating character indexing; but unlike UCS-2, UTF-32 is able to encode all Unicode code points. Unicode_sentence_18

However, because each character uses four bytes, UTF-32 takes significantly more space than other encodings, and is not widely used. Unicode_sentence_19

Examples of UTF-32 also being variable-length (as all the other encodings), while in a different sense include: "Devanagari kshi is encoded by 4 code points [..] Flag emojis are also grapheme clusters and composed of two code point characters – for example, the flag of Japan" and all "combining character sequences are graphemes, but there are other sequences of code points that are as well; for example \r\n is one." Unicode_sentence_20

Origin and development Unicode_section_0

Unicode has the explicit aim of transcending the limitations of traditional character encodings, such as those defined by the ISO/IEC 8859 standard, which find wide usage in various countries of the world but remain largely incompatible with each other. Unicode_sentence_21

Many traditional character encodings share a common problem in that they allow bilingual computer processing (usually using Latin characters and the local script), but not multilingual computer processing (computer processing of arbitrary scripts mixed with each other). Unicode_sentence_22

Unicode, in intent, encodes the underlying characters—graphemes and grapheme-like units—rather than the variant glyphs (renderings) for such characters. Unicode_sentence_23

In the case of Chinese characters, this sometimes leads to controversies over distinguishing the underlying character from its variant glyphs (see Han unification). Unicode_sentence_24

In text processing, Unicode takes the role of providing a unique code point—a number, not a glyph—for each character. Unicode_sentence_25

In other words, Unicode represents a character in an abstract way and leaves the visual rendering (size, shape, font, or style) to other software, such as a web browser or word processor. Unicode_sentence_26

This simple aim becomes complicated, however, because of concessions made by Unicode's designers in the hope of encouraging a more rapid adoption of Unicode. Unicode_sentence_27

The first 256 code points were made identical to the content of ISO/IEC 8859-1 so as to make it trivial to convert existing western text. Unicode_sentence_28

Many essentially identical characters were encoded multiple times at different code points to preserve distinctions used by legacy encodings and therefore, allow conversion from those encodings to Unicode (and back) without losing any information. Unicode_sentence_29

For example, the "fullwidth forms" section of code points encompasses a full duplicate of the Latin alphabet because Chinese, Japanese, and Korean (CJK) fonts contain two versions of these letters, "fullwidth" matching the width of the CJK characters, and normal width. Unicode_sentence_30

For other examples, see duplicate characters in Unicode. Unicode_sentence_31

History Unicode_section_1

Based on experiences with the Xerox Character Code Standard (XCCS) since 1980, the origins of Unicode date to 1987, when Joe Becker from Xerox with Lee Collins and Mark Davis from Apple, started investigating the practicalities of creating a universal character set. Unicode_sentence_32

With additional input from Peter Fenwick and Dave Opstad, Joe Becker published a draft proposal for an "international/multilingual text character encoding system in August 1988, tentatively called Unicode". Unicode_sentence_33

He explained that "[t]he name 'Unicode' is intended to suggest a unique, unified, universal encoding". Unicode_sentence_34

In this document, entitled Unicode 88, Becker outlined a 16-bit character model: Unicode_sentence_35

His original 16-bit design was based on the assumption that only those scripts and characters in modern use would need to be encoded: Unicode_sentence_36

In early 1989, the Unicode working group expanded to include Ken Whistler and Mike Kernaghan of Metaphor, Karen Smith-Yoshimura and Joan Aliprand of RLG, and Glenn Wright of Sun Microsystems, and in 1990, Michel Suignard and Asmus Freytag from Microsoft and Rick McGowan of NeXT joined the group. Unicode_sentence_37

By the end of 1990, most of the work on mapping existing character encoding standards had been completed, and a final review draft of Unicode was ready. Unicode_sentence_38

The Unicode Consortium was incorporated in California on 3 January 1991, and in October 1991, the first volume of the Unicode standard was published. Unicode_sentence_39

The second volume, covering Han ideographs, was published in June 1992. Unicode_sentence_40

In 1996, a surrogate character mechanism was implemented in Unicode 2.0, so that Unicode was no longer restricted to 16 bits. Unicode_sentence_41

This increased the Unicode codespace to over a million code points, which allowed for the encoding of many historic scripts (e.g., Egyptian hieroglyphs) and thousands of rarely used or obsolete characters that had not been anticipated as needing encoding. Unicode_sentence_42

Among the characters not originally intended for Unicode are rarely used Kanji or Chinese characters, many of which are part of personal and place names, making them rarely used, but much more essential than envisioned in the original architecture of Unicode. Unicode_sentence_43

The Microsoft TrueType specification version 1.0 from 1992 used the name Apple Unicode instead of Unicode for the Platform ID in the naming table. Unicode_sentence_44

Unicode Consortium Unicode_section_2

Main article: Unicode Consortium Unicode_sentence_45

The Unicode Consortium is a nonprofit organization that coordinates Unicode's development. Unicode_sentence_46

Full members include most of the main computer software and hardware companies with any interest in text-processing standards, including Adobe, Apple, Facebook, Google, IBM, Microsoft, Netflix, and SAP SE. Unicode_sentence_47

Over the years several countries or government agencies have been members of the Unicode Consortium. Unicode_sentence_48

Presently only the Ministry of Endowments and Religious Affairs (Oman) is a full member with voting rights. Unicode_sentence_49

The Consortium has the ambitious goal of eventually replacing existing character encoding schemes with Unicode and its standard Unicode Transformation Format (UTF) schemes, as many of the existing schemes are limited in size and scope and are incompatible with multilingual environments. Unicode_sentence_50

Scripts covered Unicode_section_3

Main article: Script (Unicode) Unicode_sentence_51

Unicode covers almost all scripts (writing systems) in current use today. Unicode_sentence_52

A total of 154 scripts are included in the latest version of Unicode (covering alphabets, abugidas and syllabaries), although there are still scripts that are not yet encoded, particularly those mainly used in historical, liturgical, and academic contexts. Unicode_sentence_53

Further additions of characters to the already encoded scripts, as well as symbols, in particular for mathematics and music (in the form of notes and rhythmic symbols), also occur. Unicode_sentence_54

The Unicode Roadmap Committee (Michael Everson, Rick McGowan, Ken Whistler, V.S. Unicode_sentence_55

Umamaheswaran) maintain the list of scripts that are candidates or potential candidates for encoding and their tentative code block assignments on the page of the Unicode Consortium Web site. Unicode_sentence_56

For some scripts on the Roadmap, such as Jurchen and Khitan small script, encoding proposals have been made and they are working their way through the approval process. Unicode_sentence_57

For others scripts, such as Mayan (besides numbers) and Rongorongo, no proposal has yet been made, and they await agreement on character repertoire and other details from the user communities involved. Unicode_sentence_58

Some modern invented scripts which have not yet been included in Unicode (e.g., Tengwar) or which do not qualify for inclusion in Unicode due to lack of real-world use (e.g., Klingon) are listed in the ConScript Unicode Registry, along with unofficial but widely used Private Use Areas code assignments. Unicode_sentence_59

There is also a Medieval Unicode Font Initiative focused on special Latin medieval characters. Unicode_sentence_60

Part of these proposals have been already included into Unicode. Unicode_sentence_61

The , a project run by Deborah Anderson at the University of California, Berkeley was founded in 2002 with the goal of funding proposals for scripts not yet encoded in the standard. Unicode_sentence_62

The project has become a major source of proposed additions to the standard in recent years. Unicode_sentence_63

Versions Unicode_section_4

Unicode is developed in conjunction with the International Organization for Standardization and shares the character repertoire with ISO/IEC 10646: the Universal Character Set. Unicode_sentence_64

Unicode and ISO/IEC 10646 function equivalently as character encodings, but The Unicode Standard contains much more information for implementers, covering—in depth—topics such as bitwise encoding, collation and rendering. Unicode_sentence_65

The Unicode Standard enumerates a multitude of character properties, including those needed for supporting bidirectional text. Unicode_sentence_66

The two standards do use slightly different terminology. Unicode_sentence_67

The Unicode Consortium first published The Unicode Standard in 1991 (version 1.0), and has published new versions on a regular basis since then. Unicode_sentence_68

The latest version of the Unicode Standard, version 13.0, was released in March 2020, and is available in electronic format from the consortium's website. Unicode_sentence_69

The last version of the standard that was published completely in book form (including the code charts) was version 5.0 in 2006, but since version 5.2 (2009) the core specification of the standard has been published as a print-on-demand paperback. Unicode_sentence_70

The entire text of each version of the standard, including the core specification, standard annexes and code charts, is freely available in PDF format on the Unicode website. Unicode_sentence_71

In April 2020, 14.0 was postponed by six months from its initial release of March 2021 due to the COVID-19 pandemic. Unicode_sentence_72

Thus far, the following major and minor versions of the Unicode standard have been published. Unicode_sentence_73

Update versions, which do not include any changes to character repertoire, are signified by the third number (e.g., "version 4.0.1") and are omitted in the table below. Unicode_sentence_74

Unicode_table_general_1

Unicode versionsUnicode_table_caption_1
VersionUnicode_header_cell_1_0_0 DateUnicode_header_cell_1_0_1 BookUnicode_header_cell_1_0_2 Corresponding ISO/IEC 10646 editionUnicode_header_cell_1_0_3 ScriptsUnicode_header_cell_1_0_4 CharactersUnicode_header_cell_1_0_5
TotalUnicode_header_cell_1_1_0 Notable additionsUnicode_header_cell_1_1_1
Unicode_cell_1_2_0 October 1991Unicode_cell_1_2_1 ISBN 0-201-56788-1 (Vol. 1)Unicode_cell_1_2_2 Unicode_cell_1_2_3 24Unicode_cell_1_2_4 7,129, not counting 'space' or 33 non-printing characters (7,163 total)Unicode_cell_1_2_5 Initial repertoire covers these scripts: Arabic, Armenian, Bengali, Bopomofo, Cyrillic, Devanagari, Georgian, Greek and Coptic, Gujarati, Gurmukhi, Hangul, Hebrew, Hiragana, Kannada, Katakana, Lao, Latin, Malayalam, Oriya, Tamil, Telugu, Thai, and Tibetan.Unicode_cell_1_2_6
1.0.1Unicode_cell_1_3_0 June 1992Unicode_cell_1_3_1 ISBN 0-201-60845-6 (Vol. 2)Unicode_cell_1_3_2 Unicode_cell_1_3_3 25Unicode_cell_1_3_4 28,327

(21,204 added; 6 removed)Unicode_cell_1_3_5

The initial set of 20,902 CJK Unified Ideographs is defined.Unicode_cell_1_3_6
Unicode_cell_1_4_0 June 1993Unicode_cell_1_4_1 Unicode_cell_1_4_2 ISO/IEC 10646-1:1993Unicode_cell_1_4_3 24Unicode_cell_1_4_4 34,168

(5,963 added; 89 removed; 33 reclassified as control characters)Unicode_cell_1_4_5

4,306 more Hangul syllables added to original set of 2,350 characters. Tibetan removed.Unicode_cell_1_4_6
Unicode_cell_1_5_0 July 1996Unicode_cell_1_5_1 ISBN 0-201-48345-9Unicode_cell_1_5_2 ISO/IEC 10646-1:1993 plus Amendments 5, 6 and 7Unicode_cell_1_5_3 25Unicode_cell_1_5_4 38,885

(11,373 added; 6,656 removed)Unicode_cell_1_5_5

Original set of Hangul syllables removed, and a new set of 11,172 Hangul syllables added at a new location. Tibetan added back in a new location and with a different character repertoire. Surrogate character mechanism defined, and Plane 15 and Plane 16 Private Use Areas allocated.Unicode_cell_1_5_6
Unicode_cell_1_6_0 May 1998Unicode_cell_1_6_1 Unicode_cell_1_6_2 ISO/IEC 10646-1:1993 plus Amendments 5, 6 and 7, as well as two characters from Amendment 18Unicode_cell_1_6_3 25Unicode_cell_1_6_4 38,887

(2 added)Unicode_cell_1_6_5

Euro sign and Object Replacement Character added.Unicode_cell_1_6_6
3.0Unicode_cell_1_7_0 September 1999Unicode_cell_1_7_1 ISBN 0-201-61633-5Unicode_cell_1_7_2 ISO/IEC 10646-1:2000Unicode_cell_1_7_3 38Unicode_cell_1_7_4 49,194

(10,307 added)Unicode_cell_1_7_5

Cherokee, Ethiopic, Khmer, Mongolian, Burmese, Ogham, Runic, Sinhala, Syriac, Thaana, Unified Canadian Aboriginal Syllabics, and Yi Syllables added, as well as a set of Braille patterns.Unicode_cell_1_7_6
3.1Unicode_cell_1_8_0 March 2001Unicode_cell_1_8_1 Unicode_cell_1_8_2 ISO/IEC 10646-1:2000

ISO/IEC 10646-2:2001Unicode_cell_1_8_3

41Unicode_cell_1_8_4 94,140

(44,946 added)Unicode_cell_1_8_5

Deseret, Gothic and Old Italic added, as well as sets of symbols for Western music and Byzantine music, and 42,711 additional CJK Unified Ideographs.Unicode_cell_1_8_6
3.2Unicode_cell_1_9_0 March 2002Unicode_cell_1_9_1 Unicode_cell_1_9_2 ISO/IEC 10646-1:2000 plus Amendment 1

ISO/IEC 10646-2:2001Unicode_cell_1_9_3

45Unicode_cell_1_9_4 95,156

(1,016 added)Unicode_cell_1_9_5

Philippine scripts Buhid, Hanunó'o, Tagalog, and Tagbanwa added.Unicode_cell_1_9_6
4.0Unicode_cell_1_10_0 April 2003Unicode_cell_1_10_1 ISBN 0-321-18578-1Unicode_cell_1_10_2 ISO/IEC 10646:2003Unicode_cell_1_10_3 52Unicode_cell_1_10_4 96,382

(1,226 added)Unicode_cell_1_10_5

Cypriot syllabary, Limbu, Linear B, Osmanya, Shavian, Tai Le, and Ugaritic added, as well as Hexagram symbols.Unicode_cell_1_10_6
4.1Unicode_cell_1_11_0 March 2005Unicode_cell_1_11_1 Unicode_cell_1_11_2 ISO/IEC 10646:2003 plus Amendment 1Unicode_cell_1_11_3 59Unicode_cell_1_11_4 97,655

(1,273 added)Unicode_cell_1_11_5

Buginese, Glagolitic, Kharoshthi, New Tai Lue, Old Persian, Syloti Nagri, and Tifinagh added, and Coptic was disunified from Greek. Ancient Greek numbers and musical symbols were also added.Unicode_cell_1_11_6
5.0Unicode_cell_1_12_0 July 2006Unicode_cell_1_12_1 ISBN 0-321-48091-0Unicode_cell_1_12_2 ISO/IEC 10646:2003 plus Amendments 1 and 2, as well as four characters from Amendment 3Unicode_cell_1_12_3 64Unicode_cell_1_12_4 99,024

(1,369 added)Unicode_cell_1_12_5

Balinese, Cuneiform, N'Ko, Phags-pa, and Phoenician added.Unicode_cell_1_12_6
5.1Unicode_cell_1_13_0 April 2008Unicode_cell_1_13_1 Unicode_cell_1_13_2 ISO/IEC 10646:2003 plus Amendments 1, 2, 3 and 4Unicode_cell_1_13_3 75Unicode_cell_1_13_4 100,648

(1,624 added)Unicode_cell_1_13_5

Carian, Cham, Kayah Li, Lepcha, Lycian, Lydian, Ol Chiki, Rejang, Saurashtra, Sundanese, and Vai added, as well as sets of symbols for the Phaistos Disc, Mahjong tiles, and Domino tiles. There were also important additions for Burmese, additions of letters and Scribal abbreviations used in medieval manuscripts, and the addition of Capital ẞ.Unicode_cell_1_13_6
5.2Unicode_cell_1_14_0 October 2009Unicode_cell_1_14_1 ISBN 978-1-936213-00-9Unicode_cell_1_14_2 ISO/IEC 10646:2003 plus Amendments 1, 2, 3, 4, 5 and 6Unicode_cell_1_14_3 90Unicode_cell_1_14_4 107,296

(6,648 added)Unicode_cell_1_14_5

Avestan, Bamum, Egyptian hieroglyphs (the Gardiner Set, comprising 1,071 characters), Imperial Aramaic, Inscriptional Pahlavi, Inscriptional Parthian, Javanese, Kaithi, Lisu, Meetei Mayek, Old South Arabian, Old Turkic, Samaritan, Tai Tham and Tai Viet added. 4,149 additional CJK Unified Ideographs (CJK-C), as well as extended Jamo for Old Hangul, and characters for Vedic Sanskrit.Unicode_cell_1_14_6
6.0Unicode_cell_1_15_0 October 2010Unicode_cell_1_15_1 ISBN 978-1-936213-01-6Unicode_cell_1_15_2 ISO/IEC 10646:2010 plus the Indian rupee signUnicode_cell_1_15_3 93Unicode_cell_1_15_4 109,384

(2,088 added)Unicode_cell_1_15_5

Batak, Brahmi, Mandaic, playing card symbols, transport and map symbols, alchemical symbols, emoticons and emoji. 222 additional CJK Unified Ideographs (CJK-D) added.Unicode_cell_1_15_6
6.1Unicode_cell_1_16_0 January 2012Unicode_cell_1_16_1 ISBN 978-1-936213-02-3Unicode_cell_1_16_2 ISO/IEC 10646:2012Unicode_cell_1_16_3 100Unicode_cell_1_16_4 110,116

(732 added)Unicode_cell_1_16_5

Chakma, Meroitic cursive, Meroitic hieroglyphs, Miao, Sharada, Sora Sompeng, and Takri.Unicode_cell_1_16_6
6.2Unicode_cell_1_17_0 September 2012Unicode_cell_1_17_1 ISBN 978-1-936213-07-8Unicode_cell_1_17_2 ISO/IEC 10646:2012 plus the Turkish lira signUnicode_cell_1_17_3 100Unicode_cell_1_17_4 110,117

(1 added)Unicode_cell_1_17_5

Turkish lira sign.Unicode_cell_1_17_6
6.3Unicode_cell_1_18_0 September 2013Unicode_cell_1_18_1 ISBN 978-1-936213-08-5Unicode_cell_1_18_2 ISO/IEC 10646:2012 plus six charactersUnicode_cell_1_18_3 100Unicode_cell_1_18_4 110,122

(5 added)Unicode_cell_1_18_5

5 bidirectional formatting characters.Unicode_cell_1_18_6
7.0Unicode_cell_1_19_0 June 2014Unicode_cell_1_19_1 ISBN 978-1-936213-09-2Unicode_cell_1_19_2 ISO/IEC 10646:2012 plus Amendments 1 and 2, as well as the Ruble signUnicode_cell_1_19_3 123Unicode_cell_1_19_4 112,956

(2,834 added)Unicode_cell_1_19_5

Bassa Vah, Caucasian Albanian, Duployan, Elbasan, Grantha, Khojki, Khudawadi, Linear A, Mahajani, Manichaean, Mende Kikakui, Modi, Mro, Nabataean, Old North Arabian, Old Permic, Pahawh Hmong, Palmyrene, Pau Cin Hau, Psalter Pahlavi, Siddham, Tirhuta, Warang Citi, and Dingbats.Unicode_cell_1_19_6
8.0Unicode_cell_1_20_0 June 2015Unicode_cell_1_20_1 ISBN 978-1-936213-10-8Unicode_cell_1_20_2 ISO/IEC 10646:2014 plus Amendment 1, as well as the Lari sign, nine CJK unified ideographs, and 41 emoji charactersUnicode_cell_1_20_3 129Unicode_cell_1_20_4 120,672

(7,716 added)Unicode_cell_1_20_5

Ahom, Anatolian hieroglyphs, Hatran, Multani, Old Hungarian, SignWriting, 5,771 CJK unified ideographs, a set of lowercase letters for Cherokee, and five emoji skin tone modifiersUnicode_cell_1_20_6
9.0Unicode_cell_1_21_0 June 2016Unicode_cell_1_21_1 ISBN 978-1-936213-13-9Unicode_cell_1_21_2 ISO/IEC 10646:2014 plus Amendments 1 and 2, as well as Adlam, Newa, Japanese TV symbols, and 74 emoji and symbolsUnicode_cell_1_21_3 135Unicode_cell_1_21_4 128,172

(7,500 added)Unicode_cell_1_21_5

Adlam, Bhaiksuki, Marchen, Newa, Osage, Tangut, and 72 emojiUnicode_cell_1_21_6
10.0Unicode_cell_1_22_0 June 2017Unicode_cell_1_22_1 ISBN 978-1-936213-16-0Unicode_cell_1_22_2 ISO/IEC 10646:2017 plus 56 emoji characters, 285 hentaigana characters, and 3 Zanabazar Square charactersUnicode_cell_1_22_3 139Unicode_cell_1_22_4 136,690

(8,518 added)Unicode_cell_1_22_5

Zanabazar Square, Soyombo, Masaram Gondi, Nüshu, hentaigana (non-standard hiragana), 7,494 CJK unified ideographs, and 56 emojiUnicode_cell_1_22_6
11.0Unicode_cell_1_23_0 June 2018Unicode_cell_1_23_1 ISBN 978-1-936213-19-1Unicode_cell_1_23_2 ISO/IEC 10646:2017 plus Amendment 1, as well as 46 Mtavruli Georgian capital letters, 5 CJK unified ideographs, and 66 emoji characters.Unicode_cell_1_23_3 146Unicode_cell_1_23_4 137,374

(684 added)Unicode_cell_1_23_5

Dogra, Georgian Mtavruli capital letters, Gunjala Gondi, Hanifi Rohingya, Indic Siyaq numbers, Makasar, Medefaidrin, Old Sogdian and Sogdian, Mayan numerals, 5 urgently needed CJK unified ideographs, symbols for xiangqi (Chinese chess) and star ratings, and 145 emojiUnicode_cell_1_23_6
12.0Unicode_cell_1_24_0 March 2019Unicode_cell_1_24_1 ISBN 978-1-936213-22-1Unicode_cell_1_24_2 ISO/IEC 10646:2017 plus Amendments 1 and 2, as well as 62 additional characters.Unicode_cell_1_24_3 150Unicode_cell_1_24_4 137,928

(554 added)Unicode_cell_1_24_5

Elymaic, Nandinagari, Nyiakeng Puachue Hmong, Wancho, Miao script additions for several Miao and Yi dialects in China, hiragana and katakana small letters for writing archaic Japanese, Tamil historic fractions and symbols, Lao letters for Pali, Latin letters for Egyptological and Ugaritic transliteration, hieroglyph format controls, and 61 emojiUnicode_cell_1_24_6
12.1Unicode_cell_1_25_0 May 2019Unicode_cell_1_25_1 ISBN 978-1-936213-25-2Unicode_cell_1_25_2 Unicode_cell_1_25_3 150Unicode_cell_1_25_4 137,929

(1 added)Unicode_cell_1_25_5

Adds a single character at U+32FF for the square ligature form of the name of the Reiwa era.Unicode_cell_1_25_6
Unicode_cell_1_26_0 March 2020Unicode_cell_1_26_1 ISBN 978-1-936213-26-9Unicode_cell_1_26_2 ISO/IEC 10646:2020Unicode_cell_1_26_3 154Unicode_cell_1_26_4 143,859

(5,930 added)Unicode_cell_1_26_5

Chorasmian, Dives Akuru, Khitan small script, Yezidi, 4,969 CJK unified ideographs added (including 4,939 in Ext. G), Arabic script additions used to write Hausa, Wolof, and other languages in Africa and other additions used to write Hindko and Punjabi in Pakistan, Bopomofo additions used for Cantonese, Creative Commons license symbols, graphic characters for compatibility with teletext and home computer systems from the 1970s and 1980s, and 55 emojiUnicode_cell_1_26_6

Architecture and terminology Unicode_section_5

See also: Universal Character Set characters Unicode_sentence_75

The Unicode Standard defines a codespace of numerical values ranging from 0 through 10FFFF16, called code points and denoted as U+0000 through U+10FFFF ("U+" plus the code point value in hexadecimal, prepended with leading zeros as necessary to result in a minimum of four digits, e. g., U+00F7 for the division sign, ÷, versus U+13254 for the Egyptian hieroglyph designating a reed shelter or a ( )), respectively. Unicode_sentence_76

Out of these 2 + 2 defined code points, the code points from U+D800 through U+DFFF, which are used to encode surrogate pairs in UTF-16, are reserved by the Standard and may not be used to encode valid characters, resulting in a net total of 2 − 2 + 2 = 1,112,064 possible code points corresponding to valid Unicode characters. Unicode_sentence_77

Not all of these code points necessarily correspond to visible characters; several, for example, are assigned to control codes such as carriage return. Unicode_sentence_78

Code point planes and blocks Unicode_section_6

Main article: Plane (Unicode) Unicode_sentence_79

The Unicode codespace is divided into seventeen planes, numbered 0 to 16: Unicode_sentence_80

All code points in the BMP are accessed as a single code unit in UTF-16 encoding and can be encoded in one, two or three bytes in UTF-8. Unicode_sentence_81

Code points in Planes 1 through 16 (supplementary planes) are accessed as surrogate pairs in UTF-16 and encoded in four bytes in UTF-8. Unicode_sentence_82

Within each plane, characters are allocated within named blocks of related characters. Unicode_sentence_83

Although blocks are an arbitrary size, they are always a multiple of 16 code points and often a multiple of 128 code points. Unicode_sentence_84

Characters required for a given script may be spread out over several different blocks. Unicode_sentence_85

General Category property Unicode_section_7

Each code point has a single General Category property. Unicode_sentence_86

The major categories are denoted: Letter, Mark, Number, Punctuation, Symbol, Separator and Other. Unicode_sentence_87

Within these categories, there are subdivisions. Unicode_sentence_88

In most cases other properties must be used to sufficiently specify the characteristics of a code point. Unicode_sentence_89

The possible General Categories are: Unicode_sentence_90

Unicode_table_general_2

General Category (Unicode Character Property)Unicode_header_cell_2_0_0
ValueUnicode_header_cell_2_1_0 Category Major, minorUnicode_header_cell_2_1_1 Basic typeUnicode_header_cell_2_1_2 Character assignedUnicode_header_cell_2_1_3 Count

(as of 13.0)Unicode_header_cell_2_1_4

RemarksUnicode_header_cell_2_1_5
Unicode_header_cell_2_2_0 Unicode_header_cell_2_2_1 Unicode_header_cell_2_2_2 Unicode_header_cell_2_2_3 Unicode_header_cell_2_2_4 Unicode_header_cell_2_2_5
Letter (L)Unicode_cell_2_3_0
LuUnicode_cell_2_4_0 Letter, uppercaseUnicode_cell_2_4_1 GraphicUnicode_cell_2_4_2 CharacterUnicode_cell_2_4_3 1,791Unicode_cell_2_4_4 Unicode_cell_2_4_5
LlUnicode_cell_2_5_0 Letter, lowercaseUnicode_cell_2_5_1 GraphicUnicode_cell_2_5_2 CharacterUnicode_cell_2_5_3 2,155Unicode_cell_2_5_4 Unicode_cell_2_5_5
LtUnicode_cell_2_6_0 Letter, titlecaseUnicode_cell_2_6_1 GraphicUnicode_cell_2_6_2 CharacterUnicode_cell_2_6_3 31Unicode_cell_2_6_4 Ligatures containing uppercase followed by lowercase letters (e.g., Dž, Lj, Nj, and Dz)Unicode_cell_2_6_5
LmUnicode_cell_2_7_0 Letter, modifierUnicode_cell_2_7_1 GraphicUnicode_cell_2_7_2 CharacterUnicode_cell_2_7_3 260Unicode_cell_2_7_4 A modifier letterUnicode_cell_2_7_5
LoUnicode_cell_2_8_0 Letter, otherUnicode_cell_2_8_1 GraphicUnicode_cell_2_8_2 CharacterUnicode_cell_2_8_3 127,004Unicode_cell_2_8_4 An ideograph or a letter in a unicase alphabetUnicode_cell_2_8_5
Mark (M)Unicode_cell_2_9_0
MnUnicode_cell_2_10_0 Mark, nonspacingUnicode_cell_2_10_1 GraphicUnicode_cell_2_10_2 CharacterUnicode_cell_2_10_3 1,839Unicode_cell_2_10_4 Unicode_cell_2_10_5
McUnicode_cell_2_11_0 Mark, spacing combiningUnicode_cell_2_11_1 GraphicUnicode_cell_2_11_2 CharacterUnicode_cell_2_11_3 443Unicode_cell_2_11_4 Unicode_cell_2_11_5
MeUnicode_cell_2_12_0 Mark, enclosingUnicode_cell_2_12_1 GraphicUnicode_cell_2_12_2 CharacterUnicode_cell_2_12_3 13Unicode_cell_2_12_4 Unicode_cell_2_12_5
Number (N)Unicode_cell_2_13_0
NdUnicode_cell_2_14_0 Number, decimal digitUnicode_cell_2_14_1 GraphicUnicode_cell_2_14_2 CharacterUnicode_cell_2_14_3 650Unicode_cell_2_14_4 All these, and only these, have Numeric Type = DeUnicode_cell_2_14_5
NlUnicode_cell_2_15_0 Number, letterUnicode_cell_2_15_1 GraphicUnicode_cell_2_15_2 CharacterUnicode_cell_2_15_3 236Unicode_cell_2_15_4 Numerals composed of letters or letterlike symbols (e.g., Roman numerals)Unicode_cell_2_15_5
NoUnicode_cell_2_16_0 Number, otherUnicode_cell_2_16_1 GraphicUnicode_cell_2_16_2 CharacterUnicode_cell_2_16_3 895Unicode_cell_2_16_4 E.g., vulgar fractions, superscript and subscript digitsUnicode_cell_2_16_5
Punctuation (P)Unicode_cell_2_17_0
PcUnicode_cell_2_18_0 Punctuation, connectorUnicode_cell_2_18_1 GraphicUnicode_cell_2_18_2 CharacterUnicode_cell_2_18_3 10Unicode_cell_2_18_4 Includes "_" underscoreUnicode_cell_2_18_5
PdUnicode_cell_2_19_0 Punctuation, dashUnicode_cell_2_19_1 GraphicUnicode_cell_2_19_2 CharacterUnicode_cell_2_19_3 25Unicode_cell_2_19_4 Includes several hyphen charactersUnicode_cell_2_19_5
PsUnicode_cell_2_20_0 Punctuation, openUnicode_cell_2_20_1 GraphicUnicode_cell_2_20_2 CharacterUnicode_cell_2_20_3 75Unicode_cell_2_20_4 Opening bracket charactersUnicode_cell_2_20_5
PeUnicode_cell_2_21_0 Punctuation, closeUnicode_cell_2_21_1 GraphicUnicode_cell_2_21_2 CharacterUnicode_cell_2_21_3 73Unicode_cell_2_21_4 Closing bracket charactersUnicode_cell_2_21_5
PiUnicode_cell_2_22_0 Punctuation, initial quoteUnicode_cell_2_22_1 GraphicUnicode_cell_2_22_2 CharacterUnicode_cell_2_22_3 12Unicode_cell_2_22_4 Opening quotation mark. Does not include the ASCII "neutral" quotation mark. May behave like Ps or Pe depending on usageUnicode_cell_2_22_5
PfUnicode_cell_2_23_0 Punctuation, final quoteUnicode_cell_2_23_1 GraphicUnicode_cell_2_23_2 CharacterUnicode_cell_2_23_3 10Unicode_cell_2_23_4 Closing quotation mark. May behave like Ps or Pe depending on usageUnicode_cell_2_23_5
PoUnicode_cell_2_24_0 Punctuation, otherUnicode_cell_2_24_1 GraphicUnicode_cell_2_24_2 CharacterUnicode_cell_2_24_3 593Unicode_cell_2_24_4 Unicode_cell_2_24_5
Symbol (S)Unicode_cell_2_25_0
SmUnicode_cell_2_26_0 Symbol, mathUnicode_cell_2_26_1 GraphicUnicode_cell_2_26_2 CharacterUnicode_cell_2_26_3 948Unicode_cell_2_26_4 Mathematical symbols (e.g., +, , =, ×, ÷, , , ). Does not include parentheses and brackets, which are in categories Ps and Pe. Also does not include !, *, -, or /, which despite frequent use as mathematical operators, are primarily considered to be "punctuation".Unicode_cell_2_26_5
ScUnicode_cell_2_27_0 Symbol, currencyUnicode_cell_2_27_1 GraphicUnicode_cell_2_27_2 CharacterUnicode_cell_2_27_3 62Unicode_cell_2_27_4 Currency symbolsUnicode_cell_2_27_5
SkUnicode_cell_2_28_0 Symbol, modifierUnicode_cell_2_28_1 GraphicUnicode_cell_2_28_2 CharacterUnicode_cell_2_28_3 123Unicode_cell_2_28_4 Unicode_cell_2_28_5
SoUnicode_cell_2_29_0 Symbol, otherUnicode_cell_2_29_1 GraphicUnicode_cell_2_29_2 CharacterUnicode_cell_2_29_3 6,431Unicode_cell_2_29_4 Unicode_cell_2_29_5
Separator (Z)Unicode_cell_2_30_0
ZsUnicode_cell_2_31_0 Separator, spaceUnicode_cell_2_31_1 GraphicUnicode_cell_2_31_2 CharacterUnicode_cell_2_31_3 17Unicode_cell_2_31_4 Includes the space, but not TAB, CR, or LF, which are CcUnicode_cell_2_31_5
ZlUnicode_cell_2_32_0 Separator, lineUnicode_cell_2_32_1 FormatUnicode_cell_2_32_2 CharacterUnicode_cell_2_32_3 1Unicode_cell_2_32_4 Only U+2028 LINE SEPARATOR (LSEP)Unicode_cell_2_32_5
ZpUnicode_cell_2_33_0 Separator, paragraphUnicode_cell_2_33_1 FormatUnicode_cell_2_33_2 CharacterUnicode_cell_2_33_3 1Unicode_cell_2_33_4 Only U+2029 PARAGRAPH SEPARATOR (PSEP)Unicode_cell_2_33_5
Other (C)Unicode_cell_2_34_0
CcUnicode_cell_2_35_0 Other, controlUnicode_cell_2_35_1 ControlUnicode_cell_2_35_2 CharacterUnicode_cell_2_35_3 65 (will never change)Unicode_cell_2_35_4 No name, <control>Unicode_cell_2_35_5
CfUnicode_cell_2_36_0 Other, formatUnicode_cell_2_36_1 FormatUnicode_cell_2_36_2 CharacterUnicode_cell_2_36_3 161Unicode_cell_2_36_4 Includes the soft hyphen, joining control characters (zwnj and zwj), control characters to support bi-directional text, and language tag charactersUnicode_cell_2_36_5
CsUnicode_cell_2_37_0 Other, surrogateUnicode_cell_2_37_1 SurrogateUnicode_cell_2_37_2 Not (but abstract)Unicode_cell_2_37_3 2,048 (will never change)Unicode_cell_2_37_4 No name, <surrogate>Unicode_cell_2_37_5
CoUnicode_cell_2_38_0 Other, private useUnicode_cell_2_38_1 Private-useUnicode_cell_2_38_2 Not (but abstract)Unicode_cell_2_38_3 137,468 total (will never change) (6,400 in BMP, 131,068 in Planes 15–16)Unicode_cell_2_38_4 No name, <private-use>Unicode_cell_2_38_5
CnUnicode_cell_2_39_0 Other, not assignedUnicode_cell_2_39_1 NoncharacterUnicode_cell_2_39_2 NotUnicode_cell_2_39_3 66 (will never change)Unicode_cell_2_39_4 No name, <noncharacter>Unicode_cell_2_39_5
ReservedUnicode_cell_2_40_0 NotUnicode_cell_2_40_1 830,606Unicode_cell_2_40_2 No name, <reserved>Unicode_cell_2_40_3

Code points in the range U+D800–U+DBFF (1,024 code points) are known as high-surrogate code points, and code points in the range U+DC00–U+DFFF (1,024 code points) are known as low-surrogate code points. Unicode_sentence_91

A high-surrogate code point followed by a low-surrogate code point form a surrogate pair in UTF-16 to represent code points greater than U+FFFF. Unicode_sentence_92

These code points otherwise cannot be used (this rule is ignored often in practice especially when not using UTF-16). Unicode_sentence_93

A small set of code points are guaranteed never to be used for encoding characters, although applications may make use of these code points internally if they wish. Unicode_sentence_94

There are sixty-six of these noncharacters: U+FDD0–U+FDEF and any code point ending in the value FFFE or FFFF (i.e., U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF). Unicode_sentence_95

The set of noncharacters is stable, and no new noncharacters will ever be defined. Unicode_sentence_96

Like surrogates, the rule that these cannot be used is often ignored, although the operation of the byte order mark assumes that U+FFFE will never be the first code point in a text. Unicode_sentence_97

Excluding surrogates and noncharacters leaves 1,111,998 code points available for use. Unicode_sentence_98

Private-use code points are considered to be assigned characters, but they have no interpretation specified by the Unicode standard so any interchange of such characters requires an agreement between sender and receiver on their interpretation. Unicode_sentence_99

There are three private-use areas in the Unicode codespace: Unicode_sentence_100

Unicode_unordered_list_0

  • Private Use Area: U+E000–U+F8FF (6,400 characters),Unicode_item_0_0
  • Supplementary Private Use Area-A: U+F0000–U+FFFFD (65,534 characters),Unicode_item_0_1
  • Supplementary Private Use Area-B: U+100000–U+10FFFD (65,534 characters).Unicode_item_0_2

Graphic characters are characters defined by Unicode to have particular semantics, and either have a visible glyph shape or represent a visible space. Unicode_sentence_101

As of Unicode 13.0 there are 143,696 graphic characters. Unicode_sentence_102

Format characters are characters that do not have a visible appearance, but may have an effect on the appearance or behavior of neighboring characters. Unicode_sentence_103

For example, U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER may be used to change the default shaping behavior of adjacent characters (e.g., to inhibit ligatures or request ligature formation). Unicode_sentence_104

There are 163 format characters in Unicode 13.0. Unicode_sentence_105

Sixty-five code points (U+0000–U+001F and U+007F–U+009F) are reserved as control codes, and correspond to the C0 and C1 control codes defined in ISO/IEC 6429. Unicode_sentence_106

U+0009 (Tab), U+000A (Line Feed), and U+000D (Carriage Return) are widely used in Unicode-encoded texts. Unicode_sentence_107

In practice the C1 code points are often improperly-translated (mojibake) as the legacy Windows-1252 characters used by some English and Western European texts. Unicode_sentence_108

Graphic characters, format characters, control code characters, and private use characters are known collectively as assigned characters. Unicode_sentence_109

Reserved code points are those code points which are available for use, but are not yet assigned. Unicode_sentence_110

As of Unicode 13.0 there are 830,606 reserved code points. Unicode_sentence_111

Abstract characters Unicode_section_8

The set of graphic and format characters defined by Unicode does not correspond directly to the repertoire of abstract characters that is representable under Unicode. Unicode_sentence_112

Unicode encodes characters by associating an abstract character with a particular code point. Unicode_sentence_113

However, not all abstract characters are encoded as a single Unicode character, and some abstract characters may be represented in Unicode by a sequence of two or more characters. Unicode_sentence_114

For example, a Latin small letter "i" with an ogonek, a dot above, and an acute accent, which is required in Lithuanian, is represented by the character sequence U+012F, U+0307, U+0301. Unicode_sentence_115

Unicode maintains a list of uniquely named character sequences for abstract characters that are not directly encoded in Unicode. Unicode_sentence_116

All graphic, format, and private use characters have a unique and immutable name by which they may be identified. Unicode_sentence_117

This immutability has been guaranteed since Unicode version 2.0 by the Name Stability policy. Unicode_sentence_118

In cases where the name is seriously defective and misleading, or has a serious typographical error, a formal alias may be defined, and applications are encouraged to use the formal alias in place of the official character name. Unicode_sentence_119

For example, U+A015 ꀕ YI SYLLABLE WU has the formal alias YI SYLLABLE ITERATION MARK, and U+FE18 ︘ PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET (sic) has the formal alias PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET. Unicode_sentence_120

Ready-made versus composite characters Unicode_section_9

Unicode includes a mechanism for modifying characters that greatly extends the supported glyph repertoire. Unicode_sentence_121

This covers the use of combining diacritical marks that may be added after the base character by the user. Unicode_sentence_122

Multiple combining diacritics may be simultaneously applied to the same character. Unicode_sentence_123

Unicode also contains precomposed versions of most letter/diacritic combinations in normal use. Unicode_sentence_124

These make conversion to and from legacy encodings simpler, and allow applications to use Unicode as an internal text format without having to implement combining characters. Unicode_sentence_125

For example, é can be represented in Unicode as U+0065 (LATIN SMALL LETTER E) followed by U+0301 (COMBINING ACUTE ACCENT), but it can also be represented as the precomposed character U+00E9 (LATIN SMALL LETTER E WITH ACUTE). Unicode_sentence_126

Thus, in many cases, users have multiple ways of encoding the same character. Unicode_sentence_127

To deal with this, Unicode provides the mechanism of canonical equivalence. Unicode_sentence_128

An example of this arises with Hangul, the Korean alphabet. Unicode_sentence_129

Unicode provides a mechanism for composing Hangul syllables with their individual subcomponents, known as Hangul Jamo. Unicode_sentence_130

However, it also provides 11,172 combinations of precomposed syllables made from the most common jamo. Unicode_sentence_131

The CJK characters currently have codes only for their precomposed form. Unicode_sentence_132

Still, most of those characters comprise simpler elements (called radicals), so in principle Unicode could have decomposed them as it did with Hangul. Unicode_sentence_133

This would have greatly reduced the number of required code points, while allowing the display of virtually every conceivable character (which might do away with some of the problems caused by Han unification). Unicode_sentence_134

A similar idea is used by some input methods, such as Cangjie and Wubi. Unicode_sentence_135

However, attempts to do this for character encoding have stumbled over the fact that Chinese characters do not decompose as simply or as regularly as Hangul does. Unicode_sentence_136

A set of radicals was provided in Unicode 3.0 (CJK radicals between U+2E80 and U+2EFF, KangXi radicals in U+2F00 to U+2FDF, and ideographic description characters from U+2FF0 to U+2FFB), but the Unicode standard (ch. Unicode_sentence_137

12.2 of Unicode 5.2) warns against using ideographic description sequences as an alternate representation for previously encoded characters: Unicode_sentence_138

Ligatures Unicode_section_10

Many scripts, including Arabic and Devanāgarī, have special orthographic rules that require certain combinations of letterforms to be combined into special ligature forms. Unicode_sentence_139

The rules governing ligature formation can be quite complex, requiring special script-shaping technologies such as ACE (Arabic Calligraphic Engine by DecoType in the 1980s and used to generate all the Arabic examples in the printed editions of the Unicode Standard), which became the proof of concept for OpenType (by Adobe and Microsoft), Graphite (by SIL International), or AAT (by Apple). Unicode_sentence_140

Instructions are also embedded in fonts to tell the operating system how to properly output different character sequences. Unicode_sentence_141

A simple solution to the placement of combining marks or diacritics is assigning the marks a width of zero and placing the glyph itself to the left or right of the left sidebearing (depending on the direction of the script they are intended to be used with). Unicode_sentence_142

A mark handled this way will appear over whatever character precedes it, but will not adjust its position relative to the width or height of the base glyph; it may be visually awkward and it may overlap some glyphs. Unicode_sentence_143

Real stacking is impossible, but can be approximated in limited cases (for example, Thai top-combining vowels and tone marks can just be at different heights to start with). Unicode_sentence_144

Generally this approach is only effective in monospaced fonts, but may be used as a fallback rendering method when more complex methods fail. Unicode_sentence_145

Standardized subsets Unicode_section_11

Several subsets of Unicode are standardized: Microsoft Windows since Windows NT 4.0 supports WGL-4 with 656 characters, which is considered to support all contemporary European languages using the Latin, Greek, or Cyrillic script. Unicode_sentence_146

Other standardized subsets of Unicode include the Multilingual European Subsets: Unicode_sentence_147

MES-1 (Latin scripts only, 335 characters), MES-2 (Latin, Greek and Cyrillic 1062 characters) and MES-3A & MES-3B (two larger subsets, not shown here). Unicode_sentence_148

Note that MES-2 includes every character in MES-1 and WGL-4. Unicode_sentence_149

Unicode_table_general_3

WGL-4, MES-1 and MES-2Unicode_table_caption_3
RowUnicode_header_cell_3_0_0 CellsUnicode_header_cell_3_0_1 Range(s)Unicode_header_cell_3_0_2
00Unicode_header_cell_3_1_0 20–7EUnicode_cell_3_1_1 Basic Latin (00–7F)Unicode_cell_3_1_2
A0–FFUnicode_cell_3_2_0 Latin-1 Supplement (80–FF)Unicode_cell_3_2_1
01Unicode_header_cell_3_3_0 00–13, 14–15, 16–2B, 2C–2D, 2E–4D, 4E–4F, 50–7E, 7FUnicode_cell_3_3_1 Latin Extended-A (00–7F)Unicode_cell_3_3_2
8F, 92, B7, DE-EF, FA–FFUnicode_cell_3_4_0 Latin Extended-B (80–FF ...)Unicode_cell_3_4_1
02Unicode_header_cell_3_5_0 18–1B, 1E–1FUnicode_cell_3_5_1 Latin Extended-B (... 00–4F)Unicode_cell_3_5_2
59, 7C, 92Unicode_cell_3_6_0 IPA Extensions (50–AF)Unicode_cell_3_6_1
BB–BD, C6, C7, C9, D6, D8–DB, DC, DD, DF, EEUnicode_cell_3_7_0 Spacing Modifier Letters (B0–FF)Unicode_cell_3_7_1
03Unicode_header_cell_3_8_0 74–75, 7A, 7E, 84–8A, 8C, 8E–A1, A3–CE, D7, DA–E1Unicode_cell_3_8_1 Greek (70–FF)Unicode_cell_3_8_2
04Unicode_header_cell_3_9_0 00–5F, 90–91, 92–C4, C7–C8, CB–CC, D0–EB, EE–F5, F8–F9Unicode_cell_3_9_1 Cyrillic (00–FF)Unicode_cell_3_9_2
1EUnicode_header_cell_3_10_0 02–03, 0A–0B, 1E–1F, 40–41, 56–57, 60–61, 6A–6B, 80–85, 9B, F2–F3Unicode_cell_3_10_1 Latin Extended Additional (00–FF)Unicode_cell_3_10_2
1FUnicode_header_cell_3_11_0 00–15, 18–1D, 20–45, 48–4D, 50–57, 59, 5B, 5D, 5F–7D, 80–B4, B6–C4, C6–D3, D6–DB, DD–EF, F2–F4, F6–FEUnicode_cell_3_11_1 Greek Extended (00–FF)Unicode_cell_3_11_2
20Unicode_header_cell_3_12_0 13–14, 15, 17, 18–19, 1A–1B, 1C–1D, 1E, 20–22, 26, 30, 32–33, 39–3A, 3C, 3E, 44, 4AUnicode_cell_3_12_1 General Punctuation (00–6F)Unicode_cell_3_12_2
7F, 82Unicode_cell_3_13_0 Superscripts and Subscripts (70–9F)Unicode_cell_3_13_1
A3–A4, A7, AC, AFUnicode_cell_3_14_0 Currency Symbols (A0–CF)Unicode_cell_3_14_1
21Unicode_header_cell_3_15_0 05, 13, 16, 22, 26, 2EUnicode_cell_3_15_1 Letterlike Symbols (00–4F)Unicode_cell_3_15_2
5B–5EUnicode_cell_3_16_0 Number Forms (50–8F)Unicode_cell_3_16_1
90–93, 94–95, A8Unicode_cell_3_17_0 Arrows (90–FF)Unicode_cell_3_17_1
22Unicode_header_cell_3_18_0 00, 02, 03, 06, 08–09, 0F, 11–12, 15, 19–1A, 1E–1F, 27–28, 29, 2A, 2B, 48, 59, 60–61, 64–65, 82–83, 95, 97Unicode_cell_3_18_1 Mathematical Operators (00–FF)Unicode_cell_3_18_2
23Unicode_header_cell_3_19_0 02, 0A, 20–21, 29–2AUnicode_cell_3_19_1 Miscellaneous Technical (00–FF)Unicode_cell_3_19_2
25Unicode_header_cell_3_20_0 00, 02, 0C, 10, 14, 18, 1C, 24, 2C, 34, 3C, 50–6CUnicode_cell_3_20_1 Box Drawing (00–7F)Unicode_cell_3_20_2
80, 84, 88, 8C, 90–93Unicode_cell_3_21_0 Block Elements (80–9F)Unicode_cell_3_21_1
A0–A1, AA–AC, B2, BA, BC, C4, CA–CB, CF, D8–D9, E6Unicode_cell_3_22_0 Geometric Shapes (A0–FF)Unicode_cell_3_22_1
26Unicode_header_cell_3_23_0 3A–3C, 40, 42, 60, 63, 65–66, 6A, 6BUnicode_cell_3_23_1 Miscellaneous Symbols (00–FF)Unicode_cell_3_23_2
F0Unicode_header_cell_3_24_0 (01–02)Unicode_cell_3_24_1 Private Use Area (00–FF ...)Unicode_cell_3_24_2
FBUnicode_header_cell_3_25_0 01–02Unicode_cell_3_25_1 Alphabetic Presentation Forms (00–4F)Unicode_cell_3_25_2
FFUnicode_header_cell_3_26_0 FDUnicode_cell_3_26_1 SpecialsUnicode_cell_3_26_2

Rendering software which cannot process a Unicode character appropriately often displays it as an open rectangle, or the Unicode "replacement character" (U+FFFD, �), to indicate the position of the unrecognized character. Unicode_sentence_150

Some systems have made attempts to provide more information about such characters. Unicode_sentence_151

Apple's Last Resort font will display a substitute glyph indicating the Unicode range of the character, and the SIL International's Unicode Fallback font will display a box showing the hexadecimal scalar value of the character. Unicode_sentence_152

Mapping and encodings Unicode_section_12

Several mechanisms have been specified for storing a series of code points as a series of bytes. Unicode_sentence_153

Unicode defines two mapping methods: the Unicode Transformation Format (UTF) encodings, and the Universal Coded Character Set (UCS) encodings. Unicode_sentence_154

An encoding maps (possibly a subset of) the range of Unicode code points to sequences of values in some fixed-size range, termed code units. Unicode_sentence_155

All UTF encodings map code points to a unique sequence of bytes. Unicode_sentence_156

The numbers in the names of the encodings indicate the number of bits per code unit (for UTF encodings) or the number of bytes per code unit (for UCS encodings and UTF-1). Unicode_sentence_157

UTF-8 and UTF-16 are the most commonly used encodings. Unicode_sentence_158

UCS-2 is an obsolete subset of UTF-16; UCS-4 and UTF-32 are functionally equivalent. Unicode_sentence_159

UTF encodings include: Unicode_sentence_160

Unicode_unordered_list_1

  • UTF-1, a retired predecessor of UTF-8, maximizes compatibility with ISO 2022, no longer part of The Unicode StandardUnicode_item_1_3
  • UTF-7, an obsolete 7-bit encoding sometimes used in e-mail (not part of The Unicode Standard, but only documented as an informational RFC, i.e., not on the Internet Standards Track)Unicode_item_1_4
  • UTF-8, uses one to four bytes for each code point, maximizes compatibility with ASCIIUnicode_item_1_5
  • UTF-EBCDIC, similar to UTF-8 but designed for compatibility with EBCDIC (not part of The Unicode Standard)Unicode_item_1_6
  • UTF-16, uses one or two 16-bit code units per code point, cannot encode surrogatesUnicode_item_1_7
  • UTF-32, uses one 32-bit code unit per code pointUnicode_item_1_8

UTF-8 uses one to four bytes per code point and, being compact for Latin scripts and ASCII-compatible, provides the de facto standard encoding for interchange of Unicode text. Unicode_sentence_161

It is used by FreeBSD and most recent Linux distributions as a direct replacement for legacy encodings in general text handling. Unicode_sentence_162

The UCS-2 and UTF-16 encodings specify the Unicode Byte Order Mark (BOM) for use at the beginnings of text files, which may be used for byte ordering detection (or byte endianness detection). Unicode_sentence_163

The BOM, code point U+FEFF has the important property of unambiguity on byte reorder, regardless of the Unicode encoding used; U+FFFE (the result of byte-swapping U+FEFF) does not equate to a legal character, and U+FEFF in other places, other than the beginning of text, conveys the zero-width non-break space (a character with no appearance and no effect other than preventing the formation of ligatures). Unicode_sentence_164

The same character converted to UTF-8 becomes the byte sequence EF BB BF. Unicode_sentence_165

The Unicode Standard allows that the BOM "can serve as signature for UTF-8 encoded text where the character set is unmarked". Unicode_sentence_166

Some software developers have adopted it for other encodings, including UTF-8, in an attempt to distinguish UTF-8 from local 8-bit code pages. Unicode_sentence_167

However RFC , the UTF-8 standard, recommends that byte order marks be forbidden in protocols using UTF-8, but discusses the cases where this may not be possible. Unicode_sentence_168

In addition, the large restriction on possible patterns in UTF-8 (for instance there cannot be any lone bytes with the high bit set) means that it should be possible to distinguish UTF-8 from other character encodings without relying on the BOM. Unicode_sentence_169

In UTF-32 and UCS-4, one 32-bit code unit serves as a fairly direct representation of any character's code point (although the endianness, which varies across different platforms, affects how the code unit manifests as a byte sequence). Unicode_sentence_170

In the other encodings, each code point may be represented by a variable number of code units. Unicode_sentence_171

UTF-32 is widely used as an internal representation of text in programs (as opposed to stored or transmitted text), since every Unix operating system that uses the gcc compilers to generate software uses it as the standard "wide character" encoding. Unicode_sentence_172

Some programming languages, such as Seed7, use UTF-32 as internal representation for strings and characters. Unicode_sentence_173

Recent versions of the Python programming language (beginning with 2.2) may also be configured to use UTF-32 as the representation for Unicode strings, effectively disseminating such encoding in high-level coded software. Unicode_sentence_174

Punycode, another encoding form, enables the encoding of Unicode strings into the limited character set supported by the ASCII-based Domain Name System (DNS). Unicode_sentence_175

The encoding is used as part of IDNA, which is a system enabling the use of Internationalized Domain Names in all scripts that are supported by Unicode. Unicode_sentence_176

Earlier and now historical proposals include UTF-5 and UTF-6. Unicode_sentence_177

GB18030 is another encoding form for Unicode, from the Standardization Administration of China. Unicode_sentence_178

It is the official character set of the People's Republic of China (PRC). Unicode_sentence_179

BOCU-1 and SCSU are Unicode compression schemes. Unicode_sentence_180

The April Fools' Day RFC of 2005 specified two parody UTF encodings, UTF-9 and UTF-18. Unicode_sentence_181

Adoption Unicode_section_13

Operating systems Unicode_section_14

Unicode has become the dominant scheme for internal processing and storage of text. Unicode_sentence_182

Although a great deal of text is still stored in legacy encodings, Unicode is used almost exclusively for building new information processing systems. Unicode_sentence_183

Early adopters tended to use UCS-2 (the fixed-width two-byte precursor to UTF-16) and later moved to UTF-16 (the variable-width current standard), as this was the least disruptive way to add support for non-BMP characters. Unicode_sentence_184

The best known such system is Windows NT (and its descendants, Windows 2000, Windows XP, Windows Vista, Windows 7, Windows 8 and Windows 10), which uses UTF-16 as the sole internal character encoding. Unicode_sentence_185

The Java and .NET bytecode environments, macOS, and KDE also use it for internal representation. Unicode_sentence_186

Partial support for Unicode can be installed on Windows 9x through the Microsoft Layer for Unicode. Unicode_sentence_187

UTF-8 (originally developed for Plan 9) has become the main storage encoding on most Unix-like operating systems (though others are also used by some libraries) because it is a relatively easy replacement for traditional extended ASCII character sets. Unicode_sentence_188

UTF-8 is also the most common Unicode encoding used in HTML documents on the World Wide Web. Unicode_sentence_189

Multilingual text-rendering engines which use Unicode include Uniscribe and DirectWrite for Microsoft Windows, ATSUI and Core Text for macOS, and Pango for GTK+ and the GNOME desktop. Unicode_sentence_190

Input methods Unicode_section_15

Main article: Unicode input Unicode_sentence_191

Because keyboard layouts cannot have simple key combinations for all characters, several operating systems provide alternative input methods that allow access to the entire repertoire. Unicode_sentence_192

ISO/IEC 14755, which standardises methods for entering Unicode characters from their code points, specifies several methods. Unicode_sentence_193

There is the Basic method, where a beginning sequence is followed by the hexadecimal representation of the code point and the ending sequence. Unicode_sentence_194

There is also a screen-selection entry method specified, where the characters are listed in a table in a screen, such as with a character map program. Unicode_sentence_195

Online tools for finding the code point for a known character include Unicode Lookup by Jonathan Hedley and Shapecatcher by Benjamin Milde. Unicode_sentence_196

In Unicode Lookup, one enters a search key (e.g. "fractions"), and a list of corresponding characters with their code points is returned. Unicode_sentence_197

In Shapecatcher, based on Shape context, one draws the character in a box and a list of characters approximating the drawing, with their code points, is returned. Unicode_sentence_198

Email Unicode_section_16

Main article: Unicode and email Unicode_sentence_199

MIME defines two different mechanisms for encoding non-ASCII characters in email, depending on whether the characters are in email headers (such as the "Subject:"), or in the text body of the message; in both cases, the original character set is identified as well as a transfer encoding. Unicode_sentence_200

For email transmission of Unicode, the UTF-8 character set and the Base64 or the Quoted-printable transfer encoding are recommended, depending on whether much of the message consists of ASCII characters. Unicode_sentence_201

The details of the two different mechanisms are specified in the MIME standards and generally are hidden from users of email software. Unicode_sentence_202

The adoption of Unicode in email has been very slow. Unicode_sentence_203

Some East Asian text is still encoded in encodings such as ISO-2022, and some devices, such as mobile phones, still cannot correctly handle Unicode data. Unicode_sentence_204

Support has been improving, however. Unicode_sentence_205

Many major free mail providers such as Yahoo, Google (Gmail), and Microsoft (Outlook.com) support it. Unicode_sentence_206

Web Unicode_section_17

Main article: Unicode and HTML Unicode_sentence_207

All W3C recommendations have used Unicode as their document character set since HTML 4.0. Unicode_sentence_208

Web browsers have supported Unicode, especially UTF-8, for many years. Unicode_sentence_209

There used to be display problems resulting primarily from font related issues; e.g. v 6 and older of Microsoft Internet Explorer did not render many code points unless explicitly told to use a font that contains them. Unicode_sentence_210

Although syntax rules may affect the order in which characters are allowed to appear, XML (including XHTML) documents, by definition, comprise characters from most of the Unicode code points, with the exception of: Unicode_sentence_211

Unicode_unordered_list_2

  • most of the C0 control codes,Unicode_item_2_9
  • the permanently unassigned code points D800–DFFF,Unicode_item_2_10
  • FFFE or FFFF.Unicode_item_2_11

HTML characters manifest either directly as bytes according to document's encoding, if the encoding supports them, or users may write them as numeric character references based on the character's Unicode code point. Unicode_sentence_212

For example, the references Δ, Й, ק, م, ๗, あ, 叶, 葉, and 말 (or the same numeric values expressed in hexadecimal, with &#x as the prefix) should display on all browsers as Δ, Й, ק ,م, ๗, あ, 叶, 葉, and 말. Unicode_sentence_213

When specifying URIs, for example as URLs in HTTP requests, non-ASCII characters must be percent-encoded. Unicode_sentence_214

Fonts Unicode_section_18

Main article: Unicode font Unicode_sentence_215

Unicode is not in principle concerned with fonts per se, seeing them as implementation choices. Unicode_sentence_216

Any given character may have many allographs, from the more common bold, italic and base letterforms to complex decorative styles. Unicode_sentence_217

A font is "Unicode compliant" if the glyphs in the font can be accessed using code points defined in the Unicode standard. Unicode_sentence_218

The standard does not specify a minimum number of characters that must be included in the font; some fonts have quite a small repertoire. Unicode_sentence_219

Free and retail fonts based on Unicode are widely available, since TrueType and OpenType support Unicode. Unicode_sentence_220

These font formats map Unicode code points to glyphs, but TrueType font is restricted to 65,535 glyphs. Unicode_sentence_221

Thousands of fonts exist on the market, but fewer than a dozen fonts—sometimes described as "pan-Unicode" fonts—attempt to support the majority of Unicode's character repertoire. Unicode_sentence_222

Instead, Unicode-based fonts typically focus on supporting only basic ASCII and particular scripts or sets of characters or symbols. Unicode_sentence_223

Several reasons justify this approach: applications and documents rarely need to render characters from more than one or two writing systems; fonts tend to demand resources in computing environments; and operating systems and applications show increasing intelligence in regard to obtaining glyph information from separate font files as needed, i.e., font substitution. Unicode_sentence_224

Furthermore, designing a consistent set of rendering instructions for tens of thousands of glyphs constitutes a monumental task; such a venture passes the point of diminishing returns for most typefaces. Unicode_sentence_225

Newlines Unicode_section_19

Unicode partially addresses the newline problem that occurs when trying to read a text file on different platforms. Unicode_sentence_226

Unicode defines a large number of characters that conforming applications should recognize as line terminators. Unicode_sentence_227

In terms of the newline, Unicode introduced U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR. Unicode_sentence_228

This was an attempt to provide a Unicode solution to encoding paragraphs and lines semantically, potentially replacing all of the various platform solutions. Unicode_sentence_229

In doing so, Unicode does provide a way around the historical platform dependent solutions. Unicode_sentence_230

Nonetheless, few if any Unicode solutions have adopted these Unicode line and paragraph separators as the sole canonical line ending characters. Unicode_sentence_231

However, a common approach to solving this issue is through newline normalization. Unicode_sentence_232

This is achieved with the Cocoa text system in Mac OS X and also with W3C XML and HTML recommendations. Unicode_sentence_233

In this approach every possible newline character is converted internally to a common newline (which one does not really matter since it is an internal operation just for rendering). Unicode_sentence_234

In other words, the text system can correctly treat the character as a newline, regardless of the input's actual encoding. Unicode_sentence_235

Issues Unicode_section_20

Philosophical and completeness criticisms Unicode_section_21

Han unification (the identification of forms in the East Asian languages which one can treat as stylistic variations of the same historical character) has become one of the most controversial aspects of Unicode, despite the presence of a majority of experts from all three regions in the Ideographic Research Group (IRG), which advises the Consortium and ISO on additions to the repertoire and on Han unification. Unicode_sentence_236

Unicode has been criticized for failing to separately encode older and alternative forms of kanji which, critics argue, complicates the processing of ancient Japanese and uncommon Japanese names. Unicode_sentence_237

This is often due to the fact that Unicode encodes characters rather than glyphs (the visual representations of the basic character that often vary from one language to another). Unicode_sentence_238

Unification of glyphs leads to the perception that the languages themselves, not just the basic character representation, are being merged. Unicode_sentence_239

There have been several attempts to create alternative encodings that preserve the stylistic differences between Chinese, Japanese, and Korean characters in opposition to Unicode's policy of Han unification. Unicode_sentence_240

An example of one is TRON (although it is not widely adopted in Japan, there are some users who need to handle historical Japanese text and favor it). Unicode_sentence_241

Although the repertoire of fewer than 21,000 Han characters in the earliest version of Unicode was largely limited to characters in common modern usage, Unicode now includes more than 92,000 Han characters, and work is continuing to add thousands more historic and dialectal characters used in China, Japan, Korea, Taiwan, and Vietnam. Unicode_sentence_242

Modern font technology provides a means to address the practical issue of needing to depict a unified Han character in terms of a collection of alternative glyph representations, in the form of Unicode variation sequences. Unicode_sentence_243

For example, the Advanced Typographic tables of OpenType permit one of a number of alternative glyph representations to be selected when performing the character to glyph mapping process. Unicode_sentence_244

In this case, information can be provided within plain text to designate which alternate character form to select. Unicode_sentence_245

If the difference in the appropriate glyphs for two characters in the same script differ only in the italic, Unicode has generally unified them, as can be seen in the comparison between Russian (labeled standard) and Serbian characters at right, meaning that the differences are displayed through smart font technology or manually changing fonts. Unicode_sentence_246

Mapping to legacy character sets Unicode_section_22

Unicode was designed to provide code-point-by-code-point round-trip format conversion to and from any preexisting character encodings, so that text files in older character sets can be converted to Unicode and then back and get back the same file, without employing context-dependent interpretation. Unicode_sentence_247

That has meant that inconsistent legacy architectures, such as combining diacritics and precomposed characters, both exist in Unicode, giving more than one method of representing some text. Unicode_sentence_248

This is most pronounced in the three different encoding forms for Korean Hangul. Unicode_sentence_249

Since version 3.0, any precomposed characters that can be represented by a combining sequence of already existing characters can no longer be added to the standard in order to preserve interoperability between software using different versions of Unicode. Unicode_sentence_250

Injective mappings must be provided between characters in existing legacy character sets and characters in Unicode to facilitate conversion to Unicode and allow interoperability with legacy software. Unicode_sentence_251

Lack of consistency in various mappings between earlier Japanese encodings such as Shift-JIS or EUC-JP and Unicode led to round-trip format conversion mismatches, particularly the mapping of the character JIS X 0208 '~' (1-33, WAVE DASH), heavily used in legacy database data, to either U+FF5E ~ FULLWIDTH TILDE (in Microsoft Windows) or U+301C 〜 WAVE DASH (other vendors). Unicode_sentence_252

Some Japanese computer programmers objected to Unicode because it requires them to separate the use of U+005C \ REVERSE SOLIDUS (backslash) and U+00A5 ¥ YEN SIGN, which was mapped to 0x5C in JIS X 0201, and a lot of legacy code exists with this usage. Unicode_sentence_253

(This encoding also replaces tilde '~' 0x7E with macron '¯', now 0xAF.) Unicode_sentence_254

The separation of these characters exists in ISO 8859-1, from long before Unicode. Unicode_sentence_255

Indic scripts Unicode_section_23

Indic scripts such as Tamil and Devanagari are each allocated only 128 code points, matching the ISCII standard. Unicode_sentence_256

The correct rendering of Unicode Indic text requires transforming the stored logical order characters into visual order and the forming of ligatures (aka conjuncts) out of components. Unicode_sentence_257

Some local scholars argued in favor of assignments of Unicode code points to these ligatures, going against the practice for other writing systems, though Unicode contains some Arabic and other ligatures for backward compatibility purposes only. Unicode_sentence_258

Encoding of any new ligatures in Unicode will not happen, in part because the set of ligatures is font-dependent, and Unicode is an encoding independent of font variations. Unicode_sentence_259

The same kind of issue arose for the Tibetan script in 2003 when the Standardization Administration of China proposed encoding 956 precomposed Tibetan syllables, but these were rejected for encoding by the relevant ISO committee (ISO/IEC JTC 1/SC 2). Unicode_sentence_260

Thai alphabet support has been criticized for its ordering of Thai characters. Unicode_sentence_261

The vowels เ, แ, โ, ใ, ไ that are written to the left of the preceding consonant are in visual order instead of phonetic order, unlike the Unicode representations of other Indic scripts. Unicode_sentence_262

This complication is due to Unicode inheriting the Thai Industrial Standard 620, which worked in the same way, and was the way in which Thai had always been written on keyboards. Unicode_sentence_263

This ordering problem complicates the Unicode collation process slightly, requiring table lookups to reorder Thai characters for collation. Unicode_sentence_264

Even if Unicode had adopted encoding according to spoken order, it would still be problematic to collate words in dictionary order. Unicode_sentence_265

E.g., the word [sa dɛːŋ "perform" starts with a consonant cluster "สด" (with an inherent vowel for the consonant "ส"), the vowel แ-, in spoken order would come after the ด, but in a dictionary, the word is collated as it is written, with the vowel following the ส. Unicode_sentence_266

Combining characters Unicode_section_24

Main article: Combining character Unicode_sentence_267

See also: Unicode normalization § Normalization Unicode_sentence_268

Characters with diacritical marks can generally be represented either as a single precomposed character or as a decomposed sequence of a base letter plus one or more non-spacing marks. Unicode_sentence_269

For example, ḗ (precomposed e with macron and acute above) and ḗ (e followed by the combining macron above and combining acute above) should be rendered identically, both appearing as an e with a macron and acute accent, but in practice, their appearance may vary depending upon what rendering engine and fonts are being used to display the characters. Unicode_sentence_270

Similarly, underdots, as needed in the romanization of Indic, will often be placed incorrectly.. Unicode characters that map to precomposed glyphs can be used in many cases, thus avoiding the problem, but where no precomposed character has been encoded the problem can often be solved by using a specialist Unicode font such as Charis SIL that uses Graphite, OpenType, or AAT technologies for advanced rendering features. Unicode_sentence_271

Anomalies Unicode_section_25

Main article: Unicode alias names and abbreviations Unicode_sentence_272

The Unicode standard has imposed rules intended to guarantee stability. Unicode_sentence_273

Depending on the strictness of a rule, a change can be prohibited or allowed. Unicode_sentence_274

For example, a "name" given to a code point cannot and will not change. Unicode_sentence_275

But a "script" property is more flexible, by Unicode's own rules. Unicode_sentence_276

In version 2.0, Unicode changed many code point "names" from version 1. Unicode_sentence_277

At the same moment, Unicode stated that from then on, an assigned name to a code point will never change anymore. Unicode_sentence_278

This implies that when mistakes are published, these mistakes cannot be corrected, even if they are trivial (as happened in one instance with the spelling BRAKCET for BRACKET in a character name). Unicode_sentence_279

In 2006 a list of anomalies in character names was first published, and, as of April 2017, there were 94 characters with identified issues, for example: Unicode_sentence_280

Unicode_unordered_list_3

  • U+2118 ℘ SCRIPT CAPITAL P: This is a small letter. The capital is U+1D4AB 𝒫 MATHEMATICAL SCRIPT CAPITAL PUnicode_item_3_12
  • U+034F ͏ COMBINING GRAPHEME JOINER: Does not join graphemes.Unicode_item_3_13
  • U+A015 ꀕ YI SYLLABLE WU: This is not a Yi syllable, but a Yi iteration mark.Unicode_item_3_14
  • U+FE18 ︘ PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET: bracket is spelled incorrectly.Unicode_item_3_15

Spelling errors are resolved by using Unicode alias names and abbreviations. Unicode_sentence_281

See also Unicode_section_26

Unicode_unordered_list_4


Credits to the contents of this page go to the authors of the corresponding Wikipedia page: en.wikipedia.org/wiki/Unicode.