Extended Unix Code

From Wikipedia for FEVERv2
(Redirected from EUC-TW)
Jump to navigation Jump to search

Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese. Extended Unix Code_sentence_0

The structure of EUC is based on the ISO-2022 standard, which specifies a way to represent character sets containing a maximum of 94 characters, or 8836 (94) characters, or 830584 (94) characters, as sequences of 7-bit codes. Extended Unix Code_sentence_1

Only ISO-2022 compliant character sets can have EUC forms. Extended Unix Code_sentence_2

Up to four coded character sets (referred to as G0, G1, G2, and G3 or as code sets 0, 1, 2, and 3) can be represented with the EUC scheme. Extended Unix Code_sentence_3

G0 is almost always an ISO-646 compliant coded character set such as US-ASCII, ISO 646:KR (KS X 1003) or ISO 646:JP (the lower half of JIS X 0201) that is invoked on GL (i.e. with the most significant bit cleared). Extended Unix Code_sentence_4

An exception from US-ASCII is that 0x5C (backslash in US-ASCII) is often used to represent a Yen sign in EUC-JP (see below) and a won sign in EUC-KR. Extended Unix Code_sentence_5

To get the EUC form of an ISO-2022 character, the most significant bit of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 to each of these original 7-bit codes); this allows software to easily distinguish whether a particular byte in a character string belongs to the ISO-646 code or the ISO-2022 (EUC) code. Extended Unix Code_sentence_6

The most commonly used EUC codes are variable-width encodings with a character belonging to G0 (ISO-646 compliant coded character set) taking one byte and a character belonging to G1 (taken by a 94x94 coded character set) represented in two bytes. Extended Unix Code_sentence_7

The EUC-CN form of GB 2312 and EUC-KR are examples of such two-byte EUC codes. Extended Unix Code_sentence_8

EUC-JP includes characters represented by up to three bytes whereas a single character in EUC-TW can take up to four bytes. Extended Unix Code_sentence_9

Modern applications are more likely to use UTF-8, which supports all of the glyphs of the EUC codes, and more, and is generally more portable with fewer vendor deviations and errors. Extended Unix Code_sentence_10

EUC is however still very popular, especially EUC-KR for South Korea. Extended Unix Code_sentence_11

EUC-CN Extended Unix Code_section_0

Extended Unix Code_table_infobox_0

EUC-CNExtended Unix Code_table_caption_0
MIME / IANAExtended Unix Code_header_cell_0_0_0 GB2312Extended Unix Code_cell_0_0_1
Alias(es)Extended Unix Code_header_cell_0_1_0 csGB2312Extended Unix Code_cell_0_1_1
Language(s)Extended Unix Code_header_cell_0_2_0 Simplified Chinese, English, RussianExtended Unix Code_cell_0_2_1
StandardExtended Unix Code_header_cell_0_3_0 GB 2312 (1980)Extended Unix Code_cell_0_3_1
ClassificationExtended Unix Code_header_cell_0_4_0 Extended ASCII, variable-width encoding, CJK encoding, EUCExtended Unix Code_cell_0_4_1
ExtendsExtended Unix Code_header_cell_0_5_0 US-ASCIIExtended Unix Code_cell_0_5_1
ExtensionsExtended Unix Code_header_cell_0_6_0 748, GBK, GB 18030, x-mac-chinesesimpExtended Unix Code_cell_0_6_1
Transforms / EncodesExtended Unix Code_header_cell_0_7_0 GB 2312Extended Unix Code_cell_0_7_1
Succeeded byExtended Unix Code_header_cell_0_8_0 GBK, GB 18030Extended Unix Code_cell_0_8_1

EUC-CN is the usual encoded form of the GB 2312 standard for simplified Chinese characters. Extended Unix Code_sentence_12

Unlike the case of Japanese JIS X 0208 and ISO-2022-JP, GB 2312 is not normally used in a 7-bit ISO 2022 code version, although a variant form called HZ (which delimits GB 2312 text with ASCII sequences) was sometimes used on USENET. Extended Unix Code_sentence_13

An ASCII character is represented in its usual encoding. Extended Unix Code_sentence_14

A character from GB 2312 is represented by two bytes, both from the range 0xA1–0xFE. Extended Unix Code_sentence_15

Related Simplified Chinese encoding systems Extended Unix Code_section_1

748 code Extended Unix Code_section_2

An encoding related to EUC-CN is the "748" code used in the WITS typesetting system developed by Beijing's Founder Technology (now obsoleted by its newer FITS typesetting system). Extended Unix Code_sentence_16

The 748 code contains all of GB 2312, but is not ISO 2022–compliant and therefore not a true EUC code. Extended Unix Code_sentence_17

(It uses an 8-bit lead byte but distinguishes between a second byte with its most significant bit set and one with its most significant bit cleared, and is therefore more similar in structure to Big5 and other non–ISO 2022–compliant DBCS encoding systems.) Extended Unix Code_sentence_18

The non-GB2312 portion of the 748 code contains traditional and Hong Kong characters and other glyphs used in newspaper typesetting. Extended Unix Code_sentence_19

GBK and GB 18030 Extended Unix Code_section_3

Main articles: GBK (character encoding) and GB 18030 Extended Unix Code_sentence_20

GBK is an extension to GB 2312. Extended Unix Code_sentence_21

It defines an extended form of the EUC-CN encoding capable of representing a larger array of CJK characters sourced largely from Unicode 1.1, including traditional Chinese characters and characters used only in Japanese. Extended Unix Code_sentence_22

It is not, however, a true EUC code, because ASCII bytes may appear as trail bytes (and C1 bytes, not limited to the single shifts, may appear as lead or trail bytes), due to a larger encoding space being required. Extended Unix Code_sentence_23

Variants of GBK are implemented by Windows code page 936 (the Microsoft Windows code page for simplified Chinese), and by IBM's code page 1386. Extended Unix Code_sentence_24

The Unicode-based GB 18030 character encoding defines an extension of GBK capable of encoding the entirety of Unicode. Extended Unix Code_sentence_25

However, Unicode encoded as GB 18030 is a variable-width encoding which may use up to four bytes per character, due to an even larger encoding space being required. Extended Unix Code_sentence_26

Being an extension of GBK, it is a superset of EUC-CN but is not itself a true EUC code. Extended Unix Code_sentence_27

Being a Unicode encoding, its repertoire is identical to that of other Unicode transformation formats such as UTF-8. Extended Unix Code_sentence_28

Mac OS Chinese Simplified Extended Unix Code_section_4

Other EUC-CN variants deviating from the EUC mechanism include the Mac OS Chinese Simplified script (known as Code page 10008 or x-mac-chinesesimp). Extended Unix Code_sentence_29

It uses the bytes 0x80, 0x81, 0x82, 0xA0, 0xFD, 0xFE and 0xFF for the U with umlaut (ü), two special font metric characters, the non-breaking space, the copyright sign (©), the trademark sign (™) and the ellipsis (…) respectively. Extended Unix Code_sentence_30

This differs in what is regarded as a single-byte character versus the first byte of a two-byte character from both EUC (where, of those, 0xFD and 0xFE are defined as lead bytes) and GBK (where, of those, 0x81, 0x82, 0xFD and 0xFE are defined as lead bytes). Extended Unix Code_sentence_31

This use of 0xA0, 0xFD, 0xFE and 0xFF matches Apple's Shift_JIS variant. Extended Unix Code_sentence_32

EUC-JP Extended Unix Code_section_5

Extended Unix Code_table_infobox_1

EUC-JPExtended Unix Code_table_caption_1
MIME / IANAExtended Unix Code_header_cell_1_0_0 EUC-JPExtended Unix Code_cell_1_0_1
Alias(es)Extended Unix Code_header_cell_1_1_0 Unixized JIS (UJIS), csEUCPkdFmtJapaneseExtended Unix Code_cell_1_1_1
Language(s)Extended Unix Code_header_cell_1_2_0 Japanese, English, RussianExtended Unix Code_cell_1_2_1
ClassificationExtended Unix Code_header_cell_1_3_0 Extended ISO 646, variable-width encoding, CJK encoding, EUCExtended Unix Code_cell_1_3_1
ExtendsExtended Unix Code_header_cell_1_4_0 US-ASCII or ISO 646:JPExtended Unix Code_cell_1_4_1
Transforms / EncodesExtended Unix Code_header_cell_1_5_0 JIS X 0208, JIS X 0212, JIS X 0201Extended Unix Code_cell_1_5_1
Succeeded byExtended Unix Code_header_cell_1_6_0 EUC-JISx0213Extended Unix Code_cell_1_6_1

Extended Unix Code_table_infobox_2

EUC-JIS-2004Extended Unix Code_table_caption_2
Alias(es)Extended Unix Code_header_cell_2_0_0 EUC-JISx0213Extended Unix Code_cell_2_0_1
Language(s)Extended Unix Code_header_cell_2_1_0 Japanese, Ainu, English, RussianExtended Unix Code_cell_2_1_1
StandardExtended Unix Code_header_cell_2_2_0 JIS X 0213Extended Unix Code_cell_2_2_1
ClassificationExtended Unix Code_header_cell_2_3_0 Extended ASCII, variable-width encoding, CJK encoding, EUCExtended Unix Code_cell_2_3_1
ExtendsExtended Unix Code_header_cell_2_4_0 US-ASCIIExtended Unix Code_cell_2_4_1
Transforms / EncodesExtended Unix Code_header_cell_2_5_0 JIS X 0213, JIS X 0201 (Kana)Extended Unix Code_cell_2_5_1
Preceded byExtended Unix Code_header_cell_2_6_0 EUC-JPExtended Unix Code_cell_2_6_1

EUC-JP is a variable-width encoding used to represent the elements of three Japanese character set standards, namely JIS X 0208, JIS X 0212, and JIS X 0201. Extended Unix Code_sentence_33

Other names for this encoding include Unixized JIS (or UJIS) and AT&T JIS. Extended Unix Code_sentence_34

0.1% of all web pages use EUC-JP since August 2018, while 3.2% of Japanese web sites use this encoding (less used than Shift JIS, or UTF-8). Extended Unix Code_sentence_35

It is called Code page 954 by IBM. Extended Unix Code_sentence_36

Microsoft has two code page numbers for this encoding (51932 and 20932). Extended Unix Code_sentence_37

This encoding scheme allows the easy mixing of 7-bit ASCII and 8-bit Japanese without the need for the escape characters employed by ISO-2022-JP, which is based on the same character set standards, and without ASCII bytes appearing as trail bytes (unlike Shift JIS). Extended Unix Code_sentence_38

A related and partially compatible encoding, called EUC-JISx0213 or EUC-JIS-2004, encodes JIS X 0201 and JIS X 0213 (similarly to Shift_JISx0213, its Shift_JIS-based counterpart). Extended Unix Code_sentence_39

Compared to EUC-CN or EUC-KR, EUC-JP did not become as widely adopted on PC and Macintosh systems in Japan, which used Shift JIS or its extensions (Windows code page 932 on Microsoft Windows, and MacJapanese on classic Mac OS), although it became heavily used by Unix or Unix-like operating systems (except for HP-UX). Extended Unix Code_sentence_40

Therefore, whether Japanese web sites use EUC-JP or Shift_JIS often depends on what OS the author uses. Extended Unix Code_sentence_41

Vendor extensions to EUC-JP were usually allocated within the individual code sets, as opposed to using invalid EUC sequences (as in popular extensions of EUC-CN and EUC-KR). Extended Unix Code_sentence_42

Characters are encoded as follows: Extended Unix Code_sentence_43

Extended Unix Code_unordered_list_0

  • As an EUC/ISO 2022 compliant encoding, the C0 control characters, space and DEL are represented as in ASCII.Extended Unix Code_item_0_0
  • A graphical character from ASCII (code set 0) is represented as its usual one-byte representation, in the range 0x21 – 0x7E. While some variants of EUC-JP encode the lower half of JIS X 0201 here, most encode ASCII, including the W3C/WHATWG Encoding standard used by HTML5, and so does EUC-JIS-2004. While this means that 0x5C is typically mapped to Unicode as U+005C REVERSE SOLIDUS (the ASCII backslash), U+005C may be displayed as a Yen sign by certain Japanese-locale fonts, e.g. on Microsoft Windows, for compatibility with the lower half of JIS X 0201.Extended Unix Code_item_0_1
  • A character from JIS X 0208 (code set 1) is represented by two bytes, both in the range 0xA1 – 0xFE. This differs from the ISO-2022-JP representation by having the high bit set. This code set may also contain vendor extensions in some EUC-JP variants. In EUC-JIS-2004, the first plane of JIS X 0213 is encoded here, which is effectively a superset of standard JIS X 0208.Extended Unix Code_item_0_2
  • A character from the upper half of JIS X 0201 (half-width kana, code set 2) is represented by two bytes, the first being 0x8E, the second being the usual JIS X 0201 representation in the range 0xA1 – 0xDF. This set may contain IBM vendor extensions in some variants.Extended Unix Code_item_0_3
  • A character from JIS X 0212 (code set 3) is represented in EUC-JP by three bytes, the first being 0x8F, the following two being in the range 0xA1–0xFE, i.e. with the high bit set. In addition to standard JIS X 0212, code set 3 of some EUC-JP variants may also contain extensions in rows 83 and 84 to represent characters from IBM's Shift JIS extensions which lack standard JIS X 0212 mappings, which may be coded in either of two layouts, one defined by IBM themselves and one defined by the OSF. In EUC-JIS-2004, the second plane of JIS X 0213 is encoded here, which does not collide with the allocated rows in standard JIS X 0212. Some implementations of EUC-JIS-2004, such as the one used by Python, allow both JIS X 0212 and JIS X 0213 plane 2 characters in this set.Extended Unix Code_item_0_4

EUC-KR Extended Unix Code_section_6

"EUC-KR" redirects here. Extended Unix Code_sentence_44

For the variant so named in HTML standards, see Unified Hangul Code. Extended Unix Code_sentence_45

Extended Unix Code_table_infobox_3

EUC-KRExtended Unix Code_table_caption_3
MIME / IANAExtended Unix Code_header_cell_3_0_0 EUC-KRExtended Unix Code_cell_3_0_1
Alias(es)Extended Unix Code_header_cell_3_1_0 Wansung, IBM-970Extended Unix Code_cell_3_1_1
Language(s)Extended Unix Code_header_cell_3_2_0 Korean, English, RussianExtended Unix Code_cell_3_2_1
StandardExtended Unix Code_header_cell_3_3_0 KS X 2901 (KS C 5861)Extended Unix Code_cell_3_3_1
ClassificationExtended Unix Code_header_cell_3_4_0 Extended ISO 646, variable-width encoding, CJK encoding, EUCExtended Unix Code_cell_3_4_1
ExtendsExtended Unix Code_header_cell_3_5_0 US-ASCII or ISO 646:KRExtended Unix Code_cell_3_5_1
ExtensionsExtended Unix Code_header_cell_3_6_0 Mac OS Korean, IBM-949, Unified Hangul Code (Windows-949)Extended Unix Code_cell_3_6_1
Transforms / EncodesExtended Unix Code_header_cell_3_7_0 KS X 1001Extended Unix Code_cell_3_7_1
Succeeded byExtended Unix Code_header_cell_3_8_0 Unified Hangul Code (web standards)Extended Unix Code_cell_3_8_1

EUC-KR is a variable-width encoding to represent Korean text using two coded character sets, KS X 1001 (formerly KS C 5601) and either ISO 646:KR (KS X 1003, formerly KS C 5636) or US-ASCII, depending on variant. Extended Unix Code_sentence_46

KS X 2901 (formerly KS C 5861) stipulates the encoding and RFC  dubbed it as EUC-KR. Extended Unix Code_sentence_47

A character drawn from KS X 1001 (G1, code set 1) is encoded as two bytes in GR (0xA1–0xFE) and a character from KS X 1003 or US-ASCII (G0, code set 0) takes one byte in GL (0x21–0x7E). Extended Unix Code_sentence_48

When used with ASCII, it is called Code page 970 by IBM. Extended Unix Code_sentence_49

It is known as Code page 51949 by Microsoft. Extended Unix Code_sentence_50

It is usually referred to as Wansung (Korean: 완성, romanized: Wanseong, lit. Extended Unix Code_sentence_51

'precomposed') in the Republic of Korea. Extended Unix Code_sentence_52

As of July 2020, 0.1% of all web pages globally use EUC-KR, which is misleading as 15.6% of South Korean web pages use (only country the encoding is meant for), making it the most popular non-UTF-8/Unicode encoding for a language/web domain, while only 8.4% of web pages using Korean language (making UTF-8 less popular in South Korea than in (seemingly) all countries of the world). Extended Unix Code_sentence_53

Including extensions, it is the most widely used legacy character encoding in Korea on all three major platforms (macOS, other Unix-like OSes, and Windows), but its use has been very slowly shifting to UTF-8 as it gains popularity, especially on Linux and macOS. Extended Unix Code_sentence_54

As with most other encodings, UTF-8 is now preferred for new use, solving problems with consistency between platforms and vendors. Extended Unix Code_sentence_55

Related Korean encoding systems Extended Unix Code_section_7

Unified Hangul Code Extended Unix Code_section_8

Main article: Unified Hangul Code Extended Unix Code_sentence_56

A common extension of EUC-KR is the Unified Hangul Code (통합형 한글 코드, Tonghabhyeong Hangeul Kodeu, or 통합 완성형, Tonghab Wansunghyung), which is the default Korean codepage on Microsoft Windows (code page 949, numbered 1363 by IBM). Extended Unix Code_sentence_57

IBM's code page 949 is a different, unrelated, EUC-KR extension. Extended Unix Code_sentence_58

Unified Hangul Code extends EUC-KR by using codes which do not conform to the EUC structure to incorporate additional syllable blocks, completing the coverage of the composed syllable blocks available in Johab and Unicode. Extended Unix Code_sentence_59

The W3C/WHATWG Encoding Standard used by HTML5 incorporates the Unified Hangul Code extensions into its definition of EUC-KR. Extended Unix Code_sentence_60

Mac OS Korean (HangulTalk) Extended Unix Code_section_9

Other EUC-KR compatible extensions include the Mac OS Korean encoding, used by the classic Mac OS. Extended Unix Code_sentence_61

EUC-TW Extended Unix Code_section_10

EUC-TW is a variable-width encoding that supports US-ASCII and 16 planes of CNS 11643, each of which is 94x94. Extended Unix Code_sentence_62

It is a rarely used encoding for traditional Chinese characters as used in Taiwan. Extended Unix Code_sentence_63

Variants of Big5 are much more common than EUC-TW, although Big5 only encodes the first two planes of CNS 11643 hanzi, while UTF-8 is becoming more common. Extended Unix Code_sentence_64

Extended Unix Code_unordered_list_1

  • As an EUC/ISO 2022 encoding, the C0 control characters, ASCII space and DEL are encoded as in ASCII.Extended Unix Code_item_1_5
  • A graphical character from US-ASCII (G0, code set 0) is encoded in GL as its usual single byte representation (0x21–0x7E).Extended Unix Code_item_1_6
  • A character from CNS 11643 plane 1 (code set 1) is encoded as two bytes in GR (0xA1–0xFE).Extended Unix Code_item_1_7
  • A character in plane 1 through 16 of CNS 11643 (code set 2) is encoded as four bytes:Extended Unix Code_item_1_8
    • The first byte is always 0x8E (Single Shift 2).Extended Unix Code_item_1_9
    • The second byte (0xA1–0xB0) indicates the plane, the number of which is obtained by subtracting 0xA0 from that byte.Extended Unix Code_item_1_10
    • The third and fourth bytes are in GR (0xA1–0xFE).Extended Unix Code_item_1_11

Note that the plane 1 of CNS 11643 is encoded twice as code set 1 and a part of code set 2. Extended Unix Code_sentence_65

Packed versus fixed-length form Extended Unix Code_section_11

The encodings described above (using bytes in 0x21–0x7E for code set 0, bytes in 0xA1–0xFE for code set 1, 0x8E followed by bytes in 0xA1–0xFE for code set 2 and 0x8F followed by bytes in 0xA1–0xFE for code set 3) are in a variable-width form referred to as the EUC packed format. Extended Unix Code_sentence_66

This is the form usually labelled as EUC. Extended Unix Code_sentence_67

Internal processing may make use of a fixed-length alternative form called the EUC complete two-byte format. Extended Unix Code_sentence_68

This represents: Extended Unix Code_sentence_69

Extended Unix Code_unordered_list_2

  • Code set 0 as two bytes in the range 0x21–0x7E (except that the first may be 0x00).Extended Unix Code_item_2_12
  • Code set 1 as two bytes in the range 0xA0–0xFF (except that the first may be 0x80).Extended Unix Code_item_2_13
  • Code set 2 as a byte in the range 0x20–0x7E (or 0x00) followed by a byte in the range 0xA0–0xFF.Extended Unix Code_item_2_14
  • Code set 3 as a byte in the range 0xA0–0xFF (or 0x80) followed by a byte in the range 0x21–0x7E.Extended Unix Code_item_2_15

Initial bytes of 0x00 and 0x80 are used in cases where the code set uses only one byte. Extended Unix Code_sentence_70

There is also a four-byte fixed-length format. Extended Unix Code_sentence_71

These fixed-length forms are suited to internal processing and are not usually encountered in interchange. Extended Unix Code_sentence_72

EUC-JP is registered with the IANA in both formats, the packed format as "EUC-JP" or "csEUCPkdFmtJapanese" and the fixed width format as "csEUCFixWidJapanese". Extended Unix Code_sentence_73

Only the packed format is included in the WHATWG Encoding Standard used by HTML5. Extended Unix Code_sentence_74

See also Extended Unix Code_section_12

Extended Unix Code_unordered_list_3


Credits to the contents of this page go to the authors of the corresponding Wikipedia page: en.wikipedia.org/wiki/Extended Unix Code.