root/main/trunk/release-kits/shared/linux/XML-Parser/64-bit/perl-5.22/XML/Parser/Encodings/Japanese_Encodings.msg @ 31809

Revision 31809, 4.7 KB (checked in by ak19, 2 years ago)

Added the XML-Parser for perl 5.22 as well for the release-kit to include in binaries. Note that, just as for the perl-5.24 version committed before, the auto/XML/Parser/Expat.bs (bootstrap file?) is copied from the perl 5.18 that was committed much longer ago, since the .bs file didn't differ between 5.10 and 5.18.

Line 
1Mapping files for Japanese encodings
2
31998 12/25
4
5Fuji Xerox Information Systems
6MURATA Makoto
7
81.  Overview
9
10This version of XML::Parser and XML::Encoding does not come with map files for
11the charset "Shift_JIS" and the charset "euc-jp".  Unfortunately, each of these
12charsets has more than one mapping.  None of these mappings are
13considered as authoritative.
14
15Therefore, we have come to believe that it is dangerous to provide map files
16for these charsets.  Rather, we introduce several private charsets and map
17files for these private charsets.  If IANA, Unicode Consoritum, and JIS
18eventually reach a consensus, we will be able to provide map files for
19"Shift_JIS" and "euc-jp".
20
212. Different mappings from existing charsets to Unicode
22
231) Different mappings in JIS X0221 and Unicode
24
25The mapping between JIS X0208:1990 and Unicode 1.1 and the mapping
26between JIS X0212:1990 and Unicode 1.1 are published from Unicode
27consortium.  They are available at
28ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/JIS/JIS0208.TXT and
29ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/JIS/JIS0212.TXT,
30respectively.)  These mapping files have a note as below:
31
32#   The kanji mappings are a normative part of ISO/IEC 10646.  The
33#       non-kanji mappings are provisional, pending definition of
34#       official mappings by Japanese standards bodies.
35
36Unfortunately, the non-kanji mappings in the Japanese standard for ISO 10646/1,
37namely JIS X 0221:1995, is different from the Unicode Consortium mapping since
380x213D of JIS X 0208 is mapped to U+2014 (em dash) rather than U+2015
39(horizontal bar).  Furthermore, JIS X 0221 clearly says that the mapping is
40informational and non-normative.  As a result, some companies (e.g., Microsoft and
41Apple) have introduced slightly different mappings.  Therefore, neither the
42Unicode consortium mapping nor the JIS X 0221 mapping are considered as
43authoritative.
44
452) Shift-JIS
46
47This charset is especially problematic, since its definition has been unclear
48since its inception.
49
50The current registration of the charset "Shift_JIS" is as below:
51
52>Name: Shift_JIS  (preferred MIME name)
53>MIBenum: 17
54>Source: A Microsoft code that extends csHalfWidthKatakana to include
55>       kanji by adding a second byte when the value of the first
56>       byte is in the ranges 81-9F or E0-EF.
57>Alias: MS_Kanji
58>Alias: csShiftJIS
59
60First, this does not reference to the mapping "Shift-JIS to Unicode"
61published by the Unicode consortium (available at
62ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/JIS/SHIFTJIS.TXT).
63
64Second, "kanji" in this registration can be interepreted in different ways.
65Does this "kanji" reference to JIS X0208:1978, JIS X0208:1983, or JIS
66X0208:1990(== JIS X0208:1997)?  These three standards are *incompatible* with
67each other.  Moreover, we can even argue that "kanji" refers to JIS X0212 or
68ideographic characters in other countries.
69
70Third, each company has extended Shift JIS. For example, Microsoft introduced
71OEM extensions (NEC extensionsand IBM extensions).
72
73Forth, Shift JIS uses JIS X0201, which is almost upper-compatible with US-ASCII
74but is not quite.  5C and 7E of JIS X 0201 are different from backslash and
75tilde, respectively.  However, many programming languages (e.g., Java)
76ignore this difference and assumes that 5C and 7E of Shift JIS are backslash
77and tilde.
78
79
803.  Proposed charsets and mappings
81
82As a tentative solution, we introduce two private charsets for EUC-JP and four
83priviate charsets for Shift JIS.
84
851) EUC-JP
86
87We have two charsets, namely "x-eucjp-unicode" and "x-eucjp-jisx0221".  Their
88difference is only one code point.  The mapping for the former is based
89on the Unicode Consortium mapping, while the latter is based on the JIS X0221
90mapping.
91
922) Shift JIS
93
94We have four charsets, namely x-sjis-unicode, x-sjis-jisx0221,
95x-sjis-jdk117, and x-sjis-cp932.
96
97The mapping for the charset x-sjis-unicode is the one published by the Unicode
98consortium.  The mapping for x-sjis-jisx0221 is almost equivalent to
99x-sjis-unicode, but 0x213D of JIS X 0208 is mapped to U+2014 (em dash) rather
100than U+2015.  The charset x-sjis-jdk117 is again almost equivalent to
101x-sjis-unicode, but 0x5C and 0x7E of JIS X0201 are mapped to backslash and
102tilde.
103
104The charset x-sjis-cp932 is used by Microsoft Windows, and its mapping is
105published from the Unicode Consortium (available at:
106ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.txt).  The
107coded character set for this charset includes NEC-extensions and
108IBM-extensions.  0x5C and 0x7E of JIS X0201 are mapped to backslash and tilde;
1090x213D is mapped to U+2015; and 0x2140, 0x2141, 0x2142, and 0x215E of JIS X
1100208 are mapped to compatibility characters.
111
112Makoto
113 
114Fuji Xerox Information Systems
115 
116Tel: +81-44-812-7230   Fax: +81-44-812-7231
117E-mail: murata@apsdc.ksp.fujixerox.co.jp
Note: See TracBrowser for help on using the browser.