1 | =head1 NAME
|
---|
2 |
|
---|
3 | Encode::Supported -- Encodings supported by Encode
|
---|
4 |
|
---|
5 | =head1 DESCRIPTION
|
---|
6 |
|
---|
7 | =head2 Encoding Names
|
---|
8 |
|
---|
9 | Encoding names are case insensitive. White space in names
|
---|
10 | is ignored. In addition, an encoding may have aliases.
|
---|
11 | Each encoding has one "canonical" name. The "canonical"
|
---|
12 | name is chosen from the names of the encoding by picking
|
---|
13 | the first in the following sequence (with a few exceptions).
|
---|
14 |
|
---|
15 | =over 4
|
---|
16 |
|
---|
17 | =item *
|
---|
18 |
|
---|
19 | The name used by the Perl community. That includes 'utf8' and 'ascii'.
|
---|
20 | Unlike aliases, canonical names directly reach the method so such
|
---|
21 | frequently used words like 'utf8' don't need to do alias lookups.
|
---|
22 |
|
---|
23 | =item *
|
---|
24 |
|
---|
25 | The MIME name as defined in IETF RFCs. This includes all "iso-"s.
|
---|
26 |
|
---|
27 | =item *
|
---|
28 |
|
---|
29 | The name in the IANA registry.
|
---|
30 |
|
---|
31 | =item *
|
---|
32 |
|
---|
33 | The name used by the organization that defined it.
|
---|
34 |
|
---|
35 | =back
|
---|
36 |
|
---|
37 | In case I<de jure> canonical names differ from that of the Encode
|
---|
38 | module, they are always aliased if it ever be implemented. So you can
|
---|
39 | safely tell if a given encoding is implemented or not just by passing
|
---|
40 | the canonical name.
|
---|
41 |
|
---|
42 | Because of all the alias issues, and because in the general case
|
---|
43 | encodings have state, "Encode" uses an encoding object internally
|
---|
44 | once an operation is in progress.
|
---|
45 |
|
---|
46 | =head1 Supported Encodings
|
---|
47 |
|
---|
48 | As of Perl 5.8.0, at least the following encodings are recognized.
|
---|
49 | Note that unless otherwise specified, they are all case insensitive
|
---|
50 | (via alias) and all occurrence of spaces are replaced with '-'.
|
---|
51 | In other words, "ISO 8859 1" and "iso-8859-1" are identical.
|
---|
52 |
|
---|
53 | Encodings are categorized and implemented in several different modules
|
---|
54 | but you don't have to C<use Encode::XX> to make them available for
|
---|
55 | most cases. Encode.pm will automatically load those modules on demand.
|
---|
56 |
|
---|
57 | =head2 Built-in Encodings
|
---|
58 |
|
---|
59 | The following encodings are always available.
|
---|
60 |
|
---|
61 | Canonical Aliases Comments & References
|
---|
62 | ----------------------------------------------------------------
|
---|
63 | ascii US-ascii ISO-646-US [ECMA]
|
---|
64 | ascii-ctrl Special Encoding
|
---|
65 | iso-8859-1 latin1 [ISO]
|
---|
66 | null Special Encoding
|
---|
67 | utf8 UTF-8 [RFC2279]
|
---|
68 | ----------------------------------------------------------------
|
---|
69 |
|
---|
70 | I<null> and I<ascii-ctrl> are special. "null" fails for all character
|
---|
71 | so when you set fallback mode to PERLQQ, HTMLCREF or XMLCREF, ALL
|
---|
72 | CHARACTERS will fall back to character references. Ditto for
|
---|
73 | "ascii-ctrl" except for control characters. For fallback modes, see
|
---|
74 | L<Encode>.
|
---|
75 |
|
---|
76 | =head2 Encode::Unicode -- other Unicode encodings
|
---|
77 |
|
---|
78 | Unicode coding schemes other than native utf8 are supported by
|
---|
79 | Encode::Unicode, which will be autoloaded on demand.
|
---|
80 |
|
---|
81 | ----------------------------------------------------------------
|
---|
82 | UCS-2BE UCS-2, iso-10646-1 [IANA, UC]
|
---|
83 | UCS-2LE [UC]
|
---|
84 | UTF-16 [UC]
|
---|
85 | UTF-16BE [UC]
|
---|
86 | UTF-16LE [UC]
|
---|
87 | UTF-32 [UC]
|
---|
88 | UTF-32BE UCS-4 [UC]
|
---|
89 | UTF-32LE [UC]
|
---|
90 | UTF-7 [RFC2152]
|
---|
91 | ----------------------------------------------------------------
|
---|
92 |
|
---|
93 | To find how (UCS-2|UTF-(16|32))(LE|BE)? differ from one another,
|
---|
94 | see L<Encode::Unicode>.
|
---|
95 |
|
---|
96 | UTF-7 is a special encoding which "re-encodes" UTF-16BE into a 7-bit
|
---|
97 | encoding. It is implemented seperately by Encode::Unicode::UTF7.
|
---|
98 |
|
---|
99 | =head2 Encode::Byte -- Extended ASCII
|
---|
100 |
|
---|
101 | Encode::Byte implements most single-byte encodings except for
|
---|
102 | Symbols and EBCDIC. The following encodings are based on single-byte
|
---|
103 | encodings implemented as extended ASCII. Most of them map
|
---|
104 | \x80-\xff (upper half) to non-ASCII characters.
|
---|
105 |
|
---|
106 | =over 4
|
---|
107 |
|
---|
108 | =item ISO-8859 and corresponding vendor mappings
|
---|
109 |
|
---|
110 | Since there are so many, they are presented in table format with
|
---|
111 | languages and corresponding encoding names by vendors. Note that
|
---|
112 | the table is sorted in order of ISO-8859 and the corresponding vendor
|
---|
113 | mappings are slightly different from that of ISO. See
|
---|
114 | L<http://czyborra.com/charsets/iso8859.html> for details.
|
---|
115 |
|
---|
116 | Lang/Regions ISO/Other Std. DOS Windows Macintosh Others
|
---|
117 | ----------------------------------------------------------------
|
---|
118 | N. America (ASCII) cp437 AdobeStandardEncoding
|
---|
119 | cp863 (DOSCanadaF)
|
---|
120 | W. Europe iso-8859-1 cp850 cp1252 MacRoman nextstep
|
---|
121 | hp-roman8
|
---|
122 | cp860 (DOSPortuguese)
|
---|
123 | Cntrl. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman
|
---|
124 | MacCroatian
|
---|
125 | MacRomanian
|
---|
126 | MacRumanian
|
---|
127 | Latin3[1] iso-8859-3
|
---|
128 | Latin4[2] iso-8859-4
|
---|
129 | Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic
|
---|
130 | (See also next section) cp866 MacUkrainian
|
---|
131 | Arabic iso-8859-6 cp864 cp1256 MacArabic
|
---|
132 | cp1006 MacFarsi
|
---|
133 | Greek iso-8859-7 cp737 cp1253 MacGreek
|
---|
134 | cp869 (DOSGreek2)
|
---|
135 | Hebrew iso-8859-8 cp862 cp1255 MacHebrew
|
---|
136 | Turkish iso-8859-9 cp857 cp1254 MacTurkish
|
---|
137 | Nordics iso-8859-10 cp865
|
---|
138 | cp861 MacIcelandic
|
---|
139 | MacSami
|
---|
140 | Thai iso-8859-11[3] cp874 MacThai
|
---|
141 | (iso-8859-12 is nonexistent. Reserved for Indics?)
|
---|
142 | Baltics iso-8859-13 cp775 cp1257
|
---|
143 | Celtics iso-8859-14
|
---|
144 | Latin9 [4] iso-8859-15
|
---|
145 | Latin10 iso-8859-16
|
---|
146 | Vietnamese viscii cp1258 MacVietnamese
|
---|
147 | ----------------------------------------------------------------
|
---|
148 |
|
---|
149 | [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-9.
|
---|
150 | [2] Baltics. Now on 8859-10, except for Latvian.
|
---|
151 | [3] TIS 620 + Non-Breaking Space (0xA0 / U+00A0)
|
---|
152 | [4] Nicknamed Latin0; the Euro sign as well as French and Finnish
|
---|
153 | letters that are missing from 8859-1 were added.
|
---|
154 |
|
---|
155 | All cp* are also available as ibm-*, ms-*, and windows-* . See also
|
---|
156 | L<http://czyborra.com/charsets/codepages.html>.
|
---|
157 |
|
---|
158 | Macintosh encodings don't seem to be registered in such entities as
|
---|
159 | IANA. "Canonical" names in Encode are based upon Apple's Tech Note
|
---|
160 | 1150. See L<http://developer.apple.com/technotes/tn/tn1150.html>
|
---|
161 | for details.
|
---|
162 |
|
---|
163 | =item KOI8 - De Facto Standard for the Cyrillic world
|
---|
164 |
|
---|
165 | Though ISO-8859 does have ISO-8859-5, the KOI8 series is far more
|
---|
166 | popular in the Net. L<Encode> comes with the following KOI charsets.
|
---|
167 | For gory details, see L<http://czyborra.com/charsets/cyrillic.html>
|
---|
168 |
|
---|
169 | ----------------------------------------------------------------
|
---|
170 | koi8-f
|
---|
171 | koi8-r cp878 [RFC1489]
|
---|
172 | koi8-u [RFC2319]
|
---|
173 | ----------------------------------------------------------------
|
---|
174 |
|
---|
175 | =item gsm0338 - Hentai Latin 1
|
---|
176 |
|
---|
177 | GSM0338 is for GSM handsets. Though it shares alphanumerals with
|
---|
178 | ASCII, control character ranges and other parts are mapped very
|
---|
179 | differently, mainly to store Greek characters. There are also escape
|
---|
180 | sequences (starting with 0x1B) to cover e.g. the Euro sign. Some
|
---|
181 | special cases like a trailing 0x00 byte or a lone 0x1B byte are not
|
---|
182 | well-defined and decode() will return an empty string for them.
|
---|
183 | One possible workaround is
|
---|
184 |
|
---|
185 | $gsm =~ s/\x00\z/\x00\x00/;
|
---|
186 | $uni = decode("gsm0338", $gsm);
|
---|
187 | $uni .= "\xA0" if $gsm =~ /\x1B\z/;
|
---|
188 |
|
---|
189 | Note that the Encode implementation of GSM0338 does not implement the
|
---|
190 | reuse of Latin capital letters as Greek capital letters (for example,
|
---|
191 | the 0x5A is U+005A (LATIN CAPITAL LETTER Z), not U+0396 (GREEK CAPITAL
|
---|
192 | LETTER ZETA).
|
---|
193 |
|
---|
194 | The GSM0338 is also covered in Encode::Byte even though it is not
|
---|
195 | an "extended ASCII" encoding.
|
---|
196 |
|
---|
197 | =back
|
---|
198 |
|
---|
199 | =head2 CJK: Chinese, Japanese, Korean (Multibyte)
|
---|
200 |
|
---|
201 | Note that Vietnamese is listed above. Also read "Encoding vs Charset"
|
---|
202 | below. Also note that these are implemented in distinct modules by
|
---|
203 | countries, due to the size concerns (simplified Chinese is mapped
|
---|
204 | to 'CN', continental China, while traditional Chinese is mapped to
|
---|
205 | 'TW', Taiwan). Please refer to their respective documentation pages.
|
---|
206 |
|
---|
207 | =over 4
|
---|
208 |
|
---|
209 | =item Encode::CN -- Continental China
|
---|
210 |
|
---|
211 | Standard DOS/Win Macintosh Comment/Reference
|
---|
212 | ----------------------------------------------------------------
|
---|
213 | euc-cn [1] MacChineseSimp
|
---|
214 | (gbk) cp936 [2]
|
---|
215 | gb12345-raw { GB12345 without CES }
|
---|
216 | gb2312-raw { GB2312 without CES }
|
---|
217 | hz
|
---|
218 | iso-ir-165
|
---|
219 | ----------------------------------------------------------------
|
---|
220 |
|
---|
221 | [1] GB2312 is aliased to this. See L<Microsoft-related naming mess>
|
---|
222 | [2] gbk is aliased to this. See L<Microsoft-related naming mess>
|
---|
223 |
|
---|
224 | =item Encode::JP -- Japan
|
---|
225 |
|
---|
226 | Standard DOS/Win Macintosh Comment/Reference
|
---|
227 | ----------------------------------------------------------------
|
---|
228 | euc-jp
|
---|
229 | shiftjis cp932 macJapanese
|
---|
230 | 7bit-jis
|
---|
231 | iso-2022-jp [RFC1468]
|
---|
232 | iso-2022-jp-1 [RFC2237]
|
---|
233 | jis0201-raw { JIS X 0201 (roman + halfwidth kana) without CES }
|
---|
234 | jis0208-raw { JIS X 0208 (Kanji + fullwidth kana) without CES }
|
---|
235 | jis0212-raw { JIS X 0212 (Extended Kanji) without CES }
|
---|
236 | ----------------------------------------------------------------
|
---|
237 |
|
---|
238 | =item Encode::KR -- Korea
|
---|
239 |
|
---|
240 | Standard DOS/Win Macintosh Comment/Reference
|
---|
241 | ----------------------------------------------------------------
|
---|
242 | euc-kr MacKorean [RFC1557]
|
---|
243 | cp949 [1]
|
---|
244 | iso-2022-kr [RFC1557]
|
---|
245 | johab [KS X 1001:1998, Annex 3]
|
---|
246 | ksc5601-raw { KSC5601 without CES }
|
---|
247 | ----------------------------------------------------------------
|
---|
248 |
|
---|
249 | [1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this.
|
---|
250 | See below.
|
---|
251 |
|
---|
252 | =item Encode::TW -- Taiwan
|
---|
253 |
|
---|
254 | Standard DOS/Win Macintosh Comment/Reference
|
---|
255 | ----------------------------------------------------------------
|
---|
256 | big5-eten cp950 MacChineseTrad {big5 aliased to big5-eten}
|
---|
257 | big5-hkscs
|
---|
258 | ----------------------------------------------------------------
|
---|
259 |
|
---|
260 | =item Encode::HanExtra -- More Chinese via CPAN
|
---|
261 |
|
---|
262 | Due to the size concerns, additional Chinese encodings below are
|
---|
263 | distributed separately on CPAN, under the name Encode::HanExtra.
|
---|
264 |
|
---|
265 | Standard DOS/Win Macintosh Comment/Reference
|
---|
266 | ----------------------------------------------------------------
|
---|
267 | big5ext CMEX's Big5e Extension
|
---|
268 | big5plus CMEX's Big5+ Extension
|
---|
269 | cccii Chinese Character Code for Information Interchange
|
---|
270 | euc-tw EUC (Extended Unix Character)
|
---|
271 | gb18030 GBK with Traditional Characters
|
---|
272 | ----------------------------------------------------------------
|
---|
273 |
|
---|
274 | =item Encode::JIS2K -- JIS X 0213 encodings via CPAN
|
---|
275 |
|
---|
276 | Due to size concerns, additional Japanese encodings below are
|
---|
277 | distributed separately on CPAN, under the name Encode::JIS2K.
|
---|
278 |
|
---|
279 | Standard DOS/Win Macintosh Comment/Reference
|
---|
280 | ----------------------------------------------------------------
|
---|
281 | euc-jisx0213
|
---|
282 | shiftjisx0123
|
---|
283 | iso-2022-jp-3
|
---|
284 | jis0213-1-raw
|
---|
285 | jis0213-2-raw
|
---|
286 | ----------------------------------------------------------------
|
---|
287 |
|
---|
288 | =back
|
---|
289 |
|
---|
290 | =head2 Miscellaneous encodings
|
---|
291 |
|
---|
292 | =over 4
|
---|
293 |
|
---|
294 | =item Encode::EBCDIC
|
---|
295 |
|
---|
296 | See L<perlebcdic> for details.
|
---|
297 |
|
---|
298 | ----------------------------------------------------------------
|
---|
299 | cp37
|
---|
300 | cp500
|
---|
301 | cp875
|
---|
302 | cp1026
|
---|
303 | cp1047
|
---|
304 | posix-bc
|
---|
305 | ----------------------------------------------------------------
|
---|
306 |
|
---|
307 | =item Encode::Symbols
|
---|
308 |
|
---|
309 | For symbols and dingbats.
|
---|
310 |
|
---|
311 | ----------------------------------------------------------------
|
---|
312 | symbol
|
---|
313 | dingbats
|
---|
314 | MacDingbats
|
---|
315 | AdobeZdingbat
|
---|
316 | AdobeSymbol
|
---|
317 | ----------------------------------------------------------------
|
---|
318 |
|
---|
319 | =item Encode::MIME::Header
|
---|
320 |
|
---|
321 | Strictly speaking, MIME header encoding documented in RFC 2047 is more
|
---|
322 | of encapsulation than encoding. However, their support in modern
|
---|
323 | world is imperative so they are supported.
|
---|
324 |
|
---|
325 | ----------------------------------------------------------------
|
---|
326 | MIME-Header [RFC2047]
|
---|
327 | MIME-B [RFC2047]
|
---|
328 | MIME-Q [RFC2047]
|
---|
329 | ----------------------------------------------------------------
|
---|
330 |
|
---|
331 | =item Encode::Guess
|
---|
332 |
|
---|
333 | This one is not a name of encoding but a utility that lets you pick up
|
---|
334 | the most appropriate encoding for a data out of given I<suspects>. See
|
---|
335 | L<Encode::Guess> for details.
|
---|
336 |
|
---|
337 | =back
|
---|
338 |
|
---|
339 | =head1 Unsupported encodings
|
---|
340 |
|
---|
341 | The following encodings are not supported as yet; some because they
|
---|
342 | are rarely used, some because of technical difficulties. They may
|
---|
343 | be supported by external modules via CPAN in the future, however.
|
---|
344 |
|
---|
345 | =over 4
|
---|
346 |
|
---|
347 | =item ISO-2022-JP-2 [RFC1554]
|
---|
348 |
|
---|
349 | Not very popular yet. Needs Unicode Database or equivalent to
|
---|
350 | implement encode() (because it includes JIS X 0208/0212, KSC5601, and
|
---|
351 | GB2312 simultaneously, whose code points in Unicode overlap. So you
|
---|
352 | need to lookup the database to determine to what character set a given
|
---|
353 | Unicode character should belong).
|
---|
354 |
|
---|
355 | =item ISO-2022-CN [RFC1922]
|
---|
356 |
|
---|
357 | Not very popular. Needs CNS 11643-1 and -2 which are not available in
|
---|
358 | this module. CNS 11643 is supported (via euc-tw) in Encode::HanExtra.
|
---|
359 | Autrijus Tang may add support for this encoding in his module in future.
|
---|
360 |
|
---|
361 | =item Various HP-UX encodings
|
---|
362 |
|
---|
363 | The following are unsupported due to the lack of mapping data.
|
---|
364 |
|
---|
365 | '8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
|
---|
366 | '15' - japanese15, korean15, and roi15
|
---|
367 |
|
---|
368 | =item Cyrillic encoding ISO-IR-111
|
---|
369 |
|
---|
370 | Anton Tagunov doubts its usefulness.
|
---|
371 |
|
---|
372 | =item ISO-8859-8-1 [Hebrew]
|
---|
373 |
|
---|
374 | None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
|
---|
375 | MacHebrew are supported because and just because there were mappings
|
---|
376 | available at L<http://www.unicode.org/>). Contributions welcome.
|
---|
377 |
|
---|
378 | =item ISIRI 3342, Iran System, ISIRI 2900 [Farsi]
|
---|
379 |
|
---|
380 | Ditto.
|
---|
381 |
|
---|
382 | =item Thai encoding TCVN
|
---|
383 |
|
---|
384 | Ditto.
|
---|
385 |
|
---|
386 | =item Vietnamese encodings VPS
|
---|
387 |
|
---|
388 | Though Jungshik Shin has reported that Mozilla supports this encoding,
|
---|
389 | it was too late before 5.8.0 for us to add it. In the future, it
|
---|
390 | may be available via a separate module. See
|
---|
391 | L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf>
|
---|
392 | and
|
---|
393 | L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
|
---|
394 | if you are interested in helping us.
|
---|
395 |
|
---|
396 | =item Various Mac encodings
|
---|
397 |
|
---|
398 | The following are unsupported due to the lack of mapping data.
|
---|
399 |
|
---|
400 | MacArmenian, MacBengali, MacBurmese, MacEthiopic
|
---|
401 | MacExtArabic, MacGeorgian, MacKannada, MacKhmer
|
---|
402 | MacLaotian, MacMalayalam, MacMongolian, MacOriya
|
---|
403 | MacSinhalese, MacTamil, MacTelugu, MacTibetan
|
---|
404 | MacVietnamese
|
---|
405 |
|
---|
406 | The rest which are already available are based upon the vendor mappings
|
---|
407 | at L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
|
---|
408 |
|
---|
409 | =item (Mac) Indic encodings
|
---|
410 |
|
---|
411 | The maps for the following are available at L<http://www.unicode.org/>
|
---|
412 | but remain unsupport because those encodings need algorithmical
|
---|
413 | approach, currently unsupported by F<enc2xs>:
|
---|
414 |
|
---|
415 | MacDevanagari
|
---|
416 | MacGurmukhi
|
---|
417 | MacGujarati
|
---|
418 |
|
---|
419 | For details, please see C<Unicode mapping issues and notes:> at
|
---|
420 | L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
|
---|
421 |
|
---|
422 | I believe this issue is prevalent not only for Mac Indics but also in
|
---|
423 | other Indic encodings, but the above were the only Indic encodings
|
---|
424 | maps that I could find at L<http://www.unicode.org/> .
|
---|
425 |
|
---|
426 | =back
|
---|
427 |
|
---|
428 | =head1 Encoding vs. Charset -- terminology
|
---|
429 |
|
---|
430 | We are used to using the term (character) I<encoding> and I<character
|
---|
431 | set> interchangeably. But just as confusing the terms byte and
|
---|
432 | character is dangerous and the terms should be differentiated when
|
---|
433 | needed, we need to differentiate I<encoding> and I<character set>.
|
---|
434 |
|
---|
435 | To understand that, here is a description of how we make computers
|
---|
436 | grok our characters.
|
---|
437 |
|
---|
438 | =over 4
|
---|
439 |
|
---|
440 | =item *
|
---|
441 |
|
---|
442 | First we start with which characters to include. We call this
|
---|
443 | collection of characters I<character repertoire>.
|
---|
444 |
|
---|
445 | =item *
|
---|
446 |
|
---|
447 | Then we have to give each character a unique ID so your computer can
|
---|
448 | tell the difference between 'a' and 'A'. This itemized character
|
---|
449 | repertoire is now a I<character set>.
|
---|
450 |
|
---|
451 | =item *
|
---|
452 |
|
---|
453 | If your computer can grow the character set without further
|
---|
454 | processing, you can go ahead and use it. This is called a I<coded
|
---|
455 | character set> (CCS) or I<raw character encoding>. ASCII is used this
|
---|
456 | way for most cases.
|
---|
457 |
|
---|
458 | =item *
|
---|
459 |
|
---|
460 | But in many cases, especially multi-byte CJK encodings, you have to
|
---|
461 | tweak a little more. Your network connection may not accept any data
|
---|
462 | with the Most Significant Bit set, and your computer may not be able to
|
---|
463 | tell if a given byte is a whole character or just half of it. So you
|
---|
464 | have to I<encode> the character set to use it.
|
---|
465 |
|
---|
466 | A I<character encoding scheme> (CES) determines how to encode a given
|
---|
467 | character set, or a set of multiple character sets. 7bit ISO-2022 is
|
---|
468 | an example of a CES. You switch between character sets via I<escape
|
---|
469 | sequences>.
|
---|
470 |
|
---|
471 | =back
|
---|
472 |
|
---|
473 | Technically, or mathematically, speaking, a character set encoded in
|
---|
474 | such a CES that maps character by character may form a CCS. EUC is such
|
---|
475 | an example. The CES of EUC is as follows:
|
---|
476 |
|
---|
477 | =over 4
|
---|
478 |
|
---|
479 | =item *
|
---|
480 |
|
---|
481 | Map ASCII unchanged.
|
---|
482 |
|
---|
483 | =item *
|
---|
484 |
|
---|
485 | Map such a character set that consists of 94 or 96 powered by N
|
---|
486 | members by adding 0x80 to each byte.
|
---|
487 |
|
---|
488 | =item *
|
---|
489 |
|
---|
490 | You can also use 0x8e and 0x8f to indicate that the following sequence of
|
---|
491 | characters belongs to yet another character set. To each following byte
|
---|
492 | is added the value 0x80.
|
---|
493 |
|
---|
494 | =back
|
---|
495 |
|
---|
496 | By carefully looking at the encoded byte sequence, you can find that the
|
---|
497 | byte sequence conforms a unique number. In that sense, EUC is a CCS
|
---|
498 | generated by a CES above from up to four CCS (complicated?). UTF-8
|
---|
499 | falls into this category. See L<perlUnicode/"UTF-8"> to find out how
|
---|
500 | UTF-8 maps Unicode to a byte sequence.
|
---|
501 |
|
---|
502 | You may also have found out by now why 7bit ISO-2022 cannot comprise
|
---|
503 | a CCS. If you look at a byte sequence \x21\x21, you can't tell if
|
---|
504 | it is two !'s or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1
|
---|
505 | so you have no trouble differentiating between "!!". and S<" ">.
|
---|
506 |
|
---|
507 | =head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
|
---|
508 |
|
---|
509 | This section tries to classify the supported encodings by their
|
---|
510 | applicability for information exchange over the Internet and to
|
---|
511 | choose the most suitable aliases to name them in the context of
|
---|
512 | such communication.
|
---|
513 |
|
---|
514 | =over 4
|
---|
515 |
|
---|
516 | =item *
|
---|
517 |
|
---|
518 | To (en|de)code encodings marked by C<(**)>, you need
|
---|
519 | C<Encode::HanExtra>, available from CPAN.
|
---|
520 |
|
---|
521 | =back
|
---|
522 |
|
---|
523 | Encoding names
|
---|
524 |
|
---|
525 | US-ASCII UTF-8 ISO-8859-* KOI8-R
|
---|
526 | Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
|
---|
527 | EUC-KR Big5 GB2312
|
---|
528 |
|
---|
529 | are registered with IANA as preferred MIME names and may
|
---|
530 | be used over the Internet.
|
---|
531 |
|
---|
532 | C<Shift_JIS> has been officialized by JIS X 0208:1997.
|
---|
533 | L<Microsoft-related naming mess> gives details.
|
---|
534 |
|
---|
535 | C<GB2312> is the IANA name for C<EUC-CN>.
|
---|
536 | See L<Microsoft-related naming mess> for details.
|
---|
537 |
|
---|
538 | C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw>
|
---|
539 | with Encode. See L<Encode::CN> for details.
|
---|
540 |
|
---|
541 | EUC-CN
|
---|
542 | KOI8-U [RFC2319]
|
---|
543 |
|
---|
544 | have not been registered with IANA (as of March 2002) but
|
---|
545 | seem to be supported by major web browsers.
|
---|
546 | The IANA name for C<EUC-CN> is C<GB2312>.
|
---|
547 |
|
---|
548 | KS_C_5601-1987
|
---|
549 |
|
---|
550 | is heavily misused.
|
---|
551 | See L<Microsoft-related naming mess> for details.
|
---|
552 |
|
---|
553 | C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
|
---|
554 | with Encode. See L<Encode::KR> for details.
|
---|
555 |
|
---|
556 | UTF-16 UTF-16BE UTF-16LE
|
---|
557 |
|
---|
558 | are IANA-registered C<charset>s. See [RFC 2781] for details.
|
---|
559 | Jungshik Shin reports that UTF-16 with a BOM is well accepted
|
---|
560 | by MS IE 5/6 and NS 4/6. Beware however that
|
---|
561 |
|
---|
562 | =over 4
|
---|
563 |
|
---|
564 | =item *
|
---|
565 |
|
---|
566 | C<UTF-16> support in any software you're going to be
|
---|
567 | using/interoperating with has probably been less tested
|
---|
568 | then C<UTF-8> support
|
---|
569 |
|
---|
570 | =item *
|
---|
571 |
|
---|
572 | C<UTF-8> coded data seamlessly passes traditional
|
---|
573 | command piping (C<cat>, C<more>, etc.) while C<UTF-16> coded
|
---|
574 | data is likely to cause confusion (with its zero bytes,
|
---|
575 | for example)
|
---|
576 |
|
---|
577 | =item *
|
---|
578 |
|
---|
579 | it is beyond the power of words to describe the way HTML browsers
|
---|
580 | encode non-C<ASCII> form data. To get a general impression, visit
|
---|
581 | L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>.
|
---|
582 | While encoding of form data has stabilized for C<UTF-8> encoded pages
|
---|
583 | (at least IE 5/6, NS 6, and Opera 6 behave consistently), be sure to
|
---|
584 | expect fun (and cross-browser discrepancies) with C<UTF-16> encoded
|
---|
585 | pages!
|
---|
586 |
|
---|
587 | =back
|
---|
588 |
|
---|
589 | The rule of thumb is to use C<UTF-8> unless you know what
|
---|
590 | you're doing and unless you really benefit from using C<UTF-16>.
|
---|
591 |
|
---|
592 | ISO-IR-165 [RFC1345]
|
---|
593 | VISCII
|
---|
594 | GB 12345
|
---|
595 | GB 18030 (**) (see links bellow)
|
---|
596 | EUC-TW (**)
|
---|
597 |
|
---|
598 | are totally valid encodings but not registered at IANA.
|
---|
599 | The names under which they are listed here are probably the
|
---|
600 | most widely-known names for these encodings and are recommended
|
---|
601 | names.
|
---|
602 |
|
---|
603 | BIG5PLUS (**)
|
---|
604 |
|
---|
605 | is a proprietary name.
|
---|
606 |
|
---|
607 | =head2 Microsoft-related naming mess
|
---|
608 |
|
---|
609 | Microsoft products misuse the following names:
|
---|
610 |
|
---|
611 | =over 4
|
---|
612 |
|
---|
613 | =item KS_C_5601-1987
|
---|
614 |
|
---|
615 | Microsoft extension to C<EUC-KR>.
|
---|
616 |
|
---|
617 | Proper names: C<CP949>, C<UHC>, C<x-windows-949> (as used by Mozilla).
|
---|
618 |
|
---|
619 | See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html>
|
---|
620 | for details.
|
---|
621 |
|
---|
622 | Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this common
|
---|
623 | misusage. I<Raw> C<KS_C_5601-1987> encoding is available as
|
---|
624 | C<kcs5601-raw>.
|
---|
625 |
|
---|
626 | See L<Encode::KR> for details.
|
---|
627 |
|
---|
628 | =item GB2312
|
---|
629 |
|
---|
630 | Microsoft extension to C<EUC-CN>.
|
---|
631 |
|
---|
632 | Proper names: C<CP936>, C<GBK>.
|
---|
633 |
|
---|
634 | C<GB2312> has been registered in the C<EUC-CN> meaning at
|
---|
635 | IANA. This has partially repaired the situation: Microsoft's
|
---|
636 | C<GB2312> has become a superset of the official C<GB2312>.
|
---|
637 |
|
---|
638 | Encode aliases C<GB2312> to C<euc-cn> in full agreement with
|
---|
639 | IANA registration. C<cp936> is supported separately.
|
---|
640 | I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>.
|
---|
641 |
|
---|
642 | See L<Encode::CN> for details.
|
---|
643 |
|
---|
644 | =item Big5
|
---|
645 |
|
---|
646 | Microsoft extension to C<Big5>.
|
---|
647 |
|
---|
648 | Proper name: C<CP950>.
|
---|
649 |
|
---|
650 | Encode separately supports C<Big5> and C<cp950>.
|
---|
651 |
|
---|
652 | =item Shift_JIS
|
---|
653 |
|
---|
654 | Microsoft's understanding of C<Shift_JIS>.
|
---|
655 |
|
---|
656 | JIS has not endorsed the full Microsoft standard however.
|
---|
657 | The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
|
---|
658 | character sets, while Microsoft has always used C<Shift_JIS>
|
---|
659 | to encode a wider character repertoire. See C<IANA> registration for
|
---|
660 | C<Windows-31J>.
|
---|
661 |
|
---|
662 | As a historical predecessor, Microsoft's variant
|
---|
663 | probably has more rights for the name, though it may be objected
|
---|
664 | that Microsoft shouldn't have used JIS as part of the name
|
---|
665 | in the first place.
|
---|
666 |
|
---|
667 | Unambiguous name: C<CP932>. C<IANA> name (also used by Mozilla, and
|
---|
668 | provided as an alias by Encode): C<Windows-31J>.
|
---|
669 |
|
---|
670 | Encode separately supports C<Shift_JIS> and C<cp932>.
|
---|
671 |
|
---|
672 | =back
|
---|
673 |
|
---|
674 | =head1 Glossary
|
---|
675 |
|
---|
676 | =over 4
|
---|
677 |
|
---|
678 | =item character repertoire
|
---|
679 |
|
---|
680 | A collection of unique characters. A I<character> set in the strictest
|
---|
681 | sense. At this stage, characters are not numbered.
|
---|
682 |
|
---|
683 | =item coded character set (CCS)
|
---|
684 |
|
---|
685 | A character set that is mapped in a way computers can use directly.
|
---|
686 | Many character encodings, including EUC, fall in this category.
|
---|
687 |
|
---|
688 | =item character encoding scheme (CES)
|
---|
689 |
|
---|
690 | An algorithm to map a character set to a byte sequence. You don't
|
---|
691 | have to be able to tell which character set a given byte sequence
|
---|
692 | belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is an
|
---|
693 | example of being both a CCS and CES.
|
---|
694 |
|
---|
695 | =item charset (in MIME context)
|
---|
696 |
|
---|
697 | has long been used in the meaning of C<encoding>, CES.
|
---|
698 |
|
---|
699 | While the word combination C<character set> has lost this meaning
|
---|
700 | in MIME context since [RFC 2130], the C<charset> abbreviation has
|
---|
701 | retained it. This is how [RFC 2277] and [RFC 2278] bless C<charset>:
|
---|
702 |
|
---|
703 | This document uses the term "charset" to mean a set of rules for
|
---|
704 | mapping from a sequence of octets to a sequence of characters, such
|
---|
705 | as the combination of a coded character set and a character encoding
|
---|
706 | scheme; this is also what is used as an identifier in MIME "charset="
|
---|
707 | parameters, and registered in the IANA charset registry ... (Note
|
---|
708 | that this is NOT a term used by other standards bodies, such as ISO).
|
---|
709 | [RFC 2277]
|
---|
710 |
|
---|
711 | =item EUC
|
---|
712 |
|
---|
713 | Extended Unix Character. See ISO-2022.
|
---|
714 |
|
---|
715 | =item ISO-2022
|
---|
716 |
|
---|
717 | A CES that was carefully designed to coexist with ASCII. There are a 7
|
---|
718 | bit version and an 8 bit version.
|
---|
719 |
|
---|
720 | The 7 bit version switches character set via escape sequence so it
|
---|
721 | cannot form a CCS. Since this is more difficult to handle in programs
|
---|
722 | than the 8 bit version, the 7 bit version is not very popular except for
|
---|
723 | iso-2022-jp, the I<de facto> standard CES for e-mails.
|
---|
724 |
|
---|
725 | The 8 bit version can form a CCS. EUC and ISO-8859 are two examples
|
---|
726 | thereof. Pre-5.6 perl could use them as string literals.
|
---|
727 |
|
---|
728 | =item UCS
|
---|
729 |
|
---|
730 | Short for I<Universal Character Set>. When you say just UCS, it means
|
---|
731 | I<Unicode>.
|
---|
732 |
|
---|
733 | =item UCS-2
|
---|
734 |
|
---|
735 | ISO/IEC 10646 encoding form: Universal Character Set coded in two
|
---|
736 | octets.
|
---|
737 |
|
---|
738 | =item Unicode
|
---|
739 |
|
---|
740 | A character set that aims to include all character repertoires of the
|
---|
741 | world. Many character sets in various national as well as industrial
|
---|
742 | standards have become, in a way, just subsets of Unicode.
|
---|
743 |
|
---|
744 | =item UTF
|
---|
745 |
|
---|
746 | Short for I<Unicode Transformation Format>. Determines how to map a
|
---|
747 | Unicode character into a byte sequence.
|
---|
748 |
|
---|
749 | =item UTF-16
|
---|
750 |
|
---|
751 | A UTF in 16-bit encoding. Can either be in big endian or little
|
---|
752 | endian. The big endian version is called UTF-16BE (equal to UCS-2 +
|
---|
753 | surrogate support) and the little endian version is called UTF-16LE.
|
---|
754 |
|
---|
755 | =back
|
---|
756 |
|
---|
757 | =head1 See Also
|
---|
758 |
|
---|
759 | L<Encode>,
|
---|
760 | L<Encode::Byte>,
|
---|
761 | L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
|
---|
762 | L<Encode::EBCDIC>, L<Encode::Symbol>
|
---|
763 | L<Encode::MIME::Header>, L<Encode::Guess>
|
---|
764 |
|
---|
765 | =head1 References
|
---|
766 |
|
---|
767 | =over 4
|
---|
768 |
|
---|
769 | =item ECMA
|
---|
770 |
|
---|
771 | European Computer Manufacturers Association
|
---|
772 | L<http://www.ecma.ch>
|
---|
773 |
|
---|
774 | =over 4
|
---|
775 |
|
---|
776 | =item ECMA-035 (eq C<ISO-2022>)
|
---|
777 |
|
---|
778 | L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
|
---|
779 |
|
---|
780 | The specification of ISO-2022 is available from the link above.
|
---|
781 |
|
---|
782 | =back
|
---|
783 |
|
---|
784 | =item IANA
|
---|
785 |
|
---|
786 | Internet Assigned Numbers Authority
|
---|
787 | L<http://www.iana.org/>
|
---|
788 |
|
---|
789 | =over 4
|
---|
790 |
|
---|
791 | =item Assigned Charset Names by IANA
|
---|
792 |
|
---|
793 | L<http://www.iana.org/assignments/character-sets>
|
---|
794 |
|
---|
795 | Most of the C<canonical names> in Encode derive from this list
|
---|
796 | so you can directly apply the string you have extracted from MIME
|
---|
797 | header of mails and web pages.
|
---|
798 |
|
---|
799 | =back
|
---|
800 |
|
---|
801 | =item ISO
|
---|
802 |
|
---|
803 | International Organization for Standardization
|
---|
804 | L<http://www.iso.ch/>
|
---|
805 |
|
---|
806 | =item RFC
|
---|
807 |
|
---|
808 | Request For Comments -- need I say more?
|
---|
809 | L<http://www.rfc-editor.org/>, L<http://www.rfc.net/>,
|
---|
810 | L<http://www.faqs.org/rfcs/>
|
---|
811 |
|
---|
812 | =item UC
|
---|
813 |
|
---|
814 | Unicode Consortium
|
---|
815 | L<http://www.unicode.org/>
|
---|
816 |
|
---|
817 | =over 4
|
---|
818 |
|
---|
819 | =item Unicode Glossary
|
---|
820 |
|
---|
821 | L<http://www.unicode.org/glossary/>
|
---|
822 |
|
---|
823 | The glossary of this document is based upon this site.
|
---|
824 |
|
---|
825 | =back
|
---|
826 |
|
---|
827 | =back
|
---|
828 |
|
---|
829 | =head2 Other Notable Sites
|
---|
830 |
|
---|
831 | =over 4
|
---|
832 |
|
---|
833 | =item czyborra.com
|
---|
834 |
|
---|
835 | L<http://czyborra.com/>
|
---|
836 |
|
---|
837 | Contains a lot of useful information, especially gory details of ISO
|
---|
838 | vs. vendor mappings.
|
---|
839 |
|
---|
840 | =item CJK.inf
|
---|
841 |
|
---|
842 | L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
|
---|
843 |
|
---|
844 | Somewhat obsolete (last update in 1996), but still useful. Also try
|
---|
845 |
|
---|
846 | L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
|
---|
847 |
|
---|
848 | You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>.
|
---|
849 |
|
---|
850 | =item Jungshik Shin's Hangul FAQ
|
---|
851 |
|
---|
852 | L<http://jshin.net/faq>
|
---|
853 |
|
---|
854 | And especially its subject 8.
|
---|
855 |
|
---|
856 | L<http://jshin.net/faq/qa8.html>
|
---|
857 |
|
---|
858 | A comprehensive overview of the Korean (C<KS *>) standards.
|
---|
859 |
|
---|
860 | =item debian.org: "Introduction to i18n"
|
---|
861 |
|
---|
862 | A brief description for most of the mentioned CJK encodings is
|
---|
863 | contained in
|
---|
864 | L<http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html>
|
---|
865 |
|
---|
866 | =back
|
---|
867 |
|
---|
868 | =head2 Offline sources
|
---|
869 |
|
---|
870 | =over 4
|
---|
871 |
|
---|
872 | =item C<CJKV Information Processing> by Ken Lunde
|
---|
873 |
|
---|
874 | CJKV Information Processing
|
---|
875 | 1999 O'Reilly & Associates, ISBN : 1-56592-224-7
|
---|
876 |
|
---|
877 | The modern successor of C<CJK.inf>.
|
---|
878 |
|
---|
879 | Features a comprehensive coverage of CJKV character sets and
|
---|
880 | encodings along with many other issues faced by anyone trying
|
---|
881 | to better support CJKV languages/scripts in all the areas of
|
---|
882 | information processing.
|
---|
883 |
|
---|
884 | To purchase this book, visit
|
---|
885 | L<http://www.oreilly.com/catalog/cjkvinfo/>
|
---|
886 | or your favourite bookstore.
|
---|
887 |
|
---|
888 | =back
|
---|
889 |
|
---|
890 | =cut
|
---|