1 | =head1 NAME
|
---|
2 |
|
---|
3 | Encode::PerlIO -- a detailed document on Encode and PerlIO
|
---|
4 |
|
---|
5 | =head1 Overview
|
---|
6 |
|
---|
7 | It is very common to want to do encoding transformations when
|
---|
8 | reading or writing files, network connections, pipes etc.
|
---|
9 | If Perl is configured to use the new 'perlio' IO system then
|
---|
10 | C<Encode> provides a "layer" (see L<PerlIO>) which can transform
|
---|
11 | data as it is read or written.
|
---|
12 |
|
---|
13 | Here is how the blind poet would modernise the encoding:
|
---|
14 |
|
---|
15 | use Encode;
|
---|
16 | open(my $iliad,'<:encoding(iso-8859-7)','iliad.greek');
|
---|
17 | open(my $utf8,'>:utf8','iliad.utf8');
|
---|
18 | my @epic = <$iliad>;
|
---|
19 | print $utf8 @epic;
|
---|
20 | close($utf8);
|
---|
21 | close($illiad);
|
---|
22 |
|
---|
23 | In addition, the new IO system can also be configured to read/write
|
---|
24 | UTF-8 encoded characters (as noted above, this is efficient):
|
---|
25 |
|
---|
26 | open(my $fh,'>:utf8','anything');
|
---|
27 | print $fh "Any \x{0021} string \N{SMILEY FACE}\n";
|
---|
28 |
|
---|
29 | Either of the above forms of "layer" specifications can be made the default
|
---|
30 | for a lexical scope with the C<use open ...> pragma. See L<open>.
|
---|
31 |
|
---|
32 | Once a handle is open, its layers can be altered using C<binmode>.
|
---|
33 |
|
---|
34 | Without any such configuration, or if Perl itself is built using the
|
---|
35 | system's own IO, then write operations assume that the file handle
|
---|
36 | accepts only I<bytes> and will C<die> if a character larger than 255 is
|
---|
37 | written to the handle. When reading, each octet from the handle becomes
|
---|
38 | a byte-in-a-character. Note that this default is the same behaviour
|
---|
39 | as bytes-only languages (including Perl before v5.6) would have,
|
---|
40 | and is sufficient to handle native 8-bit encodings e.g. iso-8859-1,
|
---|
41 | EBCDIC etc. and any legacy mechanisms for handling other encodings
|
---|
42 | and binary data.
|
---|
43 |
|
---|
44 | In other cases, it is the program's responsibility to transform
|
---|
45 | characters into bytes using the API above before doing writes, and to
|
---|
46 | transform the bytes read from a handle into characters before doing
|
---|
47 | "character operations" (e.g. C<lc>, C</\W+/>, ...).
|
---|
48 |
|
---|
49 | You can also use PerlIO to convert larger amounts of data you don't
|
---|
50 | want to bring into memory. For example, to convert between ISO-8859-1
|
---|
51 | (Latin 1) and UTF-8 (or UTF-EBCDIC in EBCDIC machines):
|
---|
52 |
|
---|
53 | open(F, "<:encoding(iso-8859-1)", "data.txt") or die $!;
|
---|
54 | open(G, ">:utf8", "data.utf") or die $!;
|
---|
55 | while (<F>) { print G }
|
---|
56 |
|
---|
57 | # Could also do "print G <F>" but that would pull
|
---|
58 | # the whole file into memory just to write it out again.
|
---|
59 |
|
---|
60 | More examples:
|
---|
61 |
|
---|
62 | open(my $f, "<:encoding(cp1252)")
|
---|
63 | open(my $g, ">:encoding(iso-8859-2)")
|
---|
64 | open(my $h, ">:encoding(latin9)") # iso-8859-15
|
---|
65 |
|
---|
66 | See also L<encoding> for how to change the default encoding of the
|
---|
67 | data in your script.
|
---|
68 |
|
---|
69 | =head1 How does it work?
|
---|
70 |
|
---|
71 | Here is a crude diagram of how filehandle, PerlIO, and Encode
|
---|
72 | interact.
|
---|
73 |
|
---|
74 | filehandle <-> PerlIO PerlIO <-> scalar (read/printed)
|
---|
75 | \ /
|
---|
76 | Encode
|
---|
77 |
|
---|
78 | When PerlIO receives data from either direction, it fills a buffer
|
---|
79 | (currently with 1024 bytes) and passes the buffer to Encode.
|
---|
80 | Encode tries to convert the valid part and passes it back to PerlIO,
|
---|
81 | leaving invalid parts (usually a partial character) in the buffer.
|
---|
82 | PerlIO then appends more data to the buffer, calls Encode again,
|
---|
83 | and so on until the data stream ends.
|
---|
84 |
|
---|
85 | To do so, PerlIO always calls (de|en)code methods with CHECK set to 1.
|
---|
86 | This ensures that the method stops at the right place when it
|
---|
87 | encounters partial character. The following is what happens when
|
---|
88 | PerlIO and Encode tries to encode (from utf8) more than 1024 bytes
|
---|
89 | and the buffer boundary happens to be in the middle of a character.
|
---|
90 |
|
---|
91 | A B C .... ~ \x{3000} ....
|
---|
92 | 41 42 43 .... 7E e3 80 80 ....
|
---|
93 | <- buffer --------------->
|
---|
94 | << encoded >>>>>>>>>>
|
---|
95 | <- next buffer ------
|
---|
96 |
|
---|
97 | Encode converts from the beginning to \x7E, leaving \xe3 in the buffer
|
---|
98 | because it is invalid (partial character).
|
---|
99 |
|
---|
100 | Unfortunately, this scheme does not work well with escape-based
|
---|
101 | encodings such as ISO-2022-JP.
|
---|
102 |
|
---|
103 | =head1 Line Buffering
|
---|
104 |
|
---|
105 | Now let's see what happens when you try to decode from ISO-2022-JP and
|
---|
106 | the buffer ends in the middle of a character.
|
---|
107 |
|
---|
108 | JIS208-ESC \x{5f3e}
|
---|
109 | A B C .... ~ \e $ B |DAN | ....
|
---|
110 | 41 42 43 .... 7E 1b 24 41 43 46 ....
|
---|
111 | <- buffer --------------------------->
|
---|
112 | << encoded >>>>>>>>>>>>>>>>>>>>>>>
|
---|
113 |
|
---|
114 | As you see, the next buffer begins with \x43. But \x43 is 'C' in
|
---|
115 | ASCII, which is wrong in this case because we are now in JISX 0208
|
---|
116 | area so it has to convert \x43\x46, not \x43. Unlike utf8 and EUC,
|
---|
117 | in escape-based encodings you can't tell if a given octet is a whole
|
---|
118 | character or just part of it.
|
---|
119 |
|
---|
120 | Fortunately PerlIO also supports line buffer if you tell PerlIO to use
|
---|
121 | one instead of fixed buffer. Since ISO-2022-JP is guaranteed to revert to ASCII at the end of the line, partial
|
---|
122 | character will never happen when line buffer is used.
|
---|
123 |
|
---|
124 | To tell PerlIO to use line buffer, implement -E<gt>needs_lines method
|
---|
125 | for your encoding object. See L<Encode::Encoding> for details.
|
---|
126 |
|
---|
127 | Thanks to these efforts most encodings that come with Encode support
|
---|
128 | PerlIO but that still leaves following encodings.
|
---|
129 |
|
---|
130 | iso-2022-kr
|
---|
131 | MIME-B
|
---|
132 | MIME-Header
|
---|
133 | MIME-Q
|
---|
134 |
|
---|
135 | Fortunately iso-2022-kr is hardly used (according to Jungshik) and
|
---|
136 | MIME-* are very unlikely to be fed to PerlIO because they are for mail
|
---|
137 | headers. See L<Encode::MIME::Header> for details.
|
---|
138 |
|
---|
139 | =head2 How can I tell whether my encoding fully supports PerlIO ?
|
---|
140 |
|
---|
141 | As of this writing, any encoding whose class belongs to Encode::XS and
|
---|
142 | Encode::Unicode works. The Encode module has a C<perlio_ok> method
|
---|
143 | which you can use before applying PerlIO encoding to the filehandle.
|
---|
144 | Here is an example:
|
---|
145 |
|
---|
146 | my $use_perlio = perlio_ok($enc);
|
---|
147 | my $layer = $use_perlio ? "<:raw" : "<:encoding($enc)";
|
---|
148 | open my $fh, $layer, $file or die "$file : $!";
|
---|
149 | while(<$fh>){
|
---|
150 | $_ = decode($enc, $_) unless $use_perlio;
|
---|
151 | # ....
|
---|
152 | }
|
---|
153 |
|
---|
154 | =head1 SEE ALSO
|
---|
155 |
|
---|
156 | L<Encode::Encoding>,
|
---|
157 | L<Encode::Supported>,
|
---|
158 | L<Encode::PerlIO>,
|
---|
159 | L<encoding>,
|
---|
160 | L<perlebcdic>,
|
---|
161 | L<perlfunc/open>,
|
---|
162 | L<perlunicode>,
|
---|
163 | L<utf8>,
|
---|
164 | the Perl Unicode Mailing List E<lt>[email protected]<gt>
|
---|
165 |
|
---|
166 | =cut
|
---|
167 |
|
---|