1 | =head1 NAME
|
---|
2 |
|
---|
3 | perlunicode - Unicode support in Perl
|
---|
4 |
|
---|
5 | =head1 DESCRIPTION
|
---|
6 |
|
---|
7 | =head2 Important Caveats
|
---|
8 |
|
---|
9 | Unicode support is an extensive requirement. While Perl does not
|
---|
10 | implement the Unicode standard or the accompanying technical reports
|
---|
11 | from cover to cover, Perl does support many Unicode features.
|
---|
12 |
|
---|
13 | =over 4
|
---|
14 |
|
---|
15 | =item Input and Output Layers
|
---|
16 |
|
---|
17 | Perl knows when a filehandle uses Perl's internal Unicode encodings
|
---|
18 | (UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
|
---|
19 | the ":utf8" layer. Other encodings can be converted to Perl's
|
---|
20 | encoding on input or from Perl's encoding on output by use of the
|
---|
21 | ":encoding(...)" layer. See L<open>.
|
---|
22 |
|
---|
23 | To indicate that Perl source itself is using a particular encoding,
|
---|
24 | see L<encoding>.
|
---|
25 |
|
---|
26 | =item Regular Expressions
|
---|
27 |
|
---|
28 | The regular expression compiler produces polymorphic opcodes. That is,
|
---|
29 | the pattern adapts to the data and automatically switches to the Unicode
|
---|
30 | character scheme when presented with Unicode data--or instead uses
|
---|
31 | a traditional byte scheme when presented with byte data.
|
---|
32 |
|
---|
33 | =item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
|
---|
34 |
|
---|
35 | As a compatibility measure, the C<use utf8> pragma must be explicitly
|
---|
36 | included to enable recognition of UTF-8 in the Perl scripts themselves
|
---|
37 | (in string or regular expression literals, or in identifier names) on
|
---|
38 | ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
|
---|
39 | machines. B<These are the only times when an explicit C<use utf8>
|
---|
40 | is needed.> See L<utf8>.
|
---|
41 |
|
---|
42 | You can also use the C<encoding> pragma to change the default encoding
|
---|
43 | of the data in your script; see L<encoding>.
|
---|
44 |
|
---|
45 | =item BOM-marked scripts and UTF-16 scripts autodetected
|
---|
46 |
|
---|
47 | If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
|
---|
48 | or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
|
---|
49 | endianness, Perl will correctly read in the script as Unicode.
|
---|
50 | (BOMless UTF-8 cannot be effectively recognized or differentiated from
|
---|
51 | ISO 8859-1 or other eight-bit encodings.)
|
---|
52 |
|
---|
53 | =item C<use encoding> needed to upgrade non-Latin-1 byte strings
|
---|
54 |
|
---|
55 | By default, there is a fundamental asymmetry in Perl's unicode model:
|
---|
56 | implicit upgrading from byte strings to Unicode strings assumes that
|
---|
57 | they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
|
---|
58 | downgraded with UTF-8 encoding. This happens because the first 256
|
---|
59 | codepoints in Unicode happens to agree with Latin-1.
|
---|
60 |
|
---|
61 | If you wish to interpret byte strings as UTF-8 instead, use the
|
---|
62 | C<encoding> pragma:
|
---|
63 |
|
---|
64 | use encoding 'utf8';
|
---|
65 |
|
---|
66 | See L</"Byte and Character Semantics"> for more details.
|
---|
67 |
|
---|
68 | =back
|
---|
69 |
|
---|
70 | =head2 Byte and Character Semantics
|
---|
71 |
|
---|
72 | Beginning with version 5.6, Perl uses logically-wide characters to
|
---|
73 | represent strings internally.
|
---|
74 |
|
---|
75 | In future, Perl-level operations will be expected to work with
|
---|
76 | characters rather than bytes.
|
---|
77 |
|
---|
78 | However, as an interim compatibility measure, Perl aims to
|
---|
79 | provide a safe migration path from byte semantics to character
|
---|
80 | semantics for programs. For operations where Perl can unambiguously
|
---|
81 | decide that the input data are characters, Perl switches to
|
---|
82 | character semantics. For operations where this determination cannot
|
---|
83 | be made without additional information from the user, Perl decides in
|
---|
84 | favor of compatibility and chooses to use byte semantics.
|
---|
85 |
|
---|
86 | This behavior preserves compatibility with earlier versions of Perl,
|
---|
87 | which allowed byte semantics in Perl operations only if
|
---|
88 | none of the program's inputs were marked as being as source of Unicode
|
---|
89 | character data. Such data may come from filehandles, from calls to
|
---|
90 | external programs, from information provided by the system (such as %ENV),
|
---|
91 | or from literals and constants in the source text.
|
---|
92 |
|
---|
93 | The C<bytes> pragma will always, regardless of platform, force byte
|
---|
94 | semantics in a particular lexical scope. See L<bytes>.
|
---|
95 |
|
---|
96 | The C<utf8> pragma is primarily a compatibility device that enables
|
---|
97 | recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
|
---|
98 | Note that this pragma is only required while Perl defaults to byte
|
---|
99 | semantics; when character semantics become the default, this pragma
|
---|
100 | may become a no-op. See L<utf8>.
|
---|
101 |
|
---|
102 | Unless explicitly stated, Perl operators use character semantics
|
---|
103 | for Unicode data and byte semantics for non-Unicode data.
|
---|
104 | The decision to use character semantics is made transparently. If
|
---|
105 | input data comes from a Unicode source--for example, if a character
|
---|
106 | encoding layer is added to a filehandle or a literal Unicode
|
---|
107 | string constant appears in a program--character semantics apply.
|
---|
108 | Otherwise, byte semantics are in effect. The C<bytes> pragma should
|
---|
109 | be used to force byte semantics on Unicode data.
|
---|
110 |
|
---|
111 | If strings operating under byte semantics and strings with Unicode
|
---|
112 | character data are concatenated, the new string will be created by
|
---|
113 | decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
|
---|
114 | old Unicode string used EBCDIC. This translation is done without
|
---|
115 | regard to the system's native 8-bit encoding. To change this for
|
---|
116 | systems with non-Latin-1 and non-EBCDIC native encodings, use the
|
---|
117 | C<encoding> pragma. See L<encoding>.
|
---|
118 |
|
---|
119 | Under character semantics, many operations that formerly operated on
|
---|
120 | bytes now operate on characters. A character in Perl is
|
---|
121 | logically just a number ranging from 0 to 2**31 or so. Larger
|
---|
122 | characters may encode into longer sequences of bytes internally, but
|
---|
123 | this internal detail is mostly hidden for Perl code.
|
---|
124 | See L<perluniintro> for more.
|
---|
125 |
|
---|
126 | =head2 Effects of Character Semantics
|
---|
127 |
|
---|
128 | Character semantics have the following effects:
|
---|
129 |
|
---|
130 | =over 4
|
---|
131 |
|
---|
132 | =item *
|
---|
133 |
|
---|
134 | Strings--including hash keys--and regular expression patterns may
|
---|
135 | contain characters that have an ordinal value larger than 255.
|
---|
136 |
|
---|
137 | If you use a Unicode editor to edit your program, Unicode characters
|
---|
138 | may occur directly within the literal strings in one of the various
|
---|
139 | Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized
|
---|
140 | as such and converted to Perl's internal representation only if the
|
---|
141 | appropriate L<encoding> is specified.
|
---|
142 |
|
---|
143 | Unicode characters can also be added to a string by using the
|
---|
144 | C<\x{...}> notation. The Unicode code for the desired character, in
|
---|
145 | hexadecimal, should be placed in the braces. For instance, a smiley
|
---|
146 | face is C<\x{263A}>. This encoding scheme only works for characters
|
---|
147 | with a code of 0x100 or above.
|
---|
148 |
|
---|
149 | Additionally, if you
|
---|
150 |
|
---|
151 | use charnames ':full';
|
---|
152 |
|
---|
153 | you can use the C<\N{...}> notation and put the official Unicode
|
---|
154 | character name within the braces, such as C<\N{WHITE SMILING FACE}>.
|
---|
155 |
|
---|
156 |
|
---|
157 | =item *
|
---|
158 |
|
---|
159 | If an appropriate L<encoding> is specified, identifiers within the
|
---|
160 | Perl script may contain Unicode alphanumeric characters, including
|
---|
161 | ideographs. Perl does not currently attempt to canonicalize variable
|
---|
162 | names.
|
---|
163 |
|
---|
164 | =item *
|
---|
165 |
|
---|
166 | Regular expressions match characters instead of bytes. "." matches
|
---|
167 | a character instead of a byte. The C<\C> pattern is provided to force
|
---|
168 | a match a single byte--a C<char> in C, hence C<\C>.
|
---|
169 |
|
---|
170 | =item *
|
---|
171 |
|
---|
172 | Character classes in regular expressions match characters instead of
|
---|
173 | bytes and match against the character properties specified in the
|
---|
174 | Unicode properties database. C<\w> can be used to match a Japanese
|
---|
175 | ideograph, for instance.
|
---|
176 |
|
---|
177 | (However, and as a limitation of the current implementation, using
|
---|
178 | C<\w> or C<\W> I<inside> a C<[...]> character class will still match
|
---|
179 | with byte semantics.)
|
---|
180 |
|
---|
181 | =item *
|
---|
182 |
|
---|
183 | Named Unicode properties, scripts, and block ranges may be used like
|
---|
184 | character classes via the C<\p{}> "matches property" construct and
|
---|
185 | the C<\P{}> negation, "doesn't match property".
|
---|
186 |
|
---|
187 | For instance, C<\p{Lu}> matches any character with the Unicode "Lu"
|
---|
188 | (Letter, uppercase) property, while C<\p{M}> matches any character
|
---|
189 | with an "M" (mark--accents and such) property. Brackets are not
|
---|
190 | required for single letter properties, so C<\p{M}> is equivalent to
|
---|
191 | C<\pM>. Many predefined properties are available, such as
|
---|
192 | C<\p{Mirrored}> and C<\p{Tibetan}>.
|
---|
193 |
|
---|
194 | The official Unicode script and block names have spaces and dashes as
|
---|
195 | separators, but for convenience you can use dashes, spaces, or
|
---|
196 | underbars, and case is unimportant. It is recommended, however, that
|
---|
197 | for consistency you use the following naming: the official Unicode
|
---|
198 | script, property, or block name (see below for the additional rules
|
---|
199 | that apply to block names) with whitespace and dashes removed, and the
|
---|
200 | words "uppercase-first-lowercase-rest". C<Latin-1 Supplement> thus
|
---|
201 | becomes C<Latin1Supplement>.
|
---|
202 |
|
---|
203 | You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
|
---|
204 | (^) between the first brace and the property name: C<\p{^Tamil}> is
|
---|
205 | equal to C<\P{Tamil}>.
|
---|
206 |
|
---|
207 | B<NOTE: the properties, scripts, and blocks listed here are as of
|
---|
208 | Unicode 3.2.0, March 2002, or Perl 5.8.0, July 2002. Unicode 4.0.0
|
---|
209 | came out in April 2003, and Perl 5.8.1 in September 2003.>
|
---|
210 |
|
---|
211 | Here are the basic Unicode General Category properties, followed by their
|
---|
212 | long form. You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>,
|
---|
213 | for instance, are identical.
|
---|
214 |
|
---|
215 | Short Long
|
---|
216 |
|
---|
217 | L Letter
|
---|
218 | LC CasedLetter
|
---|
219 | Lu UppercaseLetter
|
---|
220 | Ll LowercaseLetter
|
---|
221 | Lt TitlecaseLetter
|
---|
222 | Lm ModifierLetter
|
---|
223 | Lo OtherLetter
|
---|
224 |
|
---|
225 | M Mark
|
---|
226 | Mn NonspacingMark
|
---|
227 | Mc SpacingMark
|
---|
228 | Me EnclosingMark
|
---|
229 |
|
---|
230 | N Number
|
---|
231 | Nd DecimalNumber
|
---|
232 | Nl LetterNumber
|
---|
233 | No OtherNumber
|
---|
234 |
|
---|
235 | P Punctuation
|
---|
236 | Pc ConnectorPunctuation
|
---|
237 | Pd DashPunctuation
|
---|
238 | Ps OpenPunctuation
|
---|
239 | Pe ClosePunctuation
|
---|
240 | Pi InitialPunctuation
|
---|
241 | (may behave like Ps or Pe depending on usage)
|
---|
242 | Pf FinalPunctuation
|
---|
243 | (may behave like Ps or Pe depending on usage)
|
---|
244 | Po OtherPunctuation
|
---|
245 |
|
---|
246 | S Symbol
|
---|
247 | Sm MathSymbol
|
---|
248 | Sc CurrencySymbol
|
---|
249 | Sk ModifierSymbol
|
---|
250 | So OtherSymbol
|
---|
251 |
|
---|
252 | Z Separator
|
---|
253 | Zs SpaceSeparator
|
---|
254 | Zl LineSeparator
|
---|
255 | Zp ParagraphSeparator
|
---|
256 |
|
---|
257 | C Other
|
---|
258 | Cc Control
|
---|
259 | Cf Format
|
---|
260 | Cs Surrogate (not usable)
|
---|
261 | Co PrivateUse
|
---|
262 | Cn Unassigned
|
---|
263 |
|
---|
264 | Single-letter properties match all characters in any of the
|
---|
265 | two-letter sub-properties starting with the same letter.
|
---|
266 | C<LC> and C<L&> are special cases, which are aliases for the set of
|
---|
267 | C<Ll>, C<Lu>, and C<Lt>.
|
---|
268 |
|
---|
269 | Because Perl hides the need for the user to understand the internal
|
---|
270 | representation of Unicode characters, there is no need to implement
|
---|
271 | the somewhat messy concept of surrogates. C<Cs> is therefore not
|
---|
272 | supported.
|
---|
273 |
|
---|
274 | Because scripts differ in their directionality--Hebrew is
|
---|
275 | written right to left, for example--Unicode supplies these properties in
|
---|
276 | the BidiClass class:
|
---|
277 |
|
---|
278 | Property Meaning
|
---|
279 |
|
---|
280 | L Left-to-Right
|
---|
281 | LRE Left-to-Right Embedding
|
---|
282 | LRO Left-to-Right Override
|
---|
283 | R Right-to-Left
|
---|
284 | AL Right-to-Left Arabic
|
---|
285 | RLE Right-to-Left Embedding
|
---|
286 | RLO Right-to-Left Override
|
---|
287 | PDF Pop Directional Format
|
---|
288 | EN European Number
|
---|
289 | ES European Number Separator
|
---|
290 | ET European Number Terminator
|
---|
291 | AN Arabic Number
|
---|
292 | CS Common Number Separator
|
---|
293 | NSM Non-Spacing Mark
|
---|
294 | BN Boundary Neutral
|
---|
295 | B Paragraph Separator
|
---|
296 | S Segment Separator
|
---|
297 | WS Whitespace
|
---|
298 | ON Other Neutrals
|
---|
299 |
|
---|
300 | For example, C<\p{BidiClass:R}> matches characters that are normally
|
---|
301 | written right to left.
|
---|
302 |
|
---|
303 | =back
|
---|
304 |
|
---|
305 | =head2 Scripts
|
---|
306 |
|
---|
307 | The script names which can be used by C<\p{...}> and C<\P{...}>,
|
---|
308 | such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows:
|
---|
309 |
|
---|
310 | Arabic
|
---|
311 | Armenian
|
---|
312 | Bengali
|
---|
313 | Bopomofo
|
---|
314 | Buhid
|
---|
315 | CanadianAboriginal
|
---|
316 | Cherokee
|
---|
317 | Cyrillic
|
---|
318 | Deseret
|
---|
319 | Devanagari
|
---|
320 | Ethiopic
|
---|
321 | Georgian
|
---|
322 | Gothic
|
---|
323 | Greek
|
---|
324 | Gujarati
|
---|
325 | Gurmukhi
|
---|
326 | Han
|
---|
327 | Hangul
|
---|
328 | Hanunoo
|
---|
329 | Hebrew
|
---|
330 | Hiragana
|
---|
331 | Inherited
|
---|
332 | Kannada
|
---|
333 | Katakana
|
---|
334 | Khmer
|
---|
335 | Lao
|
---|
336 | Latin
|
---|
337 | Malayalam
|
---|
338 | Mongolian
|
---|
339 | Myanmar
|
---|
340 | Ogham
|
---|
341 | OldItalic
|
---|
342 | Oriya
|
---|
343 | Runic
|
---|
344 | Sinhala
|
---|
345 | Syriac
|
---|
346 | Tagalog
|
---|
347 | Tagbanwa
|
---|
348 | Tamil
|
---|
349 | Telugu
|
---|
350 | Thaana
|
---|
351 | Thai
|
---|
352 | Tibetan
|
---|
353 | Yi
|
---|
354 |
|
---|
355 | Extended property classes can supplement the basic
|
---|
356 | properties, defined by the F<PropList> Unicode database:
|
---|
357 |
|
---|
358 | ASCIIHexDigit
|
---|
359 | BidiControl
|
---|
360 | Dash
|
---|
361 | Deprecated
|
---|
362 | Diacritic
|
---|
363 | Extender
|
---|
364 | GraphemeLink
|
---|
365 | HexDigit
|
---|
366 | Hyphen
|
---|
367 | Ideographic
|
---|
368 | IDSBinaryOperator
|
---|
369 | IDSTrinaryOperator
|
---|
370 | JoinControl
|
---|
371 | LogicalOrderException
|
---|
372 | NoncharacterCodePoint
|
---|
373 | OtherAlphabetic
|
---|
374 | OtherDefaultIgnorableCodePoint
|
---|
375 | OtherGraphemeExtend
|
---|
376 | OtherLowercase
|
---|
377 | OtherMath
|
---|
378 | OtherUppercase
|
---|
379 | QuotationMark
|
---|
380 | Radical
|
---|
381 | SoftDotted
|
---|
382 | TerminalPunctuation
|
---|
383 | UnifiedIdeograph
|
---|
384 | WhiteSpace
|
---|
385 |
|
---|
386 | and there are further derived properties:
|
---|
387 |
|
---|
388 | Alphabetic Lu + Ll + Lt + Lm + Lo + OtherAlphabetic
|
---|
389 | Lowercase Ll + OtherLowercase
|
---|
390 | Uppercase Lu + OtherUppercase
|
---|
391 | Math Sm + OtherMath
|
---|
392 |
|
---|
393 | ID_Start Lu + Ll + Lt + Lm + Lo + Nl
|
---|
394 | ID_Continue ID_Start + Mn + Mc + Nd + Pc
|
---|
395 |
|
---|
396 | Any Any character
|
---|
397 | Assigned Any non-Cn character (i.e. synonym for \P{Cn})
|
---|
398 | Unassigned Synonym for \p{Cn}
|
---|
399 | Common Any character (or unassigned code point)
|
---|
400 | not explicitly assigned to a script
|
---|
401 |
|
---|
402 | For backward compatibility (with Perl 5.6), all properties mentioned
|
---|
403 | so far may have C<Is> prepended to their name, so C<\P{IsLu}>, for
|
---|
404 | example, is equal to C<\P{Lu}>.
|
---|
405 |
|
---|
406 | =head2 Blocks
|
---|
407 |
|
---|
408 | In addition to B<scripts>, Unicode also defines B<blocks> of
|
---|
409 | characters. The difference between scripts and blocks is that the
|
---|
410 | concept of scripts is closer to natural languages, while the concept
|
---|
411 | of blocks is more of an artificial grouping based on groups of 256
|
---|
412 | Unicode characters. For example, the C<Latin> script contains letters
|
---|
413 | from many blocks but does not contain all the characters from those
|
---|
414 | blocks. It does not, for example, contain digits, because digits are
|
---|
415 | shared across many scripts. Digits and similar groups, like
|
---|
416 | punctuation, are in a category called C<Common>.
|
---|
417 |
|
---|
418 | For more about scripts, see the UTR #24:
|
---|
419 |
|
---|
420 | http://www.unicode.org/unicode/reports/tr24/
|
---|
421 |
|
---|
422 | For more about blocks, see:
|
---|
423 |
|
---|
424 | http://www.unicode.org/Public/UNIDATA/Blocks.txt
|
---|
425 |
|
---|
426 | Block names are given with the C<In> prefix. For example, the
|
---|
427 | Katakana block is referenced via C<\p{InKatakana}>. The C<In>
|
---|
428 | prefix may be omitted if there is no naming conflict with a script
|
---|
429 | or any other property, but it is recommended that C<In> always be used
|
---|
430 | for block tests to avoid confusion.
|
---|
431 |
|
---|
432 | These block names are supported:
|
---|
433 |
|
---|
434 | InAlphabeticPresentationForms
|
---|
435 | InArabic
|
---|
436 | InArabicPresentationFormsA
|
---|
437 | InArabicPresentationFormsB
|
---|
438 | InArmenian
|
---|
439 | InArrows
|
---|
440 | InBasicLatin
|
---|
441 | InBengali
|
---|
442 | InBlockElements
|
---|
443 | InBopomofo
|
---|
444 | InBopomofoExtended
|
---|
445 | InBoxDrawing
|
---|
446 | InBraillePatterns
|
---|
447 | InBuhid
|
---|
448 | InByzantineMusicalSymbols
|
---|
449 | InCJKCompatibility
|
---|
450 | InCJKCompatibilityForms
|
---|
451 | InCJKCompatibilityIdeographs
|
---|
452 | InCJKCompatibilityIdeographsSupplement
|
---|
453 | InCJKRadicalsSupplement
|
---|
454 | InCJKSymbolsAndPunctuation
|
---|
455 | InCJKUnifiedIdeographs
|
---|
456 | InCJKUnifiedIdeographsExtensionA
|
---|
457 | InCJKUnifiedIdeographsExtensionB
|
---|
458 | InCherokee
|
---|
459 | InCombiningDiacriticalMarks
|
---|
460 | InCombiningDiacriticalMarksforSymbols
|
---|
461 | InCombiningHalfMarks
|
---|
462 | InControlPictures
|
---|
463 | InCurrencySymbols
|
---|
464 | InCyrillic
|
---|
465 | InCyrillicSupplementary
|
---|
466 | InDeseret
|
---|
467 | InDevanagari
|
---|
468 | InDingbats
|
---|
469 | InEnclosedAlphanumerics
|
---|
470 | InEnclosedCJKLettersAndMonths
|
---|
471 | InEthiopic
|
---|
472 | InGeneralPunctuation
|
---|
473 | InGeometricShapes
|
---|
474 | InGeorgian
|
---|
475 | InGothic
|
---|
476 | InGreekExtended
|
---|
477 | InGreekAndCoptic
|
---|
478 | InGujarati
|
---|
479 | InGurmukhi
|
---|
480 | InHalfwidthAndFullwidthForms
|
---|
481 | InHangulCompatibilityJamo
|
---|
482 | InHangulJamo
|
---|
483 | InHangulSyllables
|
---|
484 | InHanunoo
|
---|
485 | InHebrew
|
---|
486 | InHighPrivateUseSurrogates
|
---|
487 | InHighSurrogates
|
---|
488 | InHiragana
|
---|
489 | InIPAExtensions
|
---|
490 | InIdeographicDescriptionCharacters
|
---|
491 | InKanbun
|
---|
492 | InKangxiRadicals
|
---|
493 | InKannada
|
---|
494 | InKatakana
|
---|
495 | InKatakanaPhoneticExtensions
|
---|
496 | InKhmer
|
---|
497 | InLao
|
---|
498 | InLatin1Supplement
|
---|
499 | InLatinExtendedA
|
---|
500 | InLatinExtendedAdditional
|
---|
501 | InLatinExtendedB
|
---|
502 | InLetterlikeSymbols
|
---|
503 | InLowSurrogates
|
---|
504 | InMalayalam
|
---|
505 | InMathematicalAlphanumericSymbols
|
---|
506 | InMathematicalOperators
|
---|
507 | InMiscellaneousMathematicalSymbolsA
|
---|
508 | InMiscellaneousMathematicalSymbolsB
|
---|
509 | InMiscellaneousSymbols
|
---|
510 | InMiscellaneousTechnical
|
---|
511 | InMongolian
|
---|
512 | InMusicalSymbols
|
---|
513 | InMyanmar
|
---|
514 | InNumberForms
|
---|
515 | InOgham
|
---|
516 | InOldItalic
|
---|
517 | InOpticalCharacterRecognition
|
---|
518 | InOriya
|
---|
519 | InPrivateUseArea
|
---|
520 | InRunic
|
---|
521 | InSinhala
|
---|
522 | InSmallFormVariants
|
---|
523 | InSpacingModifierLetters
|
---|
524 | InSpecials
|
---|
525 | InSuperscriptsAndSubscripts
|
---|
526 | InSupplementalArrowsA
|
---|
527 | InSupplementalArrowsB
|
---|
528 | InSupplementalMathematicalOperators
|
---|
529 | InSupplementaryPrivateUseAreaA
|
---|
530 | InSupplementaryPrivateUseAreaB
|
---|
531 | InSyriac
|
---|
532 | InTagalog
|
---|
533 | InTagbanwa
|
---|
534 | InTags
|
---|
535 | InTamil
|
---|
536 | InTelugu
|
---|
537 | InThaana
|
---|
538 | InThai
|
---|
539 | InTibetan
|
---|
540 | InUnifiedCanadianAboriginalSyllabics
|
---|
541 | InVariationSelectors
|
---|
542 | InYiRadicals
|
---|
543 | InYiSyllables
|
---|
544 |
|
---|
545 | =over 4
|
---|
546 |
|
---|
547 | =item *
|
---|
548 |
|
---|
549 | The special pattern C<\X> matches any extended Unicode
|
---|
550 | sequence--"a combining character sequence" in Standardese--where the
|
---|
551 | first character is a base character and subsequent characters are mark
|
---|
552 | characters that apply to the base character. C<\X> is equivalent to
|
---|
553 | C<(?:\PM\pM*)>.
|
---|
554 |
|
---|
555 | =item *
|
---|
556 |
|
---|
557 | The C<tr///> operator translates characters instead of bytes. Note
|
---|
558 | that the C<tr///CU> functionality has been removed. For similar
|
---|
559 | functionality see pack('U0', ...) and pack('C0', ...).
|
---|
560 |
|
---|
561 | =item *
|
---|
562 |
|
---|
563 | Case translation operators use the Unicode case translation tables
|
---|
564 | when character input is provided. Note that C<uc()>, or C<\U> in
|
---|
565 | interpolated strings, translates to uppercase, while C<ucfirst>,
|
---|
566 | or C<\u> in interpolated strings, translates to titlecase in languages
|
---|
567 | that make the distinction.
|
---|
568 |
|
---|
569 | =item *
|
---|
570 |
|
---|
571 | Most operators that deal with positions or lengths in a string will
|
---|
572 | automatically switch to using character positions, including
|
---|
573 | C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
|
---|
574 | C<sprintf()>, C<write()>, and C<length()>. Operators that
|
---|
575 | specifically do not switch include C<vec()>, C<pack()>, and
|
---|
576 | C<unpack()>. Operators that really don't care include
|
---|
577 | operators that treats strings as a bucket of bits such as C<sort()>,
|
---|
578 | and operators dealing with filenames.
|
---|
579 |
|
---|
580 | =item *
|
---|
581 |
|
---|
582 | The C<pack()>/C<unpack()> letters C<c> and C<C> do I<not> change,
|
---|
583 | since they are often used for byte-oriented formats. Again, think
|
---|
584 | C<char> in the C language.
|
---|
585 |
|
---|
586 | There is a new C<U> specifier that converts between Unicode characters
|
---|
587 | and code points.
|
---|
588 |
|
---|
589 | =item *
|
---|
590 |
|
---|
591 | The C<chr()> and C<ord()> functions work on characters, similar to
|
---|
592 | C<pack("U")> and C<unpack("U")>, I<not> C<pack("C")> and
|
---|
593 | C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for
|
---|
594 | emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
|
---|
595 | While these methods reveal the internal encoding of Unicode strings,
|
---|
596 | that is not something one normally needs to care about at all.
|
---|
597 |
|
---|
598 | =item *
|
---|
599 |
|
---|
600 | The bit string operators, C<& | ^ ~>, can operate on character data.
|
---|
601 | However, for backward compatibility, such as when using bit string
|
---|
602 | operations when characters are all less than 256 in ordinal value, one
|
---|
603 | should not use C<~> (the bit complement) with characters of both
|
---|
604 | values less than 256 and values greater than 256. Most importantly,
|
---|
605 | DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
|
---|
606 | will not hold. The reason for this mathematical I<faux pas> is that
|
---|
607 | the complement cannot return B<both> the 8-bit (byte-wide) bit
|
---|
608 | complement B<and> the full character-wide bit complement.
|
---|
609 |
|
---|
610 | =item *
|
---|
611 |
|
---|
612 | lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
|
---|
613 |
|
---|
614 | =over 8
|
---|
615 |
|
---|
616 | =item *
|
---|
617 |
|
---|
618 | the case mapping is from a single Unicode character to another
|
---|
619 | single Unicode character, or
|
---|
620 |
|
---|
621 | =item *
|
---|
622 |
|
---|
623 | the case mapping is from a single Unicode character to more
|
---|
624 | than one Unicode character.
|
---|
625 |
|
---|
626 | =back
|
---|
627 |
|
---|
628 | Things to do with locales (Lithuanian, Turkish, Azeri) do B<not> work
|
---|
629 | since Perl does not understand the concept of Unicode locales.
|
---|
630 |
|
---|
631 | See the Unicode Technical Report #21, Case Mappings, for more details.
|
---|
632 |
|
---|
633 | =back
|
---|
634 |
|
---|
635 | =over 4
|
---|
636 |
|
---|
637 | =item *
|
---|
638 |
|
---|
639 | And finally, C<scalar reverse()> reverses by character rather than by byte.
|
---|
640 |
|
---|
641 | =back
|
---|
642 |
|
---|
643 | =head2 User-Defined Character Properties
|
---|
644 |
|
---|
645 | You can define your own character properties by defining subroutines
|
---|
646 | whose names begin with "In" or "Is". The subroutines can be defined in
|
---|
647 | any package. The user-defined properties can be used in the regular
|
---|
648 | expression C<\p> and C<\P> constructs; if you are using a user-defined
|
---|
649 | property from a package other than the one you are in, you must specify
|
---|
650 | its package in the C<\p> or C<\P> construct.
|
---|
651 |
|
---|
652 | # assuming property IsForeign defined in Lang::
|
---|
653 | package main; # property package name required
|
---|
654 | if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
|
---|
655 |
|
---|
656 | package Lang; # property package name not required
|
---|
657 | if ($txt =~ /\p{IsForeign}+/) { ... }
|
---|
658 |
|
---|
659 |
|
---|
660 | Note that the effect is compile-time and immutable once defined.
|
---|
661 |
|
---|
662 | The subroutines must return a specially-formatted string, with one
|
---|
663 | or more newline-separated lines. Each line must be one of the following:
|
---|
664 |
|
---|
665 | =over 4
|
---|
666 |
|
---|
667 | =item *
|
---|
668 |
|
---|
669 | Two hexadecimal numbers separated by horizontal whitespace (space or
|
---|
670 | tabular characters) denoting a range of Unicode code points to include.
|
---|
671 |
|
---|
672 | =item *
|
---|
673 |
|
---|
674 | Something to include, prefixed by "+": a built-in character
|
---|
675 | property (prefixed by "utf8::") or a user-defined character property,
|
---|
676 | to represent all the characters in that property; two hexadecimal code
|
---|
677 | points for a range; or a single hexadecimal code point.
|
---|
678 |
|
---|
679 | =item *
|
---|
680 |
|
---|
681 | Something to exclude, prefixed by "-": an existing character
|
---|
682 | property (prefixed by "utf8::") or a user-defined character property,
|
---|
683 | to represent all the characters in that property; two hexadecimal code
|
---|
684 | points for a range; or a single hexadecimal code point.
|
---|
685 |
|
---|
686 | =item *
|
---|
687 |
|
---|
688 | Something to negate, prefixed "!": an existing character
|
---|
689 | property (prefixed by "utf8::") or a user-defined character property,
|
---|
690 | to represent all the characters in that property; two hexadecimal code
|
---|
691 | points for a range; or a single hexadecimal code point.
|
---|
692 |
|
---|
693 | =item *
|
---|
694 |
|
---|
695 | Something to intersect with, prefixed by "&": an existing character
|
---|
696 | property (prefixed by "utf8::") or a user-defined character property,
|
---|
697 | for all the characters except the characters in the property; two
|
---|
698 | hexadecimal code points for a range; or a single hexadecimal code point.
|
---|
699 |
|
---|
700 | =back
|
---|
701 |
|
---|
702 | For example, to define a property that covers both the Japanese
|
---|
703 | syllabaries (hiragana and katakana), you can define
|
---|
704 |
|
---|
705 | sub InKana {
|
---|
706 | return <<END;
|
---|
707 | 3040\t309F
|
---|
708 | 30A0\t30FF
|
---|
709 | END
|
---|
710 | }
|
---|
711 |
|
---|
712 | Imagine that the here-doc end marker is at the beginning of the line.
|
---|
713 | Now you can use C<\p{InKana}> and C<\P{InKana}>.
|
---|
714 |
|
---|
715 | You could also have used the existing block property names:
|
---|
716 |
|
---|
717 | sub InKana {
|
---|
718 | return <<'END';
|
---|
719 | +utf8::InHiragana
|
---|
720 | +utf8::InKatakana
|
---|
721 | END
|
---|
722 | }
|
---|
723 |
|
---|
724 | Suppose you wanted to match only the allocated characters,
|
---|
725 | not the raw block ranges: in other words, you want to remove
|
---|
726 | the non-characters:
|
---|
727 |
|
---|
728 | sub InKana {
|
---|
729 | return <<'END';
|
---|
730 | +utf8::InHiragana
|
---|
731 | +utf8::InKatakana
|
---|
732 | -utf8::IsCn
|
---|
733 | END
|
---|
734 | }
|
---|
735 |
|
---|
736 | The negation is useful for defining (surprise!) negated classes.
|
---|
737 |
|
---|
738 | sub InNotKana {
|
---|
739 | return <<'END';
|
---|
740 | !utf8::InHiragana
|
---|
741 | -utf8::InKatakana
|
---|
742 | +utf8::IsCn
|
---|
743 | END
|
---|
744 | }
|
---|
745 |
|
---|
746 | Intersection is useful for getting the common characters matched by
|
---|
747 | two (or more) classes.
|
---|
748 |
|
---|
749 | sub InFooAndBar {
|
---|
750 | return <<'END';
|
---|
751 | +main::Foo
|
---|
752 | &main::Bar
|
---|
753 | END
|
---|
754 | }
|
---|
755 |
|
---|
756 | It's important to remember not to use "&" for the first set -- that
|
---|
757 | would be intersecting with nothing (resulting in an empty set).
|
---|
758 |
|
---|
759 | You can also define your own mappings to be used in the lc(),
|
---|
760 | lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
|
---|
761 | The principle is the same: define subroutines in the C<main> package
|
---|
762 | with names like C<ToLower> (for lc() and lcfirst()), C<ToTitle> (for
|
---|
763 | the first character in ucfirst()), and C<ToUpper> (for uc(), and the
|
---|
764 | rest of the characters in ucfirst()).
|
---|
765 |
|
---|
766 | The string returned by the subroutines needs now to be three
|
---|
767 | hexadecimal numbers separated by tabulators: start of the source
|
---|
768 | range, end of the source range, and start of the destination range.
|
---|
769 | For example:
|
---|
770 |
|
---|
771 | sub ToUpper {
|
---|
772 | return <<END;
|
---|
773 | 0061\t0063\t0041
|
---|
774 | END
|
---|
775 | }
|
---|
776 |
|
---|
777 | defines an uc() mapping that causes only the characters "a", "b", and
|
---|
778 | "c" to be mapped to "A", "B", "C", all other characters will remain
|
---|
779 | unchanged.
|
---|
780 |
|
---|
781 | If there is no source range to speak of, that is, the mapping is from
|
---|
782 | a single character to another single character, leave the end of the
|
---|
783 | source range empty, but the two tabulator characters are still needed.
|
---|
784 | For example:
|
---|
785 |
|
---|
786 | sub ToLower {
|
---|
787 | return <<END;
|
---|
788 | 0041\t\t0061
|
---|
789 | END
|
---|
790 | }
|
---|
791 |
|
---|
792 | defines a lc() mapping that causes only "A" to be mapped to "a", all
|
---|
793 | other characters will remain unchanged.
|
---|
794 |
|
---|
795 | (For serious hackers only) If you want to introspect the default
|
---|
796 | mappings, you can find the data in the directory
|
---|
797 | C<$Config{privlib}>/F<unicore/To/>. The mapping data is returned as
|
---|
798 | the here-document, and the C<utf8::ToSpecFoo> are special exception
|
---|
799 | mappings derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>.
|
---|
800 | The C<Digit> and C<Fold> mappings that one can see in the directory
|
---|
801 | are not directly user-accessible, one can use either the
|
---|
802 | C<Unicode::UCD> module, or just match case-insensitively (that's when
|
---|
803 | the C<Fold> mapping is used).
|
---|
804 |
|
---|
805 | A final note on the user-defined property tests and mappings: they
|
---|
806 | will be used only if the scalar has been marked as having Unicode
|
---|
807 | characters. Old byte-style strings will not be affected.
|
---|
808 |
|
---|
809 | =head2 Character Encodings for Input and Output
|
---|
810 |
|
---|
811 | See L<Encode>.
|
---|
812 |
|
---|
813 | =head2 Unicode Regular Expression Support Level
|
---|
814 |
|
---|
815 | The following list of Unicode support for regular expressions describes
|
---|
816 | all the features currently supported. The references to "Level N"
|
---|
817 | and the section numbers refer to the Unicode Technical Report 18,
|
---|
818 | "Unicode Regular Expression Guidelines", version 6 (Unicode 3.2.0,
|
---|
819 | Perl 5.8.0).
|
---|
820 |
|
---|
821 | =over 4
|
---|
822 |
|
---|
823 | =item *
|
---|
824 |
|
---|
825 | Level 1 - Basic Unicode Support
|
---|
826 |
|
---|
827 | 2.1 Hex Notation - done [1]
|
---|
828 | Named Notation - done [2]
|
---|
829 | 2.2 Categories - done [3][4]
|
---|
830 | 2.3 Subtraction - MISSING [5][6]
|
---|
831 | 2.4 Simple Word Boundaries - done [7]
|
---|
832 | 2.5 Simple Loose Matches - done [8]
|
---|
833 | 2.6 End of Line - MISSING [9][10]
|
---|
834 |
|
---|
835 | [ 1] \x{...}
|
---|
836 | [ 2] \N{...}
|
---|
837 | [ 3] . \p{...} \P{...}
|
---|
838 | [ 4] support for scripts (see UTR#24 Script Names), blocks,
|
---|
839 | binary properties, enumerated non-binary properties, and
|
---|
840 | numeric properties (as listed in UTR#18 Other Properties)
|
---|
841 | [ 5] have negation
|
---|
842 | [ 6] can use regular expression look-ahead [a]
|
---|
843 | or user-defined character properties [b] to emulate subtraction
|
---|
844 | [ 7] include Letters in word characters
|
---|
845 | [ 8] note that Perl does Full case-folding in matching, not Simple:
|
---|
846 | for example U+1F88 is equivalent with U+1F00 U+03B9,
|
---|
847 | not with 1F80. This difference matters for certain Greek
|
---|
848 | capital letters with certain modifiers: the Full case-folding
|
---|
849 | decomposes the letter, while the Simple case-folding would map
|
---|
850 | it to a single character.
|
---|
851 | [ 9] see UTR #13 Unicode Newline Guidelines
|
---|
852 | [10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029}
|
---|
853 | (should also affect <>, $., and script line numbers)
|
---|
854 | (the \x{85}, \x{2028} and \x{2029} do match \s)
|
---|
855 |
|
---|
856 | [a] You can mimic class subtraction using lookahead.
|
---|
857 | For example, what UTR #18 might write as
|
---|
858 |
|
---|
859 | [{Greek}-[{UNASSIGNED}]]
|
---|
860 |
|
---|
861 | in Perl can be written as:
|
---|
862 |
|
---|
863 | (?!\p{Unassigned})\p{InGreekAndCoptic}
|
---|
864 | (?=\p{Assigned})\p{InGreekAndCoptic}
|
---|
865 |
|
---|
866 | But in this particular example, you probably really want
|
---|
867 |
|
---|
868 | \p{GreekAndCoptic}
|
---|
869 |
|
---|
870 | which will match assigned characters known to be part of the Greek script.
|
---|
871 |
|
---|
872 | Also see the Unicode::Regex::Set module, it does implement the full
|
---|
873 | UTR #18 grouping, intersection, union, and removal (subtraction) syntax.
|
---|
874 |
|
---|
875 | [b] See L</"User-Defined Character Properties">.
|
---|
876 |
|
---|
877 | =item *
|
---|
878 |
|
---|
879 | Level 2 - Extended Unicode Support
|
---|
880 |
|
---|
881 | 3.1 Surrogates - MISSING [11]
|
---|
882 | 3.2 Canonical Equivalents - MISSING [12][13]
|
---|
883 | 3.3 Locale-Independent Graphemes - MISSING [14]
|
---|
884 | 3.4 Locale-Independent Words - MISSING [15]
|
---|
885 | 3.5 Locale-Independent Loose Matches - MISSING [16]
|
---|
886 |
|
---|
887 | [11] Surrogates are solely a UTF-16 concept and Perl's internal
|
---|
888 | representation is UTF-8. The Encode module does UTF-16, though.
|
---|
889 | [12] see UTR#15 Unicode Normalization
|
---|
890 | [13] have Unicode::Normalize but not integrated to regexes
|
---|
891 | [14] have \X but at this level . should equal that
|
---|
892 | [15] need three classes, not just \w and \W
|
---|
893 | [16] see UTR#21 Case Mappings
|
---|
894 |
|
---|
895 | =item *
|
---|
896 |
|
---|
897 | Level 3 - Locale-Sensitive Support
|
---|
898 |
|
---|
899 | 4.1 Locale-Dependent Categories - MISSING
|
---|
900 | 4.2 Locale-Dependent Graphemes - MISSING [16][17]
|
---|
901 | 4.3 Locale-Dependent Words - MISSING
|
---|
902 | 4.4 Locale-Dependent Loose Matches - MISSING
|
---|
903 | 4.5 Locale-Dependent Ranges - MISSING
|
---|
904 |
|
---|
905 | [16] see UTR#10 Unicode Collation Algorithms
|
---|
906 | [17] have Unicode::Collate but not integrated to regexes
|
---|
907 |
|
---|
908 | =back
|
---|
909 |
|
---|
910 | =head2 Unicode Encodings
|
---|
911 |
|
---|
912 | Unicode characters are assigned to I<code points>, which are abstract
|
---|
913 | numbers. To use these numbers, various encodings are needed.
|
---|
914 |
|
---|
915 | =over 4
|
---|
916 |
|
---|
917 | =item *
|
---|
918 |
|
---|
919 | UTF-8
|
---|
920 |
|
---|
921 | UTF-8 is a variable-length (1 to 6 bytes, current character allocations
|
---|
922 | require 4 bytes), byte-order independent encoding. For ASCII (and we
|
---|
923 | really do mean 7-bit ASCII, not another 8-bit encoding), UTF-8 is
|
---|
924 | transparent.
|
---|
925 |
|
---|
926 | The following table is from Unicode 3.2.
|
---|
927 |
|
---|
928 | Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
|
---|
929 |
|
---|
930 | U+0000..U+007F 00..7F
|
---|
931 | U+0080..U+07FF C2..DF 80..BF
|
---|
932 | U+0800..U+0FFF E0 A0..BF 80..BF
|
---|
933 | U+1000..U+CFFF E1..EC 80..BF 80..BF
|
---|
934 | U+D000..U+D7FF ED 80..9F 80..BF
|
---|
935 | U+D800..U+DFFF ******* ill-formed *******
|
---|
936 | U+E000..U+FFFF EE..EF 80..BF 80..BF
|
---|
937 | U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
|
---|
938 | U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
|
---|
939 | U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
|
---|
940 |
|
---|
941 | Note the C<A0..BF> in C<U+0800..U+0FFF>, the C<80..9F> in
|
---|
942 | C<U+D000...U+D7FF>, the C<90..B>F in C<U+10000..U+3FFFF>, and the
|
---|
943 | C<80...8F> in C<U+100000..U+10FFFF>. The "gaps" are caused by legal
|
---|
944 | UTF-8 avoiding non-shortest encodings: it is technically possible to
|
---|
945 | UTF-8-encode a single code point in different ways, but that is
|
---|
946 | explicitly forbidden, and the shortest possible encoding should always
|
---|
947 | be used. So that's what Perl does.
|
---|
948 |
|
---|
949 | Another way to look at it is via bits:
|
---|
950 |
|
---|
951 | Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
|
---|
952 |
|
---|
953 | 0aaaaaaa 0aaaaaaa
|
---|
954 | 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
|
---|
955 | ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
|
---|
956 | 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
|
---|
957 |
|
---|
958 | As you can see, the continuation bytes all begin with C<10>, and the
|
---|
959 | leading bits of the start byte tell how many bytes the are in the
|
---|
960 | encoded character.
|
---|
961 |
|
---|
962 | =item *
|
---|
963 |
|
---|
964 | UTF-EBCDIC
|
---|
965 |
|
---|
966 | Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
|
---|
967 |
|
---|
968 | =item *
|
---|
969 |
|
---|
970 | UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
|
---|
971 |
|
---|
972 | The followings items are mostly for reference and general Unicode
|
---|
973 | knowledge, Perl doesn't use these constructs internally.
|
---|
974 |
|
---|
975 | UTF-16 is a 2 or 4 byte encoding. The Unicode code points
|
---|
976 | C<U+0000..U+FFFF> are stored in a single 16-bit unit, and the code
|
---|
977 | points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is
|
---|
978 | using I<surrogates>, the first 16-bit unit being the I<high
|
---|
979 | surrogate>, and the second being the I<low surrogate>.
|
---|
980 |
|
---|
981 | Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
|
---|
982 | range of Unicode code points in pairs of 16-bit units. The I<high
|
---|
983 | surrogates> are the range C<U+D800..U+DBFF>, and the I<low surrogates>
|
---|
984 | are the range C<U+DC00..U+DFFF>. The surrogate encoding is
|
---|
985 |
|
---|
986 | $hi = ($uni - 0x10000) / 0x400 + 0xD800;
|
---|
987 | $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
|
---|
988 |
|
---|
989 | and the decoding is
|
---|
990 |
|
---|
991 | $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
|
---|
992 |
|
---|
993 | If you try to generate surrogates (for example by using chr()), you
|
---|
994 | will get a warning if warnings are turned on, because those code
|
---|
995 | points are not valid for a Unicode character.
|
---|
996 |
|
---|
997 | Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
|
---|
998 | itself can be used for in-memory computations, but if storage or
|
---|
999 | transfer is required either UTF-16BE (big-endian) or UTF-16LE
|
---|
1000 | (little-endian) encodings must be chosen.
|
---|
1001 |
|
---|
1002 | This introduces another problem: what if you just know that your data
|
---|
1003 | is UTF-16, but you don't know which endianness? Byte Order Marks, or
|
---|
1004 | BOMs, are a solution to this. A special character has been reserved
|
---|
1005 | in Unicode to function as a byte order marker: the character with the
|
---|
1006 | code point C<U+FEFF> is the BOM.
|
---|
1007 |
|
---|
1008 | The trick is that if you read a BOM, you will know the byte order,
|
---|
1009 | since if it was written on a big-endian platform, you will read the
|
---|
1010 | bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
|
---|
1011 | you will read the bytes C<0xFF 0xFE>. (And if the originating platform
|
---|
1012 | was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
|
---|
1013 |
|
---|
1014 | The way this trick works is that the character with the code point
|
---|
1015 | C<U+FFFE> is guaranteed not to be a valid Unicode character, so the
|
---|
1016 | sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
|
---|
1017 | little-endian format" and cannot be C<U+FFFE>, represented in big-endian
|
---|
1018 | format".
|
---|
1019 |
|
---|
1020 | =item *
|
---|
1021 |
|
---|
1022 | UTF-32, UTF-32BE, UTF-32LE
|
---|
1023 |
|
---|
1024 | The UTF-32 family is pretty much like the UTF-16 family, expect that
|
---|
1025 | the units are 32-bit, and therefore the surrogate scheme is not
|
---|
1026 | needed. The BOM signatures will be C<0x00 0x00 0xFE 0xFF> for BE and
|
---|
1027 | C<0xFF 0xFE 0x00 0x00> for LE.
|
---|
1028 |
|
---|
1029 | =item *
|
---|
1030 |
|
---|
1031 | UCS-2, UCS-4
|
---|
1032 |
|
---|
1033 | Encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
|
---|
1034 | encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
|
---|
1035 | because it does not use surrogates. UCS-4 is a 32-bit encoding,
|
---|
1036 | functionally identical to UTF-32.
|
---|
1037 |
|
---|
1038 | =item *
|
---|
1039 |
|
---|
1040 | UTF-7
|
---|
1041 |
|
---|
1042 | A seven-bit safe (non-eight-bit) encoding, which is useful if the
|
---|
1043 | transport or storage is not eight-bit safe. Defined by RFC 2152.
|
---|
1044 |
|
---|
1045 | =back
|
---|
1046 |
|
---|
1047 | =head2 Security Implications of Unicode
|
---|
1048 |
|
---|
1049 | =over 4
|
---|
1050 |
|
---|
1051 | =item *
|
---|
1052 |
|
---|
1053 | Malformed UTF-8
|
---|
1054 |
|
---|
1055 | Unfortunately, the specification of UTF-8 leaves some room for
|
---|
1056 | interpretation of how many bytes of encoded output one should generate
|
---|
1057 | from one input Unicode character. Strictly speaking, the shortest
|
---|
1058 | possible sequence of UTF-8 bytes should be generated,
|
---|
1059 | because otherwise there is potential for an input buffer overflow at
|
---|
1060 | the receiving end of a UTF-8 connection. Perl always generates the
|
---|
1061 | shortest length UTF-8, and with warnings on Perl will warn about
|
---|
1062 | non-shortest length UTF-8 along with other malformations, such as the
|
---|
1063 | surrogates, which are not real Unicode code points.
|
---|
1064 |
|
---|
1065 | =item *
|
---|
1066 |
|
---|
1067 | Regular expressions behave slightly differently between byte data and
|
---|
1068 | character (Unicode) data. For example, the "word character" character
|
---|
1069 | class C<\w> will work differently depending on if data is eight-bit bytes
|
---|
1070 | or Unicode.
|
---|
1071 |
|
---|
1072 | In the first case, the set of C<\w> characters is either small--the
|
---|
1073 | default set of alphabetic characters, digits, and the "_"--or, if you
|
---|
1074 | are using a locale (see L<perllocale>), the C<\w> might contain a few
|
---|
1075 | more letters according to your language and country.
|
---|
1076 |
|
---|
1077 | In the second case, the C<\w> set of characters is much, much larger.
|
---|
1078 | Most importantly, even in the set of the first 256 characters, it will
|
---|
1079 | probably match different characters: unlike most locales, which are
|
---|
1080 | specific to a language and country pair, Unicode classifies all the
|
---|
1081 | characters that are letters I<somewhere> as C<\w>. For example, your
|
---|
1082 | locale might not think that LATIN SMALL LETTER ETH is a letter (unless
|
---|
1083 | you happen to speak Icelandic), but Unicode does.
|
---|
1084 |
|
---|
1085 | As discussed elsewhere, Perl has one foot (two hooves?) planted in
|
---|
1086 | each of two worlds: the old world of bytes and the new world of
|
---|
1087 | characters, upgrading from bytes to characters when necessary.
|
---|
1088 | If your legacy code does not explicitly use Unicode, no automatic
|
---|
1089 | switch-over to characters should happen. Characters shouldn't get
|
---|
1090 | downgraded to bytes, either. It is possible to accidentally mix bytes
|
---|
1091 | and characters, however (see L<perluniintro>), in which case C<\w> in
|
---|
1092 | regular expressions might start behaving differently. Review your
|
---|
1093 | code. Use warnings and the C<strict> pragma.
|
---|
1094 |
|
---|
1095 | =back
|
---|
1096 |
|
---|
1097 | =head2 Unicode in Perl on EBCDIC
|
---|
1098 |
|
---|
1099 | The way Unicode is handled on EBCDIC platforms is still
|
---|
1100 | experimental. On such platforms, references to UTF-8 encoding in this
|
---|
1101 | document and elsewhere should be read as meaning the UTF-EBCDIC
|
---|
1102 | specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
|
---|
1103 | are specifically discussed. There is no C<utfebcdic> pragma or
|
---|
1104 | ":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean
|
---|
1105 | the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
|
---|
1106 | for more discussion of the issues.
|
---|
1107 |
|
---|
1108 | =head2 Locales
|
---|
1109 |
|
---|
1110 | Usually locale settings and Unicode do not affect each other, but
|
---|
1111 | there are a couple of exceptions:
|
---|
1112 |
|
---|
1113 | =over 4
|
---|
1114 |
|
---|
1115 | =item *
|
---|
1116 |
|
---|
1117 | You can enable automatic UTF-8-ification of your standard file
|
---|
1118 | handles, default C<open()> layer, and C<@ARGV> by using either
|
---|
1119 | the C<-C> command line switch or the C<PERL_UNICODE> environment
|
---|
1120 | variable, see L<perlrun> for the documentation of the C<-C> switch.
|
---|
1121 |
|
---|
1122 | =item *
|
---|
1123 |
|
---|
1124 | Perl tries really hard to work both with Unicode and the old
|
---|
1125 | byte-oriented world. Most often this is nice, but sometimes Perl's
|
---|
1126 | straddling of the proverbial fence causes problems.
|
---|
1127 |
|
---|
1128 | =back
|
---|
1129 |
|
---|
1130 | =head2 When Unicode Does Not Happen
|
---|
1131 |
|
---|
1132 | While Perl does have extensive ways to input and output in Unicode,
|
---|
1133 | and few other 'entry points' like the @ARGV which can be interpreted
|
---|
1134 | as Unicode (UTF-8), there still are many places where Unicode (in some
|
---|
1135 | encoding or another) could be given as arguments or received as
|
---|
1136 | results, or both, but it is not.
|
---|
1137 |
|
---|
1138 | The following are such interfaces. For all of these interfaces Perl
|
---|
1139 | currently (as of 5.8.3) simply assumes byte strings both as arguments
|
---|
1140 | and results, or UTF-8 strings if the C<encoding> pragma has been used.
|
---|
1141 |
|
---|
1142 | One reason why Perl does not attempt to resolve the role of Unicode in
|
---|
1143 | this cases is that the answers are highly dependent on the operating
|
---|
1144 | system and the file system(s). For example, whether filenames can be
|
---|
1145 | in Unicode, and in exactly what kind of encoding, is not exactly a
|
---|
1146 | portable concept. Similarly for the qx and system: how well will the
|
---|
1147 | 'command line interface' (and which of them?) handle Unicode?
|
---|
1148 |
|
---|
1149 | =over 4
|
---|
1150 |
|
---|
1151 | =item *
|
---|
1152 |
|
---|
1153 | chdir, chmod, chown, chroot, exec, link, lstat, mkdir,
|
---|
1154 | rename, rmdir, stat, symlink, truncate, unlink, utime, -X
|
---|
1155 |
|
---|
1156 | =item *
|
---|
1157 |
|
---|
1158 | %ENV
|
---|
1159 |
|
---|
1160 | =item *
|
---|
1161 |
|
---|
1162 | glob (aka the <*>)
|
---|
1163 |
|
---|
1164 | =item *
|
---|
1165 |
|
---|
1166 | open, opendir, sysopen
|
---|
1167 |
|
---|
1168 | =item *
|
---|
1169 |
|
---|
1170 | qx (aka the backtick operator), system
|
---|
1171 |
|
---|
1172 | =item *
|
---|
1173 |
|
---|
1174 | readdir, readlink
|
---|
1175 |
|
---|
1176 | =back
|
---|
1177 |
|
---|
1178 | =head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
|
---|
1179 |
|
---|
1180 | Sometimes (see L</"When Unicode Does Not Happen">) there are
|
---|
1181 | situations where you simply need to force Perl to believe that a byte
|
---|
1182 | string is UTF-8, or vice versa. The low-level calls
|
---|
1183 | utf8::upgrade($bytestring) and utf8::downgrade($utf8string) are
|
---|
1184 | the answers.
|
---|
1185 |
|
---|
1186 | Do not use them without careful thought, though: Perl may easily get
|
---|
1187 | very confused, angry, or even crash, if you suddenly change the 'nature'
|
---|
1188 | of scalar like that. Especially careful you have to be if you use the
|
---|
1189 | utf8::upgrade(): any random byte string is not valid UTF-8.
|
---|
1190 |
|
---|
1191 | =head2 Using Unicode in XS
|
---|
1192 |
|
---|
1193 | If you want to handle Perl Unicode in XS extensions, you may find the
|
---|
1194 | following C APIs useful. See also L<perlguts/"Unicode Support"> for an
|
---|
1195 | explanation about Unicode at the XS level, and L<perlapi> for the API
|
---|
1196 | details.
|
---|
1197 |
|
---|
1198 | =over 4
|
---|
1199 |
|
---|
1200 | =item *
|
---|
1201 |
|
---|
1202 | C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
|
---|
1203 | pragma is not in effect. C<SvUTF8(sv)> returns true is the C<UTF8>
|
---|
1204 | flag is on; the bytes pragma is ignored. The C<UTF8> flag being on
|
---|
1205 | does B<not> mean that there are any characters of code points greater
|
---|
1206 | than 255 (or 127) in the scalar or that there are even any characters
|
---|
1207 | in the scalar. What the C<UTF8> flag means is that the sequence of
|
---|
1208 | octets in the representation of the scalar is the sequence of UTF-8
|
---|
1209 | encoded code points of the characters of a string. The C<UTF8> flag
|
---|
1210 | being off means that each octet in this representation encodes a
|
---|
1211 | single character with code point 0..255 within the string. Perl's
|
---|
1212 | Unicode model is not to use UTF-8 until it is absolutely necessary.
|
---|
1213 |
|
---|
1214 | =item *
|
---|
1215 |
|
---|
1216 | C<uvuni_to_utf8(buf, chr)> writes a Unicode character code point into
|
---|
1217 | a buffer encoding the code point as UTF-8, and returns a pointer
|
---|
1218 | pointing after the UTF-8 bytes.
|
---|
1219 |
|
---|
1220 | =item *
|
---|
1221 |
|
---|
1222 | C<utf8_to_uvuni(buf, lenp)> reads UTF-8 encoded bytes from a buffer and
|
---|
1223 | returns the Unicode character code point and, optionally, the length of
|
---|
1224 | the UTF-8 byte sequence.
|
---|
1225 |
|
---|
1226 | =item *
|
---|
1227 |
|
---|
1228 | C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer
|
---|
1229 | in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded
|
---|
1230 | scalar.
|
---|
1231 |
|
---|
1232 | =item *
|
---|
1233 |
|
---|
1234 | C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8
|
---|
1235 | encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if
|
---|
1236 | possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that
|
---|
1237 | it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the
|
---|
1238 | opposite of C<sv_utf8_encode()>. Note that none of these are to be
|
---|
1239 | used as general-purpose encoding or decoding interfaces: C<use Encode>
|
---|
1240 | for that. C<sv_utf8_upgrade()> is affected by the encoding pragma
|
---|
1241 | but C<sv_utf8_downgrade()> is not (since the encoding pragma is
|
---|
1242 | designed to be a one-way street).
|
---|
1243 |
|
---|
1244 | =item *
|
---|
1245 |
|
---|
1246 | C<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8
|
---|
1247 | character.
|
---|
1248 |
|
---|
1249 | =item *
|
---|
1250 |
|
---|
1251 | C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer
|
---|
1252 | are valid UTF-8.
|
---|
1253 |
|
---|
1254 | =item *
|
---|
1255 |
|
---|
1256 | C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded
|
---|
1257 | character in the buffer. C<UNISKIP(chr)> will return the number of bytes
|
---|
1258 | required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()>
|
---|
1259 | is useful for example for iterating over the characters of a UTF-8
|
---|
1260 | encoded buffer; C<UNISKIP()> is useful, for example, in computing
|
---|
1261 | the size required for a UTF-8 encoded buffer.
|
---|
1262 |
|
---|
1263 | =item *
|
---|
1264 |
|
---|
1265 | C<utf8_distance(a, b)> will tell the distance in characters between the
|
---|
1266 | two pointers pointing to the same UTF-8 encoded buffer.
|
---|
1267 |
|
---|
1268 | =item *
|
---|
1269 |
|
---|
1270 | C<utf8_hop(s, off)> will return a pointer to an UTF-8 encoded buffer
|
---|
1271 | that is C<off> (positive or negative) Unicode characters displaced
|
---|
1272 | from the UTF-8 buffer C<s>. Be careful not to overstep the buffer:
|
---|
1273 | C<utf8_hop()> will merrily run off the end or the beginning of the
|
---|
1274 | buffer if told to do so.
|
---|
1275 |
|
---|
1276 | =item *
|
---|
1277 |
|
---|
1278 | C<pv_uni_display(dsv, spv, len, pvlim, flags)> and
|
---|
1279 | C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the
|
---|
1280 | output of Unicode strings and scalars. By default they are useful
|
---|
1281 | only for debugging--they display B<all> characters as hexadecimal code
|
---|
1282 | points--but with the flags C<UNI_DISPLAY_ISPRINT>,
|
---|
1283 | C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the
|
---|
1284 | output more readable.
|
---|
1285 |
|
---|
1286 | =item *
|
---|
1287 |
|
---|
1288 | C<ibcmp_utf8(s1, pe1, u1, l1, u1, s2, pe2, l2, u2)> can be used to
|
---|
1289 | compare two strings case-insensitively in Unicode. For case-sensitive
|
---|
1290 | comparisons you can just use C<memEQ()> and C<memNE()> as usual.
|
---|
1291 |
|
---|
1292 | =back
|
---|
1293 |
|
---|
1294 | For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
|
---|
1295 | in the Perl source code distribution.
|
---|
1296 |
|
---|
1297 | =head1 BUGS
|
---|
1298 |
|
---|
1299 | =head2 Interaction with Locales
|
---|
1300 |
|
---|
1301 | Use of locales with Unicode data may lead to odd results. Currently,
|
---|
1302 | Perl attempts to attach 8-bit locale info to characters in the range
|
---|
1303 | 0..255, but this technique is demonstrably incorrect for locales that
|
---|
1304 | use characters above that range when mapped into Unicode. Perl's
|
---|
1305 | Unicode support will also tend to run slower. Use of locales with
|
---|
1306 | Unicode is discouraged.
|
---|
1307 |
|
---|
1308 | =head2 Interaction with Extensions
|
---|
1309 |
|
---|
1310 | When Perl exchanges data with an extension, the extension should be
|
---|
1311 | able to understand the UTF-8 flag and act accordingly. If the
|
---|
1312 | extension doesn't know about the flag, it's likely that the extension
|
---|
1313 | will return incorrectly-flagged data.
|
---|
1314 |
|
---|
1315 | So if you're working with Unicode data, consult the documentation of
|
---|
1316 | every module you're using if there are any issues with Unicode data
|
---|
1317 | exchange. If the documentation does not talk about Unicode at all,
|
---|
1318 | suspect the worst and probably look at the source to learn how the
|
---|
1319 | module is implemented. Modules written completely in Perl shouldn't
|
---|
1320 | cause problems. Modules that directly or indirectly access code written
|
---|
1321 | in other programming languages are at risk.
|
---|
1322 |
|
---|
1323 | For affected functions, the simple strategy to avoid data corruption is
|
---|
1324 | to always make the encoding of the exchanged data explicit. Choose an
|
---|
1325 | encoding that you know the extension can handle. Convert arguments passed
|
---|
1326 | to the extensions to that encoding and convert results back from that
|
---|
1327 | encoding. Write wrapper functions that do the conversions for you, so
|
---|
1328 | you can later change the functions when the extension catches up.
|
---|
1329 |
|
---|
1330 | To provide an example, let's say the popular Foo::Bar::escape_html
|
---|
1331 | function doesn't deal with Unicode data yet. The wrapper function
|
---|
1332 | would convert the argument to raw UTF-8 and convert the result back to
|
---|
1333 | Perl's internal representation like so:
|
---|
1334 |
|
---|
1335 | sub my_escape_html ($) {
|
---|
1336 | my($what) = shift;
|
---|
1337 | return unless defined $what;
|
---|
1338 | Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what)));
|
---|
1339 | }
|
---|
1340 |
|
---|
1341 | Sometimes, when the extension does not convert data but just stores
|
---|
1342 | and retrieves them, you will be in a position to use the otherwise
|
---|
1343 | dangerous Encode::_utf8_on() function. Let's say the popular
|
---|
1344 | C<Foo::Bar> extension, written in C, provides a C<param> method that
|
---|
1345 | lets you store and retrieve data according to these prototypes:
|
---|
1346 |
|
---|
1347 | $self->param($name, $value); # set a scalar
|
---|
1348 | $value = $self->param($name); # retrieve a scalar
|
---|
1349 |
|
---|
1350 | If it does not yet provide support for any encoding, one could write a
|
---|
1351 | derived class with such a C<param> method:
|
---|
1352 |
|
---|
1353 | sub param {
|
---|
1354 | my($self,$name,$value) = @_;
|
---|
1355 | utf8::upgrade($name); # make sure it is UTF-8 encoded
|
---|
1356 | if (defined $value)
|
---|
1357 | utf8::upgrade($value); # make sure it is UTF-8 encoded
|
---|
1358 | return $self->SUPER::param($name,$value);
|
---|
1359 | } else {
|
---|
1360 | my $ret = $self->SUPER::param($name);
|
---|
1361 | Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
|
---|
1362 | return $ret;
|
---|
1363 | }
|
---|
1364 | }
|
---|
1365 |
|
---|
1366 | Some extensions provide filters on data entry/exit points, such as
|
---|
1367 | DB_File::filter_store_key and family. Look out for such filters in
|
---|
1368 | the documentation of your extensions, they can make the transition to
|
---|
1369 | Unicode data much easier.
|
---|
1370 |
|
---|
1371 | =head2 Speed
|
---|
1372 |
|
---|
1373 | Some functions are slower when working on UTF-8 encoded strings than
|
---|
1374 | on byte encoded strings. All functions that need to hop over
|
---|
1375 | characters such as length(), substr() or index(), or matching regular
|
---|
1376 | expressions can work B<much> faster when the underlying data are
|
---|
1377 | byte-encoded.
|
---|
1378 |
|
---|
1379 | In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
|
---|
1380 | a caching scheme was introduced which will hopefully make the slowness
|
---|
1381 | somewhat less spectacular, at least for some operations. In general,
|
---|
1382 | operations with UTF-8 encoded strings are still slower. As an example,
|
---|
1383 | the Unicode properties (character classes) like C<\p{Nd}> are known to
|
---|
1384 | be quite a bit slower (5-20 times) than their simpler counterparts
|
---|
1385 | like C<\d> (then again, there 268 Unicode characters matching C<Nd>
|
---|
1386 | compared with the 10 ASCII characters matching C<d>).
|
---|
1387 |
|
---|
1388 | =head2 Porting code from perl-5.6.X
|
---|
1389 |
|
---|
1390 | Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
|
---|
1391 | was required to use the C<utf8> pragma to declare that a given scope
|
---|
1392 | expected to deal with Unicode data and had to make sure that only
|
---|
1393 | Unicode data were reaching that scope. If you have code that is
|
---|
1394 | working with 5.6, you will need some of the following adjustments to
|
---|
1395 | your code. The examples are written such that the code will continue
|
---|
1396 | to work under 5.6, so you should be safe to try them out.
|
---|
1397 |
|
---|
1398 | =over 4
|
---|
1399 |
|
---|
1400 | =item *
|
---|
1401 |
|
---|
1402 | A filehandle that should read or write UTF-8
|
---|
1403 |
|
---|
1404 | if ($] > 5.007) {
|
---|
1405 | binmode $fh, ":utf8";
|
---|
1406 | }
|
---|
1407 |
|
---|
1408 | =item *
|
---|
1409 |
|
---|
1410 | A scalar that is going to be passed to some extension
|
---|
1411 |
|
---|
1412 | Be it Compress::Zlib, Apache::Request or any extension that has no
|
---|
1413 | mention of Unicode in the manpage, you need to make sure that the
|
---|
1414 | UTF-8 flag is stripped off. Note that at the time of this writing
|
---|
1415 | (October 2002) the mentioned modules are not UTF-8-aware. Please
|
---|
1416 | check the documentation to verify if this is still true.
|
---|
1417 |
|
---|
1418 | if ($] > 5.007) {
|
---|
1419 | require Encode;
|
---|
1420 | $val = Encode::encode_utf8($val); # make octets
|
---|
1421 | }
|
---|
1422 |
|
---|
1423 | =item *
|
---|
1424 |
|
---|
1425 | A scalar we got back from an extension
|
---|
1426 |
|
---|
1427 | If you believe the scalar comes back as UTF-8, you will most likely
|
---|
1428 | want the UTF-8 flag restored:
|
---|
1429 |
|
---|
1430 | if ($] > 5.007) {
|
---|
1431 | require Encode;
|
---|
1432 | $val = Encode::decode_utf8($val);
|
---|
1433 | }
|
---|
1434 |
|
---|
1435 | =item *
|
---|
1436 |
|
---|
1437 | Same thing, if you are really sure it is UTF-8
|
---|
1438 |
|
---|
1439 | if ($] > 5.007) {
|
---|
1440 | require Encode;
|
---|
1441 | Encode::_utf8_on($val);
|
---|
1442 | }
|
---|
1443 |
|
---|
1444 | =item *
|
---|
1445 |
|
---|
1446 | A wrapper for fetchrow_array and fetchrow_hashref
|
---|
1447 |
|
---|
1448 | When the database contains only UTF-8, a wrapper function or method is
|
---|
1449 | a convenient way to replace all your fetchrow_array and
|
---|
1450 | fetchrow_hashref calls. A wrapper function will also make it easier to
|
---|
1451 | adapt to future enhancements in your database driver. Note that at the
|
---|
1452 | time of this writing (October 2002), the DBI has no standardized way
|
---|
1453 | to deal with UTF-8 data. Please check the documentation to verify if
|
---|
1454 | that is still true.
|
---|
1455 |
|
---|
1456 | sub fetchrow {
|
---|
1457 | my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref}
|
---|
1458 | if ($] < 5.007) {
|
---|
1459 | return $sth->$what;
|
---|
1460 | } else {
|
---|
1461 | require Encode;
|
---|
1462 | if (wantarray) {
|
---|
1463 | my @arr = $sth->$what;
|
---|
1464 | for (@arr) {
|
---|
1465 | defined && /[^\000-\177]/ && Encode::_utf8_on($_);
|
---|
1466 | }
|
---|
1467 | return @arr;
|
---|
1468 | } else {
|
---|
1469 | my $ret = $sth->$what;
|
---|
1470 | if (ref $ret) {
|
---|
1471 | for my $k (keys %$ret) {
|
---|
1472 | defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret->{$k};
|
---|
1473 | }
|
---|
1474 | return $ret;
|
---|
1475 | } else {
|
---|
1476 | defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
|
---|
1477 | return $ret;
|
---|
1478 | }
|
---|
1479 | }
|
---|
1480 | }
|
---|
1481 | }
|
---|
1482 |
|
---|
1483 |
|
---|
1484 | =item *
|
---|
1485 |
|
---|
1486 | A large scalar that you know can only contain ASCII
|
---|
1487 |
|
---|
1488 | Scalars that contain only ASCII and are marked as UTF-8 are sometimes
|
---|
1489 | a drag to your program. If you recognize such a situation, just remove
|
---|
1490 | the UTF-8 flag:
|
---|
1491 |
|
---|
1492 | utf8::downgrade($val) if $] > 5.007;
|
---|
1493 |
|
---|
1494 | =back
|
---|
1495 |
|
---|
1496 | =head1 SEE ALSO
|
---|
1497 |
|
---|
1498 | L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>,
|
---|
1499 | L<perlretut>, L<perlvar/"${^UNICODE}">
|
---|
1500 |
|
---|
1501 | =cut
|
---|