1 | =head1 NAME
|
---|
2 |
|
---|
3 | perlretut - Perl regular expressions tutorial
|
---|
4 |
|
---|
5 | =head1 DESCRIPTION
|
---|
6 |
|
---|
7 | This page provides a basic tutorial on understanding, creating and
|
---|
8 | using regular expressions in Perl. It serves as a complement to the
|
---|
9 | reference page on regular expressions L<perlre>. Regular expressions
|
---|
10 | are an integral part of the C<m//>, C<s///>, C<qr//> and C<split>
|
---|
11 | operators and so this tutorial also overlaps with
|
---|
12 | L<perlop/"Regexp Quote-Like Operators"> and L<perlfunc/split>.
|
---|
13 |
|
---|
14 | Perl is widely renowned for excellence in text processing, and regular
|
---|
15 | expressions are one of the big factors behind this fame. Perl regular
|
---|
16 | expressions display an efficiency and flexibility unknown in most
|
---|
17 | other computer languages. Mastering even the basics of regular
|
---|
18 | expressions will allow you to manipulate text with surprising ease.
|
---|
19 |
|
---|
20 | What is a regular expression? A regular expression is simply a string
|
---|
21 | that describes a pattern. Patterns are in common use these days;
|
---|
22 | examples are the patterns typed into a search engine to find web pages
|
---|
23 | and the patterns used to list files in a directory, e.g., C<ls *.txt>
|
---|
24 | or C<dir *.*>. In Perl, the patterns described by regular expressions
|
---|
25 | are used to search strings, extract desired parts of strings, and to
|
---|
26 | do search and replace operations.
|
---|
27 |
|
---|
28 | Regular expressions have the undeserved reputation of being abstract
|
---|
29 | and difficult to understand. Regular expressions are constructed using
|
---|
30 | simple concepts like conditionals and loops and are no more difficult
|
---|
31 | to understand than the corresponding C<if> conditionals and C<while>
|
---|
32 | loops in the Perl language itself. In fact, the main challenge in
|
---|
33 | learning regular expressions is just getting used to the terse
|
---|
34 | notation used to express these concepts.
|
---|
35 |
|
---|
36 | This tutorial flattens the learning curve by discussing regular
|
---|
37 | expression concepts, along with their notation, one at a time and with
|
---|
38 | many examples. The first part of the tutorial will progress from the
|
---|
39 | simplest word searches to the basic regular expression concepts. If
|
---|
40 | you master the first part, you will have all the tools needed to solve
|
---|
41 | about 98% of your needs. The second part of the tutorial is for those
|
---|
42 | comfortable with the basics and hungry for more power tools. It
|
---|
43 | discusses the more advanced regular expression operators and
|
---|
44 | introduces the latest cutting edge innovations in 5.6.0.
|
---|
45 |
|
---|
46 | A note: to save time, 'regular expression' is often abbreviated as
|
---|
47 | regexp or regex. Regexp is a more natural abbreviation than regex, but
|
---|
48 | is harder to pronounce. The Perl pod documentation is evenly split on
|
---|
49 | regexp vs regex; in Perl, there is more than one way to abbreviate it.
|
---|
50 | We'll use regexp in this tutorial.
|
---|
51 |
|
---|
52 | =head1 Part 1: The basics
|
---|
53 |
|
---|
54 | =head2 Simple word matching
|
---|
55 |
|
---|
56 | The simplest regexp is simply a word, or more generally, a string of
|
---|
57 | characters. A regexp consisting of a word matches any string that
|
---|
58 | contains that word:
|
---|
59 |
|
---|
60 | "Hello World" =~ /World/; # matches
|
---|
61 |
|
---|
62 | What is this perl statement all about? C<"Hello World"> is a simple
|
---|
63 | double quoted string. C<World> is the regular expression and the
|
---|
64 | C<//> enclosing C</World/> tells perl to search a string for a match.
|
---|
65 | The operator C<=~> associates the string with the regexp match and
|
---|
66 | produces a true value if the regexp matched, or false if the regexp
|
---|
67 | did not match. In our case, C<World> matches the second word in
|
---|
68 | C<"Hello World">, so the expression is true. Expressions like this
|
---|
69 | are useful in conditionals:
|
---|
70 |
|
---|
71 | if ("Hello World" =~ /World/) {
|
---|
72 | print "It matches\n";
|
---|
73 | }
|
---|
74 | else {
|
---|
75 | print "It doesn't match\n";
|
---|
76 | }
|
---|
77 |
|
---|
78 | There are useful variations on this theme. The sense of the match can
|
---|
79 | be reversed by using C<!~> operator:
|
---|
80 |
|
---|
81 | if ("Hello World" !~ /World/) {
|
---|
82 | print "It doesn't match\n";
|
---|
83 | }
|
---|
84 | else {
|
---|
85 | print "It matches\n";
|
---|
86 | }
|
---|
87 |
|
---|
88 | The literal string in the regexp can be replaced by a variable:
|
---|
89 |
|
---|
90 | $greeting = "World";
|
---|
91 | if ("Hello World" =~ /$greeting/) {
|
---|
92 | print "It matches\n";
|
---|
93 | }
|
---|
94 | else {
|
---|
95 | print "It doesn't match\n";
|
---|
96 | }
|
---|
97 |
|
---|
98 | If you're matching against the special default variable C<$_>, the
|
---|
99 | C<$_ =~> part can be omitted:
|
---|
100 |
|
---|
101 | $_ = "Hello World";
|
---|
102 | if (/World/) {
|
---|
103 | print "It matches\n";
|
---|
104 | }
|
---|
105 | else {
|
---|
106 | print "It doesn't match\n";
|
---|
107 | }
|
---|
108 |
|
---|
109 | And finally, the C<//> default delimiters for a match can be changed
|
---|
110 | to arbitrary delimiters by putting an C<'m'> out front:
|
---|
111 |
|
---|
112 | "Hello World" =~ m!World!; # matches, delimited by '!'
|
---|
113 | "Hello World" =~ m{World}; # matches, note the matching '{}'
|
---|
114 | "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
|
---|
115 | # '/' becomes an ordinary char
|
---|
116 |
|
---|
117 | C</World/>, C<m!World!>, and C<m{World}> all represent the
|
---|
118 | same thing. When, e.g., C<""> is used as a delimiter, the forward
|
---|
119 | slash C<'/'> becomes an ordinary character and can be used in a regexp
|
---|
120 | without trouble.
|
---|
121 |
|
---|
122 | Let's consider how different regexps would match C<"Hello World">:
|
---|
123 |
|
---|
124 | "Hello World" =~ /world/; # doesn't match
|
---|
125 | "Hello World" =~ /o W/; # matches
|
---|
126 | "Hello World" =~ /oW/; # doesn't match
|
---|
127 | "Hello World" =~ /World /; # doesn't match
|
---|
128 |
|
---|
129 | The first regexp C<world> doesn't match because regexps are
|
---|
130 | case-sensitive. The second regexp matches because the substring
|
---|
131 | S<C<'o W'> > occurs in the string S<C<"Hello World"> >. The space
|
---|
132 | character ' ' is treated like any other character in a regexp and is
|
---|
133 | needed to match in this case. The lack of a space character is the
|
---|
134 | reason the third regexp C<'oW'> doesn't match. The fourth regexp
|
---|
135 | C<'World '> doesn't match because there is a space at the end of the
|
---|
136 | regexp, but not at the end of the string. The lesson here is that
|
---|
137 | regexps must match a part of the string I<exactly> in order for the
|
---|
138 | statement to be true.
|
---|
139 |
|
---|
140 | If a regexp matches in more than one place in the string, perl will
|
---|
141 | always match at the earliest possible point in the string:
|
---|
142 |
|
---|
143 | "Hello World" =~ /o/; # matches 'o' in 'Hello'
|
---|
144 | "That hat is red" =~ /hat/; # matches 'hat' in 'That'
|
---|
145 |
|
---|
146 | With respect to character matching, there are a few more points you
|
---|
147 | need to know about. First of all, not all characters can be used 'as
|
---|
148 | is' in a match. Some characters, called B<metacharacters>, are reserved
|
---|
149 | for use in regexp notation. The metacharacters are
|
---|
150 |
|
---|
151 | {}[]()^$.|*+?\
|
---|
152 |
|
---|
153 | The significance of each of these will be explained
|
---|
154 | in the rest of the tutorial, but for now, it is important only to know
|
---|
155 | that a metacharacter can be matched by putting a backslash before it:
|
---|
156 |
|
---|
157 | "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter
|
---|
158 | "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary +
|
---|
159 | "The interval is [0,1)." =~ /[0,1)./ # is a syntax error!
|
---|
160 | "The interval is [0,1)." =~ /\[0,1\)\./ # matches
|
---|
161 | "/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches
|
---|
162 |
|
---|
163 | In the last regexp, the forward slash C<'/'> is also backslashed,
|
---|
164 | because it is used to delimit the regexp. This can lead to LTS
|
---|
165 | (leaning toothpick syndrome), however, and it is often more readable
|
---|
166 | to change delimiters.
|
---|
167 |
|
---|
168 | "/usr/bin/perl" =~ m!/usr/bin/perl!; # easier to read
|
---|
169 |
|
---|
170 | The backslash character C<'\'> is a metacharacter itself and needs to
|
---|
171 | be backslashed:
|
---|
172 |
|
---|
173 | 'C:\WIN32' =~ /C:\\WIN/; # matches
|
---|
174 |
|
---|
175 | In addition to the metacharacters, there are some ASCII characters
|
---|
176 | which don't have printable character equivalents and are instead
|
---|
177 | represented by B<escape sequences>. Common examples are C<\t> for a
|
---|
178 | tab, C<\n> for a newline, C<\r> for a carriage return and C<\a> for a
|
---|
179 | bell. If your string is better thought of as a sequence of arbitrary
|
---|
180 | bytes, the octal escape sequence, e.g., C<\033>, or hexadecimal escape
|
---|
181 | sequence, e.g., C<\x1B> may be a more natural representation for your
|
---|
182 | bytes. Here are some examples of escapes:
|
---|
183 |
|
---|
184 | "1000\t2000" =~ m(0\t2) # matches
|
---|
185 | "1000\n2000" =~ /0\n20/ # matches
|
---|
186 | "1000\t2000" =~ /\000\t2/ # doesn't match, "0" ne "\000"
|
---|
187 | "cat" =~ /\143\x61\x74/ # matches, but a weird way to spell cat
|
---|
188 |
|
---|
189 | If you've been around Perl a while, all this talk of escape sequences
|
---|
190 | may seem familiar. Similar escape sequences are used in double-quoted
|
---|
191 | strings and in fact the regexps in Perl are mostly treated as
|
---|
192 | double-quoted strings. This means that variables can be used in
|
---|
193 | regexps as well. Just like double-quoted strings, the values of the
|
---|
194 | variables in the regexp will be substituted in before the regexp is
|
---|
195 | evaluated for matching purposes. So we have:
|
---|
196 |
|
---|
197 | $foo = 'house';
|
---|
198 | 'housecat' =~ /$foo/; # matches
|
---|
199 | 'cathouse' =~ /cat$foo/; # matches
|
---|
200 | 'housecat' =~ /${foo}cat/; # matches
|
---|
201 |
|
---|
202 | So far, so good. With the knowledge above you can already perform
|
---|
203 | searches with just about any literal string regexp you can dream up.
|
---|
204 | Here is a I<very simple> emulation of the Unix grep program:
|
---|
205 |
|
---|
206 | % cat > simple_grep
|
---|
207 | #!/usr/bin/perl
|
---|
208 | $regexp = shift;
|
---|
209 | while (<>) {
|
---|
210 | print if /$regexp/;
|
---|
211 | }
|
---|
212 | ^D
|
---|
213 |
|
---|
214 | % chmod +x simple_grep
|
---|
215 |
|
---|
216 | % simple_grep abba /usr/dict/words
|
---|
217 | Babbage
|
---|
218 | cabbage
|
---|
219 | cabbages
|
---|
220 | sabbath
|
---|
221 | Sabbathize
|
---|
222 | Sabbathizes
|
---|
223 | sabbatical
|
---|
224 | scabbard
|
---|
225 | scabbards
|
---|
226 |
|
---|
227 | This program is easy to understand. C<#!/usr/bin/perl> is the standard
|
---|
228 | way to invoke a perl program from the shell.
|
---|
229 | S<C<$regexp = shift;> > saves the first command line argument as the
|
---|
230 | regexp to be used, leaving the rest of the command line arguments to
|
---|
231 | be treated as files. S<C<< while (<>) >> > loops over all the lines in
|
---|
232 | all the files. For each line, S<C<print if /$regexp/;> > prints the
|
---|
233 | line if the regexp matches the line. In this line, both C<print> and
|
---|
234 | C</$regexp/> use the default variable C<$_> implicitly.
|
---|
235 |
|
---|
236 | With all of the regexps above, if the regexp matched anywhere in the
|
---|
237 | string, it was considered a match. Sometimes, however, we'd like to
|
---|
238 | specify I<where> in the string the regexp should try to match. To do
|
---|
239 | this, we would use the B<anchor> metacharacters C<^> and C<$>. The
|
---|
240 | anchor C<^> means match at the beginning of the string and the anchor
|
---|
241 | C<$> means match at the end of the string, or before a newline at the
|
---|
242 | end of the string. Here is how they are used:
|
---|
243 |
|
---|
244 | "housekeeper" =~ /keeper/; # matches
|
---|
245 | "housekeeper" =~ /^keeper/; # doesn't match
|
---|
246 | "housekeeper" =~ /keeper$/; # matches
|
---|
247 | "housekeeper\n" =~ /keeper$/; # matches
|
---|
248 |
|
---|
249 | The second regexp doesn't match because C<^> constrains C<keeper> to
|
---|
250 | match only at the beginning of the string, but C<"housekeeper"> has
|
---|
251 | keeper starting in the middle. The third regexp does match, since the
|
---|
252 | C<$> constrains C<keeper> to match only at the end of the string.
|
---|
253 |
|
---|
254 | When both C<^> and C<$> are used at the same time, the regexp has to
|
---|
255 | match both the beginning and the end of the string, i.e., the regexp
|
---|
256 | matches the whole string. Consider
|
---|
257 |
|
---|
258 | "keeper" =~ /^keep$/; # doesn't match
|
---|
259 | "keeper" =~ /^keeper$/; # matches
|
---|
260 | "" =~ /^$/; # ^$ matches an empty string
|
---|
261 |
|
---|
262 | The first regexp doesn't match because the string has more to it than
|
---|
263 | C<keep>. Since the second regexp is exactly the string, it
|
---|
264 | matches. Using both C<^> and C<$> in a regexp forces the complete
|
---|
265 | string to match, so it gives you complete control over which strings
|
---|
266 | match and which don't. Suppose you are looking for a fellow named
|
---|
267 | bert, off in a string by himself:
|
---|
268 |
|
---|
269 | "dogbert" =~ /bert/; # matches, but not what you want
|
---|
270 |
|
---|
271 | "dilbert" =~ /^bert/; # doesn't match, but ..
|
---|
272 | "bertram" =~ /^bert/; # matches, so still not good enough
|
---|
273 |
|
---|
274 | "bertram" =~ /^bert$/; # doesn't match, good
|
---|
275 | "dilbert" =~ /^bert$/; # doesn't match, good
|
---|
276 | "bert" =~ /^bert$/; # matches, perfect
|
---|
277 |
|
---|
278 | Of course, in the case of a literal string, one could just as easily
|
---|
279 | use the string equivalence S<C<$string eq 'bert'> > and it would be
|
---|
280 | more efficient. The C<^...$> regexp really becomes useful when we
|
---|
281 | add in the more powerful regexp tools below.
|
---|
282 |
|
---|
283 | =head2 Using character classes
|
---|
284 |
|
---|
285 | Although one can already do quite a lot with the literal string
|
---|
286 | regexps above, we've only scratched the surface of regular expression
|
---|
287 | technology. In this and subsequent sections we will introduce regexp
|
---|
288 | concepts (and associated metacharacter notations) that will allow a
|
---|
289 | regexp to not just represent a single character sequence, but a I<whole
|
---|
290 | class> of them.
|
---|
291 |
|
---|
292 | One such concept is that of a B<character class>. A character class
|
---|
293 | allows a set of possible characters, rather than just a single
|
---|
294 | character, to match at a particular point in a regexp. Character
|
---|
295 | classes are denoted by brackets C<[...]>, with the set of characters
|
---|
296 | to be possibly matched inside. Here are some examples:
|
---|
297 |
|
---|
298 | /cat/; # matches 'cat'
|
---|
299 | /[bcr]at/; # matches 'bat, 'cat', or 'rat'
|
---|
300 | /item[0123456789]/; # matches 'item0' or ... or 'item9'
|
---|
301 | "abc" =~ /[cab]/; # matches 'a'
|
---|
302 |
|
---|
303 | In the last statement, even though C<'c'> is the first character in
|
---|
304 | the class, C<'a'> matches because the first character position in the
|
---|
305 | string is the earliest point at which the regexp can match.
|
---|
306 |
|
---|
307 | /[yY][eE][sS]/; # match 'yes' in a case-insensitive way
|
---|
308 | # 'yes', 'Yes', 'YES', etc.
|
---|
309 |
|
---|
310 | This regexp displays a common task: perform a case-insensitive
|
---|
311 | match. Perl provides away of avoiding all those brackets by simply
|
---|
312 | appending an C<'i'> to the end of the match. Then C</[yY][eE][sS]/;>
|
---|
313 | can be rewritten as C</yes/i;>. The C<'i'> stands for
|
---|
314 | case-insensitive and is an example of a B<modifier> of the matching
|
---|
315 | operation. We will meet other modifiers later in the tutorial.
|
---|
316 |
|
---|
317 | We saw in the section above that there were ordinary characters, which
|
---|
318 | represented themselves, and special characters, which needed a
|
---|
319 | backslash C<\> to represent themselves. The same is true in a
|
---|
320 | character class, but the sets of ordinary and special characters
|
---|
321 | inside a character class are different than those outside a character
|
---|
322 | class. The special characters for a character class are C<-]\^$>. C<]>
|
---|
323 | is special because it denotes the end of a character class. C<$> is
|
---|
324 | special because it denotes a scalar variable. C<\> is special because
|
---|
325 | it is used in escape sequences, just like above. Here is how the
|
---|
326 | special characters C<]$\> are handled:
|
---|
327 |
|
---|
328 | /[\]c]def/; # matches ']def' or 'cdef'
|
---|
329 | $x = 'bcr';
|
---|
330 | /[$x]at/; # matches 'bat', 'cat', or 'rat'
|
---|
331 | /[\$x]at/; # matches '$at' or 'xat'
|
---|
332 | /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
|
---|
333 |
|
---|
334 | The last two are a little tricky. in C<[\$x]>, the backslash protects
|
---|
335 | the dollar sign, so the character class has two members C<$> and C<x>.
|
---|
336 | In C<[\\$x]>, the backslash is protected, so C<$x> is treated as a
|
---|
337 | variable and substituted in double quote fashion.
|
---|
338 |
|
---|
339 | The special character C<'-'> acts as a range operator within character
|
---|
340 | classes, so that a contiguous set of characters can be written as a
|
---|
341 | range. With ranges, the unwieldy C<[0123456789]> and C<[abc...xyz]>
|
---|
342 | become the svelte C<[0-9]> and C<[a-z]>. Some examples are
|
---|
343 |
|
---|
344 | /item[0-9]/; # matches 'item0' or ... or 'item9'
|
---|
345 | /[0-9bx-z]aa/; # matches '0aa', ..., '9aa',
|
---|
346 | # 'baa', 'xaa', 'yaa', or 'zaa'
|
---|
347 | /[0-9a-fA-F]/; # matches a hexadecimal digit
|
---|
348 | /[0-9a-zA-Z_]/; # matches a "word" character,
|
---|
349 | # like those in a perl variable name
|
---|
350 |
|
---|
351 | If C<'-'> is the first or last character in a character class, it is
|
---|
352 | treated as an ordinary character; C<[-ab]>, C<[ab-]> and C<[a\-b]> are
|
---|
353 | all equivalent.
|
---|
354 |
|
---|
355 | The special character C<^> in the first position of a character class
|
---|
356 | denotes a B<negated character class>, which matches any character but
|
---|
357 | those in the brackets. Both C<[...]> and C<[^...]> must match a
|
---|
358 | character, or the match fails. Then
|
---|
359 |
|
---|
360 | /[^a]at/; # doesn't match 'aat' or 'at', but matches
|
---|
361 | # all other 'bat', 'cat, '0at', '%at', etc.
|
---|
362 | /[^0-9]/; # matches a non-numeric character
|
---|
363 | /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary
|
---|
364 |
|
---|
365 | Now, even C<[0-9]> can be a bother the write multiple times, so in the
|
---|
366 | interest of saving keystrokes and making regexps more readable, Perl
|
---|
367 | has several abbreviations for common character classes:
|
---|
368 |
|
---|
369 | =over 4
|
---|
370 |
|
---|
371 | =item *
|
---|
372 |
|
---|
373 | \d is a digit and represents [0-9]
|
---|
374 |
|
---|
375 | =item *
|
---|
376 |
|
---|
377 | \s is a whitespace character and represents [\ \t\r\n\f]
|
---|
378 |
|
---|
379 | =item *
|
---|
380 |
|
---|
381 | \w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_]
|
---|
382 |
|
---|
383 | =item *
|
---|
384 |
|
---|
385 | \D is a negated \d; it represents any character but a digit [^0-9]
|
---|
386 |
|
---|
387 | =item *
|
---|
388 |
|
---|
389 | \S is a negated \s; it represents any non-whitespace character [^\s]
|
---|
390 |
|
---|
391 | =item *
|
---|
392 |
|
---|
393 | \W is a negated \w; it represents any non-word character [^\w]
|
---|
394 |
|
---|
395 | =item *
|
---|
396 |
|
---|
397 | The period '.' matches any character but "\n"
|
---|
398 |
|
---|
399 | =back
|
---|
400 |
|
---|
401 | The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside
|
---|
402 | of character classes. Here are some in use:
|
---|
403 |
|
---|
404 | /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
|
---|
405 | /[\d\s]/; # matches any digit or whitespace character
|
---|
406 | /\w\W\w/; # matches a word char, followed by a
|
---|
407 | # non-word char, followed by a word char
|
---|
408 | /..rt/; # matches any two chars, followed by 'rt'
|
---|
409 | /end\./; # matches 'end.'
|
---|
410 | /end[.]/; # same thing, matches 'end.'
|
---|
411 |
|
---|
412 | Because a period is a metacharacter, it needs to be escaped to match
|
---|
413 | as an ordinary period. Because, for example, C<\d> and C<\w> are sets
|
---|
414 | of characters, it is incorrect to think of C<[^\d\w]> as C<[\D\W]>; in
|
---|
415 | fact C<[^\d\w]> is the same as C<[^\w]>, which is the same as
|
---|
416 | C<[\W]>. Think DeMorgan's laws.
|
---|
417 |
|
---|
418 | An anchor useful in basic regexps is the S<B<word anchor> >
|
---|
419 | C<\b>. This matches a boundary between a word character and a non-word
|
---|
420 | character C<\w\W> or C<\W\w>:
|
---|
421 |
|
---|
422 | $x = "Housecat catenates house and cat";
|
---|
423 | $x =~ /cat/; # matches cat in 'housecat'
|
---|
424 | $x =~ /\bcat/; # matches cat in 'catenates'
|
---|
425 | $x =~ /cat\b/; # matches cat in 'housecat'
|
---|
426 | $x =~ /\bcat\b/; # matches 'cat' at end of string
|
---|
427 |
|
---|
428 | Note in the last example, the end of the string is considered a word
|
---|
429 | boundary.
|
---|
430 |
|
---|
431 | You might wonder why C<'.'> matches everything but C<"\n"> - why not
|
---|
432 | every character? The reason is that often one is matching against
|
---|
433 | lines and would like to ignore the newline characters. For instance,
|
---|
434 | while the string C<"\n"> represents one line, we would like to think
|
---|
435 | of as empty. Then
|
---|
436 |
|
---|
437 | "" =~ /^$/; # matches
|
---|
438 | "\n" =~ /^$/; # matches, "\n" is ignored
|
---|
439 |
|
---|
440 | "" =~ /./; # doesn't match; it needs a char
|
---|
441 | "" =~ /^.$/; # doesn't match; it needs a char
|
---|
442 | "\n" =~ /^.$/; # doesn't match; it needs a char other than "\n"
|
---|
443 | "a" =~ /^.$/; # matches
|
---|
444 | "a\n" =~ /^.$/; # matches, ignores the "\n"
|
---|
445 |
|
---|
446 | This behavior is convenient, because we usually want to ignore
|
---|
447 | newlines when we count and match characters in a line. Sometimes,
|
---|
448 | however, we want to keep track of newlines. We might even want C<^>
|
---|
449 | and C<$> to anchor at the beginning and end of lines within the
|
---|
450 | string, rather than just the beginning and end of the string. Perl
|
---|
451 | allows us to choose between ignoring and paying attention to newlines
|
---|
452 | by using the C<//s> and C<//m> modifiers. C<//s> and C<//m> stand for
|
---|
453 | single line and multi-line and they determine whether a string is to
|
---|
454 | be treated as one continuous string, or as a set of lines. The two
|
---|
455 | modifiers affect two aspects of how the regexp is interpreted: 1) how
|
---|
456 | the C<'.'> character class is defined, and 2) where the anchors C<^>
|
---|
457 | and C<$> are able to match. Here are the four possible combinations:
|
---|
458 |
|
---|
459 | =over 4
|
---|
460 |
|
---|
461 | =item *
|
---|
462 |
|
---|
463 | no modifiers (//): Default behavior. C<'.'> matches any character
|
---|
464 | except C<"\n">. C<^> matches only at the beginning of the string and
|
---|
465 | C<$> matches only at the end or before a newline at the end.
|
---|
466 |
|
---|
467 | =item *
|
---|
468 |
|
---|
469 | s modifier (//s): Treat string as a single long line. C<'.'> matches
|
---|
470 | any character, even C<"\n">. C<^> matches only at the beginning of
|
---|
471 | the string and C<$> matches only at the end or before a newline at the
|
---|
472 | end.
|
---|
473 |
|
---|
474 | =item *
|
---|
475 |
|
---|
476 | m modifier (//m): Treat string as a set of multiple lines. C<'.'>
|
---|
477 | matches any character except C<"\n">. C<^> and C<$> are able to match
|
---|
478 | at the start or end of I<any> line within the string.
|
---|
479 |
|
---|
480 | =item *
|
---|
481 |
|
---|
482 | both s and m modifiers (//sm): Treat string as a single long line, but
|
---|
483 | detect multiple lines. C<'.'> matches any character, even
|
---|
484 | C<"\n">. C<^> and C<$>, however, are able to match at the start or end
|
---|
485 | of I<any> line within the string.
|
---|
486 |
|
---|
487 | =back
|
---|
488 |
|
---|
489 | Here are examples of C<//s> and C<//m> in action:
|
---|
490 |
|
---|
491 | $x = "There once was a girl\nWho programmed in Perl\n";
|
---|
492 |
|
---|
493 | $x =~ /^Who/; # doesn't match, "Who" not at start of string
|
---|
494 | $x =~ /^Who/s; # doesn't match, "Who" not at start of string
|
---|
495 | $x =~ /^Who/m; # matches, "Who" at start of second line
|
---|
496 | $x =~ /^Who/sm; # matches, "Who" at start of second line
|
---|
497 |
|
---|
498 | $x =~ /girl.Who/; # doesn't match, "." doesn't match "\n"
|
---|
499 | $x =~ /girl.Who/s; # matches, "." matches "\n"
|
---|
500 | $x =~ /girl.Who/m; # doesn't match, "." doesn't match "\n"
|
---|
501 | $x =~ /girl.Who/sm; # matches, "." matches "\n"
|
---|
502 |
|
---|
503 | Most of the time, the default behavior is what is want, but C<//s> and
|
---|
504 | C<//m> are occasionally very useful. If C<//m> is being used, the start
|
---|
505 | of the string can still be matched with C<\A> and the end of string
|
---|
506 | can still be matched with the anchors C<\Z> (matches both the end and
|
---|
507 | the newline before, like C<$>), and C<\z> (matches only the end):
|
---|
508 |
|
---|
509 | $x =~ /^Who/m; # matches, "Who" at start of second line
|
---|
510 | $x =~ /\AWho/m; # doesn't match, "Who" is not at start of string
|
---|
511 |
|
---|
512 | $x =~ /girl$/m; # matches, "girl" at end of first line
|
---|
513 | $x =~ /girl\Z/m; # doesn't match, "girl" is not at end of string
|
---|
514 |
|
---|
515 | $x =~ /Perl\Z/m; # matches, "Perl" is at newline before end
|
---|
516 | $x =~ /Perl\z/m; # doesn't match, "Perl" is not at end of string
|
---|
517 |
|
---|
518 | We now know how to create choices among classes of characters in a
|
---|
519 | regexp. What about choices among words or character strings? Such
|
---|
520 | choices are described in the next section.
|
---|
521 |
|
---|
522 | =head2 Matching this or that
|
---|
523 |
|
---|
524 | Sometimes we would like to our regexp to be able to match different
|
---|
525 | possible words or character strings. This is accomplished by using
|
---|
526 | the B<alternation> metacharacter C<|>. To match C<dog> or C<cat>, we
|
---|
527 | form the regexp C<dog|cat>. As before, perl will try to match the
|
---|
528 | regexp at the earliest possible point in the string. At each
|
---|
529 | character position, perl will first try to match the first
|
---|
530 | alternative, C<dog>. If C<dog> doesn't match, perl will then try the
|
---|
531 | next alternative, C<cat>. If C<cat> doesn't match either, then the
|
---|
532 | match fails and perl moves to the next position in the string. Some
|
---|
533 | examples:
|
---|
534 |
|
---|
535 | "cats and dogs" =~ /cat|dog|bird/; # matches "cat"
|
---|
536 | "cats and dogs" =~ /dog|cat|bird/; # matches "cat"
|
---|
537 |
|
---|
538 | Even though C<dog> is the first alternative in the second regexp,
|
---|
539 | C<cat> is able to match earlier in the string.
|
---|
540 |
|
---|
541 | "cats" =~ /c|ca|cat|cats/; # matches "c"
|
---|
542 | "cats" =~ /cats|cat|ca|c/; # matches "cats"
|
---|
543 |
|
---|
544 | Here, all the alternatives match at the first string position, so the
|
---|
545 | first alternative is the one that matches. If some of the
|
---|
546 | alternatives are truncations of the others, put the longest ones first
|
---|
547 | to give them a chance to match.
|
---|
548 |
|
---|
549 | "cab" =~ /a|b|c/ # matches "c"
|
---|
550 | # /a|b|c/ == /[abc]/
|
---|
551 |
|
---|
552 | The last example points out that character classes are like
|
---|
553 | alternations of characters. At a given character position, the first
|
---|
554 | alternative that allows the regexp match to succeed will be the one
|
---|
555 | that matches.
|
---|
556 |
|
---|
557 | =head2 Grouping things and hierarchical matching
|
---|
558 |
|
---|
559 | Alternation allows a regexp to choose among alternatives, but by
|
---|
560 | itself it unsatisfying. The reason is that each alternative is a whole
|
---|
561 | regexp, but sometime we want alternatives for just part of a
|
---|
562 | regexp. For instance, suppose we want to search for housecats or
|
---|
563 | housekeepers. The regexp C<housecat|housekeeper> fits the bill, but is
|
---|
564 | inefficient because we had to type C<house> twice. It would be nice to
|
---|
565 | have parts of the regexp be constant, like C<house>, and some
|
---|
566 | parts have alternatives, like C<cat|keeper>.
|
---|
567 |
|
---|
568 | The B<grouping> metacharacters C<()> solve this problem. Grouping
|
---|
569 | allows parts of a regexp to be treated as a single unit. Parts of a
|
---|
570 | regexp are grouped by enclosing them in parentheses. Thus we could solve
|
---|
571 | the C<housecat|housekeeper> by forming the regexp as
|
---|
572 | C<house(cat|keeper)>. The regexp C<house(cat|keeper)> means match
|
---|
573 | C<house> followed by either C<cat> or C<keeper>. Some more examples
|
---|
574 | are
|
---|
575 |
|
---|
576 | /(a|b)b/; # matches 'ab' or 'bb'
|
---|
577 | /(ac|b)b/; # matches 'acb' or 'bb'
|
---|
578 | /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere
|
---|
579 | /(a|[bc])d/; # matches 'ad', 'bd', or 'cd'
|
---|
580 |
|
---|
581 | /house(cat|)/; # matches either 'housecat' or 'house'
|
---|
582 | /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or
|
---|
583 | # 'house'. Note groups can be nested.
|
---|
584 |
|
---|
585 | /(19|20|)\d\d/; # match years 19xx, 20xx, or the Y2K problem, xx
|
---|
586 | "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d',
|
---|
587 | # because '20\d\d' can't match
|
---|
588 |
|
---|
589 | Alternations behave the same way in groups as out of them: at a given
|
---|
590 | string position, the leftmost alternative that allows the regexp to
|
---|
591 | match is taken. So in the last example at the first string position,
|
---|
592 | C<"20"> matches the second alternative, but there is nothing left over
|
---|
593 | to match the next two digits C<\d\d>. So perl moves on to the next
|
---|
594 | alternative, which is the null alternative and that works, since
|
---|
595 | C<"20"> is two digits.
|
---|
596 |
|
---|
597 | The process of trying one alternative, seeing if it matches, and
|
---|
598 | moving on to the next alternative if it doesn't, is called
|
---|
599 | B<backtracking>. The term 'backtracking' comes from the idea that
|
---|
600 | matching a regexp is like a walk in the woods. Successfully matching
|
---|
601 | a regexp is like arriving at a destination. There are many possible
|
---|
602 | trailheads, one for each string position, and each one is tried in
|
---|
603 | order, left to right. From each trailhead there may be many paths,
|
---|
604 | some of which get you there, and some which are dead ends. When you
|
---|
605 | walk along a trail and hit a dead end, you have to backtrack along the
|
---|
606 | trail to an earlier point to try another trail. If you hit your
|
---|
607 | destination, you stop immediately and forget about trying all the
|
---|
608 | other trails. You are persistent, and only if you have tried all the
|
---|
609 | trails from all the trailheads and not arrived at your destination, do
|
---|
610 | you declare failure. To be concrete, here is a step-by-step analysis
|
---|
611 | of what perl does when it tries to match the regexp
|
---|
612 |
|
---|
613 | "abcde" =~ /(abd|abc)(df|d|de)/;
|
---|
614 |
|
---|
615 | =over 4
|
---|
616 |
|
---|
617 | =item 0
|
---|
618 |
|
---|
619 | Start with the first letter in the string 'a'.
|
---|
620 |
|
---|
621 | =item 1
|
---|
622 |
|
---|
623 | Try the first alternative in the first group 'abd'.
|
---|
624 |
|
---|
625 | =item 2
|
---|
626 |
|
---|
627 | Match 'a' followed by 'b'. So far so good.
|
---|
628 |
|
---|
629 | =item 3
|
---|
630 |
|
---|
631 | 'd' in the regexp doesn't match 'c' in the string - a dead
|
---|
632 | end. So backtrack two characters and pick the second alternative in
|
---|
633 | the first group 'abc'.
|
---|
634 |
|
---|
635 | =item 4
|
---|
636 |
|
---|
637 | Match 'a' followed by 'b' followed by 'c'. We are on a roll
|
---|
638 | and have satisfied the first group. Set $1 to 'abc'.
|
---|
639 |
|
---|
640 | =item 5
|
---|
641 |
|
---|
642 | Move on to the second group and pick the first alternative
|
---|
643 | 'df'.
|
---|
644 |
|
---|
645 | =item 6
|
---|
646 |
|
---|
647 | Match the 'd'.
|
---|
648 |
|
---|
649 | =item 7
|
---|
650 |
|
---|
651 | 'f' in the regexp doesn't match 'e' in the string, so a dead
|
---|
652 | end. Backtrack one character and pick the second alternative in the
|
---|
653 | second group 'd'.
|
---|
654 |
|
---|
655 | =item 8
|
---|
656 |
|
---|
657 | 'd' matches. The second grouping is satisfied, so set $2 to
|
---|
658 | 'd'.
|
---|
659 |
|
---|
660 | =item 9
|
---|
661 |
|
---|
662 | We are at the end of the regexp, so we are done! We have
|
---|
663 | matched 'abcd' out of the string "abcde".
|
---|
664 |
|
---|
665 | =back
|
---|
666 |
|
---|
667 | There are a couple of things to note about this analysis. First, the
|
---|
668 | third alternative in the second group 'de' also allows a match, but we
|
---|
669 | stopped before we got to it - at a given character position, leftmost
|
---|
670 | wins. Second, we were able to get a match at the first character
|
---|
671 | position of the string 'a'. If there were no matches at the first
|
---|
672 | position, perl would move to the second character position 'b' and
|
---|
673 | attempt the match all over again. Only when all possible paths at all
|
---|
674 | possible character positions have been exhausted does perl give
|
---|
675 | up and declare S<C<$string =~ /(abd|abc)(df|d|de)/;> > to be false.
|
---|
676 |
|
---|
677 | Even with all this work, regexp matching happens remarkably fast. To
|
---|
678 | speed things up, during compilation stage, perl compiles the regexp
|
---|
679 | into a compact sequence of opcodes that can often fit inside a
|
---|
680 | processor cache. When the code is executed, these opcodes can then run
|
---|
681 | at full throttle and search very quickly.
|
---|
682 |
|
---|
683 | =head2 Extracting matches
|
---|
684 |
|
---|
685 | The grouping metacharacters C<()> also serve another completely
|
---|
686 | different function: they allow the extraction of the parts of a string
|
---|
687 | that matched. This is very useful to find out what matched and for
|
---|
688 | text processing in general. For each grouping, the part that matched
|
---|
689 | inside goes into the special variables C<$1>, C<$2>, etc. They can be
|
---|
690 | used just as ordinary variables:
|
---|
691 |
|
---|
692 | # extract hours, minutes, seconds
|
---|
693 | if ($time =~ /(\d\d):(\d\d):(\d\d)/) { # match hh:mm:ss format
|
---|
694 | $hours = $1;
|
---|
695 | $minutes = $2;
|
---|
696 | $seconds = $3;
|
---|
697 | }
|
---|
698 |
|
---|
699 | Now, we know that in scalar context,
|
---|
700 | S<C<$time =~ /(\d\d):(\d\d):(\d\d)/> > returns a true or false
|
---|
701 | value. In list context, however, it returns the list of matched values
|
---|
702 | C<($1,$2,$3)>. So we could write the code more compactly as
|
---|
703 |
|
---|
704 | # extract hours, minutes, seconds
|
---|
705 | ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
|
---|
706 |
|
---|
707 | If the groupings in a regexp are nested, C<$1> gets the group with the
|
---|
708 | leftmost opening parenthesis, C<$2> the next opening parenthesis,
|
---|
709 | etc. For example, here is a complex regexp and the matching variables
|
---|
710 | indicated below it:
|
---|
711 |
|
---|
712 | /(ab(cd|ef)((gi)|j))/;
|
---|
713 | 1 2 34
|
---|
714 |
|
---|
715 | so that if the regexp matched, e.g., C<$2> would contain 'cd' or 'ef'. For
|
---|
716 | convenience, perl sets C<$+> to the string held by the highest numbered
|
---|
717 | C<$1>, C<$2>, ... that got assigned (and, somewhat related, C<$^N> to the
|
---|
718 | value of the C<$1>, C<$2>, ... most-recently assigned; i.e. the C<$1>,
|
---|
719 | C<$2>, ... associated with the rightmost closing parenthesis used in the
|
---|
720 | match).
|
---|
721 |
|
---|
722 | Closely associated with the matching variables C<$1>, C<$2>, ... are
|
---|
723 | the B<backreferences> C<\1>, C<\2>, ... . Backreferences are simply
|
---|
724 | matching variables that can be used I<inside> a regexp. This is a
|
---|
725 | really nice feature - what matches later in a regexp can depend on
|
---|
726 | what matched earlier in the regexp. Suppose we wanted to look
|
---|
727 | for doubled words in text, like 'the the'. The following regexp finds
|
---|
728 | all 3-letter doubles with a space in between:
|
---|
729 |
|
---|
730 | /(\w\w\w)\s\1/;
|
---|
731 |
|
---|
732 | The grouping assigns a value to \1, so that the same 3 letter sequence
|
---|
733 | is used for both parts. Here are some words with repeated parts:
|
---|
734 |
|
---|
735 | % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\1$' /usr/dict/words
|
---|
736 | beriberi
|
---|
737 | booboo
|
---|
738 | coco
|
---|
739 | mama
|
---|
740 | murmur
|
---|
741 | papa
|
---|
742 |
|
---|
743 | The regexp has a single grouping which considers 4-letter
|
---|
744 | combinations, then 3-letter combinations, etc. and uses C<\1> to look for
|
---|
745 | a repeat. Although C<$1> and C<\1> represent the same thing, care should be
|
---|
746 | taken to use matched variables C<$1>, C<$2>, ... only outside a regexp
|
---|
747 | and backreferences C<\1>, C<\2>, ... only inside a regexp; not doing
|
---|
748 | so may lead to surprising and/or undefined results.
|
---|
749 |
|
---|
750 | In addition to what was matched, Perl 5.6.0 also provides the
|
---|
751 | positions of what was matched with the C<@-> and C<@+>
|
---|
752 | arrays. C<$-[0]> is the position of the start of the entire match and
|
---|
753 | C<$+[0]> is the position of the end. Similarly, C<$-[n]> is the
|
---|
754 | position of the start of the C<$n> match and C<$+[n]> is the position
|
---|
755 | of the end. If C<$n> is undefined, so are C<$-[n]> and C<$+[n]>. Then
|
---|
756 | this code
|
---|
757 |
|
---|
758 | $x = "Mmm...donut, thought Homer";
|
---|
759 | $x =~ /^(Mmm|Yech)\.\.\.(donut|peas)/; # matches
|
---|
760 | foreach $expr (1..$#-) {
|
---|
761 | print "Match $expr: '${$expr}' at position ($-[$expr],$+[$expr])\n";
|
---|
762 | }
|
---|
763 |
|
---|
764 | prints
|
---|
765 |
|
---|
766 | Match 1: 'Mmm' at position (0,3)
|
---|
767 | Match 2: 'donut' at position (6,11)
|
---|
768 |
|
---|
769 | Even if there are no groupings in a regexp, it is still possible to
|
---|
770 | find out what exactly matched in a string. If you use them, perl
|
---|
771 | will set C<$`> to the part of the string before the match, will set C<$&>
|
---|
772 | to the part of the string that matched, and will set C<$'> to the part
|
---|
773 | of the string after the match. An example:
|
---|
774 |
|
---|
775 | $x = "the cat caught the mouse";
|
---|
776 | $x =~ /cat/; # $` = 'the ', $& = 'cat', $' = ' caught the mouse'
|
---|
777 | $x =~ /the/; # $` = '', $& = 'the', $' = ' cat caught the mouse'
|
---|
778 |
|
---|
779 | In the second match, S<C<$` = ''> > because the regexp matched at the
|
---|
780 | first character position in the string and stopped, it never saw the
|
---|
781 | second 'the'. It is important to note that using C<$`> and C<$'>
|
---|
782 | slows down regexp matching quite a bit, and C< $& > slows it down to a
|
---|
783 | lesser extent, because if they are used in one regexp in a program,
|
---|
784 | they are generated for <all> regexps in the program. So if raw
|
---|
785 | performance is a goal of your application, they should be avoided.
|
---|
786 | If you need them, use C<@-> and C<@+> instead:
|
---|
787 |
|
---|
788 | $` is the same as substr( $x, 0, $-[0] )
|
---|
789 | $& is the same as substr( $x, $-[0], $+[0]-$-[0] )
|
---|
790 | $' is the same as substr( $x, $+[0] )
|
---|
791 |
|
---|
792 | =head2 Matching repetitions
|
---|
793 |
|
---|
794 | The examples in the previous section display an annoying weakness. We
|
---|
795 | were only matching 3-letter words, or syllables of 4 letters or
|
---|
796 | less. We'd like to be able to match words or syllables of any length,
|
---|
797 | without writing out tedious alternatives like
|
---|
798 | C<\w\w\w\w|\w\w\w|\w\w|\w>.
|
---|
799 |
|
---|
800 | This is exactly the problem the B<quantifier> metacharacters C<?>,
|
---|
801 | C<*>, C<+>, and C<{}> were created for. They allow us to determine the
|
---|
802 | number of repeats of a portion of a regexp we consider to be a
|
---|
803 | match. Quantifiers are put immediately after the character, character
|
---|
804 | class, or grouping that we want to specify. They have the following
|
---|
805 | meanings:
|
---|
806 |
|
---|
807 | =over 4
|
---|
808 |
|
---|
809 | =item *
|
---|
810 |
|
---|
811 | C<a?> = match 'a' 1 or 0 times
|
---|
812 |
|
---|
813 | =item *
|
---|
814 |
|
---|
815 | C<a*> = match 'a' 0 or more times, i.e., any number of times
|
---|
816 |
|
---|
817 | =item *
|
---|
818 |
|
---|
819 | C<a+> = match 'a' 1 or more times, i.e., at least once
|
---|
820 |
|
---|
821 | =item *
|
---|
822 |
|
---|
823 | C<a{n,m}> = match at least C<n> times, but not more than C<m>
|
---|
824 | times.
|
---|
825 |
|
---|
826 | =item *
|
---|
827 |
|
---|
828 | C<a{n,}> = match at least C<n> or more times
|
---|
829 |
|
---|
830 | =item *
|
---|
831 |
|
---|
832 | C<a{n}> = match exactly C<n> times
|
---|
833 |
|
---|
834 | =back
|
---|
835 |
|
---|
836 | Here are some examples:
|
---|
837 |
|
---|
838 | /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and
|
---|
839 | # any number of digits
|
---|
840 | /(\w+)\s+\1/; # match doubled words of arbitrary length
|
---|
841 | /y(es)?/i; # matches 'y', 'Y', or a case-insensitive 'yes'
|
---|
842 | $year =~ /\d{2,4}/; # make sure year is at least 2 but not more
|
---|
843 | # than 4 digits
|
---|
844 | $year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates
|
---|
845 | $year =~ /\d{2}(\d{2})?/; # same thing written differently. However,
|
---|
846 | # this produces $1 and the other does not.
|
---|
847 |
|
---|
848 | % simple_grep '^(\w+)\1$' /usr/dict/words # isn't this easier?
|
---|
849 | beriberi
|
---|
850 | booboo
|
---|
851 | coco
|
---|
852 | mama
|
---|
853 | murmur
|
---|
854 | papa
|
---|
855 |
|
---|
856 | For all of these quantifiers, perl will try to match as much of the
|
---|
857 | string as possible, while still allowing the regexp to succeed. Thus
|
---|
858 | with C</a?.../>, perl will first try to match the regexp with the C<a>
|
---|
859 | present; if that fails, perl will try to match the regexp without the
|
---|
860 | C<a> present. For the quantifier C<*>, we get the following:
|
---|
861 |
|
---|
862 | $x = "the cat in the hat";
|
---|
863 | $x =~ /^(.*)(cat)(.*)$/; # matches,
|
---|
864 | # $1 = 'the '
|
---|
865 | # $2 = 'cat'
|
---|
866 | # $3 = ' in the hat'
|
---|
867 |
|
---|
868 | Which is what we might expect, the match finds the only C<cat> in the
|
---|
869 | string and locks onto it. Consider, however, this regexp:
|
---|
870 |
|
---|
871 | $x =~ /^(.*)(at)(.*)$/; # matches,
|
---|
872 | # $1 = 'the cat in the h'
|
---|
873 | # $2 = 'at'
|
---|
874 | # $3 = '' (0 matches)
|
---|
875 |
|
---|
876 | One might initially guess that perl would find the C<at> in C<cat> and
|
---|
877 | stop there, but that wouldn't give the longest possible string to the
|
---|
878 | first quantifier C<.*>. Instead, the first quantifier C<.*> grabs as
|
---|
879 | much of the string as possible while still having the regexp match. In
|
---|
880 | this example, that means having the C<at> sequence with the final C<at>
|
---|
881 | in the string. The other important principle illustrated here is that
|
---|
882 | when there are two or more elements in a regexp, the I<leftmost>
|
---|
883 | quantifier, if there is one, gets to grab as much the string as
|
---|
884 | possible, leaving the rest of the regexp to fight over scraps. Thus in
|
---|
885 | our example, the first quantifier C<.*> grabs most of the string, while
|
---|
886 | the second quantifier C<.*> gets the empty string. Quantifiers that
|
---|
887 | grab as much of the string as possible are called B<maximal match> or
|
---|
888 | B<greedy> quantifiers.
|
---|
889 |
|
---|
890 | When a regexp can match a string in several different ways, we can use
|
---|
891 | the principles above to predict which way the regexp will match:
|
---|
892 |
|
---|
893 | =over 4
|
---|
894 |
|
---|
895 | =item *
|
---|
896 |
|
---|
897 | Principle 0: Taken as a whole, any regexp will be matched at the
|
---|
898 | earliest possible position in the string.
|
---|
899 |
|
---|
900 | =item *
|
---|
901 |
|
---|
902 | Principle 1: In an alternation C<a|b|c...>, the leftmost alternative
|
---|
903 | that allows a match for the whole regexp will be the one used.
|
---|
904 |
|
---|
905 | =item *
|
---|
906 |
|
---|
907 | Principle 2: The maximal matching quantifiers C<?>, C<*>, C<+> and
|
---|
908 | C<{n,m}> will in general match as much of the string as possible while
|
---|
909 | still allowing the whole regexp to match.
|
---|
910 |
|
---|
911 | =item *
|
---|
912 |
|
---|
913 | Principle 3: If there are two or more elements in a regexp, the
|
---|
914 | leftmost greedy quantifier, if any, will match as much of the string
|
---|
915 | as possible while still allowing the whole regexp to match. The next
|
---|
916 | leftmost greedy quantifier, if any, will try to match as much of the
|
---|
917 | string remaining available to it as possible, while still allowing the
|
---|
918 | whole regexp to match. And so on, until all the regexp elements are
|
---|
919 | satisfied.
|
---|
920 |
|
---|
921 | =back
|
---|
922 |
|
---|
923 | As we have seen above, Principle 0 overrides the others - the regexp
|
---|
924 | will be matched as early as possible, with the other principles
|
---|
925 | determining how the regexp matches at that earliest character
|
---|
926 | position.
|
---|
927 |
|
---|
928 | Here is an example of these principles in action:
|
---|
929 |
|
---|
930 | $x = "The programming republic of Perl";
|
---|
931 | $x =~ /^(.+)(e|r)(.*)$/; # matches,
|
---|
932 | # $1 = 'The programming republic of Pe'
|
---|
933 | # $2 = 'r'
|
---|
934 | # $3 = 'l'
|
---|
935 |
|
---|
936 | This regexp matches at the earliest string position, C<'T'>. One
|
---|
937 | might think that C<e>, being leftmost in the alternation, would be
|
---|
938 | matched, but C<r> produces the longest string in the first quantifier.
|
---|
939 |
|
---|
940 | $x =~ /(m{1,2})(.*)$/; # matches,
|
---|
941 | # $1 = 'mm'
|
---|
942 | # $2 = 'ing republic of Perl'
|
---|
943 |
|
---|
944 | Here, The earliest possible match is at the first C<'m'> in
|
---|
945 | C<programming>. C<m{1,2}> is the first quantifier, so it gets to match
|
---|
946 | a maximal C<mm>.
|
---|
947 |
|
---|
948 | $x =~ /.*(m{1,2})(.*)$/; # matches,
|
---|
949 | # $1 = 'm'
|
---|
950 | # $2 = 'ing republic of Perl'
|
---|
951 |
|
---|
952 | Here, the regexp matches at the start of the string. The first
|
---|
953 | quantifier C<.*> grabs as much as possible, leaving just a single
|
---|
954 | C<'m'> for the second quantifier C<m{1,2}>.
|
---|
955 |
|
---|
956 | $x =~ /(.?)(m{1,2})(.*)$/; # matches,
|
---|
957 | # $1 = 'a'
|
---|
958 | # $2 = 'mm'
|
---|
959 | # $3 = 'ing republic of Perl'
|
---|
960 |
|
---|
961 | Here, C<.?> eats its maximal one character at the earliest possible
|
---|
962 | position in the string, C<'a'> in C<programming>, leaving C<m{1,2}>
|
---|
963 | the opportunity to match both C<m>'s. Finally,
|
---|
964 |
|
---|
965 | "aXXXb" =~ /(X*)/; # matches with $1 = ''
|
---|
966 |
|
---|
967 | because it can match zero copies of C<'X'> at the beginning of the
|
---|
968 | string. If you definitely want to match at least one C<'X'>, use
|
---|
969 | C<X+>, not C<X*>.
|
---|
970 |
|
---|
971 | Sometimes greed is not good. At times, we would like quantifiers to
|
---|
972 | match a I<minimal> piece of string, rather than a maximal piece. For
|
---|
973 | this purpose, Larry Wall created the S<B<minimal match> > or
|
---|
974 | B<non-greedy> quantifiers C<??>,C<*?>, C<+?>, and C<{}?>. These are
|
---|
975 | the usual quantifiers with a C<?> appended to them. They have the
|
---|
976 | following meanings:
|
---|
977 |
|
---|
978 | =over 4
|
---|
979 |
|
---|
980 | =item *
|
---|
981 |
|
---|
982 | C<a??> = match 'a' 0 or 1 times. Try 0 first, then 1.
|
---|
983 |
|
---|
984 | =item *
|
---|
985 |
|
---|
986 | C<a*?> = match 'a' 0 or more times, i.e., any number of times,
|
---|
987 | but as few times as possible
|
---|
988 |
|
---|
989 | =item *
|
---|
990 |
|
---|
991 | C<a+?> = match 'a' 1 or more times, i.e., at least once, but
|
---|
992 | as few times as possible
|
---|
993 |
|
---|
994 | =item *
|
---|
995 |
|
---|
996 | C<a{n,m}?> = match at least C<n> times, not more than C<m>
|
---|
997 | times, as few times as possible
|
---|
998 |
|
---|
999 | =item *
|
---|
1000 |
|
---|
1001 | C<a{n,}?> = match at least C<n> times, but as few times as
|
---|
1002 | possible
|
---|
1003 |
|
---|
1004 | =item *
|
---|
1005 |
|
---|
1006 | C<a{n}?> = match exactly C<n> times. Because we match exactly
|
---|
1007 | C<n> times, C<a{n}?> is equivalent to C<a{n}> and is just there for
|
---|
1008 | notational consistency.
|
---|
1009 |
|
---|
1010 | =back
|
---|
1011 |
|
---|
1012 | Let's look at the example above, but with minimal quantifiers:
|
---|
1013 |
|
---|
1014 | $x = "The programming republic of Perl";
|
---|
1015 | $x =~ /^(.+?)(e|r)(.*)$/; # matches,
|
---|
1016 | # $1 = 'Th'
|
---|
1017 | # $2 = 'e'
|
---|
1018 | # $3 = ' programming republic of Perl'
|
---|
1019 |
|
---|
1020 | The minimal string that will allow both the start of the string C<^>
|
---|
1021 | and the alternation to match is C<Th>, with the alternation C<e|r>
|
---|
1022 | matching C<e>. The second quantifier C<.*> is free to gobble up the
|
---|
1023 | rest of the string.
|
---|
1024 |
|
---|
1025 | $x =~ /(m{1,2}?)(.*?)$/; # matches,
|
---|
1026 | # $1 = 'm'
|
---|
1027 | # $2 = 'ming republic of Perl'
|
---|
1028 |
|
---|
1029 | The first string position that this regexp can match is at the first
|
---|
1030 | C<'m'> in C<programming>. At this position, the minimal C<m{1,2}?>
|
---|
1031 | matches just one C<'m'>. Although the second quantifier C<.*?> would
|
---|
1032 | prefer to match no characters, it is constrained by the end-of-string
|
---|
1033 | anchor C<$> to match the rest of the string.
|
---|
1034 |
|
---|
1035 | $x =~ /(.*?)(m{1,2}?)(.*)$/; # matches,
|
---|
1036 | # $1 = 'The progra'
|
---|
1037 | # $2 = 'm'
|
---|
1038 | # $3 = 'ming republic of Perl'
|
---|
1039 |
|
---|
1040 | In this regexp, you might expect the first minimal quantifier C<.*?>
|
---|
1041 | to match the empty string, because it is not constrained by a C<^>
|
---|
1042 | anchor to match the beginning of the word. Principle 0 applies here,
|
---|
1043 | however. Because it is possible for the whole regexp to match at the
|
---|
1044 | start of the string, it I<will> match at the start of the string. Thus
|
---|
1045 | the first quantifier has to match everything up to the first C<m>. The
|
---|
1046 | second minimal quantifier matches just one C<m> and the third
|
---|
1047 | quantifier matches the rest of the string.
|
---|
1048 |
|
---|
1049 | $x =~ /(.??)(m{1,2})(.*)$/; # matches,
|
---|
1050 | # $1 = 'a'
|
---|
1051 | # $2 = 'mm'
|
---|
1052 | # $3 = 'ing republic of Perl'
|
---|
1053 |
|
---|
1054 | Just as in the previous regexp, the first quantifier C<.??> can match
|
---|
1055 | earliest at position C<'a'>, so it does. The second quantifier is
|
---|
1056 | greedy, so it matches C<mm>, and the third matches the rest of the
|
---|
1057 | string.
|
---|
1058 |
|
---|
1059 | We can modify principle 3 above to take into account non-greedy
|
---|
1060 | quantifiers:
|
---|
1061 |
|
---|
1062 | =over 4
|
---|
1063 |
|
---|
1064 | =item *
|
---|
1065 |
|
---|
1066 | Principle 3: If there are two or more elements in a regexp, the
|
---|
1067 | leftmost greedy (non-greedy) quantifier, if any, will match as much
|
---|
1068 | (little) of the string as possible while still allowing the whole
|
---|
1069 | regexp to match. The next leftmost greedy (non-greedy) quantifier, if
|
---|
1070 | any, will try to match as much (little) of the string remaining
|
---|
1071 | available to it as possible, while still allowing the whole regexp to
|
---|
1072 | match. And so on, until all the regexp elements are satisfied.
|
---|
1073 |
|
---|
1074 | =back
|
---|
1075 |
|
---|
1076 | Just like alternation, quantifiers are also susceptible to
|
---|
1077 | backtracking. Here is a step-by-step analysis of the example
|
---|
1078 |
|
---|
1079 | $x = "the cat in the hat";
|
---|
1080 | $x =~ /^(.*)(at)(.*)$/; # matches,
|
---|
1081 | # $1 = 'the cat in the h'
|
---|
1082 | # $2 = 'at'
|
---|
1083 | # $3 = '' (0 matches)
|
---|
1084 |
|
---|
1085 | =over 4
|
---|
1086 |
|
---|
1087 | =item 0
|
---|
1088 |
|
---|
1089 | Start with the first letter in the string 't'.
|
---|
1090 |
|
---|
1091 | =item 1
|
---|
1092 |
|
---|
1093 | The first quantifier '.*' starts out by matching the whole
|
---|
1094 | string 'the cat in the hat'.
|
---|
1095 |
|
---|
1096 | =item 2
|
---|
1097 |
|
---|
1098 | 'a' in the regexp element 'at' doesn't match the end of the
|
---|
1099 | string. Backtrack one character.
|
---|
1100 |
|
---|
1101 | =item 3
|
---|
1102 |
|
---|
1103 | 'a' in the regexp element 'at' still doesn't match the last
|
---|
1104 | letter of the string 't', so backtrack one more character.
|
---|
1105 |
|
---|
1106 | =item 4
|
---|
1107 |
|
---|
1108 | Now we can match the 'a' and the 't'.
|
---|
1109 |
|
---|
1110 | =item 5
|
---|
1111 |
|
---|
1112 | Move on to the third element '.*'. Since we are at the end of
|
---|
1113 | the string and '.*' can match 0 times, assign it the empty string.
|
---|
1114 |
|
---|
1115 | =item 6
|
---|
1116 |
|
---|
1117 | We are done!
|
---|
1118 |
|
---|
1119 | =back
|
---|
1120 |
|
---|
1121 | Most of the time, all this moving forward and backtracking happens
|
---|
1122 | quickly and searching is fast. There are some pathological regexps,
|
---|
1123 | however, whose execution time exponentially grows with the size of the
|
---|
1124 | string. A typical structure that blows up in your face is of the form
|
---|
1125 |
|
---|
1126 | /(a|b+)*/;
|
---|
1127 |
|
---|
1128 | The problem is the nested indeterminate quantifiers. There are many
|
---|
1129 | different ways of partitioning a string of length n between the C<+>
|
---|
1130 | and C<*>: one repetition with C<b+> of length n, two repetitions with
|
---|
1131 | the first C<b+> length k and the second with length n-k, m repetitions
|
---|
1132 | whose bits add up to length n, etc. In fact there are an exponential
|
---|
1133 | number of ways to partition a string as a function of length. A
|
---|
1134 | regexp may get lucky and match early in the process, but if there is
|
---|
1135 | no match, perl will try I<every> possibility before giving up. So be
|
---|
1136 | careful with nested C<*>'s, C<{n,m}>'s, and C<+>'s. The book
|
---|
1137 | I<Mastering regular expressions> by Jeffrey Friedl gives a wonderful
|
---|
1138 | discussion of this and other efficiency issues.
|
---|
1139 |
|
---|
1140 | =head2 Building a regexp
|
---|
1141 |
|
---|
1142 | At this point, we have all the basic regexp concepts covered, so let's
|
---|
1143 | give a more involved example of a regular expression. We will build a
|
---|
1144 | regexp that matches numbers.
|
---|
1145 |
|
---|
1146 | The first task in building a regexp is to decide what we want to match
|
---|
1147 | and what we want to exclude. In our case, we want to match both
|
---|
1148 | integers and floating point numbers and we want to reject any string
|
---|
1149 | that isn't a number.
|
---|
1150 |
|
---|
1151 | The next task is to break the problem down into smaller problems that
|
---|
1152 | are easily converted into a regexp.
|
---|
1153 |
|
---|
1154 | The simplest case is integers. These consist of a sequence of digits,
|
---|
1155 | with an optional sign in front. The digits we can represent with
|
---|
1156 | C<\d+> and the sign can be matched with C<[+-]>. Thus the integer
|
---|
1157 | regexp is
|
---|
1158 |
|
---|
1159 | /[+-]?\d+/; # matches integers
|
---|
1160 |
|
---|
1161 | A floating point number potentially has a sign, an integral part, a
|
---|
1162 | decimal point, a fractional part, and an exponent. One or more of these
|
---|
1163 | parts is optional, so we need to check out the different
|
---|
1164 | possibilities. Floating point numbers which are in proper form include
|
---|
1165 | 123., 0.345, .34, -1e6, and 25.4E-72. As with integers, the sign out
|
---|
1166 | front is completely optional and can be matched by C<[+-]?>. We can
|
---|
1167 | see that if there is no exponent, floating point numbers must have a
|
---|
1168 | decimal point, otherwise they are integers. We might be tempted to
|
---|
1169 | model these with C<\d*\.\d*>, but this would also match just a single
|
---|
1170 | decimal point, which is not a number. So the three cases of floating
|
---|
1171 | point number sans exponent are
|
---|
1172 |
|
---|
1173 | /[+-]?\d+\./; # 1., 321., etc.
|
---|
1174 | /[+-]?\.\d+/; # .1, .234, etc.
|
---|
1175 | /[+-]?\d+\.\d+/; # 1.0, 30.56, etc.
|
---|
1176 |
|
---|
1177 | These can be combined into a single regexp with a three-way alternation:
|
---|
1178 |
|
---|
1179 | /[+-]?(\d+\.\d+|\d+\.|\.\d+)/; # floating point, no exponent
|
---|
1180 |
|
---|
1181 | In this alternation, it is important to put C<'\d+\.\d+'> before
|
---|
1182 | C<'\d+\.'>. If C<'\d+\.'> were first, the regexp would happily match that
|
---|
1183 | and ignore the fractional part of the number.
|
---|
1184 |
|
---|
1185 | Now consider floating point numbers with exponents. The key
|
---|
1186 | observation here is that I<both> integers and numbers with decimal
|
---|
1187 | points are allowed in front of an exponent. Then exponents, like the
|
---|
1188 | overall sign, are independent of whether we are matching numbers with
|
---|
1189 | or without decimal points, and can be 'decoupled' from the
|
---|
1190 | mantissa. The overall form of the regexp now becomes clear:
|
---|
1191 |
|
---|
1192 | /^(optional sign)(integer | f.p. mantissa)(optional exponent)$/;
|
---|
1193 |
|
---|
1194 | The exponent is an C<e> or C<E>, followed by an integer. So the
|
---|
1195 | exponent regexp is
|
---|
1196 |
|
---|
1197 | /[eE][+-]?\d+/; # exponent
|
---|
1198 |
|
---|
1199 | Putting all the parts together, we get a regexp that matches numbers:
|
---|
1200 |
|
---|
1201 | /^[+-]?(\d+\.\d+|\d+\.|\.\d+|\d+)([eE][+-]?\d+)?$/; # Ta da!
|
---|
1202 |
|
---|
1203 | Long regexps like this may impress your friends, but can be hard to
|
---|
1204 | decipher. In complex situations like this, the C<//x> modifier for a
|
---|
1205 | match is invaluable. It allows one to put nearly arbitrary whitespace
|
---|
1206 | and comments into a regexp without affecting their meaning. Using it,
|
---|
1207 | we can rewrite our 'extended' regexp in the more pleasing form
|
---|
1208 |
|
---|
1209 | /^
|
---|
1210 | [+-]? # first, match an optional sign
|
---|
1211 | ( # then match integers or f.p. mantissas:
|
---|
1212 | \d+\.\d+ # mantissa of the form a.b
|
---|
1213 | |\d+\. # mantissa of the form a.
|
---|
1214 | |\.\d+ # mantissa of the form .b
|
---|
1215 | |\d+ # integer of the form a
|
---|
1216 | )
|
---|
1217 | ([eE][+-]?\d+)? # finally, optionally match an exponent
|
---|
1218 | $/x;
|
---|
1219 |
|
---|
1220 | If whitespace is mostly irrelevant, how does one include space
|
---|
1221 | characters in an extended regexp? The answer is to backslash it
|
---|
1222 | S<C<'\ '> > or put it in a character class S<C<[ ]> >. The same thing
|
---|
1223 | goes for pound signs, use C<\#> or C<[#]>. For instance, Perl allows
|
---|
1224 | a space between the sign and the mantissa/integer, and we could add
|
---|
1225 | this to our regexp as follows:
|
---|
1226 |
|
---|
1227 | /^
|
---|
1228 | [+-]?\ * # first, match an optional sign *and space*
|
---|
1229 | ( # then match integers or f.p. mantissas:
|
---|
1230 | \d+\.\d+ # mantissa of the form a.b
|
---|
1231 | |\d+\. # mantissa of the form a.
|
---|
1232 | |\.\d+ # mantissa of the form .b
|
---|
1233 | |\d+ # integer of the form a
|
---|
1234 | )
|
---|
1235 | ([eE][+-]?\d+)? # finally, optionally match an exponent
|
---|
1236 | $/x;
|
---|
1237 |
|
---|
1238 | In this form, it is easier to see a way to simplify the
|
---|
1239 | alternation. Alternatives 1, 2, and 4 all start with C<\d+>, so it
|
---|
1240 | could be factored out:
|
---|
1241 |
|
---|
1242 | /^
|
---|
1243 | [+-]?\ * # first, match an optional sign
|
---|
1244 | ( # then match integers or f.p. mantissas:
|
---|
1245 | \d+ # start out with a ...
|
---|
1246 | (
|
---|
1247 | \.\d* # mantissa of the form a.b or a.
|
---|
1248 | )? # ? takes care of integers of the form a
|
---|
1249 | |\.\d+ # mantissa of the form .b
|
---|
1250 | )
|
---|
1251 | ([eE][+-]?\d+)? # finally, optionally match an exponent
|
---|
1252 | $/x;
|
---|
1253 |
|
---|
1254 | or written in the compact form,
|
---|
1255 |
|
---|
1256 | /^[+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?$/;
|
---|
1257 |
|
---|
1258 | This is our final regexp. To recap, we built a regexp by
|
---|
1259 |
|
---|
1260 | =over 4
|
---|
1261 |
|
---|
1262 | =item *
|
---|
1263 |
|
---|
1264 | specifying the task in detail,
|
---|
1265 |
|
---|
1266 | =item *
|
---|
1267 |
|
---|
1268 | breaking down the problem into smaller parts,
|
---|
1269 |
|
---|
1270 | =item *
|
---|
1271 |
|
---|
1272 | translating the small parts into regexps,
|
---|
1273 |
|
---|
1274 | =item *
|
---|
1275 |
|
---|
1276 | combining the regexps,
|
---|
1277 |
|
---|
1278 | =item *
|
---|
1279 |
|
---|
1280 | and optimizing the final combined regexp.
|
---|
1281 |
|
---|
1282 | =back
|
---|
1283 |
|
---|
1284 | These are also the typical steps involved in writing a computer
|
---|
1285 | program. This makes perfect sense, because regular expressions are
|
---|
1286 | essentially programs written a little computer language that specifies
|
---|
1287 | patterns.
|
---|
1288 |
|
---|
1289 | =head2 Using regular expressions in Perl
|
---|
1290 |
|
---|
1291 | The last topic of Part 1 briefly covers how regexps are used in Perl
|
---|
1292 | programs. Where do they fit into Perl syntax?
|
---|
1293 |
|
---|
1294 | We have already introduced the matching operator in its default
|
---|
1295 | C</regexp/> and arbitrary delimiter C<m!regexp!> forms. We have used
|
---|
1296 | the binding operator C<=~> and its negation C<!~> to test for string
|
---|
1297 | matches. Associated with the matching operator, we have discussed the
|
---|
1298 | single line C<//s>, multi-line C<//m>, case-insensitive C<//i> and
|
---|
1299 | extended C<//x> modifiers.
|
---|
1300 |
|
---|
1301 | There are a few more things you might want to know about matching
|
---|
1302 | operators. First, we pointed out earlier that variables in regexps are
|
---|
1303 | substituted before the regexp is evaluated:
|
---|
1304 |
|
---|
1305 | $pattern = 'Seuss';
|
---|
1306 | while (<>) {
|
---|
1307 | print if /$pattern/;
|
---|
1308 | }
|
---|
1309 |
|
---|
1310 | This will print any lines containing the word C<Seuss>. It is not as
|
---|
1311 | efficient as it could be, however, because perl has to re-evaluate
|
---|
1312 | C<$pattern> each time through the loop. If C<$pattern> won't be
|
---|
1313 | changing over the lifetime of the script, we can add the C<//o>
|
---|
1314 | modifier, which directs perl to only perform variable substitutions
|
---|
1315 | once:
|
---|
1316 |
|
---|
1317 | #!/usr/bin/perl
|
---|
1318 | # Improved simple_grep
|
---|
1319 | $regexp = shift;
|
---|
1320 | while (<>) {
|
---|
1321 | print if /$regexp/o; # a good deal faster
|
---|
1322 | }
|
---|
1323 |
|
---|
1324 | If you change C<$pattern> after the first substitution happens, perl
|
---|
1325 | will ignore it. If you don't want any substitutions at all, use the
|
---|
1326 | special delimiter C<m''>:
|
---|
1327 |
|
---|
1328 | @pattern = ('Seuss');
|
---|
1329 | while (<>) {
|
---|
1330 | print if m'@pattern'; # matches literal '@pattern', not 'Seuss'
|
---|
1331 | }
|
---|
1332 |
|
---|
1333 | C<m''> acts like single quotes on a regexp; all other C<m> delimiters
|
---|
1334 | act like double quotes. If the regexp evaluates to the empty string,
|
---|
1335 | the regexp in the I<last successful match> is used instead. So we have
|
---|
1336 |
|
---|
1337 | "dog" =~ /d/; # 'd' matches
|
---|
1338 | "dogbert =~ //; # this matches the 'd' regexp used before
|
---|
1339 |
|
---|
1340 | The final two modifiers C<//g> and C<//c> concern multiple matches.
|
---|
1341 | The modifier C<//g> stands for global matching and allows the
|
---|
1342 | matching operator to match within a string as many times as possible.
|
---|
1343 | In scalar context, successive invocations against a string will have
|
---|
1344 | `C<//g> jump from match to match, keeping track of position in the
|
---|
1345 | string as it goes along. You can get or set the position with the
|
---|
1346 | C<pos()> function.
|
---|
1347 |
|
---|
1348 | The use of C<//g> is shown in the following example. Suppose we have
|
---|
1349 | a string that consists of words separated by spaces. If we know how
|
---|
1350 | many words there are in advance, we could extract the words using
|
---|
1351 | groupings:
|
---|
1352 |
|
---|
1353 | $x = "cat dog house"; # 3 words
|
---|
1354 | $x =~ /^\s*(\w+)\s+(\w+)\s+(\w+)\s*$/; # matches,
|
---|
1355 | # $1 = 'cat'
|
---|
1356 | # $2 = 'dog'
|
---|
1357 | # $3 = 'house'
|
---|
1358 |
|
---|
1359 | But what if we had an indeterminate number of words? This is the sort
|
---|
1360 | of task C<//g> was made for. To extract all words, form the simple
|
---|
1361 | regexp C<(\w+)> and loop over all matches with C</(\w+)/g>:
|
---|
1362 |
|
---|
1363 | while ($x =~ /(\w+)/g) {
|
---|
1364 | print "Word is $1, ends at position ", pos $x, "\n";
|
---|
1365 | }
|
---|
1366 |
|
---|
1367 | prints
|
---|
1368 |
|
---|
1369 | Word is cat, ends at position 3
|
---|
1370 | Word is dog, ends at position 7
|
---|
1371 | Word is house, ends at position 13
|
---|
1372 |
|
---|
1373 | A failed match or changing the target string resets the position. If
|
---|
1374 | you don't want the position reset after failure to match, add the
|
---|
1375 | C<//c>, as in C</regexp/gc>. The current position in the string is
|
---|
1376 | associated with the string, not the regexp. This means that different
|
---|
1377 | strings have different positions and their respective positions can be
|
---|
1378 | set or read independently.
|
---|
1379 |
|
---|
1380 | In list context, C<//g> returns a list of matched groupings, or if
|
---|
1381 | there are no groupings, a list of matches to the whole regexp. So if
|
---|
1382 | we wanted just the words, we could use
|
---|
1383 |
|
---|
1384 | @words = ($x =~ /(\w+)/g); # matches,
|
---|
1385 | # $word[0] = 'cat'
|
---|
1386 | # $word[1] = 'dog'
|
---|
1387 | # $word[2] = 'house'
|
---|
1388 |
|
---|
1389 | Closely associated with the C<//g> modifier is the C<\G> anchor. The
|
---|
1390 | C<\G> anchor matches at the point where the previous C<//g> match left
|
---|
1391 | off. C<\G> allows us to easily do context-sensitive matching:
|
---|
1392 |
|
---|
1393 | $metric = 1; # use metric units
|
---|
1394 | ...
|
---|
1395 | $x = <FILE>; # read in measurement
|
---|
1396 | $x =~ /^([+-]?\d+)\s*/g; # get magnitude
|
---|
1397 | $weight = $1;
|
---|
1398 | if ($metric) { # error checking
|
---|
1399 | print "Units error!" unless $x =~ /\Gkg\./g;
|
---|
1400 | }
|
---|
1401 | else {
|
---|
1402 | print "Units error!" unless $x =~ /\Glbs\./g;
|
---|
1403 | }
|
---|
1404 | $x =~ /\G\s+(widget|sprocket)/g; # continue processing
|
---|
1405 |
|
---|
1406 | The combination of C<//g> and C<\G> allows us to process the string a
|
---|
1407 | bit at a time and use arbitrary Perl logic to decide what to do next.
|
---|
1408 | Currently, the C<\G> anchor is only fully supported when used to anchor
|
---|
1409 | to the start of the pattern.
|
---|
1410 |
|
---|
1411 | C<\G> is also invaluable in processing fixed length records with
|
---|
1412 | regexps. Suppose we have a snippet of coding region DNA, encoded as
|
---|
1413 | base pair letters C<ATCGTTGAAT...> and we want to find all the stop
|
---|
1414 | codons C<TGA>. In a coding region, codons are 3-letter sequences, so
|
---|
1415 | we can think of the DNA snippet as a sequence of 3-letter records. The
|
---|
1416 | naive regexp
|
---|
1417 |
|
---|
1418 | # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC"
|
---|
1419 | $dna = "ATCGTTGAATGCAAATGACATGAC";
|
---|
1420 | $dna =~ /TGA/;
|
---|
1421 |
|
---|
1422 | doesn't work; it may match a C<TGA>, but there is no guarantee that
|
---|
1423 | the match is aligned with codon boundaries, e.g., the substring
|
---|
1424 | S<C<GTT GAA> > gives a match. A better solution is
|
---|
1425 |
|
---|
1426 | while ($dna =~ /(\w\w\w)*?TGA/g) { # note the minimal *?
|
---|
1427 | print "Got a TGA stop codon at position ", pos $dna, "\n";
|
---|
1428 | }
|
---|
1429 |
|
---|
1430 | which prints
|
---|
1431 |
|
---|
1432 | Got a TGA stop codon at position 18
|
---|
1433 | Got a TGA stop codon at position 23
|
---|
1434 |
|
---|
1435 | Position 18 is good, but position 23 is bogus. What happened?
|
---|
1436 |
|
---|
1437 | The answer is that our regexp works well until we get past the last
|
---|
1438 | real match. Then the regexp will fail to match a synchronized C<TGA>
|
---|
1439 | and start stepping ahead one character position at a time, not what we
|
---|
1440 | want. The solution is to use C<\G> to anchor the match to the codon
|
---|
1441 | alignment:
|
---|
1442 |
|
---|
1443 | while ($dna =~ /\G(\w\w\w)*?TGA/g) {
|
---|
1444 | print "Got a TGA stop codon at position ", pos $dna, "\n";
|
---|
1445 | }
|
---|
1446 |
|
---|
1447 | This prints
|
---|
1448 |
|
---|
1449 | Got a TGA stop codon at position 18
|
---|
1450 |
|
---|
1451 | which is the correct answer. This example illustrates that it is
|
---|
1452 | important not only to match what is desired, but to reject what is not
|
---|
1453 | desired.
|
---|
1454 |
|
---|
1455 | B<search and replace>
|
---|
1456 |
|
---|
1457 | Regular expressions also play a big role in B<search and replace>
|
---|
1458 | operations in Perl. Search and replace is accomplished with the
|
---|
1459 | C<s///> operator. The general form is
|
---|
1460 | C<s/regexp/replacement/modifiers>, with everything we know about
|
---|
1461 | regexps and modifiers applying in this case as well. The
|
---|
1462 | C<replacement> is a Perl double quoted string that replaces in the
|
---|
1463 | string whatever is matched with the C<regexp>. The operator C<=~> is
|
---|
1464 | also used here to associate a string with C<s///>. If matching
|
---|
1465 | against C<$_>, the S<C<$_ =~> > can be dropped. If there is a match,
|
---|
1466 | C<s///> returns the number of substitutions made, otherwise it returns
|
---|
1467 | false. Here are a few examples:
|
---|
1468 |
|
---|
1469 | $x = "Time to feed the cat!";
|
---|
1470 | $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!"
|
---|
1471 | if ($x =~ s/^(Time.*hacker)!$/$1 now!/) {
|
---|
1472 | $more_insistent = 1;
|
---|
1473 | }
|
---|
1474 | $y = "'quoted words'";
|
---|
1475 | $y =~ s/^'(.*)'$/$1/; # strip single quotes,
|
---|
1476 | # $y contains "quoted words"
|
---|
1477 |
|
---|
1478 | In the last example, the whole string was matched, but only the part
|
---|
1479 | inside the single quotes was grouped. With the C<s///> operator, the
|
---|
1480 | matched variables C<$1>, C<$2>, etc. are immediately available for use
|
---|
1481 | in the replacement expression, so we use C<$1> to replace the quoted
|
---|
1482 | string with just what was quoted. With the global modifier, C<s///g>
|
---|
1483 | will search and replace all occurrences of the regexp in the string:
|
---|
1484 |
|
---|
1485 | $x = "I batted 4 for 4";
|
---|
1486 | $x =~ s/4/four/; # doesn't do it all:
|
---|
1487 | # $x contains "I batted four for 4"
|
---|
1488 | $x = "I batted 4 for 4";
|
---|
1489 | $x =~ s/4/four/g; # does it all:
|
---|
1490 | # $x contains "I batted four for four"
|
---|
1491 |
|
---|
1492 | If you prefer 'regex' over 'regexp' in this tutorial, you could use
|
---|
1493 | the following program to replace it:
|
---|
1494 |
|
---|
1495 | % cat > simple_replace
|
---|
1496 | #!/usr/bin/perl
|
---|
1497 | $regexp = shift;
|
---|
1498 | $replacement = shift;
|
---|
1499 | while (<>) {
|
---|
1500 | s/$regexp/$replacement/go;
|
---|
1501 | print;
|
---|
1502 | }
|
---|
1503 | ^D
|
---|
1504 |
|
---|
1505 | % simple_replace regexp regex perlretut.pod
|
---|
1506 |
|
---|
1507 | In C<simple_replace> we used the C<s///g> modifier to replace all
|
---|
1508 | occurrences of the regexp on each line and the C<s///o> modifier to
|
---|
1509 | compile the regexp only once. As with C<simple_grep>, both the
|
---|
1510 | C<print> and the C<s/$regexp/$replacement/go> use C<$_> implicitly.
|
---|
1511 |
|
---|
1512 | A modifier available specifically to search and replace is the
|
---|
1513 | C<s///e> evaluation modifier. C<s///e> wraps an C<eval{...}> around
|
---|
1514 | the replacement string and the evaluated result is substituted for the
|
---|
1515 | matched substring. C<s///e> is useful if you need to do a bit of
|
---|
1516 | computation in the process of replacing text. This example counts
|
---|
1517 | character frequencies in a line:
|
---|
1518 |
|
---|
1519 | $x = "Bill the cat";
|
---|
1520 | $x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself
|
---|
1521 | print "frequency of '$_' is $chars{$_}\n"
|
---|
1522 | foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars);
|
---|
1523 |
|
---|
1524 | This prints
|
---|
1525 |
|
---|
1526 | frequency of ' ' is 2
|
---|
1527 | frequency of 't' is 2
|
---|
1528 | frequency of 'l' is 2
|
---|
1529 | frequency of 'B' is 1
|
---|
1530 | frequency of 'c' is 1
|
---|
1531 | frequency of 'e' is 1
|
---|
1532 | frequency of 'h' is 1
|
---|
1533 | frequency of 'i' is 1
|
---|
1534 | frequency of 'a' is 1
|
---|
1535 |
|
---|
1536 | As with the match C<m//> operator, C<s///> can use other delimiters,
|
---|
1537 | such as C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are
|
---|
1538 | used C<s'''>, then the regexp and replacement are treated as single
|
---|
1539 | quoted strings and there are no substitutions. C<s///> in list context
|
---|
1540 | returns the same thing as in scalar context, i.e., the number of
|
---|
1541 | matches.
|
---|
1542 |
|
---|
1543 | B<The split operator>
|
---|
1544 |
|
---|
1545 | The B<C<split> > function can also optionally use a matching operator
|
---|
1546 | C<m//> to split a string. C<split /regexp/, string, limit> splits
|
---|
1547 | C<string> into a list of substrings and returns that list. The regexp
|
---|
1548 | is used to match the character sequence that the C<string> is split
|
---|
1549 | with respect to. The C<limit>, if present, constrains splitting into
|
---|
1550 | no more than C<limit> number of strings. For example, to split a
|
---|
1551 | string into words, use
|
---|
1552 |
|
---|
1553 | $x = "Calvin and Hobbes";
|
---|
1554 | @words = split /\s+/, $x; # $word[0] = 'Calvin'
|
---|
1555 | # $word[1] = 'and'
|
---|
1556 | # $word[2] = 'Hobbes'
|
---|
1557 |
|
---|
1558 | If the empty regexp C<//> is used, the regexp always matches and
|
---|
1559 | the string is split into individual characters. If the regexp has
|
---|
1560 | groupings, then list produced contains the matched substrings from the
|
---|
1561 | groupings as well. For instance,
|
---|
1562 |
|
---|
1563 | $x = "/usr/bin/perl";
|
---|
1564 | @dirs = split m!/!, $x; # $dirs[0] = ''
|
---|
1565 | # $dirs[1] = 'usr'
|
---|
1566 | # $dirs[2] = 'bin'
|
---|
1567 | # $dirs[3] = 'perl'
|
---|
1568 | @parts = split m!(/)!, $x; # $parts[0] = ''
|
---|
1569 | # $parts[1] = '/'
|
---|
1570 | # $parts[2] = 'usr'
|
---|
1571 | # $parts[3] = '/'
|
---|
1572 | # $parts[4] = 'bin'
|
---|
1573 | # $parts[5] = '/'
|
---|
1574 | # $parts[6] = 'perl'
|
---|
1575 |
|
---|
1576 | Since the first character of $x matched the regexp, C<split> prepended
|
---|
1577 | an empty initial element to the list.
|
---|
1578 |
|
---|
1579 | If you have read this far, congratulations! You now have all the basic
|
---|
1580 | tools needed to use regular expressions to solve a wide range of text
|
---|
1581 | processing problems. If this is your first time through the tutorial,
|
---|
1582 | why not stop here and play around with regexps a while... S<Part 2>
|
---|
1583 | concerns the more esoteric aspects of regular expressions and those
|
---|
1584 | concepts certainly aren't needed right at the start.
|
---|
1585 |
|
---|
1586 | =head1 Part 2: Power tools
|
---|
1587 |
|
---|
1588 | OK, you know the basics of regexps and you want to know more. If
|
---|
1589 | matching regular expressions is analogous to a walk in the woods, then
|
---|
1590 | the tools discussed in Part 1 are analogous to topo maps and a
|
---|
1591 | compass, basic tools we use all the time. Most of the tools in part 2
|
---|
1592 | are analogous to flare guns and satellite phones. They aren't used
|
---|
1593 | too often on a hike, but when we are stuck, they can be invaluable.
|
---|
1594 |
|
---|
1595 | What follows are the more advanced, less used, or sometimes esoteric
|
---|
1596 | capabilities of perl regexps. In Part 2, we will assume you are
|
---|
1597 | comfortable with the basics and concentrate on the new features.
|
---|
1598 |
|
---|
1599 | =head2 More on characters, strings, and character classes
|
---|
1600 |
|
---|
1601 | There are a number of escape sequences and character classes that we
|
---|
1602 | haven't covered yet.
|
---|
1603 |
|
---|
1604 | There are several escape sequences that convert characters or strings
|
---|
1605 | between upper and lower case. C<\l> and C<\u> convert the next
|
---|
1606 | character to lower or upper case, respectively:
|
---|
1607 |
|
---|
1608 | $x = "perl";
|
---|
1609 | $string =~ /\u$x/; # matches 'Perl' in $string
|
---|
1610 | $x = "M(rs?|s)\\."; # note the double backslash
|
---|
1611 | $string =~ /\l$x/; # matches 'mr.', 'mrs.', and 'ms.',
|
---|
1612 |
|
---|
1613 | C<\L> and C<\U> converts a whole substring, delimited by C<\L> or
|
---|
1614 | C<\U> and C<\E>, to lower or upper case:
|
---|
1615 |
|
---|
1616 | $x = "This word is in lower case:\L SHOUT\E";
|
---|
1617 | $x =~ /shout/; # matches
|
---|
1618 | $x = "I STILL KEYPUNCH CARDS FOR MY 360"
|
---|
1619 | $x =~ /\Ukeypunch/; # matches punch card string
|
---|
1620 |
|
---|
1621 | If there is no C<\E>, case is converted until the end of the
|
---|
1622 | string. The regexps C<\L\u$word> or C<\u\L$word> convert the first
|
---|
1623 | character of C<$word> to uppercase and the rest of the characters to
|
---|
1624 | lowercase.
|
---|
1625 |
|
---|
1626 | Control characters can be escaped with C<\c>, so that a control-Z
|
---|
1627 | character would be matched with C<\cZ>. The escape sequence
|
---|
1628 | C<\Q>...C<\E> quotes, or protects most non-alphabetic characters. For
|
---|
1629 | instance,
|
---|
1630 |
|
---|
1631 | $x = "\QThat !^*&%~& cat!";
|
---|
1632 | $x =~ /\Q!^*&%~&\E/; # check for rough language
|
---|
1633 |
|
---|
1634 | It does not protect C<$> or C<@>, so that variables can still be
|
---|
1635 | substituted.
|
---|
1636 |
|
---|
1637 | With the advent of 5.6.0, perl regexps can handle more than just the
|
---|
1638 | standard ASCII character set. Perl now supports B<Unicode>, a standard
|
---|
1639 | for encoding the character sets from many of the world's written
|
---|
1640 | languages. Unicode does this by allowing characters to be more than
|
---|
1641 | one byte wide. Perl uses the UTF-8 encoding, in which ASCII characters
|
---|
1642 | are still encoded as one byte, but characters greater than C<chr(127)>
|
---|
1643 | may be stored as two or more bytes.
|
---|
1644 |
|
---|
1645 | What does this mean for regexps? Well, regexp users don't need to know
|
---|
1646 | much about perl's internal representation of strings. But they do need
|
---|
1647 | to know 1) how to represent Unicode characters in a regexp and 2) when
|
---|
1648 | a matching operation will treat the string to be searched as a
|
---|
1649 | sequence of bytes (the old way) or as a sequence of Unicode characters
|
---|
1650 | (the new way). The answer to 1) is that Unicode characters greater
|
---|
1651 | than C<chr(127)> may be represented using the C<\x{hex}> notation,
|
---|
1652 | with C<hex> a hexadecimal integer:
|
---|
1653 |
|
---|
1654 | /\x{263a}/; # match a Unicode smiley face :)
|
---|
1655 |
|
---|
1656 | Unicode characters in the range of 128-255 use two hexadecimal digits
|
---|
1657 | with braces: C<\x{ab}>. Note that this is different than C<\xab>,
|
---|
1658 | which is just a hexadecimal byte with no Unicode significance.
|
---|
1659 |
|
---|
1660 | B<NOTE>: in Perl 5.6.0 it used to be that one needed to say C<use
|
---|
1661 | utf8> to use any Unicode features. This is no more the case: for
|
---|
1662 | almost all Unicode processing, the explicit C<utf8> pragma is not
|
---|
1663 | needed. (The only case where it matters is if your Perl script is in
|
---|
1664 | Unicode and encoded in UTF-8, then an explicit C<use utf8> is needed.)
|
---|
1665 |
|
---|
1666 | Figuring out the hexadecimal sequence of a Unicode character you want
|
---|
1667 | or deciphering someone else's hexadecimal Unicode regexp is about as
|
---|
1668 | much fun as programming in machine code. So another way to specify
|
---|
1669 | Unicode characters is to use the S<B<named character> > escape
|
---|
1670 | sequence C<\N{name}>. C<name> is a name for the Unicode character, as
|
---|
1671 | specified in the Unicode standard. For instance, if we wanted to
|
---|
1672 | represent or match the astrological sign for the planet Mercury, we
|
---|
1673 | could use
|
---|
1674 |
|
---|
1675 | use charnames ":full"; # use named chars with Unicode full names
|
---|
1676 | $x = "abc\N{MERCURY}def";
|
---|
1677 | $x =~ /\N{MERCURY}/; # matches
|
---|
1678 |
|
---|
1679 | One can also use short names or restrict names to a certain alphabet:
|
---|
1680 |
|
---|
1681 | use charnames ':full';
|
---|
1682 | print "\N{GREEK SMALL LETTER SIGMA} is called sigma.\n";
|
---|
1683 |
|
---|
1684 | use charnames ":short";
|
---|
1685 | print "\N{greek:Sigma} is an upper-case sigma.\n";
|
---|
1686 |
|
---|
1687 | use charnames qw(greek);
|
---|
1688 | print "\N{sigma} is Greek sigma\n";
|
---|
1689 |
|
---|
1690 | A list of full names is found in the file Names.txt in the
|
---|
1691 | lib/perl5/5.X.X/unicore directory.
|
---|
1692 |
|
---|
1693 | The answer to requirement 2), as of 5.6.0, is that if a regexp
|
---|
1694 | contains Unicode characters, the string is searched as a sequence of
|
---|
1695 | Unicode characters. Otherwise, the string is searched as a sequence of
|
---|
1696 | bytes. If the string is being searched as a sequence of Unicode
|
---|
1697 | characters, but matching a single byte is required, we can use the C<\C>
|
---|
1698 | escape sequence. C<\C> is a character class akin to C<.> except that
|
---|
1699 | it matches I<any> byte 0-255. So
|
---|
1700 |
|
---|
1701 | use charnames ":full"; # use named chars with Unicode full names
|
---|
1702 | $x = "a";
|
---|
1703 | $x =~ /\C/; # matches 'a', eats one byte
|
---|
1704 | $x = "";
|
---|
1705 | $x =~ /\C/; # doesn't match, no bytes to match
|
---|
1706 | $x = "\N{MERCURY}"; # two-byte Unicode character
|
---|
1707 | $x =~ /\C/; # matches, but dangerous!
|
---|
1708 |
|
---|
1709 | The last regexp matches, but is dangerous because the string
|
---|
1710 | I<character> position is no longer synchronized to the string I<byte>
|
---|
1711 | position. This generates the warning 'Malformed UTF-8
|
---|
1712 | character'. The C<\C> is best used for matching the binary data in strings
|
---|
1713 | with binary data intermixed with Unicode characters.
|
---|
1714 |
|
---|
1715 | Let us now discuss the rest of the character classes. Just as with
|
---|
1716 | Unicode characters, there are named Unicode character classes
|
---|
1717 | represented by the C<\p{name}> escape sequence. Closely associated is
|
---|
1718 | the C<\P{name}> character class, which is the negation of the
|
---|
1719 | C<\p{name}> class. For example, to match lower and uppercase
|
---|
1720 | characters,
|
---|
1721 |
|
---|
1722 | use charnames ":full"; # use named chars with Unicode full names
|
---|
1723 | $x = "BOB";
|
---|
1724 | $x =~ /^\p{IsUpper}/; # matches, uppercase char class
|
---|
1725 | $x =~ /^\P{IsUpper}/; # doesn't match, char class sans uppercase
|
---|
1726 | $x =~ /^\p{IsLower}/; # doesn't match, lowercase char class
|
---|
1727 | $x =~ /^\P{IsLower}/; # matches, char class sans lowercase
|
---|
1728 |
|
---|
1729 | Here is the association between some Perl named classes and the
|
---|
1730 | traditional Unicode classes:
|
---|
1731 |
|
---|
1732 | Perl class name Unicode class name or regular expression
|
---|
1733 |
|
---|
1734 | IsAlpha /^[LM]/
|
---|
1735 | IsAlnum /^[LMN]/
|
---|
1736 | IsASCII $code <= 127
|
---|
1737 | IsCntrl /^C/
|
---|
1738 | IsBlank $code =~ /^(0020|0009)$/ || /^Z[^lp]/
|
---|
1739 | IsDigit Nd
|
---|
1740 | IsGraph /^([LMNPS]|Co)/
|
---|
1741 | IsLower Ll
|
---|
1742 | IsPrint /^([LMNPS]|Co|Zs)/
|
---|
1743 | IsPunct /^P/
|
---|
1744 | IsSpace /^Z/ || ($code =~ /^(0009|000A|000B|000C|000D)$/
|
---|
1745 | IsSpacePerl /^Z/ || ($code =~ /^(0009|000A|000C|000D|0085|2028|2029)$/
|
---|
1746 | IsUpper /^L[ut]/
|
---|
1747 | IsWord /^[LMN]/ || $code eq "005F"
|
---|
1748 | IsXDigit $code =~ /^00(3[0-9]|[46][1-6])$/
|
---|
1749 |
|
---|
1750 | You can also use the official Unicode class names with the C<\p> and
|
---|
1751 | C<\P>, like C<\p{L}> for Unicode 'letters', or C<\p{Lu}> for uppercase
|
---|
1752 | letters, or C<\P{Nd}> for non-digits. If a C<name> is just one
|
---|
1753 | letter, the braces can be dropped. For instance, C<\pM> is the
|
---|
1754 | character class of Unicode 'marks', for example accent marks.
|
---|
1755 | For the full list see L<perlunicode>.
|
---|
1756 |
|
---|
1757 | The Unicode has also been separated into various sets of characters
|
---|
1758 | which you can test with C<\p{In...}> (in) and C<\P{In...}> (not in),
|
---|
1759 | for example C<\p{Latin}>, C<\p{Greek}>, or C<\P{Katakana}>.
|
---|
1760 | For the full list see L<perlunicode>.
|
---|
1761 |
|
---|
1762 | C<\X> is an abbreviation for a character class sequence that includes
|
---|
1763 | the Unicode 'combining character sequences'. A 'combining character
|
---|
1764 | sequence' is a base character followed by any number of combining
|
---|
1765 | characters. An example of a combining character is an accent. Using
|
---|
1766 | the Unicode full names, e.g., S<C<A + COMBINING RING> > is a combining
|
---|
1767 | character sequence with base character C<A> and combining character
|
---|
1768 | S<C<COMBINING RING> >, which translates in Danish to A with the circle
|
---|
1769 | atop it, as in the word Angstrom. C<\X> is equivalent to C<\PM\pM*}>,
|
---|
1770 | i.e., a non-mark followed by one or more marks.
|
---|
1771 |
|
---|
1772 | For the full and latest information about Unicode see the latest
|
---|
1773 | Unicode standard, or the Unicode Consortium's website http://www.unicode.org/
|
---|
1774 |
|
---|
1775 | As if all those classes weren't enough, Perl also defines POSIX style
|
---|
1776 | character classes. These have the form C<[:name:]>, with C<name> the
|
---|
1777 | name of the POSIX class. The POSIX classes are C<alpha>, C<alnum>,
|
---|
1778 | C<ascii>, C<cntrl>, C<digit>, C<graph>, C<lower>, C<print>, C<punct>,
|
---|
1779 | C<space>, C<upper>, and C<xdigit>, and two extensions, C<word> (a Perl
|
---|
1780 | extension to match C<\w>), and C<blank> (a GNU extension). If C<utf8>
|
---|
1781 | is being used, then these classes are defined the same as their
|
---|
1782 | corresponding perl Unicode classes: C<[:upper:]> is the same as
|
---|
1783 | C<\p{IsUpper}>, etc. The POSIX character classes, however, don't
|
---|
1784 | require using C<utf8>. The C<[:digit:]>, C<[:word:]>, and
|
---|
1785 | C<[:space:]> correspond to the familiar C<\d>, C<\w>, and C<\s>
|
---|
1786 | character classes. To negate a POSIX class, put a C<^> in front of
|
---|
1787 | the name, so that, e.g., C<[:^digit:]> corresponds to C<\D> and under
|
---|
1788 | C<utf8>, C<\P{IsDigit}>. The Unicode and POSIX character classes can
|
---|
1789 | be used just like C<\d>, with the exception that POSIX character
|
---|
1790 | classes can only be used inside of a character class:
|
---|
1791 |
|
---|
1792 | /\s+[abc[:digit:]xyz]\s*/; # match a,b,c,x,y,z, or a digit
|
---|
1793 | /^=item\s[[:digit:]]/; # match '=item',
|
---|
1794 | # followed by a space and a digit
|
---|
1795 | use charnames ":full";
|
---|
1796 | /\s+[abc\p{IsDigit}xyz]\s+/; # match a,b,c,x,y,z, or a digit
|
---|
1797 | /^=item\s\p{IsDigit}/; # match '=item',
|
---|
1798 | # followed by a space and a digit
|
---|
1799 |
|
---|
1800 | Whew! That is all the rest of the characters and character classes.
|
---|
1801 |
|
---|
1802 | =head2 Compiling and saving regular expressions
|
---|
1803 |
|
---|
1804 | In Part 1 we discussed the C<//o> modifier, which compiles a regexp
|
---|
1805 | just once. This suggests that a compiled regexp is some data structure
|
---|
1806 | that can be stored once and used again and again. The regexp quote
|
---|
1807 | C<qr//> does exactly that: C<qr/string/> compiles the C<string> as a
|
---|
1808 | regexp and transforms the result into a form that can be assigned to a
|
---|
1809 | variable:
|
---|
1810 |
|
---|
1811 | $reg = qr/foo+bar?/; # reg contains a compiled regexp
|
---|
1812 |
|
---|
1813 | Then C<$reg> can be used as a regexp:
|
---|
1814 |
|
---|
1815 | $x = "fooooba";
|
---|
1816 | $x =~ $reg; # matches, just like /foo+bar?/
|
---|
1817 | $x =~ /$reg/; # same thing, alternate form
|
---|
1818 |
|
---|
1819 | C<$reg> can also be interpolated into a larger regexp:
|
---|
1820 |
|
---|
1821 | $x =~ /(abc)?$reg/; # still matches
|
---|
1822 |
|
---|
1823 | As with the matching operator, the regexp quote can use different
|
---|
1824 | delimiters, e.g., C<qr!!>, C<qr{}> and C<qr~~>. The single quote
|
---|
1825 | delimiters C<qr''> prevent any interpolation from taking place.
|
---|
1826 |
|
---|
1827 | Pre-compiled regexps are useful for creating dynamic matches that
|
---|
1828 | don't need to be recompiled each time they are encountered. Using
|
---|
1829 | pre-compiled regexps, C<simple_grep> program can be expanded into a
|
---|
1830 | program that matches multiple patterns:
|
---|
1831 |
|
---|
1832 | % cat > multi_grep
|
---|
1833 | #!/usr/bin/perl
|
---|
1834 | # multi_grep - match any of <number> regexps
|
---|
1835 | # usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ...
|
---|
1836 |
|
---|
1837 | $number = shift;
|
---|
1838 | $regexp[$_] = shift foreach (0..$number-1);
|
---|
1839 | @compiled = map qr/$_/, @regexp;
|
---|
1840 | while ($line = <>) {
|
---|
1841 | foreach $pattern (@compiled) {
|
---|
1842 | if ($line =~ /$pattern/) {
|
---|
1843 | print $line;
|
---|
1844 | last; # we matched, so move onto the next line
|
---|
1845 | }
|
---|
1846 | }
|
---|
1847 | }
|
---|
1848 | ^D
|
---|
1849 |
|
---|
1850 | % multi_grep 2 last for multi_grep
|
---|
1851 | $regexp[$_] = shift foreach (0..$number-1);
|
---|
1852 | foreach $pattern (@compiled) {
|
---|
1853 | last;
|
---|
1854 |
|
---|
1855 | Storing pre-compiled regexps in an array C<@compiled> allows us to
|
---|
1856 | simply loop through the regexps without any recompilation, thus gaining
|
---|
1857 | flexibility without sacrificing speed.
|
---|
1858 |
|
---|
1859 | =head2 Embedding comments and modifiers in a regular expression
|
---|
1860 |
|
---|
1861 | Starting with this section, we will be discussing Perl's set of
|
---|
1862 | B<extended patterns>. These are extensions to the traditional regular
|
---|
1863 | expression syntax that provide powerful new tools for pattern
|
---|
1864 | matching. We have already seen extensions in the form of the minimal
|
---|
1865 | matching constructs C<??>, C<*?>, C<+?>, C<{n,m}?>, and C<{n,}?>. The
|
---|
1866 | rest of the extensions below have the form C<(?char...)>, where the
|
---|
1867 | C<char> is a character that determines the type of extension.
|
---|
1868 |
|
---|
1869 | The first extension is an embedded comment C<(?#text)>. This embeds a
|
---|
1870 | comment into the regular expression without affecting its meaning. The
|
---|
1871 | comment should not have any closing parentheses in the text. An
|
---|
1872 | example is
|
---|
1873 |
|
---|
1874 | /(?# Match an integer:)[+-]?\d+/;
|
---|
1875 |
|
---|
1876 | This style of commenting has been largely superseded by the raw,
|
---|
1877 | freeform commenting that is allowed with the C<//x> modifier.
|
---|
1878 |
|
---|
1879 | The modifiers C<//i>, C<//m>, C<//s>, and C<//x> can also embedded in
|
---|
1880 | a regexp using C<(?i)>, C<(?m)>, C<(?s)>, and C<(?x)>. For instance,
|
---|
1881 |
|
---|
1882 | /(?i)yes/; # match 'yes' case insensitively
|
---|
1883 | /yes/i; # same thing
|
---|
1884 | /(?x)( # freeform version of an integer regexp
|
---|
1885 | [+-]? # match an optional sign
|
---|
1886 | \d+ # match a sequence of digits
|
---|
1887 | )
|
---|
1888 | /x;
|
---|
1889 |
|
---|
1890 | Embedded modifiers can have two important advantages over the usual
|
---|
1891 | modifiers. Embedded modifiers allow a custom set of modifiers to
|
---|
1892 | I<each> regexp pattern. This is great for matching an array of regexps
|
---|
1893 | that must have different modifiers:
|
---|
1894 |
|
---|
1895 | $pattern[0] = '(?i)doctor';
|
---|
1896 | $pattern[1] = 'Johnson';
|
---|
1897 | ...
|
---|
1898 | while (<>) {
|
---|
1899 | foreach $patt (@pattern) {
|
---|
1900 | print if /$patt/;
|
---|
1901 | }
|
---|
1902 | }
|
---|
1903 |
|
---|
1904 | The second advantage is that embedded modifiers only affect the regexp
|
---|
1905 | inside the group the embedded modifier is contained in. So grouping
|
---|
1906 | can be used to localize the modifier's effects:
|
---|
1907 |
|
---|
1908 | /Answer: ((?i)yes)/; # matches 'Answer: yes', 'Answer: YES', etc.
|
---|
1909 |
|
---|
1910 | Embedded modifiers can also turn off any modifiers already present
|
---|
1911 | by using, e.g., C<(?-i)>. Modifiers can also be combined into
|
---|
1912 | a single expression, e.g., C<(?s-i)> turns on single line mode and
|
---|
1913 | turns off case insensitivity.
|
---|
1914 |
|
---|
1915 | =head2 Non-capturing groupings
|
---|
1916 |
|
---|
1917 | We noted in Part 1 that groupings C<()> had two distinct functions: 1)
|
---|
1918 | group regexp elements together as a single unit, and 2) extract, or
|
---|
1919 | capture, substrings that matched the regexp in the
|
---|
1920 | grouping. Non-capturing groupings, denoted by C<(?:regexp)>, allow the
|
---|
1921 | regexp to be treated as a single unit, but don't extract substrings or
|
---|
1922 | set matching variables C<$1>, etc. Both capturing and non-capturing
|
---|
1923 | groupings are allowed to co-exist in the same regexp. Because there is
|
---|
1924 | no extraction, non-capturing groupings are faster than capturing
|
---|
1925 | groupings. Non-capturing groupings are also handy for choosing exactly
|
---|
1926 | which parts of a regexp are to be extracted to matching variables:
|
---|
1927 |
|
---|
1928 | # match a number, $1-$4 are set, but we only want $1
|
---|
1929 | /([+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?)/;
|
---|
1930 |
|
---|
1931 | # match a number faster , only $1 is set
|
---|
1932 | /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?)/;
|
---|
1933 |
|
---|
1934 | # match a number, get $1 = whole number, $2 = exponent
|
---|
1935 | /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE]([+-]?\d+))?)/;
|
---|
1936 |
|
---|
1937 | Non-capturing groupings are also useful for removing nuisance
|
---|
1938 | elements gathered from a split operation:
|
---|
1939 |
|
---|
1940 | $x = '12a34b5';
|
---|
1941 | @num = split /(a|b)/, $x; # @num = ('12','a','34','b','5')
|
---|
1942 | @num = split /(?:a|b)/, $x; # @num = ('12','34','5')
|
---|
1943 |
|
---|
1944 | Non-capturing groupings may also have embedded modifiers:
|
---|
1945 | C<(?i-m:regexp)> is a non-capturing grouping that matches C<regexp>
|
---|
1946 | case insensitively and turns off multi-line mode.
|
---|
1947 |
|
---|
1948 | =head2 Looking ahead and looking behind
|
---|
1949 |
|
---|
1950 | This section concerns the lookahead and lookbehind assertions. First,
|
---|
1951 | a little background.
|
---|
1952 |
|
---|
1953 | In Perl regular expressions, most regexp elements 'eat up' a certain
|
---|
1954 | amount of string when they match. For instance, the regexp element
|
---|
1955 | C<[abc}]> eats up one character of the string when it matches, in the
|
---|
1956 | sense that perl moves to the next character position in the string
|
---|
1957 | after the match. There are some elements, however, that don't eat up
|
---|
1958 | characters (advance the character position) if they match. The examples
|
---|
1959 | we have seen so far are the anchors. The anchor C<^> matches the
|
---|
1960 | beginning of the line, but doesn't eat any characters. Similarly, the
|
---|
1961 | word boundary anchor C<\b> matches, e.g., if the character to the left
|
---|
1962 | is a word character and the character to the right is a non-word
|
---|
1963 | character, but it doesn't eat up any characters itself. Anchors are
|
---|
1964 | examples of 'zero-width assertions'. Zero-width, because they consume
|
---|
1965 | no characters, and assertions, because they test some property of the
|
---|
1966 | string. In the context of our walk in the woods analogy to regexp
|
---|
1967 | matching, most regexp elements move us along a trail, but anchors have
|
---|
1968 | us stop a moment and check our surroundings. If the local environment
|
---|
1969 | checks out, we can proceed forward. But if the local environment
|
---|
1970 | doesn't satisfy us, we must backtrack.
|
---|
1971 |
|
---|
1972 | Checking the environment entails either looking ahead on the trail,
|
---|
1973 | looking behind, or both. C<^> looks behind, to see that there are no
|
---|
1974 | characters before. C<$> looks ahead, to see that there are no
|
---|
1975 | characters after. C<\b> looks both ahead and behind, to see if the
|
---|
1976 | characters on either side differ in their 'word'-ness.
|
---|
1977 |
|
---|
1978 | The lookahead and lookbehind assertions are generalizations of the
|
---|
1979 | anchor concept. Lookahead and lookbehind are zero-width assertions
|
---|
1980 | that let us specify which characters we want to test for. The
|
---|
1981 | lookahead assertion is denoted by C<(?=regexp)> and the lookbehind
|
---|
1982 | assertion is denoted by C<< (?<=fixed-regexp) >>. Some examples are
|
---|
1983 |
|
---|
1984 | $x = "I catch the housecat 'Tom-cat' with catnip";
|
---|
1985 | $x =~ /cat(?=\s+)/; # matches 'cat' in 'housecat'
|
---|
1986 | @catwords = ($x =~ /(?<=\s)cat\w+/g); # matches,
|
---|
1987 | # $catwords[0] = 'catch'
|
---|
1988 | # $catwords[1] = 'catnip'
|
---|
1989 | $x =~ /\bcat\b/; # matches 'cat' in 'Tom-cat'
|
---|
1990 | $x =~ /(?<=\s)cat(?=\s)/; # doesn't match; no isolated 'cat' in
|
---|
1991 | # middle of $x
|
---|
1992 |
|
---|
1993 | Note that the parentheses in C<(?=regexp)> and C<< (?<=regexp) >> are
|
---|
1994 | non-capturing, since these are zero-width assertions. Thus in the
|
---|
1995 | second regexp, the substrings captured are those of the whole regexp
|
---|
1996 | itself. Lookahead C<(?=regexp)> can match arbitrary regexps, but
|
---|
1997 | lookbehind C<< (?<=fixed-regexp) >> only works for regexps of fixed
|
---|
1998 | width, i.e., a fixed number of characters long. Thus
|
---|
1999 | C<< (?<=(ab|bc)) >> is fine, but C<< (?<=(ab)*) >> is not. The
|
---|
2000 | negated versions of the lookahead and lookbehind assertions are
|
---|
2001 | denoted by C<(?!regexp)> and C<< (?<!fixed-regexp) >> respectively.
|
---|
2002 | They evaluate true if the regexps do I<not> match:
|
---|
2003 |
|
---|
2004 | $x = "foobar";
|
---|
2005 | $x =~ /foo(?!bar)/; # doesn't match, 'bar' follows 'foo'
|
---|
2006 | $x =~ /foo(?!baz)/; # matches, 'baz' doesn't follow 'foo'
|
---|
2007 | $x =~ /(?<!\s)foo/; # matches, there is no \s before 'foo'
|
---|
2008 |
|
---|
2009 | The C<\C> is unsupported in lookbehind, because the already
|
---|
2010 | treacherous definition of C<\C> would become even more so
|
---|
2011 | when going backwards.
|
---|
2012 |
|
---|
2013 | =head2 Using independent subexpressions to prevent backtracking
|
---|
2014 |
|
---|
2015 | The last few extended patterns in this tutorial are experimental as of
|
---|
2016 | 5.6.0. Play with them, use them in some code, but don't rely on them
|
---|
2017 | just yet for production code.
|
---|
2018 |
|
---|
2019 | S<B<Independent subexpressions> > are regular expressions, in the
|
---|
2020 | context of a larger regular expression, that function independently of
|
---|
2021 | the larger regular expression. That is, they consume as much or as
|
---|
2022 | little of the string as they wish without regard for the ability of
|
---|
2023 | the larger regexp to match. Independent subexpressions are represented
|
---|
2024 | by C<< (?>regexp) >>. We can illustrate their behavior by first
|
---|
2025 | considering an ordinary regexp:
|
---|
2026 |
|
---|
2027 | $x = "ab";
|
---|
2028 | $x =~ /a*ab/; # matches
|
---|
2029 |
|
---|
2030 | This obviously matches, but in the process of matching, the
|
---|
2031 | subexpression C<a*> first grabbed the C<a>. Doing so, however,
|
---|
2032 | wouldn't allow the whole regexp to match, so after backtracking, C<a*>
|
---|
2033 | eventually gave back the C<a> and matched the empty string. Here, what
|
---|
2034 | C<a*> matched was I<dependent> on what the rest of the regexp matched.
|
---|
2035 |
|
---|
2036 | Contrast that with an independent subexpression:
|
---|
2037 |
|
---|
2038 | $x =~ /(?>a*)ab/; # doesn't match!
|
---|
2039 |
|
---|
2040 | The independent subexpression C<< (?>a*) >> doesn't care about the rest
|
---|
2041 | of the regexp, so it sees an C<a> and grabs it. Then the rest of the
|
---|
2042 | regexp C<ab> cannot match. Because C<< (?>a*) >> is independent, there
|
---|
2043 | is no backtracking and the independent subexpression does not give
|
---|
2044 | up its C<a>. Thus the match of the regexp as a whole fails. A similar
|
---|
2045 | behavior occurs with completely independent regexps:
|
---|
2046 |
|
---|
2047 | $x = "ab";
|
---|
2048 | $x =~ /a*/g; # matches, eats an 'a'
|
---|
2049 | $x =~ /\Gab/g; # doesn't match, no 'a' available
|
---|
2050 |
|
---|
2051 | Here C<//g> and C<\G> create a 'tag team' handoff of the string from
|
---|
2052 | one regexp to the other. Regexps with an independent subexpression are
|
---|
2053 | much like this, with a handoff of the string to the independent
|
---|
2054 | subexpression, and a handoff of the string back to the enclosing
|
---|
2055 | regexp.
|
---|
2056 |
|
---|
2057 | The ability of an independent subexpression to prevent backtracking
|
---|
2058 | can be quite useful. Suppose we want to match a non-empty string
|
---|
2059 | enclosed in parentheses up to two levels deep. Then the following
|
---|
2060 | regexp matches:
|
---|
2061 |
|
---|
2062 | $x = "abc(de(fg)h"; # unbalanced parentheses
|
---|
2063 | $x =~ /\( ( [^()]+ | \([^()]*\) )+ \)/x;
|
---|
2064 |
|
---|
2065 | The regexp matches an open parenthesis, one or more copies of an
|
---|
2066 | alternation, and a close parenthesis. The alternation is two-way, with
|
---|
2067 | the first alternative C<[^()]+> matching a substring with no
|
---|
2068 | parentheses and the second alternative C<\([^()]*\)> matching a
|
---|
2069 | substring delimited by parentheses. The problem with this regexp is
|
---|
2070 | that it is pathological: it has nested indeterminate quantifiers
|
---|
2071 | of the form C<(a+|b)+>. We discussed in Part 1 how nested quantifiers
|
---|
2072 | like this could take an exponentially long time to execute if there
|
---|
2073 | was no match possible. To prevent the exponential blowup, we need to
|
---|
2074 | prevent useless backtracking at some point. This can be done by
|
---|
2075 | enclosing the inner quantifier as an independent subexpression:
|
---|
2076 |
|
---|
2077 | $x =~ /\( ( (?>[^()]+) | \([^()]*\) )+ \)/x;
|
---|
2078 |
|
---|
2079 | Here, C<< (?>[^()]+) >> breaks the degeneracy of string partitioning
|
---|
2080 | by gobbling up as much of the string as possible and keeping it. Then
|
---|
2081 | match failures fail much more quickly.
|
---|
2082 |
|
---|
2083 | =head2 Conditional expressions
|
---|
2084 |
|
---|
2085 | A S<B<conditional expression> > is a form of if-then-else statement
|
---|
2086 | that allows one to choose which patterns are to be matched, based on
|
---|
2087 | some condition. There are two types of conditional expression:
|
---|
2088 | C<(?(condition)yes-regexp)> and
|
---|
2089 | C<(?(condition)yes-regexp|no-regexp)>. C<(?(condition)yes-regexp)> is
|
---|
2090 | like an S<C<'if () {}'> > statement in Perl. If the C<condition> is true,
|
---|
2091 | the C<yes-regexp> will be matched. If the C<condition> is false, the
|
---|
2092 | C<yes-regexp> will be skipped and perl will move onto the next regexp
|
---|
2093 | element. The second form is like an S<C<'if () {} else {}'> > statement
|
---|
2094 | in Perl. If the C<condition> is true, the C<yes-regexp> will be
|
---|
2095 | matched, otherwise the C<no-regexp> will be matched.
|
---|
2096 |
|
---|
2097 | The C<condition> can have two forms. The first form is simply an
|
---|
2098 | integer in parentheses C<(integer)>. It is true if the corresponding
|
---|
2099 | backreference C<\integer> matched earlier in the regexp. The second
|
---|
2100 | form is a bare zero width assertion C<(?...)>, either a
|
---|
2101 | lookahead, a lookbehind, or a code assertion (discussed in the next
|
---|
2102 | section).
|
---|
2103 |
|
---|
2104 | The integer form of the C<condition> allows us to choose, with more
|
---|
2105 | flexibility, what to match based on what matched earlier in the
|
---|
2106 | regexp. This searches for words of the form C<"$x$x"> or
|
---|
2107 | C<"$x$y$y$x">:
|
---|
2108 |
|
---|
2109 | % simple_grep '^(\w+)(\w+)?(?(2)\2\1|\1)$' /usr/dict/words
|
---|
2110 | beriberi
|
---|
2111 | coco
|
---|
2112 | couscous
|
---|
2113 | deed
|
---|
2114 | ...
|
---|
2115 | toot
|
---|
2116 | toto
|
---|
2117 | tutu
|
---|
2118 |
|
---|
2119 | The lookbehind C<condition> allows, along with backreferences,
|
---|
2120 | an earlier part of the match to influence a later part of the
|
---|
2121 | match. For instance,
|
---|
2122 |
|
---|
2123 | /[ATGC]+(?(?<=AA)G|C)$/;
|
---|
2124 |
|
---|
2125 | matches a DNA sequence such that it either ends in C<AAG>, or some
|
---|
2126 | other base pair combination and C<C>. Note that the form is
|
---|
2127 | C<< (?(?<=AA)G|C) >> and not C<< (?((?<=AA))G|C) >>; for the
|
---|
2128 | lookahead, lookbehind or code assertions, the parentheses around the
|
---|
2129 | conditional are not needed.
|
---|
2130 |
|
---|
2131 | =head2 A bit of magic: executing Perl code in a regular expression
|
---|
2132 |
|
---|
2133 | Normally, regexps are a part of Perl expressions.
|
---|
2134 | S<B<Code evaluation> > expressions turn that around by allowing
|
---|
2135 | arbitrary Perl code to be a part of a regexp. A code evaluation
|
---|
2136 | expression is denoted C<(?{code})>, with C<code> a string of Perl
|
---|
2137 | statements.
|
---|
2138 |
|
---|
2139 | Code expressions are zero-width assertions, and the value they return
|
---|
2140 | depends on their environment. There are two possibilities: either the
|
---|
2141 | code expression is used as a conditional in a conditional expression
|
---|
2142 | C<(?(condition)...)>, or it is not. If the code expression is a
|
---|
2143 | conditional, the code is evaluated and the result (i.e., the result of
|
---|
2144 | the last statement) is used to determine truth or falsehood. If the
|
---|
2145 | code expression is not used as a conditional, the assertion always
|
---|
2146 | evaluates true and the result is put into the special variable
|
---|
2147 | C<$^R>. The variable C<$^R> can then be used in code expressions later
|
---|
2148 | in the regexp. Here are some silly examples:
|
---|
2149 |
|
---|
2150 | $x = "abcdef";
|
---|
2151 | $x =~ /abc(?{print "Hi Mom!";})def/; # matches,
|
---|
2152 | # prints 'Hi Mom!'
|
---|
2153 | $x =~ /aaa(?{print "Hi Mom!";})def/; # doesn't match,
|
---|
2154 | # no 'Hi Mom!'
|
---|
2155 |
|
---|
2156 | Pay careful attention to the next example:
|
---|
2157 |
|
---|
2158 | $x =~ /abc(?{print "Hi Mom!";})ddd/; # doesn't match,
|
---|
2159 | # no 'Hi Mom!'
|
---|
2160 | # but why not?
|
---|
2161 |
|
---|
2162 | At first glance, you'd think that it shouldn't print, because obviously
|
---|
2163 | the C<ddd> isn't going to match the target string. But look at this
|
---|
2164 | example:
|
---|
2165 |
|
---|
2166 | $x =~ /abc(?{print "Hi Mom!";})[d]dd/; # doesn't match,
|
---|
2167 | # but _does_ print
|
---|
2168 |
|
---|
2169 | Hmm. What happened here? If you've been following along, you know that
|
---|
2170 | the above pattern should be effectively the same as the last one --
|
---|
2171 | enclosing the d in a character class isn't going to change what it
|
---|
2172 | matches. So why does the first not print while the second one does?
|
---|
2173 |
|
---|
2174 | The answer lies in the optimizations the REx engine makes. In the first
|
---|
2175 | case, all the engine sees are plain old characters (aside from the
|
---|
2176 | C<?{}> construct). It's smart enough to realize that the string 'ddd'
|
---|
2177 | doesn't occur in our target string before actually running the pattern
|
---|
2178 | through. But in the second case, we've tricked it into thinking that our
|
---|
2179 | pattern is more complicated than it is. It takes a look, sees our
|
---|
2180 | character class, and decides that it will have to actually run the
|
---|
2181 | pattern to determine whether or not it matches, and in the process of
|
---|
2182 | running it hits the print statement before it discovers that we don't
|
---|
2183 | have a match.
|
---|
2184 |
|
---|
2185 | To take a closer look at how the engine does optimizations, see the
|
---|
2186 | section L<"Pragmas and debugging"> below.
|
---|
2187 |
|
---|
2188 | More fun with C<?{}>:
|
---|
2189 |
|
---|
2190 | $x =~ /(?{print "Hi Mom!";})/; # matches,
|
---|
2191 | # prints 'Hi Mom!'
|
---|
2192 | $x =~ /(?{$c = 1;})(?{print "$c";})/; # matches,
|
---|
2193 | # prints '1'
|
---|
2194 | $x =~ /(?{$c = 1;})(?{print "$^R";})/; # matches,
|
---|
2195 | # prints '1'
|
---|
2196 |
|
---|
2197 | The bit of magic mentioned in the section title occurs when the regexp
|
---|
2198 | backtracks in the process of searching for a match. If the regexp
|
---|
2199 | backtracks over a code expression and if the variables used within are
|
---|
2200 | localized using C<local>, the changes in the variables produced by the
|
---|
2201 | code expression are undone! Thus, if we wanted to count how many times
|
---|
2202 | a character got matched inside a group, we could use, e.g.,
|
---|
2203 |
|
---|
2204 | $x = "aaaa";
|
---|
2205 | $count = 0; # initialize 'a' count
|
---|
2206 | $c = "bob"; # test if $c gets clobbered
|
---|
2207 | $x =~ /(?{local $c = 0;}) # initialize count
|
---|
2208 | ( a # match 'a'
|
---|
2209 | (?{local $c = $c + 1;}) # increment count
|
---|
2210 | )* # do this any number of times,
|
---|
2211 | aa # but match 'aa' at the end
|
---|
2212 | (?{$count = $c;}) # copy local $c var into $count
|
---|
2213 | /x;
|
---|
2214 | print "'a' count is $count, \$c variable is '$c'\n";
|
---|
2215 |
|
---|
2216 | This prints
|
---|
2217 |
|
---|
2218 | 'a' count is 2, $c variable is 'bob'
|
---|
2219 |
|
---|
2220 | If we replace the S<C< (?{local $c = $c + 1;})> > with
|
---|
2221 | S<C< (?{$c = $c + 1;})> >, the variable changes are I<not> undone
|
---|
2222 | during backtracking, and we get
|
---|
2223 |
|
---|
2224 | 'a' count is 4, $c variable is 'bob'
|
---|
2225 |
|
---|
2226 | Note that only localized variable changes are undone. Other side
|
---|
2227 | effects of code expression execution are permanent. Thus
|
---|
2228 |
|
---|
2229 | $x = "aaaa";
|
---|
2230 | $x =~ /(a(?{print "Yow\n";}))*aa/;
|
---|
2231 |
|
---|
2232 | produces
|
---|
2233 |
|
---|
2234 | Yow
|
---|
2235 | Yow
|
---|
2236 | Yow
|
---|
2237 | Yow
|
---|
2238 |
|
---|
2239 | The result C<$^R> is automatically localized, so that it will behave
|
---|
2240 | properly in the presence of backtracking.
|
---|
2241 |
|
---|
2242 | This example uses a code expression in a conditional to match the
|
---|
2243 | article 'the' in either English or German:
|
---|
2244 |
|
---|
2245 | $lang = 'DE'; # use German
|
---|
2246 | ...
|
---|
2247 | $text = "das";
|
---|
2248 | print "matched\n"
|
---|
2249 | if $text =~ /(?(?{
|
---|
2250 | $lang eq 'EN'; # is the language English?
|
---|
2251 | })
|
---|
2252 | the | # if so, then match 'the'
|
---|
2253 | (die|das|der) # else, match 'die|das|der'
|
---|
2254 | )
|
---|
2255 | /xi;
|
---|
2256 |
|
---|
2257 | Note that the syntax here is C<(?(?{...})yes-regexp|no-regexp)>, not
|
---|
2258 | C<(?((?{...}))yes-regexp|no-regexp)>. In other words, in the case of a
|
---|
2259 | code expression, we don't need the extra parentheses around the
|
---|
2260 | conditional.
|
---|
2261 |
|
---|
2262 | If you try to use code expressions with interpolating variables, perl
|
---|
2263 | may surprise you:
|
---|
2264 |
|
---|
2265 | $bar = 5;
|
---|
2266 | $pat = '(?{ 1 })';
|
---|
2267 | /foo(?{ $bar })bar/; # compiles ok, $bar not interpolated
|
---|
2268 | /foo(?{ 1 })$bar/; # compile error!
|
---|
2269 | /foo${pat}bar/; # compile error!
|
---|
2270 |
|
---|
2271 | $pat = qr/(?{ $foo = 1 })/; # precompile code regexp
|
---|
2272 | /foo${pat}bar/; # compiles ok
|
---|
2273 |
|
---|
2274 | If a regexp has (1) code expressions and interpolating variables, or
|
---|
2275 | (2) a variable that interpolates a code expression, perl treats the
|
---|
2276 | regexp as an error. If the code expression is precompiled into a
|
---|
2277 | variable, however, interpolating is ok. The question is, why is this
|
---|
2278 | an error?
|
---|
2279 |
|
---|
2280 | The reason is that variable interpolation and code expressions
|
---|
2281 | together pose a security risk. The combination is dangerous because
|
---|
2282 | many programmers who write search engines often take user input and
|
---|
2283 | plug it directly into a regexp:
|
---|
2284 |
|
---|
2285 | $regexp = <>; # read user-supplied regexp
|
---|
2286 | $chomp $regexp; # get rid of possible newline
|
---|
2287 | $text =~ /$regexp/; # search $text for the $regexp
|
---|
2288 |
|
---|
2289 | If the C<$regexp> variable contains a code expression, the user could
|
---|
2290 | then execute arbitrary Perl code. For instance, some joker could
|
---|
2291 | search for S<C<system('rm -rf *');> > to erase your files. In this
|
---|
2292 | sense, the combination of interpolation and code expressions B<taints>
|
---|
2293 | your regexp. So by default, using both interpolation and code
|
---|
2294 | expressions in the same regexp is not allowed. If you're not
|
---|
2295 | concerned about malicious users, it is possible to bypass this
|
---|
2296 | security check by invoking S<C<use re 'eval'> >:
|
---|
2297 |
|
---|
2298 | use re 'eval'; # throw caution out the door
|
---|
2299 | $bar = 5;
|
---|
2300 | $pat = '(?{ 1 })';
|
---|
2301 | /foo(?{ 1 })$bar/; # compiles ok
|
---|
2302 | /foo${pat}bar/; # compiles ok
|
---|
2303 |
|
---|
2304 | Another form of code expression is the S<B<pattern code expression> >.
|
---|
2305 | The pattern code expression is like a regular code expression, except
|
---|
2306 | that the result of the code evaluation is treated as a regular
|
---|
2307 | expression and matched immediately. A simple example is
|
---|
2308 |
|
---|
2309 | $length = 5;
|
---|
2310 | $char = 'a';
|
---|
2311 | $x = 'aaaaabb';
|
---|
2312 | $x =~ /(??{$char x $length})/x; # matches, there are 5 of 'a'
|
---|
2313 |
|
---|
2314 |
|
---|
2315 | This final example contains both ordinary and pattern code
|
---|
2316 | expressions. It detects if a binary string C<1101010010001...> has a
|
---|
2317 | Fibonacci spacing 0,1,1,2,3,5,... of the C<1>'s:
|
---|
2318 |
|
---|
2319 | $s0 = 0; $s1 = 1; # initial conditions
|
---|
2320 | $x = "1101010010001000001";
|
---|
2321 | print "It is a Fibonacci sequence\n"
|
---|
2322 | if $x =~ /^1 # match an initial '1'
|
---|
2323 | (
|
---|
2324 | (??{'0' x $s0}) # match $s0 of '0'
|
---|
2325 | 1 # and then a '1'
|
---|
2326 | (?{
|
---|
2327 | $largest = $s0; # largest seq so far
|
---|
2328 | $s2 = $s1 + $s0; # compute next term
|
---|
2329 | $s0 = $s1; # in Fibonacci sequence
|
---|
2330 | $s1 = $s2;
|
---|
2331 | })
|
---|
2332 | )+ # repeat as needed
|
---|
2333 | $ # that is all there is
|
---|
2334 | /x;
|
---|
2335 | print "Largest sequence matched was $largest\n";
|
---|
2336 |
|
---|
2337 | This prints
|
---|
2338 |
|
---|
2339 | It is a Fibonacci sequence
|
---|
2340 | Largest sequence matched was 5
|
---|
2341 |
|
---|
2342 | Ha! Try that with your garden variety regexp package...
|
---|
2343 |
|
---|
2344 | Note that the variables C<$s0> and C<$s1> are not substituted when the
|
---|
2345 | regexp is compiled, as happens for ordinary variables outside a code
|
---|
2346 | expression. Rather, the code expressions are evaluated when perl
|
---|
2347 | encounters them during the search for a match.
|
---|
2348 |
|
---|
2349 | The regexp without the C<//x> modifier is
|
---|
2350 |
|
---|
2351 | /^1((??{'0'x$s0})1(?{$largest=$s0;$s2=$s1+$s0$s0=$s1;$s1=$s2;}))+$/;
|
---|
2352 |
|
---|
2353 | and is a great start on an Obfuscated Perl entry :-) When working with
|
---|
2354 | code and conditional expressions, the extended form of regexps is
|
---|
2355 | almost necessary in creating and debugging regexps.
|
---|
2356 |
|
---|
2357 | =head2 Pragmas and debugging
|
---|
2358 |
|
---|
2359 | Speaking of debugging, there are several pragmas available to control
|
---|
2360 | and debug regexps in Perl. We have already encountered one pragma in
|
---|
2361 | the previous section, S<C<use re 'eval';> >, that allows variable
|
---|
2362 | interpolation and code expressions to coexist in a regexp. The other
|
---|
2363 | pragmas are
|
---|
2364 |
|
---|
2365 | use re 'taint';
|
---|
2366 | $tainted = <>;
|
---|
2367 | @parts = ($tainted =~ /(\w+)\s+(\w+)/; # @parts is now tainted
|
---|
2368 |
|
---|
2369 | The C<taint> pragma causes any substrings from a match with a tainted
|
---|
2370 | variable to be tainted as well. This is not normally the case, as
|
---|
2371 | regexps are often used to extract the safe bits from a tainted
|
---|
2372 | variable. Use C<taint> when you are not extracting safe bits, but are
|
---|
2373 | performing some other processing. Both C<taint> and C<eval> pragmas
|
---|
2374 | are lexically scoped, which means they are in effect only until
|
---|
2375 | the end of the block enclosing the pragmas.
|
---|
2376 |
|
---|
2377 | use re 'debug';
|
---|
2378 | /^(.*)$/s; # output debugging info
|
---|
2379 |
|
---|
2380 | use re 'debugcolor';
|
---|
2381 | /^(.*)$/s; # output debugging info in living color
|
---|
2382 |
|
---|
2383 | The global C<debug> and C<debugcolor> pragmas allow one to get
|
---|
2384 | detailed debugging info about regexp compilation and
|
---|
2385 | execution. C<debugcolor> is the same as debug, except the debugging
|
---|
2386 | information is displayed in color on terminals that can display
|
---|
2387 | termcap color sequences. Here is example output:
|
---|
2388 |
|
---|
2389 | % perl -e 'use re "debug"; "abc" =~ /a*b+c/;'
|
---|
2390 | Compiling REx `a*b+c'
|
---|
2391 | size 9 first at 1
|
---|
2392 | 1: STAR(4)
|
---|
2393 | 2: EXACT <a>(0)
|
---|
2394 | 4: PLUS(7)
|
---|
2395 | 5: EXACT <b>(0)
|
---|
2396 | 7: EXACT <c>(9)
|
---|
2397 | 9: END(0)
|
---|
2398 | floating `bc' at 0..2147483647 (checking floating) minlen 2
|
---|
2399 | Guessing start of match, REx `a*b+c' against `abc'...
|
---|
2400 | Found floating substr `bc' at offset 1...
|
---|
2401 | Guessed: match at offset 0
|
---|
2402 | Matching REx `a*b+c' against `abc'
|
---|
2403 | Setting an EVAL scope, savestack=3
|
---|
2404 | 0 <> <abc> | 1: STAR
|
---|
2405 | EXACT <a> can match 1 times out of 32767...
|
---|
2406 | Setting an EVAL scope, savestack=3
|
---|
2407 | 1 <a> <bc> | 4: PLUS
|
---|
2408 | EXACT <b> can match 1 times out of 32767...
|
---|
2409 | Setting an EVAL scope, savestack=3
|
---|
2410 | 2 <ab> <c> | 7: EXACT <c>
|
---|
2411 | 3 <abc> <> | 9: END
|
---|
2412 | Match successful!
|
---|
2413 | Freeing REx: `a*b+c'
|
---|
2414 |
|
---|
2415 | If you have gotten this far into the tutorial, you can probably guess
|
---|
2416 | what the different parts of the debugging output tell you. The first
|
---|
2417 | part
|
---|
2418 |
|
---|
2419 | Compiling REx `a*b+c'
|
---|
2420 | size 9 first at 1
|
---|
2421 | 1: STAR(4)
|
---|
2422 | 2: EXACT <a>(0)
|
---|
2423 | 4: PLUS(7)
|
---|
2424 | 5: EXACT <b>(0)
|
---|
2425 | 7: EXACT <c>(9)
|
---|
2426 | 9: END(0)
|
---|
2427 |
|
---|
2428 | describes the compilation stage. C<STAR(4)> means that there is a
|
---|
2429 | starred object, in this case C<'a'>, and if it matches, goto line 4,
|
---|
2430 | i.e., C<PLUS(7)>. The middle lines describe some heuristics and
|
---|
2431 | optimizations performed before a match:
|
---|
2432 |
|
---|
2433 | floating `bc' at 0..2147483647 (checking floating) minlen 2
|
---|
2434 | Guessing start of match, REx `a*b+c' against `abc'...
|
---|
2435 | Found floating substr `bc' at offset 1...
|
---|
2436 | Guessed: match at offset 0
|
---|
2437 |
|
---|
2438 | Then the match is executed and the remaining lines describe the
|
---|
2439 | process:
|
---|
2440 |
|
---|
2441 | Matching REx `a*b+c' against `abc'
|
---|
2442 | Setting an EVAL scope, savestack=3
|
---|
2443 | 0 <> <abc> | 1: STAR
|
---|
2444 | EXACT <a> can match 1 times out of 32767...
|
---|
2445 | Setting an EVAL scope, savestack=3
|
---|
2446 | 1 <a> <bc> | 4: PLUS
|
---|
2447 | EXACT <b> can match 1 times out of 32767...
|
---|
2448 | Setting an EVAL scope, savestack=3
|
---|
2449 | 2 <ab> <c> | 7: EXACT <c>
|
---|
2450 | 3 <abc> <> | 9: END
|
---|
2451 | Match successful!
|
---|
2452 | Freeing REx: `a*b+c'
|
---|
2453 |
|
---|
2454 | Each step is of the form S<C<< n <x> <y> >> >, with C<< <x> >> the
|
---|
2455 | part of the string matched and C<< <y> >> the part not yet
|
---|
2456 | matched. The S<C<< | 1: STAR >> > says that perl is at line number 1
|
---|
2457 | n the compilation list above. See
|
---|
2458 | L<perldebguts/"Debugging regular expressions"> for much more detail.
|
---|
2459 |
|
---|
2460 | An alternative method of debugging regexps is to embed C<print>
|
---|
2461 | statements within the regexp. This provides a blow-by-blow account of
|
---|
2462 | the backtracking in an alternation:
|
---|
2463 |
|
---|
2464 | "that this" =~ m@(?{print "Start at position ", pos, "\n";})
|
---|
2465 | t(?{print "t1\n";})
|
---|
2466 | h(?{print "h1\n";})
|
---|
2467 | i(?{print "i1\n";})
|
---|
2468 | s(?{print "s1\n";})
|
---|
2469 | |
|
---|
2470 | t(?{print "t2\n";})
|
---|
2471 | h(?{print "h2\n";})
|
---|
2472 | a(?{print "a2\n";})
|
---|
2473 | t(?{print "t2\n";})
|
---|
2474 | (?{print "Done at position ", pos, "\n";})
|
---|
2475 | @x;
|
---|
2476 |
|
---|
2477 | prints
|
---|
2478 |
|
---|
2479 | Start at position 0
|
---|
2480 | t1
|
---|
2481 | h1
|
---|
2482 | t2
|
---|
2483 | h2
|
---|
2484 | a2
|
---|
2485 | t2
|
---|
2486 | Done at position 4
|
---|
2487 |
|
---|
2488 | =head1 BUGS
|
---|
2489 |
|
---|
2490 | Code expressions, conditional expressions, and independent expressions
|
---|
2491 | are B<experimental>. Don't use them in production code. Yet.
|
---|
2492 |
|
---|
2493 | =head1 SEE ALSO
|
---|
2494 |
|
---|
2495 | This is just a tutorial. For the full story on perl regular
|
---|
2496 | expressions, see the L<perlre> regular expressions reference page.
|
---|
2497 |
|
---|
2498 | For more information on the matching C<m//> and substitution C<s///>
|
---|
2499 | operators, see L<perlop/"Regexp Quote-Like Operators">. For
|
---|
2500 | information on the C<split> operation, see L<perlfunc/split>.
|
---|
2501 |
|
---|
2502 | For an excellent all-around resource on the care and feeding of
|
---|
2503 | regular expressions, see the book I<Mastering Regular Expressions> by
|
---|
2504 | Jeffrey Friedl (published by O'Reilly, ISBN 1556592-257-3).
|
---|
2505 |
|
---|
2506 | =head1 AUTHOR AND COPYRIGHT
|
---|
2507 |
|
---|
2508 | Copyright (c) 2000 Mark Kvale
|
---|
2509 | All rights reserved.
|
---|
2510 |
|
---|
2511 | This document may be distributed under the same terms as Perl itself.
|
---|
2512 |
|
---|
2513 | =head2 Acknowledgments
|
---|
2514 |
|
---|
2515 | The inspiration for the stop codon DNA example came from the ZIP
|
---|
2516 | code example in chapter 7 of I<Mastering Regular Expressions>.
|
---|
2517 |
|
---|
2518 | The author would like to thank Jeff Pinyan, Andrew Johnson, Peter
|
---|
2519 | Haworth, Ronald J Kimball, and Joe Smith for all their helpful
|
---|
2520 | comments.
|
---|
2521 |
|
---|
2522 | =cut
|
---|
2523 |
|
---|