source: gsdl/trunk/perllib/cpan/XML/Rules.pm@ 19051

Last change on this file since 19051 was 19051, checked in by davidb, 15 years ago

Rule based way to parse XML files. Built on top of XML::Parser. For certain types of tasks provides more convenient/concise way to express things.

File size: 73.2 KB
Line 
1package XML::Rules;
2
3use warnings;
4no warnings qw(uninitialized);
5use strict;
6use Carp;
7
8require Exporter;
9our @ISA = qw(Exporter);
10our @EXPORT_OK = qw(paths2rules);
11
12use XML::Parser::Expat;
13
14=head1 NAME
15
16XML::Rules - parse XML and specify what and how to keep/process for individual tags
17
18=head1 VERSION
19
20Version 1.05
21
22=cut
23
24our $VERSION = '1.05';
25
26=head1 SYNOPSIS
27
28 use XML::Rules;
29
30 $xml = <<'*END*';
31 <doc>
32 <person>
33 <fname>...</fname>
34 <lname>...</lname>
35 <email>...</email>
36 <address>
37 <street>...</street>
38 <city>...</city>
39 <country>...</country>
40 <bogus>...</bogus>
41 </address>
42 <phones>
43 <phone type="home">123-456-7890</phone>
44 <phone type="office">663-486-7890</phone>
45 <phone type="fax">663-486-7000</phone>
46 </phones>
47 </person>
48 <person>
49 <fname>...</fname>
50 <lname>...</lname>
51 <email>...</email>
52 <address>
53 <street>...</street>
54 <city>...</city>
55 <country>...</country>
56 <bogus>...</bogus>
57 </address>
58 <phones>
59 <phone type="office">663-486-7891</phone>
60 </phones>
61 </person>
62 </doc>
63 *END*
64
65 @rules = (
66 _default => sub {$_[0] => $_[1]->{_content}},
67 # by default I'm only interested in the content of the tag, not the attributes
68 bogus => undef,
69 # let's ignore this tag and all inner ones as well
70 address => sub {address => "$_[1]->{street}, $_[1]->{city} ($_[1]->{country})"},
71 # merge the address into a single string
72 phone => sub {$_[1]->{type} => $_[1]->{_content}},
73 # let's use the "type" attribute as the key and the content as the value
74 phones => sub {delete $_[1]->{_content}; %{$_[1]}},
75 # remove the text content and pass along the type => content from the child nodes
76 person => sub { # lets print the values, all the data is readily available in the attributes
77 print "$_[1]->{lname}, $_[1]->{fname} <$_[1]->{email}>\n";
78 print "Home phone: $_[1]->{home}\n" if $_[1]->{home};
79 print "Office phone: $_[1]->{office}\n" if $_[1]->{office};
80 print "Fax: $_[1]->{fax}\n" if $_[1]->{fax};
81 print "$_[1]->{address}\n\n";
82 return; # the <person> tag is processed, no need to remember what it contained
83 },
84 );
85 $parser = XML::Rules->new(rules => \@rules);
86 $parser->parse( $xml);
87
88=head1 INTRODUCTION
89
90There are several ways to extract data from XML. One that's often used is to read the whole file and transform it into a huge maze of objects and then write code like
91
92 foreach my $obj ($XML->forTheLifeOfMyMotherGiveMeTheFirstChildNamed("Peter")->pleaseBeSoKindAndGiveMeAllChildrenNamedSomethingLike("Jane")) {
93 my $obj2 = $obj->sorryToKeepBotheringButINeedTheChildNamed("Theophile");
94 my $birth = $obj2->whatsTheValueOfAttribute("BirthDate");
95 print "Theophile was bort at $birth\n";
96 }
97
98I'm exagerating of course, but you probably know what I mean. You can of course shorten the path and call just one method ... that is if you spend the time to learn one more "cool" thing starting with X. XPath.
99
100You can also use XML::Simple and generate an almost equaly huge maze of hashes and arrays ... which may make the code more or less complex. In either case you need to have enough memory
101to store all that data, even if you only need a piece here and there.
102
103Another way to parse the XML is to create some subroutines that handle the start and end tags and the text and whatever else may appear in the XML. Some modules will let you specify just one for start tag, one for text and one for end tag, others will let you install different handlers for different tags. The catch is that you have to build your data structures yourself, you have to know where you are, what tag is just open and what is the parent and it's parent etc. so that you could add the attributes and especially the text to the right place. And the handlers have to do everything as their side effect. Does anyone remember what do they say about side efects? They make the code hard to debug, they tend to change the code into a maze of interdependent snippets of code.
104
105So what's the difference in the way XML::Rules works? At the first glance, not much. You can also specify subroutines to be called for the tags encountered while parsing the XML, just like the other even based XML parsers. The difference is that you do not have to rely on sideeffects if all you want is to store the value of a tag. You simply return whatever you need from the current tag and the module will add it at the right place in the data structure it builds and will provide it to the handlers for the parent tag. And if the parent tag does return that data again it will be passed to its parent and so forth. Until we get to the level at which it's convenient to handle all the data we accumulated from the twig.
106
107Do we want to keep just the content and access it in the parent tag handler under a specific name?
108
109 foo => sub {return 'foo' => $_[1]->{_content}}
110
111Do we want to ornament the content a bit and add it to the parent tag's content?
112
113 u => sub {return '_' . $_[1]->{_content} . '_'}
114 strong => sub {return '*' . $_[1]->{_content} . '*'}
115 uc => sub {return uc($_[1]->{_content})}
116
117Do we want to merge the attributes into a string and access the string from the parent tag under a specified name?
118
119 address => sub {return 'Address' => "Street: $_[1]->{street} $_[1]->{bldngNo}\nCity: $_[1]->{city}\nCountry: $_[1]->{country}\nPostal code: $_[1]->{zip}"}
120
121and in this case the $_[1]->{street} may either be an attribute of the <address> tag or it may be ther result of the handler (rule)
122
123 street => sub {return 'street' => $_[1]->{_content}}
124
125and thus come from a child tag <street>. You may also use the rules to convert codes to values
126
127 our %states = (
128 AL => 'Alabama',
129 AK => 'Alaska',
130 ...
131 );
132 ...
133 state => sub {return 'state' => $states{$_[1]->{_content}}; }
134
135 or
136
137 address => sub {
138 if (exists $_[1]->{id}) {
139 $sthFetchAddress->execute($_[1]->{id});
140 my $addr = $sthFetchAddress->fetchrow_hashref();
141 $sthFetchAddress->finish();
142 return 'address' => $addr;
143 } else {
144 return 'address' => $_[1];
145 }
146 }
147
148so that you do not have to care whether there was
149
150 <address id="147"/>
151
152or
153
154 <address><street>Larry Wall's St.</street><streetno>478</streetno><city>Core</city><country>The Programming Republic of Perl</country></address>
155
156At each level in the tree structure serialized as XML you can decide what to keep, what to throw away, what to transform and then just return the stuff you care about and it will be available to the handler at the next level.
157
158=head1 CONSTRUCTOR
159
160 my $parser = XML::Rules->new(
161 rules => \@rules,
162 [ start_rules => \@start_rules, ]
163 [ stripspaces => 0 / 1 / 2 / 3 + 0 / 4 + 0 / 8, ]
164 [ normalisespaces => 0 / 1, ]
165 [ style => 'parser' / 'filter', ]
166 [ ident => ' ', [reformat_all => 0 / 1] ],
167 [ encode => 'encoding specification', ]
168 [ output_encoding => 'encoding specification', ]
169 [ namespaces => \%namespace2alias_mapping, ]
170 # and optionaly parameters passed to XML::Parser::Expat
171 );
172
173Options passed to XML::Parser::Expat : ProtocolEncoding Namespaces NoExpand Stream_Delimiter ErrorContext ParseParamEnt Base
174
175The "stripspaces" controls the handling of whitespace. Please see the C<Whitespace handling> bellow.
176
177The "style" specifies whether you want to build a parser used to extract stuff from the XML or filter/modify the XML. If you specify
178style => 'filter' then all tags for which you do not specify a subroutine rule or that occure inside such a tag are copied to the output filehandle
179passed to the ->filter() or ->filterfile() methods.
180
181The "ident" specifies what character(s) to use to ident the tags when filtering, by default the tags are not formatted in any way. If the
182"reformat_all" is not set then this affects only the tags that have a rule and their subtags. And in case of subtags only those that were
183added into the attribute hash by their rules, not those left in the _content array!
184
185The "warnoverwrite" instructs XML::Rules to issue a warning whenever the rule cause a key in a tag's hash to be overwritten by new
186data produced by the rule of a subtag. This happens eg. if a tag is repeated and its rule doesn't expect it.
187
188The "encode" allows you to ask the module to run all data through Encode::encode( 'encoding_specification', ...)
189before being passed to the rules. Otherwise all data comes as UTF8.
190
191The "output_encoding" on the other hand specifies in what encoding is the resulting data going to be, the default is again UTF8.
192This means that if you specify
193
194 encode => 'windows-1250',
195 output_encoding => 'utf8',
196
197and the XML is in ISO-8859-2 (Latin2) then the filter will 1) convert the content and attributes of the tags you are not interested in from Latin2
198directly to utf8 and output and 2) convert the content and attributes of the tags you want to process from Latin2 to Windows-1250, let you mangle
199the data and then convert the results to utf8 for the output.
200
201The C<encode> and C<output_enconding> affects also the C<$parser->toXML(...)>, if they are different then the data are converted from
202one encoding to the other.
203
204=head2 The Rules
205
206The rules option may be either an arrayref or a hashref, the module doesn't care, but if you want to use regexps to specify the groups of tags to be handled
207by the same rule you should use the array ref. The rules array/hash is made of pairs in form
208
209 tagspecification => action
210
211where the tagspecification may be either a name of a tag, a string containing comma or pipe ( "|" ) delimited list of tag names
212or a string containing a regexp enclosed in // with optional parameters or a qr// compiled regular expressions. The tag names and tag name lists
213take precedence to the regexps, the regexps are (in case of arrayref only!!!) tested in the order in which they are specified.
214
215These rules are evaluated/executed whenever a tag if fully parsed including all the content and child tags and they may access the content and attributes of the
216specified tag plus the stuff produced by the rules evaluated for the child tags.
217
218The action may be either
219
220 an undef or empty string = ignore the tag and all its children
221 a subroutine reference = the subroutine will be called to handle the tag data&contents
222 sub { my ($tagname, $attrHash, $contexArray, $parentDataArray, $parser) = @_; ...}
223 'content' = only the content of the tag is preserved and added to
224 the parent tag's hash as an attribute named after the tag
225 sub { $_[0] => $_[1]->{_content}}
226 'content trim' = only the content of the tag is preserved, trimmed and added to
227 the parent tag's hash as an attribute named after the tag
228 sub { s/^\s+//,s/\s+$// for ($_[1]->{_content}); $_[0] => $_[1]->{_content}}
229 'content array' = only the content of the tag is preserved and pushed
230 to the array pointed to by the attribute
231 sub { '@' . $_[0] => $_[1]->{_content}}
232 'as is' = the tag's hash is added to the parent tag's hash
233 as an attribute named after the tag
234 sub { $_[0] => $_[1]}
235 'as is trim' = the tag's hash is added to the parent tag's hash
236 as an attribute named after the tag, the content is trimmed
237 sub { $_[0] => $_[1]}
238 'as array' = the tag's hash is pushed to the attribute named after the tag
239 in the parent tag's hash
240 sub { '@'.$_[0] => $_[1]}
241 'as array trim' = the tag's hash is pushed to the attribute named after the tag
242 in the parent tag's hash, the content is trimmed
243 sub { '@'.$_[0] => $_[1]}
244 'no content' = the _content is removed from the tag's hash and the hash
245 is added to the parent's hash into the attribute named after the tag
246 sub { delete $_[1]->{_content}; $_[0] => $_[1]}
247 'no content array' = similar to 'no content' except the hash is pushed
248 into the array referenced by the attribute
249 'as array no content' = same as 'no content array'
250 'pass' = the tag's hash is dissolved into the parent's hash,
251 that is all tag's attributes become the parent's attributes.
252 The _content is appended to the parent's _content.
253 sub { %{$_[0]}}
254 'pass no content' = the _content is removed and the hash is dissolved
255 into the parent's hash.
256 sub { delete $_[1]->{_content}; %{$_[0]}}
257 'pass without content' = same as 'pass no content'
258 'raw' = the [tagname => attrs] is pushed to the parent tag's _content.
259 You would use this style if you wanted to be able to print
260 the parent tag as XML preserving the whitespace or other textual content
261 sub { [$_[0] => $_[1]]}
262 'raw extended' = the [tagname => attrs] is pushed to the parent tag's _content
263 and the attrs are added to the parent's attribute hash with ":$tagname" as the key
264 sub { (':'.$_[0] => $_[1], [$_[0] => $_[1]])};
265 'raw extended array' = the [tagname => attrs] is pushed to the parent tag's _content
266 and the attrs are pushed to the parent's attribute hash with ":$tagname" as the key
267 sub { ('@:'.$_[0] => $_[1], [$_[0] => $_[1]])};
268 'by <attrname>' = uses the value of the specified attribute as the key when adding the
269 attribute hash into the parent tag's hash. You can specify more names, in that case
270 the first found is used.
271 sub {delete($_[1]->{name}) => $_[1]}
272 'content by <attrname>' = uses the value of the specified attribute as the key when adding the
273 tags content into the parent tag's hash. You can specify more names, in that case
274 the first found is used.
275 sub {$_[1]->{name} => $_[1]->{_content}}
276 'no content by <attrname>' = uses the value of the specified attribute as the key when adding the
277 attribute hash into the parent tag's hash. The content is dropped. You can specify more names,
278 in that case the first found is used.
279 sub {delete($_[1]->{_content}); delete($_[1]->{name}) => $_[1]}
280 '==...' = replace the tag by the specified string. That is the string will be added to
281 the parent tag's _content
282 sub { return '...' }
283 '=...' = replace the tag contents by the specified string and forget the attributes.
284 sub { return $_[0] => '...' }
285
286You may also add " no xmlns" at the end of all those predefined rules to strip the namespace
287alias from the $_[0] (tag name).
288
289The subroutines in the rules specification receive five parameters:
290
291 $rule->( $tag_name, \%attrs, \@context, \@parent_data, $parser)
292
293It's OK to destroy the first two parameters, but you should treat the other three as read only
294or at least treat them with care!
295
296 $tag_name = string containing the tag name
297 \%attrs = hash containing the attributes of the tag plus the _content key
298 containing the text content of the tag. If it's not a leaf tag it may
299 also contain the data returned by the rules invoked for the child tags.
300 \@context = an array containing the names of the tags enclosing the current
301 one. The parent tag name is the last element of the array.
302 \@parent_data = an array containing the hashes with the attributes
303 and content read&produced for the enclosing tags so far.
304 You may need to access this for example to find out the version
305 of the format specified as an attribute of the root tag. You may
306 safely add, change or delete attributes in the hashes, but all bets
307 are off if you change the number or type of elements of this array!
308 $parser = the parser object.
309
310The subroutine may decide to handle the data and return nothing or
311tweak the data as necessary and return just the relevant bits. It may also
312load more information from elsewhere based on the ids found in the XML
313and provide it to the rules of the ancestor tags as if it was part of the XML.
314
315The possible return values of the subroutines are:
316
3171) nothing or undef or "" - nothing gets added to the parent tag's hash
318
3192) a single string - if the parent's _content is a string then the one produced by this rule is appended to the parent's _content.
320If the parent's _content is an array, then the string is push()ed to the array.
321
3223) a single reference - if the parent's _content is a string then it's changed to an array containing the original string and this reference.
323If the parent's _content is an array, then the string is push()ed to the array.
324
3254) an even numbered list - it's a list of key & value pairs to be added to the parent's hash.
326
327The handling of the attributes may be changed by adding '@', '%', '+', '*' or '.' before the attribute name.
328
329Without any "sigil" the key & value is added to the hash overwriting any previous values.
330
331The values for the keys starting with '@' are push()ed to the arrays referenced by the key name
332without the @. If there already is an attribute of the same name then the value will be preserved and will become
333the first element in the array.
334
335The values for the keys starting with '%' have to be either hash or array references. The key&value pairs
336in the referenced hash or array will be added to the hash referenced by the key. This is nice for rows of tags like this:
337
338 <field name="foo" value="12"/>
339 <field name="bar" value="24"/>
340
341if you specify the rule as
342
343 field => sub { '%fields' => [$_[1]->{name} => $_[1]->{value}]}
344
345then the parent tag's has will contain
346
347 fields => {
348 foo => 12,
349 bar => 24,
350 }
351
352The values for the keys starting with '+' are added to the current value, the ones starting with '.' are
353appended to the current value and the ones starting with '*' multiply the current value.
354
3555) an odd numbered list - the last element is appended or push()ed to the parent's _content, the rest is handled as in the previous case.
356
357
358Since 0.19 it's possible to specify several actions for a tag if you need to do something different based on the path to the tag like this:
359
360 tagname => [
361 'tag/path' => action,
362 '/root/tag/path' => action,
363 '/root/*/path' => action,
364 qr{^root/ns:[^/]+/par$} => action,
365 default_action
366 ],
367
368The path is matched against the list of parent tags joined by slashes.
369
370=head2 The Start Rules
371
372Apart from the normal rules that get invoked once the tag is fully parsed, including the contents and child tags, you may want to
373attach some code to the start tag to (optionaly) skip whole branches of XML or set up attributes and variables. You may set up
374the start rules either in a separate parameter to the constructor or in the rules=> by prepending the tag name(s) by ^.
375
376These rules are in form
377
378 tagspecification => undef / '' / 'skip' --> skip the element, including child tags
379 tagspecification => 1 / 'handle' --> handle the element, may be needed
380 if you specify the _default rule.
381 tagspecification => \&subroutine
382
383The subroutines receive the same parameters as for the "end tag" rules except of course the _content, but their return value is treated differently.
384If the subroutine returns a false value then the whole branch enclosed by the current tag is skipped, no data are stored and no rules are
385executed. You may modify the hash referenced by $attr.
386
387Both types of rules are free to store any data they want in $parser->{pad}. This property is NOT emptied
388after the parsing!
389
390=head2 Whitespace handling
391
392There are two options that affect the whitespace handling: stripspaces and normalisespaces. The normalisespaces is a simple flag that controls
393whether multiple spaces/tabs/newlines are collapsed into a single space or not. The stripspaces is more complex, it's a bit-mask,
394an ORed combination of the following options:
395
396 0 - don't remove whitespace around tags
397 (around tags means before the opening tag and after the closing tag, not in the tag's content!)
398 1 - remove whitespace before tags whose rules did not return any text content
399 (the rule specified for the tag caused the data of the tag to be ignored,
400 processed them already or added them as attributes to parent's \%attr)
401 2 - remove whitespace around tags whose rules did not return any text content
402 3 - remove whitespace around all tags
403
404 0 - remove only whitespace-only content
405 (that is remove the whitespace around <foo/> in this case "<bar> <foo/> </bar>"
406 but not this one "<bar>blah <foo/> blah</bar>")
407 4 - remove trailing/leading whitespace
408 (remove the whitespace in both cases above)
409
410 0 - don't trim content
411 8 - do trim content
412 (That is for "<foo> blah </foo>" only pass to the rule {_content => 'blah'})
413
414
415That is if you have a data oriented XML in which each tag contains either text content or subtags, but not both,
416you want to use stripspaces => 3 or stripspaces => 3|4. This will not only make sure you don't need to bother
417with the whitespace-only _content of the tags with subtags, but will also make sure you do not keep on wasting
418memory while parsing a huge XML and processing the "twigs". Without that option the parent tag of
419the repeated tag would keep on accumulating unneeded whitespace in its _content.
420
421=cut
422
423sub new {
424 my $class = shift;
425 my %params = @_;
426 croak "Please specify the rules=> for the parser!" unless $params{rules} and ref($params{rules});
427
428 my $self = {rules => {}, start_rules => {}};
429 bless $self, $class;
430
431 my @rules = (ref($params{rules}) eq 'HASH' ? %{$params{rules}} : @{$params{rules}}); # dereference and copy
432 delete $params{rules};
433
434 my @start_rules;
435 if ($params{start_rules} and ref($params{start_rules})) {
436 @start_rules = ref($params{start_rules}) eq 'HASH' ? %{$params{start_rules}} : @{$params{start_rules}}; # dereference and copy
437 };
438 delete $params{start_rules};
439
440 for (my $i=0; $i <= $#rules; $i+=2) {
441 next unless $rules[$i] =~ s/^\^//;
442 push @start_rules, splice( @rules, $i, 2);
443 $i-=2;
444 }
445
446 $self->_split_rules( \@rules, 'rules', 'as is');
447 $self->_split_rules( \@start_rules, 'start_rules', 'handle');
448
449 $self->{for_parser} = {};
450 { # extract the params for the XML::Parser::Expat constructor
451 my @for_parser = grep exists($params{$_}), qw(ProtocolEncoding Namespaces NoExpand Stream_Delimiter ErrorContext ParseParamEnt Base);
452 if (@for_parser) {
453 @{$self->{for_parser}}{@for_parser} = @params{@for_parser};
454 delete @params{@for_parser};
455 }
456 }
457
458 $self->{namespaces} = delete($params{namespaces});
459 if (defined($self->{namespaces})) {
460 croak 'XML::Rules->new( ... , namespaces => ...HERE...) must be a hash reference!'
461 unless ref($self->{namespaces}) eq 'HASH';
462 $self->{xmlns_map} = {};
463 }
464
465 $self->{style} = delete($params{style}) || 'parser';
466
467 $self->{opt}{lc $_} = $params{$_} for keys %params;
468
469 delete $self->{opt}{encode} if $self->{opt}{encode} =~ /^utf-?8$/i;
470 delete $self->{opt}{output_encoding} if $self->{opt}{output_encoding} =~ /^utf-?8$/i;
471
472 for (qw(normalisespace normalizespace normalizespaces)) {
473 last if defined($self->{opt}{normalisespaces});
474 $self->{opt}{normalisespaces} = $self->{opt}{$_};
475 delete $self->{opt}{$_};
476 }
477 $self->{opt}{normalisespaces} = 0 unless(defined($self->{opt}{normalisespaces}));
478 $self->{opt}{stripspaces} = 0 unless(defined($self->{opt}{stripspaces}));
479
480 require 'Encode.pm' if $self->{opt}{encode};
481 require 'Encode.pm' if $self->{opt}{output_encoding};
482
483 return $self;
484}
485
486sub _split_rules {
487 my ($self, $rules, $type, $default) = @_;
488
489 $self->{$type}{_default} = $default unless exists($self->{$type}{_default});
490
491 while (@$rules) {
492 my ($tag, $code) = (shift(@$rules), shift(@$rules));
493
494 if (ref($code) eq 'ARRAY') {
495 for( my $i = 0; $i < $#$code; $i+=2) {
496 $code->[$i] = _xpath2re($code->[$i]);
497 }
498 push @$code, $self->{$type}{_default} if @$code % 2 == 0; # add the default type if there's even number of items (path => code, path => code)
499 }
500
501 if ($tag =~ m{^/([^/].*)/([imosx]*)$}) { # string with a '/regexp/'
502 if ($2) {
503 push @{$self->{$type.'_re'}}, qr/(?$2)$1/;
504 } else {
505 push @{$self->{$type.'_re'}}, qr/$1/;
506 }
507 push @{$self->{$type.'_re_code'}}, $code;
508 } elsif (ref($tag) eq 'Regexp') { # a qr// created regexp
509 push @{$self->{$type.'_re'}}, $tag;
510 push @{$self->{$type.'_re_code'}}, $code;
511 } elsif ($tag =~ /[,\|]/) { # a , or | separated list
512 if ($tag =~ s/^\^//) {
513 my @tags = split(/\s*[,\|]\s*/, $tag);
514 $self->{$type}{'^'.$_} = $code for (@tags);
515 } else {
516 my @tags = split(/\s*[,\|]\s*/, $tag);
517 $self->{$type}{$_} = $code for (@tags);
518 }
519 } else { # a single tag
520 $self->{$type}{$tag} = $code;
521 }
522 }
523}
524
525sub _xpath2re {
526 my $s = shift;
527 return $s if ref($s);
528 for ($s) {
529 s/([\.\[\]+{}\-])/\\$1/g;
530 s{\*}{.+}g;
531 s{^//}{}s;
532 s{^/}{^}s;
533 }
534 return qr{$s$};
535}
536
537sub skip_rest {
538 die "[XML::Rules] skip rest\n";
539}
540
541sub return_nothing {
542 die "[XML::Rules] return nothing\n";
543}
544
545sub _run {
546 my $self = shift;
547 my $string = shift;
548 $self->{parameters} = shift;
549
550 $self->{parser} = XML::Parser::Expat->new( %{$self->{for_parser}});
551
552 $self->{parser}->setHandlers(
553 Start => _Start($self),
554 End => _End($self),
555 Char => _Char($self),
556 XMLDecl => _XMLDecl($self),
557 );
558
559 $self->{data} = [];
560 $self->{context} = [];
561 $self->{_ltrim} = [0];
562
563 if (! eval {
564 $self->{parser}->parse($string) and 1;
565 }) {
566 my $err = $@;
567 undef $@;
568
569 if ($err =~ /^\[XML::Rules\] skip rest/) {
570 my (undef, $handler) = $self->{parser}->setHandlers(End => undef);
571 foreach my $tag (reverse @{$self->{context} = []}) {
572 $handler->( $self->{parser}, $tag);
573 }
574 } else {
575
576 delete $self->{normal_handlers};
577 delete $self->{ignore_handlers};
578 delete $self->{parameters};
579 $self->{parser}->release();
580
581 $self->{data} = [];
582 $self->{context} = [];
583
584 if ($err =~ /^\[XML::Rules\] return nothing/) {
585 return;
586 }
587
588 $err =~ s/at \S+Rules\.pm line \d+$//
589 and croak $err or die $err;
590 }
591 };
592
593 delete $self->{normal_handlers};
594 delete $self->{ignore_handlers};
595 $self->{parser}->release();
596
597 delete $self->{parameters};
598 my $data; # return the accumulated data, without keeping a copy inside the object
599 ($data, $self->{data}) = ($self->{data}[0], undef);
600 if (!defined(wantarray()) or ! keys(%$data)) {
601 return;
602
603 } elsif (keys(%$data) == 1 and exists(${$data}{_content})) {
604 if (ref(${$data}{_content}) eq 'ARRAY' and @{${$data}{_content}} == 1) {
605 return ${${$data}{_content}}[0]
606 } else {
607 return ${$data}{_content}
608 }
609
610 } else {
611 return $data;
612 }
613}
614
615
616sub parsestring;
617*parsestring = \&parse;
618sub parse {
619 my $self = shift;
620 croak("This XML::Rules object may only be used as a filter!") if ($self->{style} eq 'filter');
621 $self->_run(@_);
622}
623
624sub parsefile {
625 my $self = shift;
626 croak("This XML::Rules object may only be used as a filter!") if ($self->{style} eq 'filter');
627 my $filename = shift;
628 open my $IN, '<', $filename or croak "Cannot open '$filename' for reading: $^E";
629 return $self->_run($IN, @_);
630}
631
632
633sub filterstring;
634*filterstring = \&filter;
635sub filter {
636 my $self = shift;
637 croak("This XML::Rules object may only be used as a parser!") unless ($self->{style} eq 'filter');
638
639 my $XML = shift;
640 $self->{FH} = shift || select(); # either passed or the selected filehandle
641 if (!ref($self->{FH})) {
642 if ($self->{FH} =~ /^main::(?:STDOUT|STDERR)$/) {
643 # yeah, select sometimes returns the name of the filehandle, not the filehandle itself. eg. "main::STDOUT"
644 no strict;
645 $self->{FH} = \*{$self->{FH}};
646 } else {
647 open my $FH, '>', $self->{FH} or croak(qq{Failed to open "$self->{FH}" for writing: $^E});
648 $self->{FH} = $FH;
649 }
650 } elsif (ref($self->{FH}) eq 'SCALAR') {
651 open my $FH, '>', $self->{FH};
652 $self->{FH} = $FH;
653 }
654 if ($self->{opt}{output_encoding}) {
655 print {$self->{FH}} qq{<?xml version="1.0" encoding="$self->{opt}{output_encoding}"?>\n};
656 } else {
657 print {$self->{FH}} qq{<?xml version="1.0"?>\n};
658 }
659 $self->_run($XML, @_);
660 print {$self->{FH}} "\n";
661 delete $self->{FH};
662}
663
664sub filterfile {
665 my $self = shift;
666 croak("This XML::Rules object may only be used as a parser!") unless ($self->{style} eq 'filter');
667
668 my $filename = shift;
669 open my $IN, '<', $filename or croak "Cannot open '$filename' for reading: $^E";
670
671 $self->{FH} = shift || select(); # either passed or the selected filehandle
672 if (!ref($self->{FH})) {
673 if ($self->{FH} =~ /^main::(?:STDOUT|STDERR)$/) {
674 # yeah, select sometimes returns the name of the filehandle, not the filehandle itself. eg. "main::STDOUT"
675 no strict;
676 $self->{FH} = \*{$self->{FH}};
677 } else {
678 open my $FH, '>', $self->{FH} or croak(qq{Failed to open "$self->{FH}" for writing: $^E});
679 $self->{FH} = $FH;
680 }
681 } elsif (ref($self->{FH}) eq 'SCALAR') {
682 open $self->{FH}, '>', $self->{FH};
683 }
684 if ($self->{opt}{output_encoding}) {
685 print {$self->{FH}} qq{<?xml version="1.0" encoding="$self->{opt}{output_encoding}"?>\n};
686 } else {
687 print {$self->{FH}} qq{<?xml version="1.0"?>\n};
688 }
689 $self->_run($IN, @_);
690 print {$self->{FH}} "\n";
691 delete $self->{FH};
692}
693
694sub _XMLDecl {
695 my $self = shift;
696 return sub {
697 my ( $Parser, $Version, $Encoding, $Standalone) = @_;
698 $self->{opt}{original_encoding} = $Encoding
699 }
700}
701
702=begin comment
703
704start tag
705 & 3 = 3 -> rtrim parent's _content
706 & 8 = 8 -> $ltrim = 1
707
708string content
709 $ltrim -> ltrim the string, if not completely whitespace set $ltrim 0
710
711end tag
712 & 8 = 8 -> rtrim own content
713 & 3 = 3 -> $ltrim = 1
714 empty_returned_content and & 3 in (1,2) -> rtrim parent content
715 empty_returned_content and & 3 = 2 -> $ltrim
716
717=end comment
718
719=cut
720
721sub _rtrim {
722 my ($self, $attr, $more) = @_;
723
724 if ($more) {
725 if (ref $attr->{_content}) {
726 if (!ref($attr->{_content}[-1])) {
727 $attr->{_content}[-1] =~ s/\s+$//s;
728 pop @{$attr->{_content}} if $attr->{_content}[-1] eq '';
729 delete $attr->{_content} unless @{$attr->{_content}};
730 }
731 } else {
732 $attr->{_content} =~ s/\s+$//s;
733 delete $attr->{_content} if $attr->{_content} eq '';
734 }
735 } else {
736 if (ref $attr->{_content}) {
737 if (!ref($attr->{_content}[-1]) and $attr->{_content}[-1] =~ /^\s*$/s) {
738 pop @{$attr->{_content}} ;
739 delete $attr->{_content} unless @{$attr->{_content}};
740 }
741 } else {
742 delete $attr->{_content} if $attr->{_content} =~ /^\s*$/s;
743 }
744 }
745}
746
747sub _Start {
748 my $self = shift;
749 my $encode = $self->{opt}{encode};
750 my $output_encoding = $self->{opt}{output_encoding};
751 return sub {
752 my ( $Parser, $Element , %Attr) = @_;
753
754 if (($self->{opt}{stripspaces} & 3) == 3) {
755 #rtrim parent
756#print "rtrim parent content in _Start\n";
757 if ($self->{data}[-1] and $self->{data}[-1]{_content}) {
758 $self->_rtrim( $self->{data}[-1], ($self->{opt}{stripspaces} & 4));
759 }
760 }
761 if ($self->{opt}{stripspaces} & 8) {
762#print "ltrim own content in _Start\n";
763 push @{$self->{_ltrim}}, 2;
764 } else {
765 push @{$self->{_ltrim}}, 0;
766 }
767
768 if ($self->{namespaces}) {
769 my %restore;
770 if (exists $Attr{xmlns}) { # find the default namespace
771#print "Found a xmlns attribute in $Element!\n";
772 $restore{''} = $self->{xmlns_map}{''};
773 if (!exists($self->{namespaces}{ $Attr{xmlns} })) {
774 warn qq{Unexpected namespace "$Attr{xmlns}" found in the XML!\n} unless exists $self->{namespaces}{'*'};
775 delete $self->{xmlns_map}{''};
776 delete $restore{''} unless defined($restore{''});
777 delete($Attr{xmlns});
778 } else {
779 $self->{xmlns_map}{''} = $self->{namespaces}{ delete($Attr{xmlns}) };
780 }
781 }
782 foreach my $attr (keys %Attr) { # find the namespace aliases
783 next unless $attr =~ /^xmlns:(.*)$/;
784 $restore{$1} = $self->{xmlns_map}{$1};
785 if (!exists($self->{namespaces}{ $Attr{$attr} })) {
786 warn qq{Unexpected namespace "$Attr{$attr}" found in the XML!\n} unless exists $self->{namespaces}{'*'};
787 delete $self->{xmlns_map}{$1};
788 delete $restore{$1} unless defined($restore{$1});
789 delete($Attr{$attr});
790 } else {
791 $self->{xmlns_map}{$1} = $self->{namespaces}{ delete($Attr{$attr}) };
792 }
793 }
794 if (%restore) {
795 push @{$self->{xmlns_restore}}, \%restore;
796 } else {
797 push @{$self->{xmlns_restore}}, undef;
798 }
799
800 if (%{$self->{xmlns_map}}) {
801#print "About to map aliases for $Element\n";
802 # add or map the alias for the tag
803 if ($Element =~ /^([^:]+):(.*)$/) {
804#print "Mapping an alias $1 for tag $Element\n";
805 if (exists($self->{xmlns_map}{$1})) {
806 if ($self->{xmlns_map}{$1} eq '') {
807 $Element = $2 ;
808 } else {
809 $Element = $self->{xmlns_map}{$1} . ':' . $2 ;
810 }
811 }
812#print " -> $Element\n";
813 } elsif (defined($self->{xmlns_map}{''}) and $self->{xmlns_map}{''} ne '') { # no namespace alias in the tag and there's a default
814#print "Adding default alias $self->{xmlns_map}{''}:\n";
815 $Element = $self->{xmlns_map}{''} . ':' . $Element;
816#print " -> $Element\n";
817 }
818
819 # map the aliases for the attributes
820 foreach my $attr (keys %Attr) {
821 next unless $attr =~ /^([^:]+):(.*)$/; # there's an alias
822 next unless exists($self->{xmlns_map}{$1}); # and there's a mapping
823 if ($self->{xmlns_map}{$1} eq '') {
824 $Attr{$2} = delete($Attr{$attr}); # rename the attribute
825 } else {
826 $Attr{$self->{xmlns_map}{$1} . ':' . $2} = delete($Attr{$attr}); # rename the attribute
827 }
828 }
829 }
830 } # /of namespace handling
831
832
833 my ( $start_rule, $end_rule) = map {
834 if ($self->{$_}{$Element} and ref($self->{$_}{$Element}) ne 'ARRAY') {
835 $self->{$_}{$Element}
836 } else {
837 $self->_find_rule( $_, $Element, $self->{context})
838 }
839 } ( 'start_rules', 'rules');
840
841 if ($start_rule ne 'handle'
842 and (
843 !$start_rule
844 or $start_rule eq 'skip'
845 or !$start_rule->($Element,\%Attr, $self->{context}, $self->{data}, $self)
846 )
847 ) {
848 # ignore the tag and the ones below
849 if (exists $self->{ignore_handlers}) {
850#print "SETHANDLERS!\n";
851 $Parser->setHandlers(@{$self->{ignore_handlers}})
852 } else {
853 $self->{ignore_handlers} = [
854 Start => _StartIgnore($self),
855 Char => undef,
856 End => _EndIgnore($self),
857 ];
858#print "SETHANDLERS2\n";
859 $self->{normal_handlers} = [$Parser->setHandlers(@{$self->{ignore_handlers}})];
860 }
861 $self->{ignore_level}=1;
862
863 } else {
864 # process the tag and the ones below
865 if ($encode) {
866 foreach my $value (values %Attr) {
867 $value = Encode::encode( $encode, $value);
868 }
869 }
870
871 push @{$self->{context}}, $Element;
872 push @{$self->{data}}, \%Attr;
873 $self->{lastempty} = 0;
874
875 if ($self->{style} eq 'filter') {
876 $self->{in_interesting}++ if ref($end_rule) or $end_rule =~ /^=/s; # is this tag interesting?
877
878 if (! $self->{in_interesting}) { # it neither this tag not an acestor is interesting, just copy the tag
879#print "Start:R ".$Parser->recognized_string()."\n";
880#print "Start:O ".$Parser->original_string()."\n";
881#print "Start:R ".$Parser->recognized_string()."\n";
882#print "Start:O ".$Parser->original_string()."\n";
883 if (! $output_encoding) {
884 print {$self->{FH}} $Parser->recognized_string();
885 } elsif ($output_encoding eq $self->{opt}{original_encoding}) {
886 print {$self->{FH}} $Parser->original_string();
887 } else {
888 print {$self->{FH}} $self->toXML($Element, \%Attr, "don't close");
889 }
890 }
891 }
892
893 }
894 }
895}
896
897sub _find_rule {
898 my ($self, $type, $Element, $path) = @_;
899
900 if (exists($self->{$type.'_re'})) {
901 for(my $i = 0; $i < @{$self->{$type.'_re'}}; $i++) {
902 if ($Element =~ $self->{$type.'_re'}[$i]) {
903 $self->{$type}{$Element} = $self->{$type.'_re_code'}[$i];
904 last;
905 }
906 }
907 }
908 if (! exists $self->{$type}{$Element}) {
909 $self->{$type}{$Element} = $self->{$type}{_default};
910 }
911
912 if (ref $self->{$type}{$Element} eq 'ARRAY') {
913 $path = join( '/', @$path);
914 for(my $i=0; $i < $#{$self->{$type}{$Element}}; $i+=2) {
915 if ($path =~ $self->{$type}{$Element}[$i]) {
916 return $self->{$type}{$Element}[$i+1];
917 }
918 }
919 return $self->{$type}{$Element}[-1];
920 } else {
921 return $self->{$type}{$Element};
922 }
923}
924
925sub _Char {
926 my $self = shift;
927 my $encode = $self->{opt}{encode};
928 return sub {
929 my ( $Parser, $String) = @_;
930
931 if ($self->{style} eq 'filter' and ! $self->{in_interesting}) {
932 if (! $self->{opt}{output_encoding}) {
933 print {$self->{FH}} $Parser->recognized_string();
934 } elsif ($self->{opt}{output_encoding} eq $self->{opt}{original_encoding}) {
935 print {$self->{FH}} $Parser->original_string();
936 } else {
937 print {$self->{FH}} encode($self->{opt}{output_encoding}, $Parser->recognized_string());
938 }
939 return;
940 }
941
942 if ($encode) {
943 $String = Encode::encode( $encode, $String);
944 }
945
946 if ($self->{_ltrim}[-1]) {
947#print "ltrim in $self->{context}[-1] ($String)\n";
948 if ($self->{_ltrim}[-1] == 2) {
949 $String =~ s/^\s+//s;
950 return if $String eq '';
951 } else {
952 return if $String =~ /^\s*$/s;
953 }
954 $self->{_ltrim}[-1] = 0;
955#print " ($String)\n";
956 }
957 $String =~ s/\s+/ /gs if ($self->{opt}{normalisespaces});
958
959 if (!exists $self->{data}[-1]{_content}) {
960 $self->{data}[-1]{_content} = $String;
961 } elsif (!ref $self->{data}[-1]{_content}) {
962 if ($self->{opt}{normalisespaces} and $self->{data}[-1]{_content} =~ /\s$/ and $String =~ /^\s/) {
963 $String =~ s/^\s+//s;
964 }
965 $self->{data}[-1]{_content} .= $String;
966 } else {
967 if (ref $self->{data}[-1]{_content}[-1]) {
968 push @{$self->{data}[-1]{_content}}, $String;
969 } else {
970 if ($self->{opt}{normalisespaces} and $self->{data}[-1]{_content}[-1] =~ /\s$/ and $String =~ /^\s/) {
971 $String =~ s/^\s+//s;
972 }
973 $self->{data}[-1]{_content}[-1] .= $String;
974 }
975 }
976 }
977}
978
979sub _End {
980 my $self = shift;
981 return sub {
982 my ( $Parser, $Element) = @_;
983 $Element = pop @{$self->{context}}; # the element name may have been mangled by XMLNS aliasing
984
985 if ($self->{opt}{stripspaces} & 8) {
986#print "rtrim own content\n";
987 if ($self->{data}[-1] and $self->{data}[-1]{_content}) {
988 $self->_rtrim( $self->{data}[-1], 1);
989 }
990 }
991 pop(@{$self->{_ltrim}});
992
993 if ($self->{namespaces}) {
994 if (my $restore = pop @{$self->{xmlns_restore}}) { # restore the old default namespace and/or alias mapping
995 while (my ($their, $our) = each %$restore) {
996 if (defined($our)) {
997 $self->{xmlns_map}{$their} = $our;
998 } else {
999 delete $self->{xmlns_map}{$their};
1000 }
1001 }
1002 }
1003 }
1004
1005 my ($rule) = map {
1006 if ($self->{$_}{$Element} and ref($self->{$_}{$Element}) ne 'ARRAY') {
1007 $self->{$_}{$Element}
1008 } else {
1009 $self->_find_rule( $_, $Element, $self->{context})
1010 }
1011 } ('rules');
1012
1013 my $data = pop @{$self->{data}};
1014
1015 my @results;
1016 if (ref $rule or $rule =~ /^=/s) {
1017 if ($rule =~ /^==(.*)$/s) { # change the whole tag to a string
1018 @results = ($1);
1019 } elsif ($rule =~ /^=(.*)$/s) { # change the contents to a string
1020 @results = ($Element => $1);
1021 } else {
1022 @results = $rule->($Element, $data, $self->{context}, $self->{data}, $self);
1023 }
1024
1025 if ($self->{style} eq 'filter') {
1026
1027 $self->{in_interesting}--;
1028 if (!$self->{in_interesting}) {
1029 if (@{$self->{data}}) {
1030 print {$self->{FH}} $self->escape_value($self->{data}[-1]{_content});
1031 delete $self->{data}[-1]{_content};
1032 }
1033 my $base;
1034 if ($self->{opt}{ident} ne '') {
1035 $base = $self->{opt}{ident} x scalar(@{$self->{context}});
1036 }
1037 @results and $results[0] =~ s/^[\@%\+\*\.]//;
1038 while (@results) {
1039#use Data::Dumper;
1040#print "\@results=".Dumper(\@results)."\n";
1041 if (ref($results[0])) {
1042 croak(ref($results[0]) . " not supported as the return value of a filter") unless ref($results[0]) eq 'ARRAY';
1043 if (@{$results[0]} ==2 and ref($results[0][1]) eq 'HASH') {
1044 print {$self->{FH}} $self->toXML(@{$results[0]}[0,1], 0, $self->{opt}{ident}, $base);
1045 } else {
1046 foreach my $item (@{$results[0]}) {
1047 if (ref($item)) {
1048 croak(ref($item) . " not supported in the return value of a filter") unless ref($item) eq 'ARRAY';
1049 croak("Empty array not supported in the return value of a filter") unless @$item;
1050 if (@$item <= 2) {
1051 print {$self->{FH}} $self->toXML(@{$item}[0,1], 0, $self->{opt}{ident}, $base);
1052 } else { # we suppose the 3rd and following elements are parameters to ->toXML()
1053 print {$self->{FH}} $self->toXML(@$item);
1054 }
1055 } else {
1056 print {$self->{FH}} $self->escape_value($item);
1057 }
1058 }
1059 }
1060 shift(@results);
1061 } else {
1062 if (@results == 1) {
1063 print {$self->{FH}} $self->escape_value($results[0]);
1064 @results = ();last;
1065 } else {
1066 print {$self->{FH}} $self->toXML(shift(@results), shift(@results), 0, $self->{opt}{ident}, $base);
1067 }
1068 }
1069 }
1070 }
1071 }
1072 } elsif ($self->{style} eq 'filter' and ! $self->{in_interesting}) {
1073#print "End: \$Element=$Element; \$Parser->recognized_string()=".$Parser->recognized_string()."; \$Parser->original_string()=".$Parser->original_string()."\n";
1074die "Unexpected \$data->{content}={$data->{_content}} in filter outside interesting nodes!\n" if $data->{_content} ne '';
1075 if (! $self->{opt}{output_encoding}) {
1076 print {$self->{FH}} $Parser->recognized_string();
1077 } elsif ($self->{opt}{output_encoding} eq $self->{opt}{original_encoding}) {
1078 print {$self->{FH}} $Parser->original_string();
1079 } else {
1080 print {$self->{FH}} encode($self->{opt}{output_encoding}, $Parser->recognized_string());
1081 }
1082# print {$self->{FH}} $self->escape_value($data->{_content})."</$Element>";
1083
1084 } else { # a predefined rule
1085
1086 if ($rule =~ s/(?:^| )no xmlns$//) {
1087 $Element =~ s/^\w+://;
1088 $rule = 'as_is' if $rule eq '';
1089 }
1090
1091 if ($rule eq '') {
1092 @results = ();
1093 } elsif ($rule eq 'content') {
1094 @results = ($Element => $data->{_content});
1095 } elsif ($rule eq 'content trim') {
1096 s/^\s+//,s/\s+$// for ($data->{_content});
1097 @results = ($Element => $data->{_content});
1098 } elsif ($rule eq 'content array') {
1099 @results = ('@'.$Element => $data->{_content});
1100 } elsif ($rule eq 'as is') {
1101 @results = ($Element => $data);
1102 } elsif ($rule eq 'as is trim') {
1103 s/^\s+//,s/\s+$// for ($data->{_content});
1104 @results = ($Element => $data);
1105 } elsif ($rule eq 'as array') {
1106 @results = ('@'.$Element => $data);
1107 } elsif ($rule eq 'as array trim') {
1108 s/^\s+//,s/\s+$// for ($data->{_content});
1109 @results = ('@'.$Element => $data);
1110 } elsif ($rule eq 'no content') {
1111 delete ${$data}{_content}; @results = ($Element => $data);
1112 } elsif ($rule eq 'no content array' or $rule eq 'as array no content') {
1113 delete ${$data}{_content}; @results = ('@' . $Element => $data);
1114
1115 } elsif ($rule eq 'pass') {
1116 @results = (%$data);
1117 } elsif ($rule eq 'pass trim') {
1118 s/^\s+//,s/\s+$// for ($data->{_content});
1119 @results = (%$data);
1120 } elsif ($rule eq 'pass no content' or $rule eq 'pass without content') {
1121 delete ${$data}{_content}; @results = (%$data);
1122
1123 } elsif ($rule eq 'raw') {
1124 @results = [$Element => $data];
1125
1126 } elsif ($rule eq 'raw extended') {
1127 @results = (':'.$Element => $data, [$Element => $data]);
1128
1129 } elsif ($rule eq 'raw extended array') {
1130 @results = ('@:'.$Element => $data, [$Element => $data]);
1131
1132 } elsif ($rule =~ /^((?:no )?content )?by\s+(\S+)$/) {
1133 my ($cnt,$attr) = ($1,$2);
1134 if ($cnt eq 'no content ') {
1135 delete $data->{_content};
1136 }
1137 if ($attr =~ /,/) {
1138 my @attr = split /,/, $attr;
1139 foreach (@attr) {
1140 next unless exists ($data->{$_});
1141 if ($cnt eq 'content ') {
1142 @results = ($data->{$_} => $data->{_content})
1143 } else {
1144 @results = (delete $data->{$_} => $data)
1145 }
1146 last;
1147 }
1148 } else {
1149 if ($cnt eq 'content ') {
1150 @results = ($data->{$attr} => $data->{_content})
1151 } else {
1152 @results = (delete $data->{$attr} => $data);
1153 }
1154 }
1155
1156 } else {
1157 croak "Unknown predefined rule '$rule'!";
1158 }
1159 }
1160
1161 if (! @results or (@results % 2 == 0) or $results[-1] eq '') {
1162 if ($self->{opt}{stripspaces} & 3 and @{$self->{data}} and $self->{data}[-1]{_content}) { # stripping some spaces, it's not root and it did not return content
1163#print "maybe stripping some spaces in $Element, it's not root and it did not return content\n";
1164 if (($self->{opt}{stripspaces} & 3) < 3 and $self->{data}[-1]{_content}) {
1165 # rtrim parent content
1166#print " yes, rtrim parent '$self->{data}[-1]{_content}'\n";
1167 $self->_rtrim( $self->{data}[-1], ($self->{opt}{stripspaces} & 4));
1168#print " result '$self->{data}[-1]{_content}'\n";
1169 }
1170
1171 $self->{_ltrim}[-1] = (($self->{opt}{stripspaces} & 4) ? 2 : 1)
1172 if ($self->{opt}{stripspaces} & 3) == 2;
1173 }
1174 } else {
1175 $self->{_ltrim}[-1] = 0;
1176 }
1177 if (($self->{opt}{stripspaces} & 3) == 3) {
1178 $self->{_ltrim}[-1] = (($self->{opt}{stripspaces} & 4) ? 2 : 1);
1179 }
1180
1181
1182 return unless scalar(@results) or scalar(@results) == 1 and ($results[0] eq '' or !defined($results[0]));
1183
1184 @{$self->{data}} = ({}) unless @{$self->{data}}; # oops we are already closing the root tag! We do need there to be at least one hashref in $self->{data}
1185
1186 if (scalar(@results) % 2) {
1187 # odd number of items, last is content
1188 my $value = pop(@results);
1189 _add_content( $self->{data}[-1], $value);
1190 }
1191
1192 while (@results) {
1193 my ($key, $value) = ( shift(@results), shift(@results));
1194 if ($key eq '_content') {
1195 _add_content( $self->{data}[-1], $value);
1196 } elsif ($key =~ s/^\@//) {
1197 if (exists($self->{data}[-1]{$key}) and ref($self->{data}[-1]{$key}) ne 'ARRAY') {
1198 $self->{data}[-1]{$key} = [$self->{data}[-1]{$key}, $value];
1199 } else {
1200 push @{$self->{data}[-1]{$key}}, $value;
1201 }
1202 } elsif ($key =~ s/^\+//) {
1203 if (exists($self->{data}[-1]{$key})) {
1204 $self->{data}[-1]{$key} += $value;
1205 } else {
1206 $self->{data}[-1]{$key} = $value;
1207 }
1208 } elsif ($key =~ s/^\*//) {
1209 if (exists($self->{data}[-1]{$key})) {
1210 $self->{data}[-1]{$key} *= $value;
1211 } else {
1212 $self->{data}[-1]{$key} = $value;
1213 }
1214 } elsif ($key =~ s/^\.//) {
1215 if (exists($self->{data}[-1]{$key})) {
1216 $self->{data}[-1]{$key} .= $value;
1217 } else {
1218 $self->{data}[-1]{$key} = $value;
1219 }
1220 } elsif ($key =~ s/^\%//) {
1221 if (exists($self->{data}[-1]{$key})) {
1222 if (ref($value) eq 'HASH') {
1223 %{$self->{data}[-1]{$key}} = (%{$self->{data}[-1]{$key}}, %$value);
1224 } elsif (ref($value) eq 'ARRAY') {
1225 %{$self->{data}[-1]{$key}} = (%{$self->{data}[-1]{$key}}, @$value);
1226 } else {
1227 croak "The value of the rule return \%$key must be a hash or array ref!";
1228 }
1229 } else {
1230 if (ref($value) eq 'HASH') {
1231 $self->{data}[-1]{$key} = $value;
1232 } elsif (ref($value) eq 'ARRAY') {
1233 $self->{data}[-1]{$key} = {@$value};
1234 } else {
1235 croak "The value of the rule return \%$key must be a hash or array ref!";
1236 }
1237 }
1238 } else {
1239 warn "The attribute '$key' already exists for tag $self->{context}[-1].\n old value: $self->{data}[-1]{$key}\n new value: $value\n"
1240 if ($self->{opt}{warnoverwrite} and exists $self->{data}[-1]{$key} and $self->{data}[-1]{$key} ne $value);
1241 $self->{data}[-1]{$key} = $value;
1242 }
1243 }
1244 }
1245}
1246
1247sub _StartIgnore {
1248 my ($self) = shift;
1249 return sub {
1250 $self->{ignore_level}++
1251 }
1252}
1253
1254sub _EndIgnore {
1255 my ($self) = shift;
1256 return sub {
1257 return if --$self->{ignore_level};
1258
1259 $self->{parser}->setHandlers(@{$self->{normal_handlers}})
1260 }
1261}
1262
1263sub _add_content {
1264 my ($hash, $value) = @_;
1265 if (ref($value)) {
1266 if (ref($hash->{_content})) {
1267 # both are refs, push to @_content
1268 push @{$hash->{_content}}, $value;
1269 } elsif (exists($hash->{_content})) {
1270 # result is ref, _content is not -> convert to an arrayref containing old _content and result
1271 $hash->{_content} = [ $hash->{_content}, $value]
1272 } else {
1273 # result is ref, _content is not present
1274 $hash->{_content} = [ $value]
1275 }
1276 } else {
1277 if (ref($hash->{_content})) {
1278 # _content is an arrayref, value is a string
1279 if (ref $hash->{_content}[-1]) {
1280 # the last element is a ref -> push
1281 push @{$hash->{_content}}, $value;
1282 } else {
1283 # the last element is a string -> concatenate
1284 $hash->{_content}[-1] .= $value;
1285 }
1286 } else {
1287 # neither is ref, concatenate
1288 $hash->{_content} .= $value;
1289 }
1290 }
1291}
1292
1293=head1 METHODS
1294
1295=head2 parse
1296
1297 $parser->parse( $string [, $parameters]);
1298 $parser->parse( $IOhandle [, $parameters]);
1299
1300Parses the XML in the string or reads and parses the XML from the opened IO handle,
1301executes the rules as it encounters the closing tags and returns the resulting structure.
1302
1303The scalar or reference passed as the second parameter to the parse() method is assigned to
1304$parser->{parameters} for the parsing of the file or string. Once the XML is parsed the key is
1305deleted. This means that the $parser does not retain a reference to the $parameters after the parsing.
1306
1307=head2 parsestring
1308
1309 $parser->parsestring( $string [, $parameters]);
1310
1311Just an alias to ->parse().
1312
1313=head2 parsefile
1314
1315 $parser->parsefile( $filename [, $parameters]);
1316
1317Opens the specified file and parses the XML and executes the rules as it encounters
1318the closing tags and returns the resulting structure.
1319
1320
1321=head2 filter
1322
1323 $parser->filter( $string);
1324 $parser->filter( $string, $OutputIOhandle [, $parameters]);
1325 $parser->filter( $InputIOhandle, $OutputIOhandle [, $parameters]);
1326 $parser->filter( $string, $OutputFilename [, $parameters]);
1327 $parser->filter( $InputIOhandle, $OutputFilename [, $parameters]);
1328 $parser->filter( $string, $StringReference [, $parameters]);
1329 $parser->filter( $InputIOhandle, $StringReference [, $parameters]);
1330
1331Parses the XML in the string or reads and parses the XML from the opened IO handle,
1332copies the tags that do not have a subroutine rule specified and do not occure under such a tag,
1333executes the specified rules and prints the results to select()ed filehandle, $OutputFilename or
1334$OutputIOhandle or stores them in the scalar referenced by $StringReference.
1335
1336The scalar or reference passed as the third parameter to the filter() method is assigned to
1337$parser->{parameters} for the parsing of the file or string. Once the XML is parsed the key is
1338deleted. This means that the $parser does not retain a reference to the $parameters after the parsing.
1339
1340=head2 filterstring
1341
1342 $parser->filterstring( ...);
1343
1344Just an alias to ->filter().
1345
1346=head2 filterfile
1347
1348 $parser->filterfile( $filename, $OutputIOhandle [, $parameters]);
1349 $parser->filterfile( $filename, $OutputFilename [, $parameters]);
1350
1351Parses the XML in the specified file, copies the tags that do not have a subroutine rule specified
1352and do not occure under such a tag, executes the specified rules and prints the results to select()ed
1353filehandle, $OutputFilename or $OutputIOhandle or stores them in the scalar
1354referenced by $StringReference.
1355
1356The scalar or reference passed as the third parameter to the filter() method is assigned to
1357$parser->{parameters} for the parsing of the file or string. Once the XML is parsed the key is
1358deleted. This means that the $parser does not retain a reference to the $parameters after the parsing.
1359
1360=cut
1361
1362sub escape_value {
1363 my($self, $data, $level) = @_;
1364
1365 return '' unless(defined($data) and $data ne '');
1366
1367 if ($self->{opt}{output_encoding} ne $self->{opt}{encode}) {
1368 $data = Encode::decode( $self->{opt}{encode}, $data) if $self->{opt}{encode};
1369 $data = Encode::encode( $self->{opt}{output_encoding}, $data) if $self->{opt}{output_encoding};
1370 }
1371
1372 $data =~ s/&/&amp;/sg;
1373 $data =~ s/</&lt;/sg;
1374 $data =~ s/>/&gt;/sg;
1375 $data =~ s/"/&quot;/sg;
1376
1377 $level = $self->{opt}->{numericescape} unless defined $level;
1378 return $data unless $level;
1379
1380 return $self->_numeric_escape($data, $level);
1381}
1382
1383sub _numeric_escape {
1384 my($self, $data, $level) = @_;
1385
1386 use utf8; # required for 5.6
1387
1388 if($self->{opt}->{numericescape} eq '2') {
1389 $data =~ s/([^\x00-\x7F])/'&#' . ord($1) . ';'/gse;
1390 }
1391 else {
1392 $data =~ s/([^\x00-\xFF])/'&#' . ord($1) . ';'/gse;
1393 }
1394
1395 return $data;
1396}
1397
1398=head2 escape_value
1399
1400 $parser->escape_value( $data [, $numericescape])
1401
1402This method escapes the $data for inclusion in XML, the $numericescape may be 0, 1 or 2
1403and controls whether to convert 'high' (non ASCII) characters to XML entities.
1404
14050 - default: no numeric escaping (OK if you're writing out UTF8)
1406
14071 - only characters above 0xFF are escaped (ie: characters in the 0x80-FF range are not escaped), possibly useful with ISO8859-1 output
1408
14092 - all characters above 0x7F are escaped (good for plain ASCII output)
1410
1411You can also specify the default value in the constructor
1412
1413 my $parser = XML::Rules->new(
1414 ...
1415 NumericEscape => 2,
1416 );
1417
1418=cut
1419
1420sub ToXML;*ToXML=\&toXML;
1421sub toXML {
1422 my ($self, $tag, $attrs, $no_close, $ident, $base) = @_;
1423
1424 my $prefix = (defined ($ident) ? "\n$base" : '');
1425
1426 $attrs = undef if (ref $attrs eq 'HASH' and ! %{$attrs}); # ->toXML( $tagname, {}, ...)
1427
1428 if ($tag eq '') {
1429 if (! ref($attrs)) { # ->toXML( '', $string_content, ...)
1430 return $self->escape_value($attrs);
1431 } elsif (ref($attrs) eq 'ARRAY') {
1432 if (@$attrs) {
1433 return join( '', map {
1434 if (!ref($_)) {
1435 $self->escape_value($_)
1436 } elsif (ref($_) eq 'ARRAY' and @$_ == 2) {
1437 $self->toXML($_->[0], $_->[1])
1438 } else {
1439 croak "The content in XML::Rules->ToXML( '', here) must be a string or an arrayref containing strings and two element arrayrefs!";
1440 }
1441 } @$attrs);
1442 } else {
1443 return "<$attrs/>"
1444 }
1445 }
1446 }
1447
1448 if (! ref($attrs)) { # ->toXML( $tagname, $string_content, ...)
1449 if ($no_close) {
1450 return "<$tag>" . $self->escape_value($attrs);
1451 } elsif (! defined $attrs) {
1452 return "<$tag/>";
1453 } else {
1454 return "<$tag>" . $self->escape_value($attrs) . "</$tag>";
1455 }
1456 } elsif (ref($attrs) eq 'ARRAY') {
1457 return join( $prefix, map $self->toXML($tag, $_, 0, $ident, $base), @$attrs);
1458 }
1459
1460
1461 my $content = $attrs->{_content};
1462 my $result = "<$tag";
1463 my $subtags = '';
1464 foreach my $key (sort keys %$attrs) {
1465 next if $key =~ /^:/ or $key eq '_content';
1466 if (ref $attrs->{$key}) {
1467 if (ref $attrs->{$key} eq 'ARRAY') {
1468 if (@{$attrs->{$key}}) {
1469 foreach my $subtag (@{$attrs->{$key}}) {
1470 $subtags .= $prefix . $ident . $self->toXML($key, $subtag, 0, $ident, $base.$ident);
1471 }
1472 } else {
1473 $subtags .= $prefix . $ident . "<$key/>";
1474 }
1475 } elsif (ref $attrs->{$key} eq 'HASH') {
1476 $subtags .= $prefix . $ident . $self->toXML($key, $attrs->{$key}, 0, $ident, $base.$ident)
1477 } else {
1478 croak(ref($attrs->{$key}) . " attributes not supported in XML::Rules->toXML()!");
1479 }
1480 } else {
1481 $result .= qq{ $key="} . $self->escape_value($attrs->{$key}) . qq{"};
1482 }
1483 }
1484 if (! defined $content and $subtags eq '') {
1485 if ($no_close) {
1486 return $result.">";
1487 } else {
1488 return $result."/>";
1489 }
1490
1491 } elsif (!ref($content)) { # content is a string, not an array of strings and subtags
1492 if ($no_close) {
1493 return "$result>$subtags" . $self->escape_value($content);
1494 } elsif ($content eq '' and $subtags ne '') {
1495 return "$result>$subtags$prefix</$tag>";
1496 } else {
1497 return "$result>$subtags" . $self->escape_value($content) ."</$tag>";
1498 }
1499
1500 } elsif (ref($content) eq 'ARRAY') {
1501 $result .= ">$subtags";
1502 foreach my $snippet (@$content) {
1503 if (!ref($snippet)) {
1504 $result .= $self->escape_value($snippet);
1505 } elsif (ref($snippet) eq 'ARRAY') {
1506 if (@$snippet == 2) {
1507 $result .= $self->toXML($snippet->[0], $snippet->[1]);
1508 } else {
1509 croak("Arrays in _content must be in format [\$tagname => \\\%attrs, ...] in XML::Rules->toXML()!");
1510 }
1511 } else {
1512 croak(ref($snippet) . " not supported in _content in XML::Rules->toXML()!");
1513 }
1514 }
1515 if ($no_close) {
1516 return $result;
1517 } else {
1518 return $result."</$tag>";
1519 }
1520 } else {
1521 croak(ref($content) . " _content not supported in XML::Rules->toXML()!");
1522 }
1523}
1524
1525sub parentsToXML {
1526 my ($self, $level) = @_;
1527 my $tag_names = $self->{context};
1528 my $tag_attrs = $self->{data};
1529
1530 $level = scalar(@$tag_names) unless $level;
1531
1532 my $result = '';
1533 for (my $i = -1; -$i <= $level; $i--) {
1534 $result = $self->toXML( ${$tag_names}[$i], ${$tag_attrs}[$i], 1) . $result;
1535 }
1536 return $result;
1537}
1538
1539sub closeParentsToXML {
1540 my ($self, $level) = @_;
1541 my $tag_names = $self->{context};
1542
1543 if ($level) {
1544 return '</' . join( '></', (reverse(@{$tag_names}))[0..$level-1]) . '>';
1545 } else {
1546 return '</' . join( '></', reverse(@$tag_names)) . '>';
1547 }
1548}
1549
1550=head2 toXML / ToXML
1551
1552 $xml = $parser->toXML( $tagname, \%attrs[, $do_not_close, $ident, $base])
1553
1554You may use this method to convert the datastructures created by parsing the XML into the XML format.
1555Not all data structures may be printed! I'll add more docs later, for now please do experiment.
1556
1557The $ident and $base, if defined, turn on and control the pretty-printing. The $ident specifies the character(s)
1558used for one level of identation, the base contains the identation of the current tag. That is if you want to include the data inside of
1559
1560 <data>
1561 <some>
1562 <subtag>$here</subtag>
1563 </some>
1564 </data>
1565
1566you will call
1567
1568 $parser->toXML( $tagname, \%attrs, 0, "\t", "\t\t\t");
1569
1570The method does NOT validate that the $ident and $base are whitespace only, but of course if it's not you end up with invalid
1571XML. Newlines are added only before the start tag and (if the tag has only child tags and no content) before the closing tag,
1572but not after the closing tag! Newlines are added even if the $ident is an empty string.
1573
1574=head2 parentsToXML
1575
1576 $xml = $parser->parentsToXML( [$level])
1577
1578Prints all or only the topmost $level ancestor tags, including the attributes and content (parsed so far),
1579but without the closing tags. You may use this to print the header of the file you are parsing,
1580followed by calling toXML() on a structure you build and then by closeParentsToXML() to close
1581the tags left opened by parentsToXML(). You most likely want to use the style => 'filter' option
1582for the constructor instead.
1583
1584=head2 closeParentsToXML
1585
1586 $xml = $parser->closeParentsToXML( [$level])
1587
1588Prints the closing tags for all or the topmost $level ancestor tags of the one currently processed.
1589
1590=head2 paths2rules
1591
1592 my $parser = XML::Rules->new(
1593 rules => paths2rules {
1594 '/root/subtag/tag' => sub { ...},
1595 '/root/othertag/tag' => sub {...},
1596 'tag' => sub{ ... the default code for this tag ...},
1597 ...
1598 }
1599 );
1600
1601This helper function converts a hash of "somewhat xpath-like" paths and subs/rules into the format required by the module.
1602Due to backwards compatibility and efficiency I can't directly support paths in the rules and the direct syntax for their
1603specification is a bit awkward. So if you need the paths and not the regexps, you may use this helper instead of:
1604
1605 my $parser = XML::Rules->new(
1606 rules => {
1607 'tag' => [
1608 '/root/subtag' => sub { ...},
1609 '/root/othertag' => sub {...},
1610 sub{ ... the default code for this tag ...},
1611 ],
1612 ...
1613 }
1614 );
1615
1616=cut
1617
1618sub paths2rules {
1619 my ($paths) = @_;
1620
1621 my %rules;
1622 while ( my ($tag, $val) = each %$paths) {
1623
1624 if ($tag =~ m{^(.*)/(.*)$}) {
1625 my ($path, $tagname) = ($1, $2);
1626
1627 if (exists $rules{$tagname} and ref($rules{$tagname}) eq 'ARRAY') {
1628 if (@{$rules{$tagname}} % 2) {
1629 push @{$rules{$tagname}}, $path, $val;
1630 } else {
1631 splice @{$rules{$tagname}}, -1, 0, $path, $val;
1632 }
1633 } else {
1634 $rules{$tagname} = [ $path => $val]
1635 }
1636
1637 } elsif (exists $rules{$tag} and ref($rules{$tag}) eq 'ARRAY') {
1638 push @{$rules{$tag}}, $val;
1639 } else {
1640 $rules{$tag} = $val
1641 }
1642 }
1643
1644 return \%rules;
1645}
1646
1647=head2 return_nothing
1648
1649Stop parsing the XML, forget any data we already have and return from the $parser->parse().
1650This is only supposed to be used within rules and may be called both as a method and as
1651an ordinary function (it's not exported).
1652
1653=head2 skip_rest
1654
1655Stop parsing the XML and return whatever data we already have from the $parser->parse().
1656The rules for the currently opened tags are evaluated as if the XML contained all
1657the closing tags in the right order.
1658
1659These two work via raising an exception, the exception is caught within the $parser->parse() and do not propagate outside.
1660It's also safe to raise any other exception within the rules, the exception will be caught as well, the internal state of the $parser object
1661is cleaned and the exception is rethrown.
1662
1663=head2 inferRulesFromExample
1664
1665 Dumper(XML::Rules::inferRulesFromExample( $fileOrXML, $fileOrXML, $fileOrXML, ...)
1666 Dumper(XML::Rules->inferRulesFromExample( $fileOrXML, $fileOrXML, $fileOrXML, ...)
1667 Dumper($parser->inferRulesFromExample( $fileOrXML, $fileOrXML, $fileOrXML, ...)
1668
1669The subroutine parses the listed files and infers the rules that would produce the minimal, but complete datastructure.
1670It finds out what tags may be repeated, whether they contain text content, attributes etc. You may want to give
1671the subroutine several examples to make sure it knows about all possibilities. You should use this once and store
1672the generated rules in your script or even take this as the basis of a more specific set of rules.
1673
1674=cut
1675
1676sub inferRulesFromExample {
1677 shift(@_) if $_[0] eq 'XML::Rules' or ref($_[0]);
1678 my @files = @_;
1679
1680 my %rules;
1681
1682 my $parser = XML::Rules->new(
1683 namespaces => { '*' => ''},
1684 rules => {
1685 _default => sub {
1686 my ($tag, $attrs, $context, $parent_data, $parser) = @_;
1687 my $repeated = (exists $parent_data->[-1] and exists $parent_data->[-1]{$tag});
1688 my $has_content = (exists $attrs->{_content});
1689 my $has_children = grep ref($_) eq 'HASH', values %$attrs;
1690 my $has_attr = grep {$_ ne '_content' and !ref($attrs->{$_})} keys %$attrs;
1691
1692 my $rule = do {
1693 if ($repeated) {
1694 if ($has_content) {
1695 if ($has_attr or $has_children) {
1696 'as array'
1697 } else {
1698 'content array'
1699 }
1700 } else {
1701 if ($has_attr or $has_children) {
1702 'as array no content'
1703 } else {
1704 'content array'
1705 }
1706 }
1707 } else {
1708 if ($has_content) {
1709 if ($has_attr or $has_children) {
1710 'as is'
1711 } else {
1712 'content'
1713 }
1714 } else {
1715 if ($has_attr or $has_children) {
1716 'no content'
1717 } else {
1718 'content'
1719 }
1720 }
1721 }
1722 };
1723
1724 if (not exists $rules{$tag}) {
1725 $rules{$tag} = $rule
1726 } elsif($rules{$tag} ne $rule) {
1727 # we've already seen the tag and it had different type
1728 if ($rules{$tag} eq 'raw extended array') {
1729 } elsif ($rule eq 'raw extended array') {
1730 $rules{$tag} = 'raw extended array';
1731 } elsif ($rules{$tag} eq 'raw extended' and $rule =~ /array/
1732 or $rule eq 'raw extended' and $rules{$tag} =~ /array/) {
1733 $rules{$tag} = 'raw extended array'
1734 } elsif ($rules{$tag} eq 'as array' or $rule eq 'as array') {
1735 $rules{$tag} = 'as array'
1736 } elsif ($rules{$tag} eq 'content array' and $rule eq 'content'
1737 or $rule eq 'content array' and $rules{$tag} eq 'content') {
1738 $rules{$tag} = 'content array'
1739 } elsif ($rules{$tag} eq 'content array' and $rule eq 'as array no content'
1740 or $rule eq 'content array' and $rules{$tag} eq 'as array no content') {
1741 $rules{$tag} = 'as array'
1742 } elsif ($rules{$tag} eq 'content array' and $rule eq 'as is'
1743 or $rule eq 'content array' and $rules{$tag} eq 'as is') {
1744 $rules{$tag} = 'as array'
1745 } elsif ($rules{$tag} eq 'content array' and $rule eq 'no content'
1746 or $rule eq 'content array' and $rules{$tag} eq 'no content') {
1747 $rules{$tag} = 'as array'
1748 } elsif ($rules{$tag} eq 'as array no content' and $rule eq 'as is'
1749 or $rule eq 'as array no content' and $rules{$tag} eq 'as is') {
1750 $rules{$tag} = 'as array'
1751 } elsif ($rules{$tag} eq 'as array no content' and $rule eq 'content'
1752 or $rule eq 'as array no content' and $rules{$tag} eq 'content') {
1753 $rules{$tag} = 'as array'
1754 } elsif ($rules{$tag} eq 'as array no content' and $rule eq 'no content'
1755 or $rule eq 'as array no content' and $rules{$tag} eq 'no content') {
1756 $rules{$tag} = 'as array no content'
1757 } elsif ($rules{$tag} eq 'as is' and ($rule eq 'no content' or $rule eq 'content')
1758 or $rule eq 'as is' and ($rules{$tag} eq 'no content' or $rules{$tag} eq 'content')) {
1759 $rules{$tag} = 'as is'
1760 } elsif ($rules{$tag} eq 'content' and $rule eq 'no content'
1761 or $rule eq 'content' and $rules{$tag} eq 'no content') {
1762 $rules{$tag} = 'as is'
1763 } else {
1764 die "Unexpected combination of rules: old=$rules{$tag}, new=$rule for tag $tag\n";
1765 }
1766 }
1767
1768 if ($has_content and $has_children) { # the tag contains both text content and subtags!, need to use the raw extended rules
1769 foreach my $child (grep ref($attrs->{$_}) eq 'HASH', keys %$attrs) {
1770 next if $rules{$child} =~ /^raw extended/;
1771 if ($rules{$child} =~ /array/) {
1772 $rules{$child} = 'raw extended array'
1773 } else {
1774 $rules{$child} = 'raw extended'
1775 }
1776 }
1777 }
1778 return $tag => {};
1779 }
1780 },
1781 stripspaces => 7,
1782 );
1783
1784 for (@files) {
1785 eval {
1786 if (! ref($_) and $_ !~ /\n/ and $_ !~ /^\s*</) {
1787 $parser->parsefile($_);
1788 } else {
1789 $parser->parse($_);
1790 }
1791 } or croak "Error parsing $_: $@\n";
1792 }
1793
1794 my %short_rules;
1795 foreach my $tag (sort keys %rules) {
1796 push @{$short_rules{$rules{$tag}}}, $tag
1797 }
1798
1799 foreach my $tags (values %short_rules) {
1800 $tags = join ',', @$tags;
1801 }
1802 %short_rules = reverse %short_rules;
1803
1804 return \%short_rules;
1805}
1806
1807=head2 inferRulesFromDTD
1808
1809 Dumper(XML::Rules::inferRulesFromDTD( $DTDfile, [$enableExtended]))
1810 Dumper(XML::Rules->inferRulesFromDTD( $DTDfile, [$enableExtended]))
1811 Dumper($parser->inferRulesFromDTD( $DTDfile, [$enableExtended]))
1812
1813The subroutine parses the DTD and infers the rules that would produce the minimal, but complete datastructure.
1814It finds out what tags may be repeated, whether they contain text content, attributes etc. You may use this
1815each time you are about to parse the XML, once and store the generated rules in your script or even take this
1816as the basis of a more specific set of rules.
1817
1818With the second parameter set to a true value, the tags included in a mixed content will use the "raw extended"
1819or "raw extended array" types instead of just "raw". This makes sure the tag data both stay at the right place in
1820the content and are accessible easily from the parent tag's atrribute hash.
1821
1822This subroutine requires the XML::DTDParser module!
1823
1824=cut
1825
1826sub inferRulesFromDTD {
1827 require XML::DTDParser;
1828
1829 my ($DTDfile, $enable_extended) = @_;
1830
1831 my $DTD = XML::DTDParser::ParseDTDFile($DTDfile);
1832
1833 my $has_mixed = 0;
1834 foreach my $tag (values %$DTD) {
1835 $tag->{is_mixed} = (($tag->{content} and $tag->{children}) ? 1 : 0)
1836 and $has_mixed = 1;
1837 }
1838
1839 my %settings;
1840 foreach my $tagname (keys %$DTD) {
1841 my $tag = $DTD->{$tagname};
1842
1843 my $repeated = ($tag->{option} =~ /^[+*]$/ ? 1 : 0);
1844 my $has_content = $tag->{content};
1845
1846 my $in_mixed = grep {$DTD->{$_}{is_mixed}} @{$tag->{parent}};
1847
1848 if ($in_mixed) {
1849 if ($enable_extended) {
1850 if ($repeated) {
1851 $settings{$tagname} = "raw extended array"
1852 } else {
1853 $settings{$tagname} = "raw extended"
1854 }
1855 } else {
1856 $settings{$tagname} = "raw"
1857 }
1858 } else {
1859 if (exists $DTD->{attributes} or exists $tag->{children}) {
1860 my @ids ;
1861 if (exists $DTD->{attributes}) {
1862 @ids = grep {$DTD->{attributes}{$_}[0] eq 'ID' and $DTD->{attributes}{$_}[0] eq '#REQUIRED'} keys %{$DTD->{attributes}};
1863 }
1864 if (scalar(@ids) == 1) {
1865 if ($has_content) {
1866 $settings{$tagname} = "by $ids[0]"
1867 } else {
1868 $settings{$tagname} = "no content by $ids[0]"
1869 }
1870 } else {
1871 if ($has_content) {
1872 if ($repeated) {
1873 $settings{$tagname} = "as array"
1874 } else {
1875 $settings{$tagname} = "as is"
1876 }
1877 } else {
1878 if ($repeated) {
1879 $settings{$tagname} = "as array no content"
1880 } else {
1881 $settings{$tagname} = "no content"
1882 }
1883 }
1884 }
1885 } elsif ($repeated) {
1886 $settings{$tagname} = "content array"
1887 } else {
1888 $settings{$tagname} = "content array"
1889 }
1890 }
1891 }
1892
1893# use Data::Dumper;
1894# print Dumper(\%settings);
1895
1896 my %compressed;
1897 {
1898 my %tmp;
1899 while (my ($tag, $option) = each %settings) {
1900 push @{$tmp{$option}}, $tag;
1901 }
1902
1903 while (my ($option, $tags) = each %tmp) {
1904 $compressed{join ',', @$tags} = $option
1905 }
1906 }
1907
1908 if ($has_mixed) {
1909 $compressed{"#stripspaces"} = 0;
1910 } else {
1911 $compressed{"#stripspaces"} = 7;
1912 }
1913
1914 return \%compressed;
1915}
1916
1917=head1 Properties
1918
1919=head2 parameters
1920
1921You can pass a parameter (scalar or reference) to the parse...() or filter...() methods, this parameter
1922is later available to the rules as $parser->{parameters}. The module will never use this parameter for
1923any other purpose so you are free to use it for any purposes provided that you expect it to be reset by
1924each call to parse...() or filter...() first to the passed value and then, after the parsing is complete, to undef.
1925
1926=head2 pad
1927
1928The $parser->{pad} key is specificaly reserved by the module as a place where the module users can
1929store their data. The module doesn't and will not use this key in any way, doesn't set or reset it under any
1930circumstances. If you need to share some data between the rules and do not want to use the structure built
1931by applying the rules you are free to use this key.
1932
1933You should refrain from modifying or accessing other properties of the XML::Rules object!
1934
1935=head1 Namespace support
1936
1937By default the module doesn't handle namespaces in any way, it doesn't check for
1938xmlns or xmlns:alias attributes and it doesn't strip or mangle the namespace aliases
1939in tag or attribute names. This means that if you know for sure what namespace
1940aliases will be used you can set up rules for tags including the aliases and unless
1941someone decides to use a different alias or makes use of the default namespace
1942change your script will work.
1943
1944If you do specify any namespace to alias mapping in the constructor it does
1945start processing the namespace stuff. The xmlns and xmlns:alias attributes
1946are stripped from the datastructures and the aliases are transformed from
1947whatever the XML author decided to use to whatever your namespace mapping
1948specifies. Aliases are also added to all tags that belong to a default namespace.
1949
1950Assuming the constructor parameters contain
1951
1952 namespaces => {
1953 'http://my.namespaces.com/foo' => 'foo',
1954 'http://my.namespaces.com/bar' => 'bar',
1955 }
1956
1957and the XML looks like this:
1958
1959 <root>
1960 <Foo xmlns="http://my.namespaces.com/foo">
1961 <subFoo>Hello world</subfoo>
1962 </Foo>
1963 <other xmlns:b="http://my.namespaces.com/bar">
1964 <b:pub>
1965 <b:name>NaRuzku</b:name>
1966 <b:address>at any crossroads</b:address>
1967 <b:desc>Fakt <b>desnej</b> pajzl.</b:desc>
1968 </b:pub>
1969 </other>
1970 </root>
1971
1972then the rules wil be called as if the XML looked like this:
1973
1974 <root>
1975 <foo:Foo>
1976 <foo:subFoo>Hello world</foo:subfoo>
1977 </foo:Foo>
1978 <other>
1979 <bar:pub>
1980 <bar:name>NaRuzku</bar:name>
1981 <bar:address>at any crossroads</bar:address>
1982 <bar:desc>Fakt <b>desnej</b> pajzl.</bar:desc>
1983 </bar:pub>
1984 </other>
1985 </root>
1986
1987
1988This means that the namespace handling will only normalize the aliases used.
1989
1990It is possible to specify an empty alias, so eg. in case you are processing a SOAP XML
1991and know the tags defined by SOAP do not colide with the tags in the enclosed XML
1992you may simplify the parsing by removing all namespace aliases.
1993
1994If the XML references a namespace not present in the map you will get a warning
1995and the alias used for that namespace will be left intact! If you do not want to get the warnings specify
1996a namespace with URI "*".
1997
1998
1999=head1 HOW TO USE
2000
2001You may view the module either as a XML::Simple on steriods and use it to build a data structure
2002similar to the one produced by XML::Simple with the added benefit of being able
2003to specify what tags or attributes to ignore, when to take just the content, what to store as an array etc.
2004
2005Or you could view it as yet another event based XML parser that differs from all the others only in one thing.
2006It stores the data for you so that you do not have to use globals or closures and wonder where to attach
2007the snippet of data you just received onto the structure you are building.
2008
2009You can use it in a way similar to XML::Twig with simplify(), specify the rules to transform the lower
2010level tags into a XML::Simple like (simplify()ed) structure and then handle the structure in the rule for
2011the tag(s) you'd specify in XML::Twig's twig_roots.
2012
2013=head1 Unrelated tricks
2014
2015If you need to parse a XML file without the top-most tag (something that each and any sane person would allow,
2016but the XML comitee did not), you can parse
2017
2018 <!DOCTYPE doc [<!ENTITY real_doc SYSTEM "$the_file_name">]><doc>&real_doc;</doc>
2019
2020instead.
2021
2022=head1 AUTHOR
2023
2024Jan Krynicky, C<< <Jenda at CPAN.org> >>
2025
2026=head1 BUGS
2027
2028Please report any bugs or feature requests to
2029C<bug-xml-rules at rt.cpan.org>, or through the web interface at
2030L<http://rt.cpan.org/NoAuth/ReportBug.html?Queue=XML-Rules>.
2031I will be notified, and then you'll automatically be notified of progress on
2032your bug as I make changes.
2033
2034=head1 SUPPORT
2035
2036You can find documentation for this module with the perldoc command.
2037
2038 perldoc XML::Rules
2039
2040You can also look for information at:
2041
2042=over 4
2043
2044=item * AnnoCPAN: Annotated CPAN documentation
2045
2046L<http://annocpan.org/dist/XML-Rules>
2047
2048=item * CPAN Ratings
2049
2050L<http://cpanratings.perl.org/d/XML-Rules>
2051
2052=item * RT: CPAN's request tracker
2053
2054L<http://rt.cpan.org/NoAuth/Bugs.html?Dist=XML-Rules>
2055
2056=item * Search CPAN
2057
2058L<http://search.cpan.org/dist/XML-Rules>
2059
2060=item * PerlMonks
2061
2062Please see L<http://www.perlmonks.org/?node_id=581313> or
2063L<http://www.perlmonks.org/?node=XML::Rules> for discussion.
2064
2065=back
2066
2067=head1 SEE ALSO
2068
2069L<XML::Twig>, L<XML::LibXML>, L<XML::Pastor>
2070
2071=head1 ACKNOWLEDGEMENTS
2072
2073The escape_value() method is taken with minor changes from XML::Simple.
2074
2075=head1 COPYRIGHT & LICENSE
2076
2077Copyright 2006-2007 Jan Krynicky, all rights reserved.
2078
2079This program is free software; you can redistribute it and/or modify it
2080under the same terms as Perl itself.
2081
2082=cut
2083
20841; # End of XML::Rules
Note: See TracBrowser for help on using the repository browser.