Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

source: gsdl/trunk/perllib/cpan/XML/Rules.pm@ 19051

Last change on this file since 19051 was 19051, checked in by davidb, 15 years ago
Rule based way to parse XML files. Built on top of XML::Parser. For certain types of tasks provides more convenient/concise way to express things.
File size: 73.2 KB

Line
1	package XML::Rules;
2
3	use warnings;
4	no warnings qw(uninitialized);
5	use strict;
6	use Carp;
7
8	require Exporter;
9	our @ISA = qw(Exporter);
10	our @EXPORT_OK = qw(paths2rules);
11
12	use XML::Parser::Expat;
13
14	=head1 NAME
15
16	XML::Rules - parse XML and specify what and how to keep/process for individual tags
17
18	=head1 VERSION
19
20	Version 1.05
21
22	=cut
23
24	our $VERSION = '1.05';
25
26	=head1 SYNOPSIS
27
28	use XML::Rules;
29
30	$xml = <<'END';
31	<doc>
32	<person>
33	<fname>...</fname>
34	<lname>...</lname>
35	<email>...</email>
36	<address>
37	<street>...</street>
38	<city>...</city>
39	<country>...</country>
40	<bogus>...</bogus>
41	</address>
42	<phones>
43	<phone type="home">123-456-7890</phone>
44	<phone type="office">663-486-7890</phone>
45	<phone type="fax">663-486-7000</phone>
46	</phones>
47	</person>
48	<person>
49	<fname>...</fname>
50	<lname>...</lname>
51	<email>...</email>
52	<address>
53	<street>...</street>
54	<city>...</city>
55	<country>...</country>
56	<bogus>...</bogus>
57	</address>
58	<phones>
59	<phone type="office">663-486-7891</phone>
60	</phones>
61	</person>
62	</doc>
63	END
64
65	@rules = (
66	_default => sub {$_[0] => $_[1]->{_content}},
67	# by default I'm only interested in the content of the tag, not the attributes
68	bogus => undef,
69	# let's ignore this tag and all inner ones as well
70	address => sub {address => "$_[1]->{street}, $_[1]->{city} ($_[1]->{country})"},
71	# merge the address into a single string
72	phone => sub {$_[1]->{type} => $_[1]->{_content}},
73	# let's use the "type" attribute as the key and the content as the value
74	phones => sub {delete $_[1]->{_content}; %{$_[1]}},
75	# remove the text content and pass along the type => content from the child nodes
76	person => sub { # lets print the values, all the data is readily available in the attributes
77	print "$_[1]->{lname}, $_[1]->{fname} <$_[1]->{email}>\n";
78	print "Home phone: $_[1]->{home}\n" if $_[1]->{home};
79	print "Office phone: $_[1]->{office}\n" if $_[1]->{office};
80	print "Fax: $_[1]->{fax}\n" if $_[1]->{fax};
81	print "$_[1]->{address}\n\n";
82	return; # the <person> tag is processed, no need to remember what it contained
83	},
84	);
85	$parser = XML::Rules->new(rules => \@rules);
86	$parser->parse( $xml);
87
88	=head1 INTRODUCTION
89
90	There are several ways to extract data from XML. One that's often used is to read the whole file and transform it into a huge maze of objects and then write code like
91
92	foreach my $obj ($XML->forTheLifeOfMyMotherGiveMeTheFirstChildNamed("Peter")->pleaseBeSoKindAndGiveMeAllChildrenNamedSomethingLike("Jane")) {
93	my $obj2 = $obj->sorryToKeepBotheringButINeedTheChildNamed("Theophile");
94	my $birth = $obj2->whatsTheValueOfAttribute("BirthDate");
95	print "Theophile was bort at $birth\n";
96	}
97
98	I'm exagerating of course, but you probably know what I mean. You can of course shorten the path and call just one method ... that is if you spend the time to learn one more "cool" thing starting with X. XPath.
99
100	You can also use XML::Simple and generate an almost equaly huge maze of hashes and arrays ... which may make the code more or less complex. In either case you need to have enough memory
101	to store all that data, even if you only need a piece here and there.
102
103	Another way to parse the XML is to create some subroutines that handle the start and end tags and the text and whatever else may appear in the XML. Some modules will let you specify just one for start tag, one for text and one for end tag, others will let you install different handlers for different tags. The catch is that you have to build your data structures yourself, you have to know where you are, what tag is just open and what is the parent and it's parent etc. so that you could add the attributes and especially the text to the right place. And the handlers have to do everything as their side effect. Does anyone remember what do they say about side efects? They make the code hard to debug, they tend to change the code into a maze of interdependent snippets of code.
104
105	So what's the difference in the way XML::Rules works? At the first glance, not much. You can also specify subroutines to be called for the tags encountered while parsing the XML, just like the other even based XML parsers. The difference is that you do not have to rely on sideeffects if all you want is to store the value of a tag. You simply return whatever you need from the current tag and the module will add it at the right place in the data structure it builds and will provide it to the handlers for the parent tag. And if the parent tag does return that data again it will be passed to its parent and so forth. Until we get to the level at which it's convenient to handle all the data we accumulated from the twig.
106
107	Do we want to keep just the content and access it in the parent tag handler under a specific name?
108
109	foo => sub {return 'foo' => $_[1]->{_content}}
110
111	Do we want to ornament the content a bit and add it to the parent tag's content?
112
113	u => sub {return '_' . $_[1]->{_content} . '_'}
114	strong => sub {return '' . $_[1]->{_content} . ''}
115	uc => sub {return uc($_[1]->{_content})}
116
117	Do we want to merge the attributes into a string and access the string from the parent tag under a specified name?
118
119	address => sub {return 'Address' => "Street: $_[1]->{street} $_[1]->{bldngNo}\nCity: $_[1]->{city}\nCountry: $_[1]->{country}\nPostal code: $_[1]->{zip}"}
120
121	and in this case the $_[1]->{street} may either be an attribute of the <address> tag or it may be ther result of the handler (rule)
122
123	street => sub {return 'street' => $_[1]->{_content}}
124
125	and thus come from a child tag <street>. You may also use the rules to convert codes to values
126
127	our %states = (
128	AL => 'Alabama',
129	AK => 'Alaska',
130	...
131	);
132	...
133	state => sub {return 'state' => $states{$_[1]->{_content}}; }
134
135	or
136
137	address => sub {
138	if (exists $_[1]->{id}) {
139	$sthFetchAddress->execute($_[1]->{id});
140	my $addr = $sthFetchAddress->fetchrow_hashref();
141	$sthFetchAddress->finish();
142	return 'address' => $addr;
143	} else {
144	return 'address' => $_[1];
145	}
146	}
147
148	so that you do not have to care whether there was
149
150	<address id="147"/>
151
152	or
153
154	<address><street>Larry Wall's St.</street><streetno>478</streetno><city>Core</city><country>The Programming Republic of Perl</country></address>
155
156	At each level in the tree structure serialized as XML you can decide what to keep, what to throw away, what to transform and then just return the stuff you care about and it will be available to the handler at the next level.
157
158	=head1 CONSTRUCTOR
159
160	my $parser = XML::Rules->new(
161	rules => \@rules,
162	[ start_rules => \@start_rules, ]
163	[ stripspaces => 0 / 1 / 2 / 3 + 0 / 4 + 0 / 8, ]
164	[ normalisespaces => 0 / 1, ]
165	[ style => 'parser' / 'filter', ]
166	[ ident => ' ', [reformat_all => 0 / 1] ],
167	[ encode => 'encoding specification', ]
168	[ output_encoding => 'encoding specification', ]
169	[ namespaces => \%namespace2alias_mapping, ]
170	# and optionaly parameters passed to XML::Parser::Expat
171	);
172
173	Options passed to XML::Parser::Expat : ProtocolEncoding Namespaces NoExpand Stream_Delimiter ErrorContext ParseParamEnt Base
174
175	The "stripspaces" controls the handling of whitespace. Please see the C<Whitespace handling> bellow.
176
177	The "style" specifies whether you want to build a parser used to extract stuff from the XML or filter/modify the XML. If you specify
178	style => 'filter' then all tags for which you do not specify a subroutine rule or that occure inside such a tag are copied to the output filehandle
179	passed to the ->filter() or ->filterfile() methods.
180
181	The "ident" specifies what character(s) to use to ident the tags when filtering, by default the tags are not formatted in any way. If the
182	"reformat_all" is not set then this affects only the tags that have a rule and their subtags. And in case of subtags only those that were
183	added into the attribute hash by their rules, not those left in the _content array!
184
185	The "warnoverwrite" instructs XML::Rules to issue a warning whenever the rule cause a key in a tag's hash to be overwritten by new
186	data produced by the rule of a subtag. This happens eg. if a tag is repeated and its rule doesn't expect it.
187
188	The "encode" allows you to ask the module to run all data through Encode::encode( 'encoding_specification', ...)
189	before being passed to the rules. Otherwise all data comes as UTF8.
190
191	The "output_encoding" on the other hand specifies in what encoding is the resulting data going to be, the default is again UTF8.
192	This means that if you specify
193
194	encode => 'windows-1250',
195	output_encoding => 'utf8',
196
197	and the XML is in ISO-8859-2 (Latin2) then the filter will 1) convert the content and attributes of the tags you are not interested in from Latin2
198	directly to utf8 and output and 2) convert the content and attributes of the tags you want to process from Latin2 to Windows-1250, let you mangle
199	the data and then convert the results to utf8 for the output.
200
201	The C<encode> and C<output_enconding> affects also the C<$parser->toXML(...)>, if they are different then the data are converted from
202	one encoding to the other.
203
204	=head2 The Rules
205
206	The rules option may be either an arrayref or a hashref, the module doesn't care, but if you want to use regexps to specify the groups of tags to be handled
207	by the same rule you should use the array ref. The rules array/hash is made of pairs in form
208
209	tagspecification => action
210
211	where the tagspecification may be either a name of a tag, a string containing comma or pipe ( "\|" ) delimited list of tag names
212	or a string containing a regexp enclosed in // with optional parameters or a qr// compiled regular expressions. The tag names and tag name lists
213	take precedence to the regexps, the regexps are (in case of arrayref only!!!) tested in the order in which they are specified.
214
215	These rules are evaluated/executed whenever a tag if fully parsed including all the content and child tags and they may access the content and attributes of the
216	specified tag plus the stuff produced by the rules evaluated for the child tags.
217
218	The action may be either
219
220	an undef or empty string = ignore the tag and all its children
221	a subroutine reference = the subroutine will be called to handle the tag data&contents
222	sub { my ($tagname, $attrHash, $contexArray, $parentDataArray, $parser) = @_; ...}
223	'content' = only the content of the tag is preserved and added to
224	the parent tag's hash as an attribute named after the tag
225	sub { $_[0] => $_[1]->{_content}}
226	'content trim' = only the content of the tag is preserved, trimmed and added to
227	the parent tag's hash as an attribute named after the tag
228	sub { s/^\s+//,s/\s+$// for ($_[1]->{_content}); $_[0] => $_[1]->{_content}}
229	'content array' = only the content of the tag is preserved and pushed
230	to the array pointed to by the attribute
231	sub { '@' . $_[0] => $_[1]->{_content}}
232	'as is' = the tag's hash is added to the parent tag's hash
233	as an attribute named after the tag
234	sub { $_[0] => $_[1]}
235	'as is trim' = the tag's hash is added to the parent tag's hash
236	as an attribute named after the tag, the content is trimmed
237	sub { $_[0] => $_[1]}
238	'as array' = the tag's hash is pushed to the attribute named after the tag
239	in the parent tag's hash
240	sub { '@'.$_[0] => $_[1]}
241	'as array trim' = the tag's hash is pushed to the attribute named after the tag
242	in the parent tag's hash, the content is trimmed
243	sub { '@'.$_[0] => $_[1]}
244	'no content' = the _content is removed from the tag's hash and the hash
245	is added to the parent's hash into the attribute named after the tag
246	sub { delete $_[1]->{_content}; $_[0] => $_[1]}
247	'no content array' = similar to 'no content' except the hash is pushed
248	into the array referenced by the attribute
249	'as array no content' = same as 'no content array'
250	'pass' = the tag's hash is dissolved into the parent's hash,
251	that is all tag's attributes become the parent's attributes.
252	The _content is appended to the parent's _content.
253	sub { %{$_[0]}}
254	'pass no content' = the _content is removed and the hash is dissolved
255	into the parent's hash.
256	sub { delete $_[1]->{_content}; %{$_[0]}}
257	'pass without content' = same as 'pass no content'
258	'raw' = the [tagname => attrs] is pushed to the parent tag's _content.
259	You would use this style if you wanted to be able to print
260	the parent tag as XML preserving the whitespace or other textual content
261	sub { [$_[0] => $_[1]]}
262	'raw extended' = the [tagname => attrs] is pushed to the parent tag's _content
263	and the attrs are added to the parent's attribute hash with ":$tagname" as the key
264	sub { (':'.$_[0] => $_[1], [$_[0] => $_[1]])};
265	'raw extended array' = the [tagname => attrs] is pushed to the parent tag's _content
266	and the attrs are pushed to the parent's attribute hash with ":$tagname" as the key
267	sub { ('@:'.$_[0] => $_[1], [$_[0] => $_[1]])};
268	'by <attrname>' = uses the value of the specified attribute as the key when adding the
269	attribute hash into the parent tag's hash. You can specify more names, in that case
270	the first found is used.
271	sub {delete($_[1]->{name}) => $_[1]}
272	'content by <attrname>' = uses the value of the specified attribute as the key when adding the
273	tags content into the parent tag's hash. You can specify more names, in that case
274	the first found is used.
275	sub {$_[1]->{name} => $_[1]->{_content}}
276	'no content by <attrname>' = uses the value of the specified attribute as the key when adding the
277	attribute hash into the parent tag's hash. The content is dropped. You can specify more names,
278	in that case the first found is used.
279	sub {delete($_[1]->{_content}); delete($_[1]->{name}) => $_[1]}
280	'==...' = replace the tag by the specified string. That is the string will be added to
281	the parent tag's _content
282	sub { return '...' }
283	'=...' = replace the tag contents by the specified string and forget the attributes.
284	sub { return $_[0] => '...' }
285
286	You may also add " no xmlns" at the end of all those predefined rules to strip the namespace
287	alias from the $_[0] (tag name).
288
289	The subroutines in the rules specification receive five parameters:
290
291	$rule->( $tag_name, \%attrs, \@context, \@parent_data, $parser)
292
293	It's OK to destroy the first two parameters, but you should treat the other three as read only
294	or at least treat them with care!
295
296	$tag_name = string containing the tag name
297	\%attrs = hash containing the attributes of the tag plus the _content key
298	containing the text content of the tag. If it's not a leaf tag it may
299	also contain the data returned by the rules invoked for the child tags.
300	\@context = an array containing the names of the tags enclosing the current
301	one. The parent tag name is the last element of the array.
302	\@parent_data = an array containing the hashes with the attributes
303	and content read&produced for the enclosing tags so far.
304	You may need to access this for example to find out the version
305	of the format specified as an attribute of the root tag. You may
306	safely add, change or delete attributes in the hashes, but all bets
307	are off if you change the number or type of elements of this array!
308	$parser = the parser object.
309
310	The subroutine may decide to handle the data and return nothing or
311	tweak the data as necessary and return just the relevant bits. It may also
312	load more information from elsewhere based on the ids found in the XML
313	and provide it to the rules of the ancestor tags as if it was part of the XML.
314
315	The possible return values of the subroutines are:
316
317	1) nothing or undef or "" - nothing gets added to the parent tag's hash
318
319	2) a single string - if the parent's _content is a string then the one produced by this rule is appended to the parent's _content.
320	If the parent's _content is an array, then the string is push()ed to the array.
321
322	3) a single reference - if the parent's _content is a string then it's changed to an array containing the original string and this reference.
323	If the parent's _content is an array, then the string is push()ed to the array.
324
325	4) an even numbered list - it's a list of key & value pairs to be added to the parent's hash.
326
327	The handling of the attributes may be changed by adding '@', '%', '+', '*' or '.' before the attribute name.
328
329	Without any "sigil" the key & value is added to the hash overwriting any previous values.
330
331	The values for the keys starting with '@' are push()ed to the arrays referenced by the key name
332	without the @. If there already is an attribute of the same name then the value will be preserved and will become
333	the first element in the array.
334
335	The values for the keys starting with '%' have to be either hash or array references. The key&value pairs
336	in the referenced hash or array will be added to the hash referenced by the key. This is nice for rows of tags like this:
337
338	<field name="foo" value="12"/>
339	<field name="bar" value="24"/>
340
341	if you specify the rule as
342
343	field => sub { '%fields' => [$_[1]->{name} => $_[1]->{value}]}
344
345	then the parent tag's has will contain
346
347	fields => {
348	foo => 12,
349	bar => 24,
350	}
351
352	The values for the keys starting with '+' are added to the current value, the ones starting with '.' are
353	appended to the current value and the ones starting with '*' multiply the current value.
354
355	5) an odd numbered list - the last element is appended or push()ed to the parent's _content, the rest is handled as in the previous case.
356
357
358	Since 0.19 it's possible to specify several actions for a tag if you need to do something different based on the path to the tag like this:
359
360	tagname => [
361	'tag/path' => action,
362	'/root/tag/path' => action,
363	'/root/*/path' => action,
364	qr{^root/ns:[^/]+/par$} => action,
365	default_action
366	],
367
368	The path is matched against the list of parent tags joined by slashes.
369
370	=head2 The Start Rules
371
372	Apart from the normal rules that get invoked once the tag is fully parsed, including the contents and child tags, you may want to
373	attach some code to the start tag to (optionaly) skip whole branches of XML or set up attributes and variables. You may set up
374	the start rules either in a separate parameter to the constructor or in the rules=> by prepending the tag name(s) by ^.
375
376	These rules are in form
377
378	tagspecification => undef / '' / 'skip' --> skip the element, including child tags
379	tagspecification => 1 / 'handle' --> handle the element, may be needed
380	if you specify the _default rule.
381	tagspecification => \&subroutine
382
383	The subroutines receive the same parameters as for the "end tag" rules except of course the _content, but their return value is treated differently.
384	If the subroutine returns a false value then the whole branch enclosed by the current tag is skipped, no data are stored and no rules are
385	executed. You may modify the hash referenced by $attr.
386
387	Both types of rules are free to store any data they want in $parser->{pad}. This property is NOT emptied
388	after the parsing!
389
390	=head2 Whitespace handling
391
392	There are two options that affect the whitespace handling: stripspaces and normalisespaces. The normalisespaces is a simple flag that controls
393	whether multiple spaces/tabs/newlines are collapsed into a single space or not. The stripspaces is more complex, it's a bit-mask,
394	an ORed combination of the following options:
395
396	0 - don't remove whitespace around tags
397	(around tags means before the opening tag and after the closing tag, not in the tag's content!)
398	1 - remove whitespace before tags whose rules did not return any text content
399	(the rule specified for the tag caused the data of the tag to be ignored,
400	processed them already or added them as attributes to parent's \%attr)
401	2 - remove whitespace around tags whose rules did not return any text content
402	3 - remove whitespace around all tags
403
404	0 - remove only whitespace-only content
405	(that is remove the whitespace around <foo/> in this case "<bar> <foo/> </bar>"
406	but not this one "<bar>blah <foo/> blah</bar>")
407	4 - remove trailing/leading whitespace
408	(remove the whitespace in both cases above)
409
410	0 - don't trim content
411	8 - do trim content
412	(That is for "<foo> blah </foo>" only pass to the rule {_content => 'blah'})
413
414
415	That is if you have a data oriented XML in which each tag contains either text content or subtags, but not both,
416	you want to use stripspaces => 3 or stripspaces => 3\|4. This will not only make sure you don't need to bother
417	with the whitespace-only _content of the tags with subtags, but will also make sure you do not keep on wasting
418	memory while parsing a huge XML and processing the "twigs". Without that option the parent tag of
419	the repeated tag would keep on accumulating unneeded whitespace in its _content.
420
421	=cut
422
423	sub new {
424	my $class = shift;
425	my %params = @_;
426	croak "Please specify the rules=> for the parser!" unless $params{rules} and ref($params{rules});
427
428	my $self = {rules => {}, start_rules => {}};
429	bless $self, $class;
430
431	my @rules = (ref($params{rules}) eq 'HASH' ? %{$params{rules}} : @{$params{rules}}); # dereference and copy
432	delete $params{rules};
433
434	my @start_rules;
435	if ($params{start_rules} and ref($params{start_rules})) {
436	@start_rules = ref($params{start_rules}) eq 'HASH' ? %{$params{start_rules}} : @{$params{start_rules}}; # dereference and copy
437	};
438	delete $params{start_rules};
439
440	for (my $i=0; $i <= $#rules; $i+=2) {
441	next unless $rules[$i] =~ s/^\^//;
442	push @start_rules, splice( @rules, $i, 2);
443	$i-=2;
444	}
445
446	$self->_split_rules( \@rules, 'rules', 'as is');
447	$self->_split_rules( \@start_rules, 'start_rules', 'handle');
448
449	$self->{for_parser} = {};
450	{ # extract the params for the XML::Parser::Expat constructor
451	my @for_parser = grep exists($params{$_}), qw(ProtocolEncoding Namespaces NoExpand Stream_Delimiter ErrorContext ParseParamEnt Base);
452	if (@for_parser) {
453	@{$self->{for_parser}}{@for_parser} = @params{@for_parser};
454	delete @params{@for_parser};
455	}
456	}
457
458	$self->{namespaces} = delete($params{namespaces});
459	if (defined($self->{namespaces})) {
460	croak 'XML::Rules->new( ... , namespaces => ...HERE...) must be a hash reference!'
461	unless ref($self->{namespaces}) eq 'HASH';
462	$self->{xmlns_map} = {};
463	}
464
465	$self->{style} = delete($params{style}) \|\| 'parser';
466
467	$self->{opt}{lc $_} = $params{$_} for keys %params;
468
469	delete $self->{opt}{encode} if $self->{opt}{encode} =~ /^utf-?8$/i;
470	delete $self->{opt}{output_encoding} if $self->{opt}{output_encoding} =~ /^utf-?8$/i;
471
472	for (qw(normalisespace normalizespace normalizespaces)) {
473	last if defined($self->{opt}{normalisespaces});
474	$self->{opt}{normalisespaces} = $self->{opt}{$_};
475	delete $self->{opt}{$_};
476	}
477	$self->{opt}{normalisespaces} = 0 unless(defined($self->{opt}{normalisespaces}));
478	$self->{opt}{stripspaces} = 0 unless(defined($self->{opt}{stripspaces}));
479
480	require 'Encode.pm' if $self->{opt}{encode};
481	require 'Encode.pm' if $self->{opt}{output_encoding};
482
483	return $self;
484	}
485
486	sub _split_rules {
487	my ($self, $rules, $type, $default) = @_;
488
489	$self->{$type}{_default} = $default unless exists($self->{$type}{_default});
490
491	while (@$rules) {
492	my ($tag, $code) = (shift(@$rules), shift(@$rules));
493
494	if (ref($code) eq 'ARRAY') {
495	for( my $i = 0; $i < $#$code; $i+=2) {
496	$code->[$i] = _xpath2re($code->[$i]);
497	}
498	push @$code, $self->{$type}{_default} if @$code % 2 == 0; # add the default type if there's even number of items (path => code, path => code)
499	}
500
501	if ($tag =~ m{^/([^/].)/([imosx])$}) { # string with a '/regexp/'
502	if ($2) {
503	push @{$self->{$type.'_re'}}, qr/(?$2)$1/;
504	} else {
505	push @{$self->{$type.'_re'}}, qr/$1/;
506	}
507	push @{$self->{$type.'_re_code'}}, $code;
508	} elsif (ref($tag) eq 'Regexp') { # a qr// created regexp
509	push @{$self->{$type.'_re'}}, $tag;
510	push @{$self->{$type.'_re_code'}}, $code;
511	} elsif ($tag =~ /[,\\|]/) { # a , or \| separated list
512	if ($tag =~ s/^\^//) {
513	my @tags = split(/\s[,\\|]\s/, $tag);
514	$self->{$type}{'^'.$_} = $code for (@tags);
515	} else {
516	my @tags = split(/\s[,\\|]\s/, $tag);
517	$self->{$type}{$_} = $code for (@tags);
518	}
519	} else { # a single tag
520	$self->{$type}{$tag} = $code;
521	}
522	}
523	}
524
525	sub _xpath2re {
526	my $s = shift;
527	return $s if ref($s);
528	for ($s) {
529	s/([\.\[\]+{}\-])/\\$1/g;
530	s{\*}{.+}g;
531	s{^//}{}s;
532	s{^/}{^}s;
533	}
534	return qr{$s$};
535	}
536
537	sub skip_rest {
538	die "[XML::Rules] skip rest\n";
539	}
540
541	sub return_nothing {
542	die "[XML::Rules] return nothing\n";
543	}
544
545	sub _run {
546	my $self = shift;
547	my $string = shift;
548	$self->{parameters} = shift;
549
550	$self->{parser} = XML::Parser::Expat->new( %{$self->{for_parser}});
551
552	$self->{parser}->setHandlers(
553	Start => _Start($self),
554	End => _End($self),
555	Char => _Char($self),
556	XMLDecl => _XMLDecl($self),
557	);
558
559	$self->{data} = [];
560	$self->{context} = [];
561	$self->{_ltrim} = [0];
562
563	if (! eval {
564	$self->{parser}->parse($string) and 1;
565	}) {
566	my $err = $@;
567	undef $@;
568
569	if ($err =~ /^\[XML::Rules\] skip rest/) {
570	my (undef, $handler) = $self->{parser}->setHandlers(End => undef);
571	foreach my $tag (reverse @{$self->{context} = []}) {
572	$handler->( $self->{parser}, $tag);
573	}
574	} else {
575
576	delete $self->{normal_handlers};
577	delete $self->{ignore_handlers};
578	delete $self->{parameters};
579	$self->{parser}->release();
580
581	$self->{data} = [];
582	$self->{context} = [];
583
584	if ($err =~ /^\[XML::Rules\] return nothing/) {
585	return;
586	}
587
588	$err =~ s/at \S+Rules\.pm line \d+$//
589	and croak $err or die $err;
590	}
591	};
592
593	delete $self->{normal_handlers};
594	delete $self->{ignore_handlers};
595	$self->{parser}->release();
596
597	delete $self->{parameters};
598	my $data; # return the accumulated data, without keeping a copy inside the object
599	($data, $self->{data}) = ($self->{data}[0], undef);
600	if (!defined(wantarray()) or ! keys(%$data)) {
601	return;
602
603	} elsif (keys(%$data) == 1 and exists(${$data}{_content})) {
604	if (ref(${$data}{_content}) eq 'ARRAY' and @{${$data}{_content}} == 1) {
605	return ${${$data}{_content}}[0]
606	} else {
607	return ${$data}{_content}
608	}
609
610	} else {
611	return $data;
612	}
613	}
614
615
616	sub parsestring;
617	*parsestring = \&parse;
618	sub parse {
619	my $self = shift;
620	croak("This XML::Rules object may only be used as a filter!") if ($self->{style} eq 'filter');
621	$self->_run(@_);
622	}
623
624	sub parsefile {
625	my $self = shift;
626	croak("This XML::Rules object may only be used as a filter!") if ($self->{style} eq 'filter');
627	my $filename = shift;
628	open my $IN, '<', $filename or croak "Cannot open '$filename' for reading: $^E";
629	return $self->_run($IN, @_);
630	}
631
632
633	sub filterstring;
634	*filterstring = \&filter;
635	sub filter {
636	my $self = shift;
637	croak("This XML::Rules object may only be used as a parser!") unless ($self->{style} eq 'filter');
638
639	my $XML = shift;
640	$self->{FH} = shift \|\| select(); # either passed or the selected filehandle
641	if (!ref($self->{FH})) {
642	if ($self->{FH} =~ /^main::(?:STDOUT\|STDERR)$/) {
643	# yeah, select sometimes returns the name of the filehandle, not the filehandle itself. eg. "main::STDOUT"
644	no strict;
645	$self->{FH} = \*{$self->{FH}};
646	} else {
647	open my $FH, '>', $self->{FH} or croak(qq{Failed to open "$self->{FH}" for writing: $^E});
648	$self->{FH} = $FH;
649	}
650	} elsif (ref($self->{FH}) eq 'SCALAR') {
651	open my $FH, '>', $self->{FH};
652	$self->{FH} = $FH;
653	}
654	if ($self->{opt}{output_encoding}) {
655	print {$self->{FH}} qq{<?xml version="1.0" encoding="$self->{opt}{output_encoding}"?>\n};
656	} else {
657	print {$self->{FH}} qq{<?xml version="1.0"?>\n};
658	}
659	$self->_run($XML, @_);
660	print {$self->{FH}} "\n";
661	delete $self->{FH};
662	}
663
664	sub filterfile {
665	my $self = shift;
666	croak("This XML::Rules object may only be used as a parser!") unless ($self->{style} eq 'filter');
667
668	my $filename = shift;
669	open my $IN, '<', $filename or croak "Cannot open '$filename' for reading: $^E";
670
671	$self->{FH} = shift \|\| select(); # either passed or the selected filehandle
672	if (!ref($self->{FH})) {
673	if ($self->{FH} =~ /^main::(?:STDOUT\|STDERR)$/) {
674	# yeah, select sometimes returns the name of the filehandle, not the filehandle itself. eg. "main::STDOUT"
675	no strict;
676	$self->{FH} = \*{$self->{FH}};
677	} else {
678	open my $FH, '>', $self->{FH} or croak(qq{Failed to open "$self->{FH}" for writing: $^E});
679	$self->{FH} = $FH;
680	}
681	} elsif (ref($self->{FH}) eq 'SCALAR') {
682	open $self->{FH}, '>', $self->{FH};
683	}
684	if ($self->{opt}{output_encoding}) {
685	print {$self->{FH}} qq{<?xml version="1.0" encoding="$self->{opt}{output_encoding}"?>\n};
686	} else {
687	print {$self->{FH}} qq{<?xml version="1.0"?>\n};
688	}
689	$self->_run($IN, @_);
690	print {$self->{FH}} "\n";
691	delete $self->{FH};
692	}
693
694	sub _XMLDecl {
695	my $self = shift;
696	return sub {
697	my ( $Parser, $Version, $Encoding, $Standalone) = @_;
698	$self->{opt}{original_encoding} = $Encoding
699	}
700	}
701
702	=begin comment
703
704	start tag
705	& 3 = 3 -> rtrim parent's _content
706	& 8 = 8 -> $ltrim = 1
707
708	string content
709	$ltrim -> ltrim the string, if not completely whitespace set $ltrim 0
710
711	end tag
712	& 8 = 8 -> rtrim own content
713	& 3 = 3 -> $ltrim = 1
714	empty_returned_content and & 3 in (1,2) -> rtrim parent content
715	empty_returned_content and & 3 = 2 -> $ltrim
716
717	=end comment
718
719	=cut
720
721	sub _rtrim {
722	my ($self, $attr, $more) = @_;
723
724	if ($more) {
725	if (ref $attr->{_content}) {
726	if (!ref($attr->{_content}[-1])) {
727	$attr->{_content}[-1] =~ s/\s+$//s;
728	pop @{$attr->{_content}} if $attr->{_content}[-1] eq '';
729	delete $attr->{_content} unless @{$attr->{_content}};
730	}
731	} else {
732	$attr->{_content} =~ s/\s+$//s;
733	delete $attr->{_content} if $attr->{_content} eq '';
734	}
735	} else {
736	if (ref $attr->{_content}) {
737	if (!ref($attr->{_content}[-1]) and $attr->{_content}[-1] =~ /^\s*$/s) {
738	pop @{$attr->{_content}} ;
739	delete $attr->{_content} unless @{$attr->{_content}};
740	}
741	} else {
742	delete $attr->{_content} if $attr->{_content} =~ /^\s*$/s;
743	}
744	}
745	}
746
747	sub _Start {
748	my $self = shift;
749	my $encode = $self->{opt}{encode};
750	my $output_encoding = $self->{opt}{output_encoding};
751	return sub {
752	my ( $Parser, $Element , %Attr) = @_;
753
754	if (($self->{opt}{stripspaces} & 3) == 3) {
755	#rtrim parent
756	#print "rtrim parent content in _Start\n";
757	if ($self->{data}[-1] and $self->{data}[-1]{_content}) {
758	$self->_rtrim( $self->{data}[-1], ($self->{opt}{stripspaces} & 4));
759	}
760	}
761	if ($self->{opt}{stripspaces} & 8) {
762	#print "ltrim own content in _Start\n";
763	push @{$self->{_ltrim}}, 2;
764	} else {
765	push @{$self->{_ltrim}}, 0;
766	}
767
768	if ($self->{namespaces}) {
769	my %restore;
770	if (exists $Attr{xmlns}) { # find the default namespace
771	#print "Found a xmlns attribute in $Element!\n";
772	$restore{''} = $self->{xmlns_map}{''};
773	if (!exists($self->{namespaces}{ $Attr{xmlns} })) {
774	warn qq{Unexpected namespace "$Attr{xmlns}" found in the XML!\n} unless exists $self->{namespaces}{'*'};
775	delete $self->{xmlns_map}{''};
776	delete $restore{''} unless defined($restore{''});
777	delete($Attr{xmlns});
778	} else {
779	$self->{xmlns_map}{''} = $self->{namespaces}{ delete($Attr{xmlns}) };
780	}
781	}
782	foreach my $attr (keys %Attr) { # find the namespace aliases
783	next unless $attr =~ /^xmlns:(.*)$/;
784	$restore{$1} = $self->{xmlns_map}{$1};
785	if (!exists($self->{namespaces}{ $Attr{$attr} })) {
786	warn qq{Unexpected namespace "$Attr{$attr}" found in the XML!\n} unless exists $self->{namespaces}{'*'};
787	delete $self->{xmlns_map}{$1};
788	delete $restore{$1} unless defined($restore{$1});
789	delete($Attr{$attr});
790	} else {
791	$self->{xmlns_map}{$1} = $self->{namespaces}{ delete($Attr{$attr}) };
792	}
793	}
794	if (%restore) {
795	push @{$self->{xmlns_restore}}, \%restore;
796	} else {
797	push @{$self->{xmlns_restore}}, undef;
798	}
799
800	if (%{$self->{xmlns_map}}) {
801	#print "About to map aliases for $Element\n";
802	# add or map the alias for the tag
803	if ($Element =~ /^([^:]+):(.*)$/) {
804	#print "Mapping an alias $1 for tag $Element\n";
805	if (exists($self->{xmlns_map}{$1})) {
806	if ($self->{xmlns_map}{$1} eq '') {
807	$Element = $2 ;
808	} else {
809	$Element = $self->{xmlns_map}{$1} . ':' . $2 ;
810	}
811	}
812	#print " -> $Element\n";
813	} elsif (defined($self->{xmlns_map}{''}) and $self->{xmlns_map}{''} ne '') { # no namespace alias in the tag and there's a default
814	#print "Adding default alias $self->{xmlns_map}{''}:\n";
815	$Element = $self->{xmlns_map}{''} . ':' . $Element;
816	#print " -> $Element\n";
817	}
818
819	# map the aliases for the attributes
820	foreach my $attr (keys %Attr) {
821	next unless $attr =~ /^([^:]+):(.*)$/; # there's an alias
822	next unless exists($self->{xmlns_map}{$1}); # and there's a mapping
823	if ($self->{xmlns_map}{$1} eq '') {
824	$Attr{$2} = delete($Attr{$attr}); # rename the attribute
825	} else {
826	$Attr{$self->{xmlns_map}{$1} . ':' . $2} = delete($Attr{$attr}); # rename the attribute
827	}
828	}
829	}
830	} # /of namespace handling
831
832
833	my ( $start_rule, $end_rule) = map {
834	if ($self->{$_}{$Element} and ref($self->{$_}{$Element}) ne 'ARRAY') {
835	$self->{$_}{$Element}
836	} else {
837	$self->_find_rule( $_, $Element, $self->{context})
838	}
839	} ( 'start_rules', 'rules');
840
841	if ($start_rule ne 'handle'
842	and (
843	!$start_rule
844	or $start_rule eq 'skip'
845	or !$start_rule->($Element,\%Attr, $self->{context}, $self->{data}, $self)
846	)
847	) {
848	# ignore the tag and the ones below
849	if (exists $self->{ignore_handlers}) {
850	#print "SETHANDLERS!\n";
851	$Parser->setHandlers(@{$self->{ignore_handlers}})
852	} else {
853	$self->{ignore_handlers} = [
854	Start => _StartIgnore($self),
855	Char => undef,
856	End => _EndIgnore($self),
857	];
858	#print "SETHANDLERS2\n";
859	$self->{normal_handlers} = [$Parser->setHandlers(@{$self->{ignore_handlers}})];
860	}
861	$self->{ignore_level}=1;
862
863	} else {
864	# process the tag and the ones below
865	if ($encode) {
866	foreach my $value (values %Attr) {
867	$value = Encode::encode( $encode, $value);
868	}
869	}
870
871	push @{$self->{context}}, $Element;
872	push @{$self->{data}}, \%Attr;
873	$self->{lastempty} = 0;
874
875	if ($self->{style} eq 'filter') {
876	$self->{in_interesting}++ if ref($end_rule) or $end_rule =~ /^=/s; # is this tag interesting?
877
878	if (! $self->{in_interesting}) { # it neither this tag not an acestor is interesting, just copy the tag
879	#print "Start:R ".$Parser->recognized_string()."\n";
880	#print "Start:O ".$Parser->original_string()."\n";
881	#print "Start:R ".$Parser->recognized_string()."\n";
882	#print "Start:O ".$Parser->original_string()."\n";
883	if (! $output_encoding) {
884	print {$self->{FH}} $Parser->recognized_string();
885	} elsif ($output_encoding eq $self->{opt}{original_encoding}) {
886	print {$self->{FH}} $Parser->original_string();
887	} else {
888	print {$self->{FH}} $self->toXML($Element, \%Attr, "don't close");
889	}
890	}
891	}
892
893	}
894	}
895	}
896
897	sub _find_rule {
898	my ($self, $type, $Element, $path) = @_;
899
900	if (exists($self->{$type.'_re'})) {
901	for(my $i = 0; $i < @{$self->{$type.'_re'}}; $i++) {
902	if ($Element =~ $self->{$type.'_re'}[$i]) {
903	$self->{$type}{$Element} = $self->{$type.'_re_code'}[$i];
904	last;
905	}
906	}
907	}
908	if (! exists $self->{$type}{$Element}) {
909	$self->{$type}{$Element} = $self->{$type}{_default};
910	}
911
912	if (ref $self->{$type}{$Element} eq 'ARRAY') {
913	$path = join( '/', @$path);
914	for(my $i=0; $i < $#{$self->{$type}{$Element}}; $i+=2) {
915	if ($path =~ $self->{$type}{$Element}[$i]) {
916	return $self->{$type}{$Element}[$i+1];
917	}
918	}
919	return $self->{$type}{$Element}[-1];
920	} else {
921	return $self->{$type}{$Element};
922	}
923	}
924
925	sub _Char {
926	my $self = shift;
927	my $encode = $self->{opt}{encode};
928	return sub {
929	my ( $Parser, $String) = @_;
930
931	if ($self->{style} eq 'filter' and ! $self->{in_interesting}) {
932	if (! $self->{opt}{output_encoding}) {
933	print {$self->{FH}} $Parser->recognized_string();
934	} elsif ($self->{opt}{output_encoding} eq $self->{opt}{original_encoding}) {
935	print {$self->{FH}} $Parser->original_string();
936	} else {
937	print {$self->{FH}} encode($self->{opt}{output_encoding}, $Parser->recognized_string());
938	}
939	return;
940	}
941
942	if ($encode) {
943	$String = Encode::encode( $encode, $String);
944	}
945
946	if ($self->{_ltrim}[-1]) {
947	#print "ltrim in $self->{context}[-1] ($String)\n";
948	if ($self->{_ltrim}[-1] == 2) {
949	$String =~ s/^\s+//s;
950	return if $String eq '';
951	} else {
952	return if $String =~ /^\s*$/s;
953	}
954	$self->{_ltrim}[-1] = 0;
955	#print " ($String)\n";
956	}
957	$String =~ s/\s+/ /gs if ($self->{opt}{normalisespaces});
958
959	if (!exists $self->{data}[-1]{_content}) {
960	$self->{data}[-1]{_content} = $String;
961	} elsif (!ref $self->{data}[-1]{_content}) {
962	if ($self->{opt}{normalisespaces} and $self->{data}[-1]{_content} =~ /\s$/ and $String =~ /^\s/) {
963	$String =~ s/^\s+//s;
964	}
965	$self->{data}[-1]{_content} .= $String;
966	} else {
967	if (ref $self->{data}[-1]{_content}[-1]) {
968	push @{$self->{data}[-1]{_content}}, $String;
969	} else {
970	if ($self->{opt}{normalisespaces} and $self->{data}[-1]{_content}[-1] =~ /\s$/ and $String =~ /^\s/) {
971	$String =~ s/^\s+//s;
972	}
973	$self->{data}[-1]{_content}[-1] .= $String;
974	}
975	}
976	}
977	}
978
979	sub _End {
980	my $self = shift;
981	return sub {
982	my ( $Parser, $Element) = @_;
983	$Element = pop @{$self->{context}}; # the element name may have been mangled by XMLNS aliasing
984
985	if ($self->{opt}{stripspaces} & 8) {
986	#print "rtrim own content\n";
987	if ($self->{data}[-1] and $self->{data}[-1]{_content}) {
988	$self->_rtrim( $self->{data}[-1], 1);
989	}
990	}
991	pop(@{$self->{_ltrim}});
992
993	if ($self->{namespaces}) {
994	if (my $restore = pop @{$self->{xmlns_restore}}) { # restore the old default namespace and/or alias mapping
995	while (my ($their, $our) = each %$restore) {
996	if (defined($our)) {
997	$self->{xmlns_map}{$their} = $our;
998	} else {
999	delete $self->{xmlns_map}{$their};
1000	}
1001	}
1002	}
1003	}
1004
1005	my ($rule) = map {
1006	if ($self->{$_}{$Element} and ref($self->{$_}{$Element}) ne 'ARRAY') {
1007	$self->{$_}{$Element}
1008	} else {
1009	$self->_find_rule( $_, $Element, $self->{context})
1010	}
1011	} ('rules');
1012
1013	my $data = pop @{$self->{data}};
1014
1015	my @results;
1016	if (ref $rule or $rule =~ /^=/s) {
1017	if ($rule =~ /^==(.*)$/s) { # change the whole tag to a string
1018	@results = ($1);
1019	} elsif ($rule =~ /^=(.*)$/s) { # change the contents to a string
1020	@results = ($Element => $1);
1021	} else {
1022	@results = $rule->($Element, $data, $self->{context}, $self->{data}, $self);
1023	}
1024
1025	if ($self->{style} eq 'filter') {
1026
1027	$self->{in_interesting}--;
1028	if (!$self->{in_interesting}) {
1029	if (@{$self->{data}}) {
1030	print {$self->{FH}} $self->escape_value($self->{data}[-1]{_content});
1031	delete $self->{data}[-1]{_content};
1032	}
1033	my $base;
1034	if ($self->{opt}{ident} ne '') {
1035	$base = $self->{opt}{ident} x scalar(@{$self->{context}});
1036	}
1037	@results and $results[0] =~ s/^[\@%\+\*\.]//;
1038	while (@results) {
1039	#use Data::Dumper;
1040	#print "\@results=".Dumper(\@results)."\n";
1041	if (ref($results[0])) {
1042	croak(ref($results[0]) . " not supported as the return value of a filter") unless ref($results[0]) eq 'ARRAY';
1043	if (@{$results[0]} ==2 and ref($results[0][1]) eq 'HASH') {
1044	print {$self->{FH}} $self->toXML(@{$results[0]}[0,1], 0, $self->{opt}{ident}, $base);
1045	} else {
1046	foreach my $item (@{$results[0]}) {
1047	if (ref($item)) {
1048	croak(ref($item) . " not supported in the return value of a filter") unless ref($item) eq 'ARRAY';
1049	croak("Empty array not supported in the return value of a filter") unless @$item;
1050	if (@$item <= 2) {
1051	print {$self->{FH}} $self->toXML(@{$item}[0,1], 0, $self->{opt}{ident}, $base);
1052	} else { # we suppose the 3rd and following elements are parameters to ->toXML()
1053	print {$self->{FH}} $self->toXML(@$item);
1054	}
1055	} else {
1056	print {$self->{FH}} $self->escape_value($item);
1057	}
1058	}
1059	}
1060	shift(@results);
1061	} else {
1062	if (@results == 1) {
1063	print {$self->{FH}} $self->escape_value($results[0]);
1064	@results = ();last;
1065	} else {
1066	print {$self->{FH}} $self->toXML(shift(@results), shift(@results), 0, $self->{opt}{ident}, $base);
1067	}
1068	}
1069	}
1070	}
1071	}
1072	} elsif ($self->{style} eq 'filter' and ! $self->{in_interesting}) {
1073	#print "End: \$Element=$Element; \$Parser->recognized_string()=".$Parser->recognized_string()."; \$Parser->original_string()=".$Parser->original_string()."\n";
1074	die "Unexpected \$data->{content}={$data->{_content}} in filter outside interesting nodes!\n" if $data->{_content} ne '';
1075	if (! $self->{opt}{output_encoding}) {
1076	print {$self->{FH}} $Parser->recognized_string();
1077	} elsif ($self->{opt}{output_encoding} eq $self->{opt}{original_encoding}) {
1078	print {$self->{FH}} $Parser->original_string();
1079	} else {
1080	print {$self->{FH}} encode($self->{opt}{output_encoding}, $Parser->recognized_string());
1081	}
1082	# print {$self->{FH}} $self->escape_value($data->{_content})."</$Element>";
1083
1084	} else { # a predefined rule
1085
1086	if ($rule =~ s/(?:^\| )no xmlns$//) {
1087	$Element =~ s/^\w+://;
1088	$rule = 'as_is' if $rule eq '';
1089	}
1090
1091	if ($rule eq '') {
1092	@results = ();
1093	} elsif ($rule eq 'content') {
1094	@results = ($Element => $data->{_content});
1095	} elsif ($rule eq 'content trim') {
1096	s/^\s+//,s/\s+$// for ($data->{_content});
1097	@results = ($Element => $data->{_content});
1098	} elsif ($rule eq 'content array') {
1099	@results = ('@'.$Element => $data->{_content});
1100	} elsif ($rule eq 'as is') {
1101	@results = ($Element => $data);
1102	} elsif ($rule eq 'as is trim') {
1103	s/^\s+//,s/\s+$// for ($data->{_content});
1104	@results = ($Element => $data);
1105	} elsif ($rule eq 'as array') {
1106	@results = ('@'.$Element => $data);
1107	} elsif ($rule eq 'as array trim') {
1108	s/^\s+//,s/\s+$// for ($data->{_content});
1109	@results = ('@'.$Element => $data);
1110	} elsif ($rule eq 'no content') {
1111	delete ${$data}{_content}; @results = ($Element => $data);
1112	} elsif ($rule eq 'no content array' or $rule eq 'as array no content') {
1113	delete ${$data}{_content}; @results = ('@' . $Element => $data);
1114
1115	} elsif ($rule eq 'pass') {
1116	@results = (%$data);
1117	} elsif ($rule eq 'pass trim') {
1118	s/^\s+//,s/\s+$// for ($data->{_content});
1119	@results = (%$data);
1120	} elsif ($rule eq 'pass no content' or $rule eq 'pass without content') {
1121	delete ${$data}{_content}; @results = (%$data);
1122
1123	} elsif ($rule eq 'raw') {
1124	@results = [$Element => $data];
1125
1126	} elsif ($rule eq 'raw extended') {
1127	@results = (':'.$Element => $data, [$Element => $data]);
1128
1129	} elsif ($rule eq 'raw extended array') {
1130	@results = ('@:'.$Element => $data, [$Element => $data]);
1131
1132	} elsif ($rule =~ /^((?:no )?content )?by\s+(\S+)$/) {
1133	my ($cnt,$attr) = ($1,$2);
1134	if ($cnt eq 'no content ') {
1135	delete $data->{_content};
1136	}
1137	if ($attr =~ /,/) {
1138	my @attr = split /,/, $attr;
1139	foreach (@attr) {
1140	next unless exists ($data->{$_});
1141	if ($cnt eq 'content ') {
1142	@results = ($data->{$_} => $data->{_content})
1143	} else {
1144	@results = (delete $data->{$_} => $data)
1145	}
1146	last;
1147	}
1148	} else {
1149	if ($cnt eq 'content ') {
1150	@results = ($data->{$attr} => $data->{_content})
1151	} else {
1152	@results = (delete $data->{$attr} => $data);
1153	}
1154	}
1155
1156	} else {
1157	croak "Unknown predefined rule '$rule'!";
1158	}
1159	}
1160
1161	if (! @results or (@results % 2 == 0) or $results[-1] eq '') {
1162	if ($self->{opt}{stripspaces} & 3 and @{$self->{data}} and $self->{data}[-1]{_content}) { # stripping some spaces, it's not root and it did not return content
1163	#print "maybe stripping some spaces in $Element, it's not root and it did not return content\n";
1164	if (($self->{opt}{stripspaces} & 3) < 3 and $self->{data}[-1]{_content}) {
1165	# rtrim parent content
1166	#print " yes, rtrim parent '$self->{data}[-1]{_content}'\n";
1167	$self->_rtrim( $self->{data}[-1], ($self->{opt}{stripspaces} & 4));
1168	#print " result '$self->{data}[-1]{_content}'\n";
1169	}
1170
1171	$self->{_ltrim}[-1] = (($self->{opt}{stripspaces} & 4) ? 2 : 1)
1172	if ($self->{opt}{stripspaces} & 3) == 2;
1173	}
1174	} else {
1175	$self->{_ltrim}[-1] = 0;
1176	}
1177	if (($self->{opt}{stripspaces} & 3) == 3) {
1178	$self->{_ltrim}[-1] = (($self->{opt}{stripspaces} & 4) ? 2 : 1);
1179	}
1180
1181
1182	return unless scalar(@results) or scalar(@results) == 1 and ($results[0] eq '' or !defined($results[0]));
1183
1184	@{$self->{data}} = ({}) unless @{$self->{data}}; # oops we are already closing the root tag! We do need there to be at least one hashref in $self->{data}
1185
1186	if (scalar(@results) % 2) {
1187	# odd number of items, last is content
1188	my $value = pop(@results);
1189	_add_content( $self->{data}[-1], $value);
1190	}
1191
1192	while (@results) {
1193	my ($key, $value) = ( shift(@results), shift(@results));
1194	if ($key eq '_content') {
1195	_add_content( $self->{data}[-1], $value);
1196	} elsif ($key =~ s/^\@//) {
1197	if (exists($self->{data}[-1]{$key}) and ref($self->{data}[-1]{$key}) ne 'ARRAY') {
1198	$self->{data}[-1]{$key} = [$self->{data}[-1]{$key}, $value];
1199	} else {
1200	push @{$self->{data}[-1]{$key}}, $value;
1201	}
1202	} elsif ($key =~ s/^\+//) {
1203	if (exists($self->{data}[-1]{$key})) {
1204	$self->{data}[-1]{$key} += $value;
1205	} else {
1206	$self->{data}[-1]{$key} = $value;
1207	}
1208	} elsif ($key =~ s/^\*//) {
1209	if (exists($self->{data}[-1]{$key})) {
1210	$self->{data}[-1]{$key} *= $value;
1211	} else {
1212	$self->{data}[-1]{$key} = $value;
1213	}
1214	} elsif ($key =~ s/^\.//) {
1215	if (exists($self->{data}[-1]{$key})) {
1216	$self->{data}[-1]{$key} .= $value;
1217	} else {
1218	$self->{data}[-1]{$key} = $value;
1219	}
1220	} elsif ($key =~ s/^\%//) {
1221	if (exists($self->{data}[-1]{$key})) {
1222	if (ref($value) eq 'HASH') {
1223	%{$self->{data}[-1]{$key}} = (%{$self->{data}[-1]{$key}}, %$value);
1224	} elsif (ref($value) eq 'ARRAY') {
1225	%{$self->{data}[-1]{$key}} = (%{$self->{data}[-1]{$key}}, @$value);
1226	} else {
1227	croak "The value of the rule return \%$key must be a hash or array ref!";
1228	}
1229	} else {
1230	if (ref($value) eq 'HASH') {
1231	$self->{data}[-1]{$key} = $value;
1232	} elsif (ref($value) eq 'ARRAY') {
1233	$self->{data}[-1]{$key} = {@$value};
1234	} else {
1235	croak "The value of the rule return \%$key must be a hash or array ref!";
1236	}
1237	}
1238	} else {
1239	warn "The attribute '$key' already exists for tag $self->{context}[-1].\n old value: $self->{data}[-1]{$key}\n new value: $value\n"
1240	if ($self->{opt}{warnoverwrite} and exists $self->{data}[-1]{$key} and $self->{data}[-1]{$key} ne $value);
1241	$self->{data}[-1]{$key} = $value;
1242	}
1243	}
1244	}
1245	}
1246
1247	sub _StartIgnore {
1248	my ($self) = shift;
1249	return sub {
1250	$self->{ignore_level}++
1251	}
1252	}
1253
1254	sub _EndIgnore {
1255	my ($self) = shift;
1256	return sub {
1257	return if --$self->{ignore_level};
1258
1259	$self->{parser}->setHandlers(@{$self->{normal_handlers}})
1260	}
1261	}
1262
1263	sub _add_content {
1264	my ($hash, $value) = @_;
1265	if (ref($value)) {
1266	if (ref($hash->{_content})) {
1267	# both are refs, push to @_content
1268	push @{$hash->{_content}}, $value;
1269	} elsif (exists($hash->{_content})) {
1270	# result is ref, _content is not -> convert to an arrayref containing old _content and result
1271	$hash->{_content} = [ $hash->{_content}, $value]
1272	} else {
1273	# result is ref, _content is not present
1274	$hash->{_content} = [ $value]
1275	}
1276	} else {
1277	if (ref($hash->{_content})) {
1278	# _content is an arrayref, value is a string
1279	if (ref $hash->{_content}[-1]) {
1280	# the last element is a ref -> push
1281	push @{$hash->{_content}}, $value;
1282	} else {
1283	# the last element is a string -> concatenate
1284	$hash->{_content}[-1] .= $value;
1285	}
1286	} else {
1287	# neither is ref, concatenate
1288	$hash->{_content} .= $value;
1289	}
1290	}
1291	}
1292
1293	=head1 METHODS
1294
1295	=head2 parse
1296
1297	$parser->parse( $string [, $parameters]);
1298	$parser->parse( $IOhandle [, $parameters]);
1299
1300	Parses the XML in the string or reads and parses the XML from the opened IO handle,
1301	executes the rules as it encounters the closing tags and returns the resulting structure.
1302
1303	The scalar or reference passed as the second parameter to the parse() method is assigned to
1304	$parser->{parameters} for the parsing of the file or string. Once the XML is parsed the key is
1305	deleted. This means that the $parser does not retain a reference to the $parameters after the parsing.
1306
1307	=head2 parsestring
1308
1309	$parser->parsestring( $string [, $parameters]);
1310
1311	Just an alias to ->parse().
1312
1313	=head2 parsefile
1314
1315	$parser->parsefile( $filename [, $parameters]);
1316
1317	Opens the specified file and parses the XML and executes the rules as it encounters
1318	the closing tags and returns the resulting structure.
1319
1320
1321	=head2 filter
1322
1323	$parser->filter( $string);
1324	$parser->filter( $string, $OutputIOhandle [, $parameters]);
1325	$parser->filter( $InputIOhandle, $OutputIOhandle [, $parameters]);
1326	$parser->filter( $string, $OutputFilename [, $parameters]);
1327	$parser->filter( $InputIOhandle, $OutputFilename [, $parameters]);
1328	$parser->filter( $string, $StringReference [, $parameters]);
1329	$parser->filter( $InputIOhandle, $StringReference [, $parameters]);
1330
1331	Parses the XML in the string or reads and parses the XML from the opened IO handle,
1332	copies the tags that do not have a subroutine rule specified and do not occure under such a tag,
1333	executes the specified rules and prints the results to select()ed filehandle, $OutputFilename or
1334	$OutputIOhandle or stores them in the scalar referenced by $StringReference.
1335
1336	The scalar or reference passed as the third parameter to the filter() method is assigned to
1337	$parser->{parameters} for the parsing of the file or string. Once the XML is parsed the key is
1338	deleted. This means that the $parser does not retain a reference to the $parameters after the parsing.
1339
1340	=head2 filterstring
1341
1342	$parser->filterstring( ...);
1343
1344	Just an alias to ->filter().
1345
1346	=head2 filterfile
1347
1348	$parser->filterfile( $filename, $OutputIOhandle [, $parameters]);
1349	$parser->filterfile( $filename, $OutputFilename [, $parameters]);
1350
1351	Parses the XML in the specified file, copies the tags that do not have a subroutine rule specified
1352	and do not occure under such a tag, executes the specified rules and prints the results to select()ed
1353	filehandle, $OutputFilename or $OutputIOhandle or stores them in the scalar
1354	referenced by $StringReference.
1355
1356	The scalar or reference passed as the third parameter to the filter() method is assigned to
1357	$parser->{parameters} for the parsing of the file or string. Once the XML is parsed the key is
1358	deleted. This means that the $parser does not retain a reference to the $parameters after the parsing.
1359
1360	=cut
1361
1362	sub escape_value {
1363	my($self, $data, $level) = @_;
1364
1365	return '' unless(defined($data) and $data ne '');
1366
1367	if ($self->{opt}{output_encoding} ne $self->{opt}{encode}) {
1368	$data = Encode::decode( $self->{opt}{encode}, $data) if $self->{opt}{encode};
1369	$data = Encode::encode( $self->{opt}{output_encoding}, $data) if $self->{opt}{output_encoding};
1370	}
1371
1372	$data =~ s/&/&/sg;
1373	$data =~ s/</</sg;
1374	$data =~ s/>/>/sg;
1375	$data =~ s/"/"/sg;
1376
1377	$level = $self->{opt}->{numericescape} unless defined $level;
1378	return $data unless $level;
1379
1380	return $self->_numeric_escape($data, $level);
1381	}
1382
1383	sub _numeric_escape {
1384	my($self, $data, $level) = @_;
1385
1386	use utf8; # required for 5.6
1387
1388	if($self->{opt}->{numericescape} eq '2') {
1389	$data =~ s/([^\x00-\x7F])/'&#' . ord($1) . ';'/gse;
1390	}
1391	else {
1392	$data =~ s/([^\x00-\xFF])/'&#' . ord($1) . ';'/gse;
1393	}
1394
1395	return $data;
1396	}
1397
1398	=head2 escape_value
1399
1400	$parser->escape_value( $data [, $numericescape])
1401
1402	This method escapes the $data for inclusion in XML, the $numericescape may be 0, 1 or 2
1403	and controls whether to convert 'high' (non ASCII) characters to XML entities.
1404
1405	0 - default: no numeric escaping (OK if you're writing out UTF8)
1406
1407	1 - only characters above 0xFF are escaped (ie: characters in the 0x80-FF range are not escaped), possibly useful with ISO8859-1 output
1408
1409	2 - all characters above 0x7F are escaped (good for plain ASCII output)
1410
1411	You can also specify the default value in the constructor
1412
1413	my $parser = XML::Rules->new(
1414	...
1415	NumericEscape => 2,
1416	);
1417
1418	=cut
1419
1420	sub ToXML;*ToXML=\&toXML;
1421	sub toXML {
1422	my ($self, $tag, $attrs, $no_close, $ident, $base) = @_;
1423
1424	my $prefix = (defined ($ident) ? "\n$base" : '');
1425
1426	$attrs = undef if (ref $attrs eq 'HASH' and ! %{$attrs}); # ->toXML( $tagname, {}, ...)
1427
1428	if ($tag eq '') {
1429	if (! ref($attrs)) { # ->toXML( '', $string_content, ...)
1430	return $self->escape_value($attrs);
1431	} elsif (ref($attrs) eq 'ARRAY') {
1432	if (@$attrs) {
1433	return join( '', map {
1434	if (!ref($_)) {
1435	$self->escape_value($_)
1436	} elsif (ref($_) eq 'ARRAY' and @$_ == 2) {
1437	$self->toXML($_->[0], $_->[1])
1438	} else {
1439	croak "The content in XML::Rules->ToXML( '', here) must be a string or an arrayref containing strings and two element arrayrefs!";
1440	}
1441	} @$attrs);
1442	} else {
1443	return "<$attrs/>"
1444	}
1445	}
1446	}
1447
1448	if (! ref($attrs)) { # ->toXML( $tagname, $string_content, ...)
1449	if ($no_close) {
1450	return "<$tag>" . $self->escape_value($attrs);
1451	} elsif (! defined $attrs) {
1452	return "<$tag/>";
1453	} else {
1454	return "<$tag>" . $self->escape_value($attrs) . "</$tag>";
1455	}
1456	} elsif (ref($attrs) eq 'ARRAY') {
1457	return join( $prefix, map $self->toXML($tag, $_, 0, $ident, $base), @$attrs);
1458	}
1459
1460
1461	my $content = $attrs->{_content};
1462	my $result = "<$tag";
1463	my $subtags = '';
1464	foreach my $key (sort keys %$attrs) {
1465	next if $key =~ /^:/ or $key eq '_content';
1466	if (ref $attrs->{$key}) {
1467	if (ref $attrs->{$key} eq 'ARRAY') {
1468	if (@{$attrs->{$key}}) {
1469	foreach my $subtag (@{$attrs->{$key}}) {
1470	$subtags .= $prefix . $ident . $self->toXML($key, $subtag, 0, $ident, $base.$ident);
1471	}
1472	} else {
1473	$subtags .= $prefix . $ident . "<$key/>";
1474	}
1475	} elsif (ref $attrs->{$key} eq 'HASH') {
1476	$subtags .= $prefix . $ident . $self->toXML($key, $attrs->{$key}, 0, $ident, $base.$ident)
1477	} else {
1478	croak(ref($attrs->{$key}) . " attributes not supported in XML::Rules->toXML()!");
1479	}
1480	} else {
1481	$result .= qq{ $key="} . $self->escape_value($attrs->{$key}) . qq{"};
1482	}
1483	}
1484	if (! defined $content and $subtags eq '') {
1485	if ($no_close) {
1486	return $result.">";
1487	} else {
1488	return $result."/>";
1489	}
1490
1491	} elsif (!ref($content)) { # content is a string, not an array of strings and subtags
1492	if ($no_close) {
1493	return "$result>$subtags" . $self->escape_value($content);
1494	} elsif ($content eq '' and $subtags ne '') {
1495	return "$result>$subtags$prefix</$tag>";
1496	} else {
1497	return "$result>$subtags" . $self->escape_value($content) ."</$tag>";
1498	}
1499
1500	} elsif (ref($content) eq 'ARRAY') {
1501	$result .= ">$subtags";
1502	foreach my $snippet (@$content) {
1503	if (!ref($snippet)) {
1504	$result .= $self->escape_value($snippet);
1505	} elsif (ref($snippet) eq 'ARRAY') {
1506	if (@$snippet == 2) {
1507	$result .= $self->toXML($snippet->[0], $snippet->[1]);
1508	} else {
1509	croak("Arrays in _content must be in format [\$tagname => \\\%attrs, ...] in XML::Rules->toXML()!");
1510	}
1511	} else {
1512	croak(ref($snippet) . " not supported in _content in XML::Rules->toXML()!");
1513	}
1514	}
1515	if ($no_close) {
1516	return $result;
1517	} else {
1518	return $result."</$tag>";
1519	}
1520	} else {
1521	croak(ref($content) . " _content not supported in XML::Rules->toXML()!");
1522	}
1523	}
1524
1525	sub parentsToXML {
1526	my ($self, $level) = @_;
1527	my $tag_names = $self->{context};
1528	my $tag_attrs = $self->{data};
1529
1530	$level = scalar(@$tag_names) unless $level;
1531
1532	my $result = '';
1533	for (my $i = -1; -$i <= $level; $i--) {
1534	$result = $self->toXML( ${$tag_names}[$i], ${$tag_attrs}[$i], 1) . $result;
1535	}
1536	return $result;
1537	}
1538
1539	sub closeParentsToXML {
1540	my ($self, $level) = @_;
1541	my $tag_names = $self->{context};
1542
1543	if ($level) {
1544	return '</' . join( '></', (reverse(@{$tag_names}))[0..$level-1]) . '>';
1545	} else {
1546	return '</' . join( '></', reverse(@$tag_names)) . '>';
1547	}
1548	}
1549
1550	=head2 toXML / ToXML
1551
1552	$xml = $parser->toXML( $tagname, \%attrs[, $do_not_close, $ident, $base])
1553
1554	You may use this method to convert the datastructures created by parsing the XML into the XML format.
1555	Not all data structures may be printed! I'll add more docs later, for now please do experiment.
1556
1557	The $ident and $base, if defined, turn on and control the pretty-printing. The $ident specifies the character(s)
1558	used for one level of identation, the base contains the identation of the current tag. That is if you want to include the data inside of
1559
1560	<data>
1561	<some>
1562	<subtag>$here</subtag>
1563	</some>
1564	</data>
1565
1566	you will call
1567
1568	$parser->toXML( $tagname, \%attrs, 0, "\t", "\t\t\t");
1569
1570	The method does NOT validate that the $ident and $base are whitespace only, but of course if it's not you end up with invalid
1571	XML. Newlines are added only before the start tag and (if the tag has only child tags and no content) before the closing tag,
1572	but not after the closing tag! Newlines are added even if the $ident is an empty string.
1573
1574	=head2 parentsToXML
1575
1576	$xml = $parser->parentsToXML( [$level])
1577
1578	Prints all or only the topmost $level ancestor tags, including the attributes and content (parsed so far),
1579	but without the closing tags. You may use this to print the header of the file you are parsing,
1580	followed by calling toXML() on a structure you build and then by closeParentsToXML() to close
1581	the tags left opened by parentsToXML(). You most likely want to use the style => 'filter' option
1582	for the constructor instead.
1583
1584	=head2 closeParentsToXML
1585
1586	$xml = $parser->closeParentsToXML( [$level])
1587
1588	Prints the closing tags for all or the topmost $level ancestor tags of the one currently processed.
1589
1590	=head2 paths2rules
1591
1592	my $parser = XML::Rules->new(
1593	rules => paths2rules {
1594	'/root/subtag/tag' => sub { ...},
1595	'/root/othertag/tag' => sub {...},
1596	'tag' => sub{ ... the default code for this tag ...},
1597	...
1598	}
1599	);
1600
1601	This helper function converts a hash of "somewhat xpath-like" paths and subs/rules into the format required by the module.
1602	Due to backwards compatibility and efficiency I can't directly support paths in the rules and the direct syntax for their
1603	specification is a bit awkward. So if you need the paths and not the regexps, you may use this helper instead of:
1604
1605	my $parser = XML::Rules->new(
1606	rules => {
1607	'tag' => [
1608	'/root/subtag' => sub { ...},
1609	'/root/othertag' => sub {...},
1610	sub{ ... the default code for this tag ...},
1611	],
1612	...
1613	}
1614	);
1615
1616	=cut
1617
1618	sub paths2rules {
1619	my ($paths) = @_;
1620
1621	my %rules;
1622	while ( my ($tag, $val) = each %$paths) {
1623
1624	if ($tag =~ m{^(.)/(.)$}) {
1625	my ($path, $tagname) = ($1, $2);
1626
1627	if (exists $rules{$tagname} and ref($rules{$tagname}) eq 'ARRAY') {
1628	if (@{$rules{$tagname}} % 2) {
1629	push @{$rules{$tagname}}, $path, $val;
1630	} else {
1631	splice @{$rules{$tagname}}, -1, 0, $path, $val;
1632	}
1633	} else {
1634	$rules{$tagname} = [ $path => $val]
1635	}
1636
1637	} elsif (exists $rules{$tag} and ref($rules{$tag}) eq 'ARRAY') {
1638	push @{$rules{$tag}}, $val;
1639	} else {
1640	$rules{$tag} = $val
1641	}
1642	}
1643
1644	return \%rules;
1645	}
1646
1647	=head2 return_nothing
1648
1649	Stop parsing the XML, forget any data we already have and return from the $parser->parse().
1650	This is only supposed to be used within rules and may be called both as a method and as
1651	an ordinary function (it's not exported).
1652
1653	=head2 skip_rest
1654
1655	Stop parsing the XML and return whatever data we already have from the $parser->parse().
1656	The rules for the currently opened tags are evaluated as if the XML contained all
1657	the closing tags in the right order.
1658
1659	These two work via raising an exception, the exception is caught within the $parser->parse() and do not propagate outside.
1660	It's also safe to raise any other exception within the rules, the exception will be caught as well, the internal state of the $parser object
1661	is cleaned and the exception is rethrown.
1662
1663	=head2 inferRulesFromExample
1664
1665	Dumper(XML::Rules::inferRulesFromExample( $fileOrXML, $fileOrXML, $fileOrXML, ...)
1666	Dumper(XML::Rules->inferRulesFromExample( $fileOrXML, $fileOrXML, $fileOrXML, ...)
1667	Dumper($parser->inferRulesFromExample( $fileOrXML, $fileOrXML, $fileOrXML, ...)
1668
1669	The subroutine parses the listed files and infers the rules that would produce the minimal, but complete datastructure.
1670	It finds out what tags may be repeated, whether they contain text content, attributes etc. You may want to give
1671	the subroutine several examples to make sure it knows about all possibilities. You should use this once and store
1672	the generated rules in your script or even take this as the basis of a more specific set of rules.
1673
1674	=cut
1675
1676	sub inferRulesFromExample {
1677	shift(@_) if $_[0] eq 'XML::Rules' or ref($_[0]);
1678	my @files = @_;
1679
1680	my %rules;
1681
1682	my $parser = XML::Rules->new(
1683	namespaces => { '*' => ''},
1684	rules => {
1685	_default => sub {
1686	my ($tag, $attrs, $context, $parent_data, $parser) = @_;
1687	my $repeated = (exists $parent_data->[-1] and exists $parent_data->[-1]{$tag});
1688	my $has_content = (exists $attrs->{_content});
1689	my $has_children = grep ref($_) eq 'HASH', values %$attrs;
1690	my $has_attr = grep {$_ ne '_content' and !ref($attrs->{$_})} keys %$attrs;
1691
1692	my $rule = do {
1693	if ($repeated) {
1694	if ($has_content) {
1695	if ($has_attr or $has_children) {
1696	'as array'
1697	} else {
1698	'content array'
1699	}
1700	} else {
1701	if ($has_attr or $has_children) {
1702	'as array no content'
1703	} else {
1704	'content array'
1705	}
1706	}
1707	} else {
1708	if ($has_content) {
1709	if ($has_attr or $has_children) {
1710	'as is'
1711	} else {
1712	'content'
1713	}
1714	} else {
1715	if ($has_attr or $has_children) {
1716	'no content'
1717	} else {
1718	'content'
1719	}
1720	}
1721	}
1722	};
1723
1724	if (not exists $rules{$tag}) {
1725	$rules{$tag} = $rule
1726	} elsif($rules{$tag} ne $rule) {
1727	# we've already seen the tag and it had different type
1728	if ($rules{$tag} eq 'raw extended array') {
1729	} elsif ($rule eq 'raw extended array') {
1730	$rules{$tag} = 'raw extended array';
1731	} elsif ($rules{$tag} eq 'raw extended' and $rule =~ /array/
1732	or $rule eq 'raw extended' and $rules{$tag} =~ /array/) {
1733	$rules{$tag} = 'raw extended array'
1734	} elsif ($rules{$tag} eq 'as array' or $rule eq 'as array') {
1735	$rules{$tag} = 'as array'
1736	} elsif ($rules{$tag} eq 'content array' and $rule eq 'content'
1737	or $rule eq 'content array' and $rules{$tag} eq 'content') {
1738	$rules{$tag} = 'content array'
1739	} elsif ($rules{$tag} eq 'content array' and $rule eq 'as array no content'
1740	or $rule eq 'content array' and $rules{$tag} eq 'as array no content') {
1741	$rules{$tag} = 'as array'
1742	} elsif ($rules{$tag} eq 'content array' and $rule eq 'as is'
1743	or $rule eq 'content array' and $rules{$tag} eq 'as is') {
1744	$rules{$tag} = 'as array'
1745	} elsif ($rules{$tag} eq 'content array' and $rule eq 'no content'
1746	or $rule eq 'content array' and $rules{$tag} eq 'no content') {
1747	$rules{$tag} = 'as array'
1748	} elsif ($rules{$tag} eq 'as array no content' and $rule eq 'as is'
1749	or $rule eq 'as array no content' and $rules{$tag} eq 'as is') {
1750	$rules{$tag} = 'as array'
1751	} elsif ($rules{$tag} eq 'as array no content' and $rule eq 'content'
1752	or $rule eq 'as array no content' and $rules{$tag} eq 'content') {
1753	$rules{$tag} = 'as array'
1754	} elsif ($rules{$tag} eq 'as array no content' and $rule eq 'no content'
1755	or $rule eq 'as array no content' and $rules{$tag} eq 'no content') {
1756	$rules{$tag} = 'as array no content'
1757	} elsif ($rules{$tag} eq 'as is' and ($rule eq 'no content' or $rule eq 'content')
1758	or $rule eq 'as is' and ($rules{$tag} eq 'no content' or $rules{$tag} eq 'content')) {
1759	$rules{$tag} = 'as is'
1760	} elsif ($rules{$tag} eq 'content' and $rule eq 'no content'
1761	or $rule eq 'content' and $rules{$tag} eq 'no content') {
1762	$rules{$tag} = 'as is'
1763	} else {
1764	die "Unexpected combination of rules: old=$rules{$tag}, new=$rule for tag $tag\n";
1765	}
1766	}
1767
1768	if ($has_content and $has_children) { # the tag contains both text content and subtags!, need to use the raw extended rules
1769	foreach my $child (grep ref($attrs->{$_}) eq 'HASH', keys %$attrs) {
1770	next if $rules{$child} =~ /^raw extended/;
1771	if ($rules{$child} =~ /array/) {
1772	$rules{$child} = 'raw extended array'
1773	} else {
1774	$rules{$child} = 'raw extended'
1775	}
1776	}
1777	}
1778	return $tag => {};
1779	}
1780	},
1781	stripspaces => 7,
1782	);
1783
1784	for (@files) {
1785	eval {
1786	if (! ref($_) and $_ !~ /\n/ and $_ !~ /^\s*</) {
1787	$parser->parsefile($_);
1788	} else {
1789	$parser->parse($_);
1790	}
1791	} or croak "Error parsing $_: $@\n";
1792	}
1793
1794	my %short_rules;
1795	foreach my $tag (sort keys %rules) {
1796	push @{$short_rules{$rules{$tag}}}, $tag
1797	}
1798
1799	foreach my $tags (values %short_rules) {
1800	$tags = join ',', @$tags;
1801	}
1802	%short_rules = reverse %short_rules;
1803
1804	return \%short_rules;
1805	}
1806
1807	=head2 inferRulesFromDTD
1808
1809	Dumper(XML::Rules::inferRulesFromDTD( $DTDfile, [$enableExtended]))
1810	Dumper(XML::Rules->inferRulesFromDTD( $DTDfile, [$enableExtended]))
1811	Dumper($parser->inferRulesFromDTD( $DTDfile, [$enableExtended]))
1812
1813	The subroutine parses the DTD and infers the rules that would produce the minimal, but complete datastructure.
1814	It finds out what tags may be repeated, whether they contain text content, attributes etc. You may use this
1815	each time you are about to parse the XML, once and store the generated rules in your script or even take this
1816	as the basis of a more specific set of rules.
1817
1818	With the second parameter set to a true value, the tags included in a mixed content will use the "raw extended"
1819	or "raw extended array" types instead of just "raw". This makes sure the tag data both stay at the right place in
1820	the content and are accessible easily from the parent tag's atrribute hash.
1821
1822	This subroutine requires the XML::DTDParser module!
1823
1824	=cut
1825
1826	sub inferRulesFromDTD {
1827	require XML::DTDParser;
1828
1829	my ($DTDfile, $enable_extended) = @_;
1830
1831	my $DTD = XML::DTDParser::ParseDTDFile($DTDfile);
1832
1833	my $has_mixed = 0;
1834	foreach my $tag (values %$DTD) {
1835	$tag->{is_mixed} = (($tag->{content} and $tag->{children}) ? 1 : 0)
1836	and $has_mixed = 1;
1837	}
1838
1839	my %settings;
1840	foreach my $tagname (keys %$DTD) {
1841	my $tag = $DTD->{$tagname};
1842
1843	my $repeated = ($tag->{option} =~ /^[+*]$/ ? 1 : 0);
1844	my $has_content = $tag->{content};
1845
1846	my $in_mixed = grep {$DTD->{$_}{is_mixed}} @{$tag->{parent}};
1847
1848	if ($in_mixed) {
1849	if ($enable_extended) {
1850	if ($repeated) {
1851	$settings{$tagname} = "raw extended array"
1852	} else {
1853	$settings{$tagname} = "raw extended"
1854	}
1855	} else {
1856	$settings{$tagname} = "raw"
1857	}
1858	} else {
1859	if (exists $DTD->{attributes} or exists $tag->{children}) {
1860	my @ids ;
1861	if (exists $DTD->{attributes}) {
1862	@ids = grep {$DTD->{attributes}{$_}[0] eq 'ID' and $DTD->{attributes}{$_}[0] eq '#REQUIRED'} keys %{$DTD->{attributes}};
1863	}
1864	if (scalar(@ids) == 1) {
1865	if ($has_content) {
1866	$settings{$tagname} = "by $ids[0]"
1867	} else {
1868	$settings{$tagname} = "no content by $ids[0]"
1869	}
1870	} else {
1871	if ($has_content) {
1872	if ($repeated) {
1873	$settings{$tagname} = "as array"
1874	} else {
1875	$settings{$tagname} = "as is"
1876	}
1877	} else {
1878	if ($repeated) {
1879	$settings{$tagname} = "as array no content"
1880	} else {
1881	$settings{$tagname} = "no content"
1882	}
1883	}
1884	}
1885	} elsif ($repeated) {
1886	$settings{$tagname} = "content array"
1887	} else {
1888	$settings{$tagname} = "content array"
1889	}
1890	}
1891	}
1892
1893	# use Data::Dumper;
1894	# print Dumper(\%settings);
1895
1896	my %compressed;
1897	{
1898	my %tmp;
1899	while (my ($tag, $option) = each %settings) {
1900	push @{$tmp{$option}}, $tag;
1901	}
1902
1903	while (my ($option, $tags) = each %tmp) {
1904	$compressed{join ',', @$tags} = $option
1905	}
1906	}
1907
1908	if ($has_mixed) {
1909	$compressed{"#stripspaces"} = 0;
1910	} else {
1911	$compressed{"#stripspaces"} = 7;
1912	}
1913
1914	return \%compressed;
1915	}
1916
1917	=head1 Properties
1918
1919	=head2 parameters
1920
1921	You can pass a parameter (scalar or reference) to the parse...() or filter...() methods, this parameter
1922	is later available to the rules as $parser->{parameters}. The module will never use this parameter for
1923	any other purpose so you are free to use it for any purposes provided that you expect it to be reset by
1924	each call to parse...() or filter...() first to the passed value and then, after the parsing is complete, to undef.
1925
1926	=head2 pad
1927
1928	The $parser->{pad} key is specificaly reserved by the module as a place where the module users can
1929	store their data. The module doesn't and will not use this key in any way, doesn't set or reset it under any
1930	circumstances. If you need to share some data between the rules and do not want to use the structure built
1931	by applying the rules you are free to use this key.
1932
1933	You should refrain from modifying or accessing other properties of the XML::Rules object!
1934
1935	=head1 Namespace support
1936
1937	By default the module doesn't handle namespaces in any way, it doesn't check for
1938	xmlns or xmlns:alias attributes and it doesn't strip or mangle the namespace aliases
1939	in tag or attribute names. This means that if you know for sure what namespace
1940	aliases will be used you can set up rules for tags including the aliases and unless
1941	someone decides to use a different alias or makes use of the default namespace
1942	change your script will work.
1943
1944	If you do specify any namespace to alias mapping in the constructor it does
1945	start processing the namespace stuff. The xmlns and xmlns:alias attributes
1946	are stripped from the datastructures and the aliases are transformed from
1947	whatever the XML author decided to use to whatever your namespace mapping
1948	specifies. Aliases are also added to all tags that belong to a default namespace.
1949
1950	Assuming the constructor parameters contain
1951
1952	namespaces => {
1953	'http://my.namespaces.com/foo' => 'foo',
1954	'http://my.namespaces.com/bar' => 'bar',
1955	}
1956
1957	and the XML looks like this:
1958
1959	<root>
1960	<Foo xmlns="http://my.namespaces.com/foo">
1961	<subFoo>Hello world</subfoo>
1962	</Foo>
1963	<other xmlns:b="http://my.namespaces.com/bar">
1964	<b:pub>
1965	<b:name>NaRuzku</b:name>
1966	<b:address>at any crossroads</b:address>
1967	<b:desc>Fakt <b>desnej</b> pajzl.</b:desc>
1968	</b:pub>
1969	</other>
1970	</root>
1971
1972	then the rules wil be called as if the XML looked like this:
1973
1974	<root>
1975	<foo:Foo>
1976	<foo:subFoo>Hello world</foo:subfoo>
1977	</foo:Foo>
1978	<other>
1979	<bar:pub>
1980	<bar:name>NaRuzku</bar:name>
1981	<bar:address>at any crossroads</bar:address>
1982	<bar:desc>Fakt <b>desnej</b> pajzl.</bar:desc>
1983	</bar:pub>
1984	</other>
1985	</root>
1986
1987
1988	This means that the namespace handling will only normalize the aliases used.
1989
1990	It is possible to specify an empty alias, so eg. in case you are processing a SOAP XML
1991	and know the tags defined by SOAP do not colide with the tags in the enclosed XML
1992	you may simplify the parsing by removing all namespace aliases.
1993
1994	If the XML references a namespace not present in the map you will get a warning
1995	and the alias used for that namespace will be left intact! If you do not want to get the warnings specify
1996	a namespace with URI "*".
1997
1998
1999	=head1 HOW TO USE
2000
2001	You may view the module either as a XML::Simple on steriods and use it to build a data structure
2002	similar to the one produced by XML::Simple with the added benefit of being able
2003	to specify what tags or attributes to ignore, when to take just the content, what to store as an array etc.
2004
2005	Or you could view it as yet another event based XML parser that differs from all the others only in one thing.
2006	It stores the data for you so that you do not have to use globals or closures and wonder where to attach
2007	the snippet of data you just received onto the structure you are building.
2008
2009	You can use it in a way similar to XML::Twig with simplify(), specify the rules to transform the lower
2010	level tags into a XML::Simple like (simplify()ed) structure and then handle the structure in the rule for
2011	the tag(s) you'd specify in XML::Twig's twig_roots.
2012
2013	=head1 Unrelated tricks
2014
2015	If you need to parse a XML file without the top-most tag (something that each and any sane person would allow,
2016	but the XML comitee did not), you can parse
2017
2018	<!DOCTYPE doc [<!ENTITY real_doc SYSTEM "$the_file_name">]><doc>&real_doc;</doc>
2019
2020	instead.
2021
2022	=head1 AUTHOR
2023
2024	Jan Krynicky, C<< <Jenda at CPAN.org> >>
2025
2026	=head1 BUGS
2027
2028	Please report any bugs or feature requests to
2029	C<bug-xml-rules at rt.cpan.org>, or through the web interface at
2030	L<http://rt.cpan.org/NoAuth/ReportBug.html?Queue=XML-Rules>.
2031	I will be notified, and then you'll automatically be notified of progress on
2032	your bug as I make changes.
2033
2034	=head1 SUPPORT
2035
2036	You can find documentation for this module with the perldoc command.
2037
2038	perldoc XML::Rules
2039
2040	You can also look for information at:
2041
2042	=over 4
2043
2044	=item * AnnoCPAN: Annotated CPAN documentation
2045
2046	L<http://annocpan.org/dist/XML-Rules>
2047
2048	=item * CPAN Ratings
2049
2050	L<http://cpanratings.perl.org/d/XML-Rules>
2051
2052	=item * RT: CPAN's request tracker
2053
2054	L<http://rt.cpan.org/NoAuth/Bugs.html?Dist=XML-Rules>
2055
2056	=item * Search CPAN
2057
2058	L<http://search.cpan.org/dist/XML-Rules>
2059
2060	=item * PerlMonks
2061
2062	Please see L<http://www.perlmonks.org/?node_id=581313> or
2063	L<http://www.perlmonks.org/?node=XML::Rules> for discussion.
2064
2065	=back
2066
2067	=head1 SEE ALSO
2068
2069	L<XML::Twig>, L<XML::LibXML>, L<XML::Pastor>
2070
2071	=head1 ACKNOWLEDGEMENTS
2072
2073	The escape_value() method is taken with minor changes from XML::Simple.
2074
2075	=head1 COPYRIGHT & LICENSE
2076
2077	Copyright 2006-2007 Jan Krynicky, all rights reserved.
2078
2079	This program is free software; you can redistribute it and/or modify it
2080	under the same terms as Perl itself.
2081
2082	=cut
2083
2084	1; # End of XML::Rules

Note: See TracBrowser for help on using the repository browser.

Download in other formats: