will have the new attribute listed at the end.
+
+ #
+ $token->set_attr(class => 'some_class');
+ print $token->as_is;
+ #
+
+ #
+ $token->set_attr('bgcolor','red');
+ print $token->as_is;
+ #
+
+After this method is called, if successful, the C, C
+and C methods will all return updated results.
+
+=item * C
+
+Under the premise that C methods should accept what their corresponding
+C methods emit, the following works:
+
+ $tag->set_attr($tag->get_attr);
+
+Theoretically that's a no-op and for purposes of rendering HTML, it should be.
+However, internally this calls C<$tag-Erewrite_tag>, so see that method to
+understand how this may affect you.
+
+Of course, this is useless if you want to actually change the attributes, so you
+can do this:
+
+ my $attrs = {
+ class => 'headline',
+ valign => 'top'
+ };
+ $token->set_attr($attrs)
+ if $token->is_start_tag('td') && $token->get_attr('class') eq 'stories';
+
+=item * C
+
+This method rewrites the tag. The tag name and the name of all attributes will
+be lower-cased. Values that are not quoted with double quotes will be. This
+may be called on both start or end tags. Note that both C and
+C call this method prior to returning.
+
+If called on a token that is not a tag, it simply returns. Regardless of how
+it is called, it returns the token.
+
+ #
+ $token->rewrite_tag;
+ print $token->as_is;
+ #
+
+A quick cleanup of sloppy HTML is now the following:
+
+ my $parser = HTML::TokeParser::Simple->new( string => $ugly_html );
+ while (my $token = $parser->get_token) {
+ $token->rewrite_tag;
+ print $token->as_is;
+ }
+
+=back
+
+=head1 PARSER VERSUS TOKENS
+
+The parser returns tokens that are blessed into appropriate classes. Some
+people get confused and try to call parser methods on tokens and token methods
+on the parser. To prevent this, C versions 1.4 and
+above now bless all tokens into appropriate token classes. Please keep this in
+mind while using this module (and many thanks to PodMaster
+L for pointing out this issue
+to me.)
+
+=head1 EXAMPLES
+
+=head2 Finding comments
+
+For some strange reason, your Pointy-Haired Boss (PHB) is convinced that the
+graphics department is making fun of him by embedding rude things about him in
+HTML comments. You need to get all HTML comments from the HTML.
+
+ use strict;
+ use HTML::TokeParser::Simple;
+
+ my @html_docs = glob( "*.html" );
+
+ open PHB, "> phbreport.txt" or die "Cannot open phbreport for writing: $!";
+
+ foreach my $doc ( @html_docs ) {
+ print "Processing $doc\n";
+ my $p = HTML::TokeParser::Simple->new( file => $doc );
+ while ( my $token = $p->get_token ) {
+ next unless $token->is_comment;
+ print PHB $token->as_is, "\n";
+ }
+ }
+
+ close PHB;
+
+=head2 Stripping Comments
+
+Uh oh. Turns out that your PHB was right for a change. Many of the comments
+in the HTML weren't very polite. Since your entire graphics department was
+just fired, it falls on you need to strip those comments from the HTML.
+
+ use strict;
+ use HTML::TokeParser::Simple;
+
+ my $new_folder = 'no_comment/';
+ my @html_docs = glob( "*.html" );
+
+ foreach my $doc ( @html_docs ) {
+ print "Processing $doc\n";
+ my $new_file = "$new_folder$doc";
+
+ open PHB, "> $new_file" or die "Cannot open $new_file for writing: $!";
+
+ my $p = HTML::TokeParser::Simple->new( $file => doc );
+ while ( my $token = $p->get_token ) {
+ next if $token->is_comment;
+ print PHB $token->as_is;
+ }
+ close PHB;
+ }
+
+=head2 Changing form tags
+
+Your company was foo.com and now is bar.com. Unfortunately, whoever wrote your
+HTML decided to hardcode "http://www.foo.com/" into the C attribute of
+the form tags. You need to change it to "http://www.bar.com/".
+
+ use strict;
+ use HTML::TokeParser::Simple;
+
+ my $new_folder = 'new_html/';
+ my @html_docs = glob( "*.html" );
+
+ foreach my $doc ( @html_docs ) {
+ print "Processing $doc\n";
+ my $new_file = "$new_folder$doc";
+
+ open FILE, "> $new_file" or die "Cannot open $new_file for writing: $!";
+
+ my $p = HTML::TokeParser::Simple->new( file => $doc );
+ while ( my $token = $p->get_token ) {
+ if ( $token->is_start_tag('form') ) {
+ my $action = $token->get_attr(action);
+ $action =~ s/www\.foo\.com/www.bar.com/;
+ $token->set_attr('action', $action);
+ }
+ print FILE $token->as_is;
+ }
+ close FILE;
+ }
+
+=head1 CAVEATS
+
+For compatability reasons with C, methods that return
+references are violating encapsulation and altering the references directly
+B alter the state of the object. Subsequent calls to C
+can thus have unexpected results. Do not alter these references directly
+unless you are following behavior described in these docs. In the future,
+certain methods such as C, C and others may return a
+copy of the reference rather than the original reference. This behavior has
+not yet been changed in order to maintain compatability with previous versions
+of this module. At the present time, your author is not aware of anyone taking
+advantage of this "feature," but it's better to be safe than sorry.
+
+Use of C<$HTML::Parser::VERSION> which is less than 3.25 may result in
+incorrect behavior as older versions do not always handle XHTML correctly. It
+is the programmer's responsibility to verify that the behavior of this code
+matches the programmer's needs.
+
+Note that C processes text in 512 byte chunks. This sometimes
+will cause strange behavior and cause text to be broken into more than one
+token. You can suppress this behavior with the following command:
+
+ $p->unbroken_text( [$bool] );
+
+See the C documentation and
+http://www.perlmonks.org/index.pl?node_id=230667 for more information.
+
+=head1 BUGS
+
+There are no known bugs, but that's no guarantee.
+
+Address bug reports and comments to: Eeop_divo_sitruc@yahoo.comE. When
+sending bug reports, please provide the version of C,
+C, C, the version of Perl, and the
+version of the operating system you are using.
+
+Reverse the name to email the author.
+
+=head1 SUBCLASSING
+
+You may wish to change the behavior of this module. You probably do not want
+to subclass C. Instead, you'll want to subclass one
+of the token classes. C is the base class for
+all tokens. Global behavioral changes should go there. Otherwise, see the
+appropriate token class for the behavior you wish to alter.
+
+=head1 SEE ALSO
+
+L
+
+L
+
+L
+
+L
+
+L
+
+L
+
+=head1 COPYRIGHT
+
+Copyright (c) 2004 by Curtis "Ovid" Poe. All rights reserved. This program is
+free software; you may redistribute it and/or modify it under the same terms as
+Perl itself
+
+=head1 AUTHOR
+
+Curtis "Ovid" Poe Eeop_divo_sitruc@yahoo.comE
+
+Reverse the name to email the author.
+
+=cut
Index: /main/trunk/greenstone2/perllib/cpan/HTML/TokeParser/Simple/Token.pm
===================================================================
--- /main/trunk/greenstone2/perllib/cpan/HTML/TokeParser/Simple/Token.pm (revision 28214)
+++ /main/trunk/greenstone2/perllib/cpan/HTML/TokeParser/Simple/Token.pm (revision 28214)
@@ -0,0 +1,133 @@
+package HTML::TokeParser::Simple::Token;
+
+use strict;
+
+use vars qw/ $VERSION $REVISION /;
+$REVISION = '$Id$';
+$VERSION = '3.0';
+
+sub new {
+ my ($class, $token) = @_;
+ $class->_croak("This class should not be instantiated") if __PACKAGE__ eq $class;
+ return bless $token, $class;
+}
+
+sub _croak {
+ my ($proto, $message) = @_;
+ require Carp;
+ Carp::croak($message);
+}
+
+sub _carp {
+ my ($proto, $message) = @_;
+ require Carp;
+ Carp::carp($message);
+}
+
+sub is_tag {}
+sub is_start_tag {}
+sub is_end_tag {}
+sub is_text {}
+sub is_comment {}
+sub is_declaration {}
+sub is_pi {}
+sub is_process_instruction {}
+
+sub rewrite_tag { shift }
+sub delete_attr {}
+sub set_attr {}
+sub get_tag {}
+sub return_tag {} # deprecated
+sub get_attr {}
+sub return_attr {} # deprecated
+sub get_attrseq {}
+sub return_attrseq {} # deprecated
+sub get_token0 {}
+sub return_token0 {} # deprecated
+
+# get_foo methods
+
+sub return_text {
+ my ($self) = @_;
+ $self->_carp('return_text() is deprecated. Use as_is() instead');
+ goto &as_is;
+}
+
+sub as_is { return shift->[-1] }
+
+1;
+
+__END__
+
+=head1 NAME
+
+HTML::TokeParser::Simple::Token - Base class for C
+tokens.
+
+=head1 SYNOPSIS
+
+ use HTML::TokeParser::Simple;
+ my $p = HTML::TokeParser::Simple->new( $somefile );
+
+ while ( my $token = $p->get_token ) {
+ # This prints all text in an HTML doc (i.e., it strips the HTML)
+ next unless $token->is_text;
+ print $token->as_is;
+ }
+
+=head1 DESCRIPTION
+
+This is the base class for all returned tokens. It should never be
+instantiated directly. In fact, it will C if it is.
+
+=head1 METHODS
+
+The following list of methods are provided by this class. Most of these are
+stub methods which must be overridden in a subclass. See
+L for descriptions of these methods.
+
+=over 4
+
+=item * as_is
+
+=item * delete_attr
+
+=item * get_attr
+
+=item * get_attrseq
+
+=item * get_tag
+
+=item * get_token0
+
+=item * is_comment
+
+=item * is_declaration
+
+=item * is_end_tag
+
+=item * is_pi
+
+=item * is_process_instruction
+
+=item * is_start_tag
+
+=item * is_tag
+
+=item * is_text
+
+=item * return_attr
+
+=item * return_attrseq
+
+=item * return_tag
+
+=item * return_text
+
+=item * return_token0
+
+=item * rewrite_tag
+
+=item * set_attr
+
+=back
Index: /main/trunk/greenstone2/perllib/cpan/HTML/TokeParser/Simple/Token/Comment.pm
===================================================================
--- /main/trunk/greenstone2/perllib/cpan/HTML/TokeParser/Simple/Token/Comment.pm (revision 28214)
+++ /main/trunk/greenstone2/perllib/cpan/HTML/TokeParser/Simple/Token/Comment.pm (revision 28214)
@@ -0,0 +1,44 @@
+package HTML::TokeParser::Simple::Token::Comment;
+
+use strict;
+
+use vars qw/ $VERSION $REVISION /;
+$REVISION = '$Id$';
+$VERSION = '1.0';
+use base 'HTML::TokeParser::Simple::Token';
+
+sub is_comment { 1 }
+
+1;
+
+__END__
+
+=head1 NAME
+
+HTML::TokeParser::Simple::Token::Comment - Token.pm comment class.
+
+=head1 SYNOPSIS
+
+ use HTML::TokeParser::Simple;
+ my $p = HTML::TokeParser::Simple->new( $somefile );
+
+ while ( my $token = $p->get_token ) {
+ # This prints all text in an HTML doc (i.e., it strips the HTML)
+ next unless $token->is_text;
+ print $token->as_is;
+ }
+
+=head1 DESCRIPTION
+
+This is the class for comment tokens.
+
+See L for detailed information about comments.
+
+=head1 OVERRIDDEN METHODS
+
+=head2 is_comment
+
+C will return true if the token is the DTD at the top of the
+HTML.
+
+=cut
Index: /main/trunk/greenstone2/perllib/cpan/HTML/TokeParser/Simple/Token/Declaration.pm
===================================================================
--- /main/trunk/greenstone2/perllib/cpan/HTML/TokeParser/Simple/Token/Declaration.pm (revision 28214)
+++ /main/trunk/greenstone2/perllib/cpan/HTML/TokeParser/Simple/Token/Declaration.pm (revision 28214)
@@ -0,0 +1,42 @@
+package HTML::TokeParser::Simple::Token::Declaration;
+
+use strict;
+
+use vars qw/ $VERSION $REVISION /;
+$REVISION = '$Id$';
+$VERSION = '1.0';
+use base 'HTML::TokeParser::Simple::Token';
+
+sub is_declaration { 1 }
+
+1;
+
+__END__
+
+=head1 NAME
+
+HTML::TokeParser::Simple::Token::Declaration - Token.pm declaration class.
+
+=head1 SYNOPSIS
+
+ use HTML::TokeParser::Simple;
+ my $p = HTML::TokeParser::Simple->new( $somefile );
+
+ while ( my $token = $p->get_token ) {
+ # This prints all text in an HTML doc (i.e., it strips the HTML)
+ next unless $token->is_text;
+ print $token->as_is;
+ }
+
+=head1 DESCRIPTION
+
+This is the declaration class for tokens.
+
+=head1 OVERRIDDEN METHODS
+
+=head2 is_declaration
+
+C will return true if the token is the DTD at the top of the
+HTML.
+
+=cut
Index: /main/trunk/greenstone2/perllib/cpan/HTML/TokeParser/Simple/Token/ProcessInstruction.pm
===================================================================
--- /main/trunk/greenstone2/perllib/cpan/HTML/TokeParser/Simple/Token/ProcessInstruction.pm (revision 28214)
+++ /main/trunk/greenstone2/perllib/cpan/HTML/TokeParser/Simple/Token/ProcessInstruction.pm (revision 28214)
@@ -0,0 +1,63 @@
+package HTML::TokeParser::Simple::Token::ProcessInstruction;
+
+use strict;
+
+use vars qw/ $VERSION $REVISION /;
+$REVISION = '$Id$';
+$VERSION = '2.0';
+use base 'HTML::TokeParser::Simple::Token';
+
+sub return_token0 { goto &get_token0 } # deprecated
+
+sub get_token0 {
+ return shift->[1];
+}
+
+sub is_pi { 1 }
+
+sub is_process_instruction { 1 }
+
+1;
+
+__END__
+
+=head1 NAME
+
+HTML::TokeParser::Simple::Token::ProcessInstruction - Token.pm process instruction class.
+
+=head1 SYNOPSIS
+
+ use HTML::TokeParser::Simple;
+ my $p = HTML::TokeParser::Simple->new( $somefile );
+
+ while ( my $token = $p->get_token ) {
+ # This prints all text in an HTML doc (i.e., it strips the HTML)
+ next unless $token->is_text;
+ print $token->as_is;
+ }
+
+=head1 DESCRIPTION
+
+Process Instructions are from XML. This is very handy if you need to parse out
+PHP and similar things with a parser.
+
+Currently, there appear to be some problems with process instructions. You can
+override this class if you need finer grained handling of process instructions.
+
+C and C both return true.
+
+=head1 OVERRIDDEN METHODS
+
+=over 4
+
+=item * get_token0
+
+=item * is_pi
+
+=item * is_process_instruction
+
+=item * return_token0
+
+=back
+
+=cut
Index: /main/trunk/greenstone2/perllib/cpan/HTML/TokeParser/Simple/Token/Tag.pm
===================================================================
--- /main/trunk/greenstone2/perllib/cpan/HTML/TokeParser/Simple/Token/Tag.pm (revision 28214)
+++ /main/trunk/greenstone2/perllib/cpan/HTML/TokeParser/Simple/Token/Tag.pm (revision 28214)
@@ -0,0 +1,84 @@
+package HTML::TokeParser::Simple::Token::Tag;
+
+use strict;
+
+use vars qw/ $VERSION $REVISION /;
+$REVISION = '$Id$';
+$VERSION = '1.2';
+use base 'HTML::TokeParser::Simple::Token';
+
+my %INSTANCE;
+
+sub new {
+ my ($class, $object) = @_;
+ $class->_croak("This is a base class that should not be instantiated")
+ if __PACKAGE__ eq $class;
+ my $self = bless $object, $class;
+ $self->_init;
+}
+
+sub _get_attrseq { return [] }
+
+sub _get_attr { return {} }
+
+sub _set_text {
+ my $self = shift;
+ $self->[-1] = shift;
+ return $self;
+}
+
+# attribute munging methods
+# get_foo methods
+
+sub return_text {
+ carp('return_text() is deprecated. Use as_is() instead');
+ goto &as_is;
+}
+
+sub as_is {
+ return shift->_get_text;
+}
+
+sub get_tag {
+ return shift->_get_tag;
+}
+
+1;
+
+__END__
+
+=head1 NAME
+
+HTML::TokeParser::Simple::Token::Tag - Token.pm tag class.
+
+=head1 SYNOPSIS
+
+ use HTML::TokeParser::Simple;
+ my $p = HTML::TokeParser::Simple->new( $somefile );
+
+ while ( my $token = $p->get_token ) {
+ # This prints all text in an HTML doc (i.e., it strips the HTML)
+ next unless $token->is_text;
+ print $token->as_is;
+ }
+
+=head1 DESCRIPTION
+
+This is the base class for start and end tokens. It should not be
+instantiated. See C and
+C for details.
+
+=head1 OVERRIDDEN METHODS
+
+The following list of methods are provided by this class. See
+L for descriptions of these methods.
+
+=over 4
+
+=item * as_is
+
+=item * get_tag
+
+=item * return_text
+
+=back
Index: /main/trunk/greenstone2/perllib/cpan/HTML/TokeParser/Simple/Token/Tag/End.pm
===================================================================
--- /main/trunk/greenstone2/perllib/cpan/HTML/TokeParser/Simple/Token/Tag/End.pm (revision 28214)
+++ /main/trunk/greenstone2/perllib/cpan/HTML/TokeParser/Simple/Token/Tag/End.pm (revision 28214)
@@ -0,0 +1,135 @@
+package HTML::TokeParser::Simple::Token::Tag::End;
+
+use strict;
+
+use vars qw/ $VERSION $REVISION /;
+$REVISION = '$Id$';
+$VERSION = '1.0';
+use base 'HTML::TokeParser::Simple::Token::Tag';
+
+my %TOKEN = (
+ tag => 1,
+ text => 2
+);
+
+# in order to maintain the 'drop-in replacement' ability with HTML::TokeParser,
+# we cannot alter the array refs. Thus we must store instance data here. Ugh.
+
+my %INSTANCE;
+
+sub _init {
+ my $self = shift;
+ if ('E' eq $self->[0]) {
+ $INSTANCE{$self}{offset} = 0;
+ $INSTANCE{$self}{tag} = $self->[1];
+ }
+ else {
+ $INSTANCE{$self}{offset} = -1;
+ my $tag = $self->[0];
+ $tag =~ s/^\///;
+ $INSTANCE{$self}{tag} = $tag;
+ }
+ return $self;
+}
+
+sub _get_offset { return $INSTANCE{+shift}{offset} }
+sub _get_text { return shift->[-1] }
+
+sub _get_tag {
+ my $self = shift;
+ return $INSTANCE{$self}{tag};
+}
+
+sub DESTROY { delete $INSTANCE{+shift} }
+
+sub rewrite_tag {
+ my $self = shift;
+ # capture the final slash if the tag is self-closing
+ my ($self_closing) = $self->_get_text =~ m{(\s?/)>$};
+ $self_closing ||= '';
+
+ my $first = $self->is_end_tag ? '/' : '';
+ my $tag = sprintf '<%s%s%s>', $first, $self->get_tag, $self_closing;
+ $self->_set_text($tag);
+ return $self;
+}
+
+sub return_text {
+ require Carp;
+ Carp::carp('return_text() is deprecated. Use as_is() instead');
+ goto &as_is;
+}
+
+sub as_is {
+ return shift->_get_text;
+}
+
+sub get_tag {
+ return shift->_get_tag;
+}
+
+# is_foo methods
+
+sub is_tag {
+ my $self = shift;
+ return $self->is_end_tag( @_ );
+}
+
+sub is_end_tag {
+ my ($self, $tag) = @_;
+ return $tag ? $self->_match_tag($tag) : 1;
+}
+
+sub _match_tag {
+ my ($self, $tag) = @_;
+ if ('Regexp' eq ref $tag) {
+ return $self->_get_tag =~ $tag;
+ }
+ else {
+ $tag = lc $tag;
+ $tag =~ s/^\///;
+ return $self->_get_tag eq $tag;
+ }
+}
+
+1;
+
+__END__
+
+=head1 NAME
+
+HTML::TokeParser::Simple::Token::Tag::End - Token.pm "end tag" class.
+
+=head1 SYNOPSIS
+
+ use HTML::TokeParser::Simple;
+ my $p = HTML::TokeParser::Simple->new( $somefile );
+
+ while ( my $token = $p->get_token ) {
+ # This prints all text in an HTML doc (i.e., it strips the HTML)
+ next unless $token->is_text;
+ print $token->as_is;
+ }
+
+=head1 DESCRIPTION
+
+This class does most of the heavy lifting for C. See
+the C docs for details.
+
+=head1 OVERRIDDEN METHODS
+
+=over 4
+
+=item * as_is
+
+=item * get_tag
+
+=item * is_end_tag
+
+=item * is_tag
+
+=item * return_text
+
+=item * rewrite_tag
+
+=cut
Index: /main/trunk/greenstone2/perllib/cpan/HTML/TokeParser/Simple/Token/Tag/Start.pm
===================================================================
--- /main/trunk/greenstone2/perllib/cpan/HTML/TokeParser/Simple/Token/Tag/Start.pm (revision 28214)
+++ /main/trunk/greenstone2/perllib/cpan/HTML/TokeParser/Simple/Token/Tag/Start.pm (revision 28214)
@@ -0,0 +1,229 @@
+package HTML::TokeParser::Simple::Token::Tag::Start;
+
+use strict;
+
+use vars qw/ $VERSION $REVISION /;
+$REVISION = '$Id$';
+$VERSION = '1.0';
+use base 'HTML::TokeParser::Simple::Token::Tag';
+
+use HTML::Entities qw/encode_entities/;
+
+my %TOKEN = (
+ tag => 1,
+ attr => 2,
+ attrseq => 3,
+ text => 4
+);
+
+my %INSTANCE;
+
+sub _init {
+ my $self = shift;
+ if ('S' eq $self->[0]) {
+ $INSTANCE{$self}{offset} = 0;
+ $INSTANCE{$self}{tag} = $self->[1];
+ }
+ else {
+ $INSTANCE{$self}{offset} = -1;
+ my $tag = $self->[0];
+ $tag =~ s/^\///;
+ $INSTANCE{$self}{tag} = $tag;
+ }
+ return $self;
+}
+
+sub _get_offset { return $INSTANCE{+shift}{offset} }
+sub _get_text { return shift->[-1] }
+
+sub _get_tag {
+ my $self = shift;
+ return $INSTANCE{$self}{tag};
+}
+
+sub _get_attrseq {
+ my $self = shift;
+ my $index = $TOKEN{attrseq} + $self->_get_offset;
+ return $self->[$index];
+}
+
+sub _get_attr {
+ my $self = shift;
+ my $index = $TOKEN{attr} + $self->_get_offset;
+ return $self->[$index];
+}
+
+sub DESTROY { delete $INSTANCE{+shift} }
+
+sub return_attr { goto &get_attr }
+sub return_attrseq { goto &get_attrseq }
+sub return_tag { goto &get_tag }
+
+# attribute munging methods
+
+sub set_attr {
+ my ($self, $name, $value) = @_;
+ return 'HASH' eq ref $name
+ ? $self->_set_attr_from_hashref($name)
+ : $self->_set_attr_from_string($name, $value);
+}
+
+sub _set_attr_from_string {
+ my ($self, $name, $value) = @_;
+ $name = lc $name;
+ my $attr = $self->get_attr;
+ my $attrseq = $self->get_attrseq;
+ unless (exists $attr->{$name}) {
+ push @$attrseq => $name;
+ }
+ $attr->{$name} = $value;
+ $self->rewrite_tag;
+}
+
+sub _set_attr_from_hashref {
+ my ($self, $attr_hash) = @_;
+ while (my ($attr, $value) = each %$attr_hash) {
+ $self->set_attr($attr, $value);
+ }
+ return $self;
+}
+
+sub rewrite_tag {
+ my $self = shift;
+ my $attr = $self->get_attr;
+ my $attrseq = $self->get_attrseq;
+
+ # capture the final slash if the tag is self-closing
+ my ($self_closing) = $self->_get_text =~ m{(\s?/)>$};
+ $self_closing ||= '';
+
+ my $tag = '';
+ foreach ( @$attrseq ) {
+ next if $_ eq '/'; # is this a bug in HTML::TokeParser?
+ $tag .= sprintf qq{ %s="%s"} => $_, encode_entities($attr->{$_});
+ }
+ my $first = $self->is_end_tag ? '/' : '';
+ $tag = sprintf '<%s%s%s%s>', $first, $self->get_tag, $tag, $self_closing;
+ $self->_set_text($tag);
+ return $self;
+}
+
+sub delete_attr {
+ my ($self,$name) = @_;
+ $name = lc $name;
+ my $attr = $self->get_attr;
+ return unless exists $attr->{$name};
+ delete $attr->{$name};
+ my $attrseq = $self->get_attrseq;
+ @$attrseq = grep { $_ ne $name } @$attrseq;
+ $self->rewrite_tag;
+}
+
+# get_foo methods
+
+sub return_text {
+ require Carp;
+ Carp::carp('return_text() is deprecated. Use as_is() instead');
+ goto &as_is;
+}
+
+sub as_is {
+ return shift->_get_text;
+}
+
+sub get_tag {
+ return shift->_get_tag;
+}
+
+sub get_token0 {
+ return '';
+}
+
+sub get_attr {
+ my $self = shift;
+ my $attributes = $self->_get_attr;
+ return @_ ? $attributes->{lc shift} : $attributes;
+}
+
+sub get_attrseq {
+ my $self = shift;
+ $self->_get_attrseq;
+}
+
+# is_foo methods
+
+sub is_tag {
+ my $self = shift;
+ return $self->is_start_tag( @_ );
+}
+
+sub is_start_tag {
+ my ($self, $tag) = @_;
+ return $tag ? $self->_match_tag($tag) : 1;
+}
+
+sub _match_tag {
+ my ($self, $tag) = @_;
+ return 'Regexp' eq ref $tag
+ ? $self->_get_tag =~ $tag
+ : $self->_get_tag eq lc $tag;
+}
+
+1;
+
+__END__
+
+=head1 NAME
+
+HTML::TokeParser::Simple::Token::Tag::Start - Token.pm "start tag" class.
+
+=head1 SYNOPSIS
+
+ use HTML::TokeParser::Simple;
+ my $p = HTML::TokeParser::Simple->new( $somefile );
+
+ while ( my $token = $p->get_token ) {
+ # This prints all text in an HTML doc (i.e., it strips the HTML)
+ next unless $token->is_text;
+ print $token->as_is;
+ }
+
+=head1 DESCRIPTION
+
+This class does most of the heavy lifting for C. See
+the C docs for details.
+
+=head1 OVERRIDDEN METHODS
+
+=over 4
+
+=item * as_is
+
+=item * delete_attr
+
+=item * get_attr
+
+=item * get_attrseq
+
+=item * get_tag
+
+=item * get_token0
+
+=item * is_start_tag
+
+=item * is_tag
+
+=item * return_attr
+
+=item * return_attrseq
+
+=item * return_tag
+
+=item * return_text
+
+=item * rewrite_tag
+
+=item * set_attr
+
+=cut
+
Index: /main/trunk/greenstone2/perllib/cpan/HTML/TokeParser/Simple/Token/Text.pm
===================================================================
--- /main/trunk/greenstone2/perllib/cpan/HTML/TokeParser/Simple/Token/Text.pm (revision 28214)
+++ /main/trunk/greenstone2/perllib/cpan/HTML/TokeParser/Simple/Token/Text.pm (revision 28214)
@@ -0,0 +1,50 @@
+package HTML::TokeParser::Simple::Token::Text;
+
+use strict;
+
+use vars qw/ $VERSION $REVISION /;
+$REVISION = '$Id$';
+$VERSION = '1.0';
+use base 'HTML::TokeParser::Simple::Token';
+
+sub as_is {
+ return shift->[1];
+}
+
+sub is_text { 1 }
+
+1;
+
+__END__
+
+=head1 NAME
+
+HTML::TokeParser::Simple::Token::Text - Token.pm text class.
+
+=head1 SYNOPSIS
+
+ use HTML::TokeParser::Simple;
+ my $p = HTML::TokeParser::Simple->new( $somefile );
+
+ while ( my $token = $p->get_token ) {
+ # This prints all text in an HTML doc (i.e., it strips the HTML)
+ next unless $token->is_text;
+ print $token->as_is;
+ }
+
+=head1 DESCRIPTION
+
+This class represents "text" tokens. See the C
+documentation for details.
+
+=head1 OVERRIDDEN METHODS
+
+=over 4
+
+=item * as_is
+
+=item * is_text
+
+=back
+
+=cut
Index: in/trunk/greenstone2/perllib/cpan/perl-5.8/HTML/TokeParser.pm
===================================================================
--- /main/trunk/greenstone2/perllib/cpan/perl-5.8/HTML/TokeParser.pm (revision 28213)
+++ (revision )
@@ -1,371 +1,0 @@
-package HTML::TokeParser;
-
-# $Id$
-
-require HTML::PullParser;
-@ISA=qw(HTML::PullParser);
-#$VERSION = sprintf("%d.%02d", q$Revision$ =~ /(\d+)\.(\d+)/);
-
-use strict;
-use Carp ();
-use HTML::Entities qw(decode_entities);
-use HTML::Tagset ();
-
-my %ARGS =
-(
- start => "'S',tagname,attr,attrseq,text",
- end => "'E',tagname,text",
- text => "'T',text,is_cdata",
- process => "'PI',token0,text",
- comment => "'C',text",
- declaration => "'D',text",
-
- # options that default on
- unbroken_text => 1,
-);
-
-
-sub new
-{
- my $class = shift;
- my %cnf;
- if (@_ == 1) {
- my $type = (ref($_[0]) eq "SCALAR") ? "doc" : "file";
- %cnf = ($type => $_[0]);
- }
- else {
- %cnf = @_;
- }
-
- my $textify = delete $cnf{textify} || {img => "alt", applet => "alt"};
-
- my $self = $class->SUPER::new(%cnf, %ARGS) || return undef;
-
- $self->{textify} = $textify;
- $self;
-}
-
-
-sub get_tag
-{
- my $self = shift;
- my $token;
- while (1) {
- $token = $self->get_token || return undef;
- my $type = shift @$token;
- next unless $type eq "S" || $type eq "E";
- substr($token->[0], 0, 0) = "/" if $type eq "E";
- return $token unless @_;
- for (@_) {
- return $token if $token->[0] eq $_;
- }
- }
-}
-
-
-sub _textify {
- my($self, $token) = @_;
- my $tag = $token->[1];
- return undef unless exists $self->{textify}{$tag};
-
- my $alt = $self->{textify}{$tag};
- my $text;
- if (ref($alt)) {
- $text = &$alt(@$token);
- } else {
- $text = $token->[2]{$alt || "alt"};
- $text = "[\U$tag]" unless defined $text;
- }
- return $text;
-}
-
-
-sub get_text
-{
- my $self = shift;
- my @text;
- while (my $token = $self->get_token) {
- my $type = $token->[0];
- if ($type eq "T") {
- my $text = $token->[1];
- decode_entities($text) unless $token->[2];
- push(@text, $text);
- } elsif ($type =~ /^[SE]$/) {
- my $tag = $token->[1];
- if ($type eq "S") {
- if (defined(my $text = _textify($self, $token))) {
- push(@text, $text);
- next;
- }
- } else {
- $tag = "/$tag";
- }
- if (!@_ || grep $_ eq $tag, @_) {
- $self->unget_token($token);
- last;
- }
- push(@text, " ")
- if $tag eq "br" || !$HTML::Tagset::isPhraseMarkup{$token->[1]};
- }
- }
- join("", @text);
-}
-
-
-sub get_trimmed_text
-{
- my $self = shift;
- my $text = $self->get_text(@_);
- $text =~ s/^\s+//; $text =~ s/\s+$//; $text =~ s/\s+/ /g;
- $text;
-}
-
-sub get_phrase {
- my $self = shift;
- my @text;
- while (my $token = $self->get_token) {
- my $type = $token->[0];
- if ($type eq "T") {
- my $text = $token->[1];
- decode_entities($text) unless $token->[2];
- push(@text, $text);
- } elsif ($type =~ /^[SE]$/) {
- my $tag = $token->[1];
- if ($type eq "S") {
- if (defined(my $text = _textify($self, $token))) {
- push(@text, $text);
- next;
- }
- }
- if (!$HTML::Tagset::isPhraseMarkup{$tag}) {
- $self->unget_token($token);
- last;
- }
- push(@text, " ") if $tag eq "br";
- }
- }
- my $text = join("", @text);
- $text =~ s/^\s+//; $text =~ s/\s+$//; $text =~ s/\s+/ /g;
- $text;
-}
-
-1;
-
-
-__END__
-
-=head1 NAME
-
-HTML::TokeParser - Alternative HTML::Parser interface
-
-=head1 SYNOPSIS
-
- require HTML::TokeParser;
- $p = HTML::TokeParser->new("index.html") ||
- die "Can't open: $!";
- $p->empty_element_tags(1); # configure its behaviour
-
- while (my $token = $p->get_token) {
- #...
- }
-
-=head1 DESCRIPTION
-
-The C is an alternative interface to the
-C class. It is an C subclass with a
-predeclared set of token types. If you wish the tokens to be reported
-differently you probably want to use the C directly.
-
-The following methods are available:
-
-=over 4
-
-=item $p = HTML::TokeParser->new( $filename, %opt );
-
-=item $p = HTML::TokeParser->new( $filehandle, %opt );
-
-=item $p = HTML::TokeParser->new( \$document, %opt );
-
-The object constructor argument is either a file name, a file handle
-object, or the complete document to be parsed. Extra options can be
-provided as key/value pairs and are processed as documented by the base
-classes.
-
-If the argument is a plain scalar, then it is taken as the name of a
-file to be opened and parsed. If the file can't be opened for
-reading, then the constructor will return C and $! will tell
-you why it failed.
-
-If the argument is a reference to a plain scalar, then this scalar is
-taken to be the literal document to parse. The value of this
-scalar should not be changed before all tokens have been extracted.
-
-Otherwise the argument is taken to be some object that the
-C can read() from when it needs more data. Typically
-it will be a filehandle of some kind. The stream will be read() until
-EOF, but not closed.
-
-A newly constructed C differ from its base classes
-by having the C attribute enabled by default. See
-L for a description of this and other attributes that
-influence how the document is parsed. It is often a good idea to enable
-C behaviour.
-
-Note that the parsing result will likely not be valid if raw undecoded
-UTF-8 is used as a source. When parsing UTF-8 encoded files turn
-on UTF-8 decoding:
-
- open(my $fh, "<:utf8", "index.html") || die "Can't open 'index.html': $!";
- my $p = HTML::TokeParser->new( $fh );
- # ...
-
-If a $filename is passed to the constructor the file will be opened in
-raw mode and the parsing result will only be valid if its content is
-Latin-1 or pure ASCII.
-
-If parsing from an UTF-8 encoded string buffer decode it first:
-
- utf8::decode($document);
- my $p = HTML::TokeParser->new( \$document );
- # ...
-
-=item $p->get_token
-
-This method will return the next I found in the HTML document,
-or C at the end of the document. The token is returned as an
-array reference. The first element of the array will be a string
-denoting the type of this token: "S" for start tag, "E" for end tag,
-"T" for text, "C" for comment, "D" for declaration, and "PI" for
-process instructions. The rest of the token array depend on the type
-like this:
-
- ["S", $tag, $attr, $attrseq, $text]
- ["E", $tag, $text]
- ["T", $text, $is_data]
- ["C", $text]
- ["D", $text]
- ["PI", $token0, $text]
-
-where $attr is a hash reference, $attrseq is an array reference and
-the rest are plain scalars. The L explains the
-details.
-
-=item $p->unget_token( @tokens )
-
-If you find you have read too many tokens you can push them back,
-so that they are returned the next time $p->get_token is called.
-
-=item $p->get_tag
-
-=item $p->get_tag( @tags )
-
-This method returns the next start or end tag (skipping any other
-tokens), or C if there are no more tags in the document. If
-one or more arguments are given, then we skip tokens until one of the
-specified tag types is found. For example:
-
- $p->get_tag("font", "/font");
-
-will find the next start or end tag for a font-element.
-
-The tag information is returned as an array reference in the same form
-as for $p->get_token above, but the type code (first element) is
-missing. A start tag will be returned like this:
-
- [$tag, $attr, $attrseq, $text]
-
-The tagname of end tags are prefixed with "/", i.e. end tag is
-returned like this:
-
- ["/$tag", $text]
-
-=item $p->get_text
-
-=item $p->get_text( @endtags )
-
-This method returns all text found at the current position. It will
-return a zero length string if the next token is not text. Any
-entities will be converted to their corresponding character.
-
-If one or more arguments are given, then we return all text occurring
-before the first of the specified tags found. For example:
-
- $p->get_text("p", "br");
-
-will return the text up to either a paragraph of linebreak element.
-
-The text might span tags that should be I. This is
-controlled by the $p->{textify} attribute, which is a hash that
-defines how certain tags can be treated as text. If the name of a
-start tag matches a key in this hash then this tag is converted to
-text. The hash value is used to specify which tag attribute to obtain
-the text from. If this tag attribute is missing, then the upper case
-name of the tag enclosed in brackets is returned, e.g. "[IMG]". The
-hash value can also be a subroutine reference. In this case the
-routine is called with the start tag token content as its argument and
-the return value is treated as the text.
-
-The default $p->{textify} value is:
-
- {img => "alt", applet => "alt"}
-
-This means that and