Changeset 16753


Ignore:
Timestamp:
08/13/08 13:10:23 (13 years ago)
Author:
ak19
Message:

get_language_encoding for HTMLFiles strips out the comments before trying to match on html tags

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gsdl/trunk/perllib/plugins/ReadTextFile.pm

    r16724 r16753  
    340340    (exists $self->{'converted_to'} && $self->{'converted_to'} eq 'HTML')){
    341341
     342    # remove comments, including multiline ones, so that we don't match on
     343    # inactive tags (those that are nested inside comments)
     344    $text =~ s/<!--.*?-->//sg;
     345
    342346    # remove <title>stuff</title> -- as titles tend often to be in English
    343347    # for foreign language documents
     
    349353    }
    350354    # check the meta http-equiv charset tag unless it is commented out
    351     elsif (($text !~ /<!--[^<>]?<meta http-equiv/i) && ($text =~ /<meta http-equiv.*content-type.*charset=(.+?)\"/i)) {           
     355    elsif ($text =~ m/<meta http-equiv.*content-type.*charset=(.+?)\"/i) {             
    352356        $best_encoding = $1;
    353357#       print STDERR "**** meta tag found, encoding is: $best_encoding\n";
Note: See TracChangeset for help on using the changeset viewer.