Changeset 16753

Show
Ignore:
Timestamp:
13.08.2008 13:10:23 (11 years ago)
Author:
ak19
Message:

get_language_encoding for HTMLFiles strips out the comments before trying to match on html tags

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • gsdl/trunk/perllib/plugins/ReadTextFile.pm

    r16724 r16753  
    340340    (exists $self->{'converted_to'} && $self->{'converted_to'} eq 'HTML')){ 
    341341 
     342    # remove comments, including multiline ones, so that we don't match on  
     343    # inactive tags (those that are nested inside comments) 
     344    $text =~ s/<!--.*?-->//sg; 
     345 
    342346    # remove <title>stuff</title> -- as titles tend often to be in English 
    343347    # for foreign language documents 
     
    349353    } 
    350354    # check the meta http-equiv charset tag unless it is commented out 
    351     elsif (($text !~ /<!--[^<>]?<meta http-equiv/i) && ($text =~ /<meta http-equiv.*content-type.*charset=(.+?)\"/i)) {             
     355    elsif ($text =~ m/<meta http-equiv.*content-type.*charset=(.+?)\"/i) {              
    352356        $best_encoding = $1; 
    353357#       print STDERR "**** meta tag found, encoding is: $best_encoding\n";