Changeset 30593

Show
Ignore:
Timestamp:
27.06.2016 18:11:27 (4 years ago)
Author:
ak19
Message:

Dr Bainbridge found another point in the code where the UTF-16 Surrogate pairs (that lead to malformed UTF-8 character errors) are encountered in HTMLPlugin. This part of the code is encountered when the PDFPlugin has the pdfbox_conversion set. PDFBox would have produced the HTML containing entities that represent characters not considered valid in UTF-8 and this then failed on Diego's test PDF until Dr Bainbridge's bugfix.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • main/trunk/greenstone2/perllib/ghtml.pm

    r23371 r30593  
    219219    } 
    220220 
     221 
    221222    if (defined $code) { 
     223     
     224    # malformed UTF-8 character used in UTF-16 
     225    if($code >= 0xD800 && $code <= 0xDFFF) { 
     226        print STDERR "Warning: encountered the HTML entity \&#$code; which represents part of a UTF-16 surrogate pair, which is not supported in ghtml::getcharequiv(). Replacing with '?'.\n"; 
     227        $code = ord("?"); 
     228    } 
     229 
    222230    # non-standard Microsoft breakage, as usual 
    223231    if ($code < 0x9f) { # code page 1252 uses reserved bytes