Changeset 32089 for gs2-extensions


Ignore:
Timestamp:
2017-12-08T19:16:12+13:00 (6 years ago)
Author:
ak19
Message:
  1. Attempted fix by Kathy and me for Diego's problem of PDFBox's handling of a PDF. When it was set to convert_to_html, it built fine, but convert_to_text produced something that was invalid XML in doc.XML and build failed. Diego reasoned correctly that building ought to succeed in both cases if it succeeded in one case. Kathy found the correct fix for escaping the ampersand character (it wasn't & to & that I'd attempted, nor did using HTML::Entities' encode work either). 2. The fix needed to read and write files, so introducing readUTF8File() and writeUTF8File() into FileUtils.pm for reusability. Need to still contact John Thompson to ask him if and how these functions need to be modified to support parallel processing, for which FileUtils was written.
File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs2-extensions/pdf-box/trunk/java/perllib/plugins/PDFBoxConverter.pm

    r27510 r32089  
    3232no strict 'subs'; # allow barewords (eg STDERR) as function arguments
    3333
     34#use HTML::Entities; # for encoding characters into their HTML entities when PDFBox converts to text
     35
    3436use gsprintf 'gsprintf';
     37use FileUtils;
    3538
    3639# these two variables mustn't be initialised here or they will get stuck
     
    257260    #print STDERR "**** item file: $target_file_path\n";
    258261    }
    259    
     262    elsif ($self->{'converted_to'} eq "text") {
     263    # ensure html entities are doubly escaped for pdfbox to text conversion: & -> &
     264    # conversion to html does it automatically, but conversion to text doesn't
     265    # and this results in illegal characters in doc.xml
     266
     267    my $fulltext = &FileUtils::readUTF8File($target_file_path);
     268    #$fulltext = &HTML::Entities::encode($fulltext); # doesn't seem to help
     269    $fulltext =~ s@&@&@sg; # Kathy's fix to ensure doc contents don't break XML
     270    &FileUtils::writeUTF8File($target_file_path, \$fulltext);
     271    }
     272
    260273    if ($had_error) {
    261274    return (0, $result,$target_file_path);
Note: See TracChangeset for help on using the changeset viewer.