06/05/18 21:11:04 (3 years ago)

All the *essential* changes related to the PDFBox modifications Kathy asked for. The PDFBox app used to be used to generated either images for every PDF page or extract txt from the PDF. Kathy wanted to ideally produce paged images with extracted text, where available, so that this would be searchable. So images AND extracted text. Her idea was to modify the pdfbox app code to do it: a new class based on the existing one that generated the images for each page that would (based on Kathy's answers to my questions) need to be modified to additionally extract the text of each page, so that txt search results matched the correct img page presented. Might as well upgrade the pdfbox app version our GS code used. After testing that the latest version (2.09) did not have any of the issues for which we previously settled on v 1.8.2 (lower than the then most up to date version), the necessary code changes were made. All of these are documented in the newly included GS_PDFBox_README.txt. The new java file is called GS_PDFToImagesAndText.java and is located in the new java/src subfolder. This will need to be put into the pdfbox app 2.09 *src* code to be built, and the generated class file should then be copied into the java/lib/java/pdfbox-app.jar, all as explained in the GS_PDFBox_README.txt. Other files modified for the changes requested by Kathy are PDFBoxConvertger.pm, to refer to our new class and its new java package location as packages have changed in 2.09, and util.pm's create_itemfile() function which now may additionally deal with txt files matching each img file generated. (Not committing minor adjustment to ReadTextFile.pm to prevent a warning, as my fix seems hacky. But the fix is described in the Readme). The pdfbox ext zip/tarballs also modified to contain the changed PDFBoxConverter.pm and pdfbox-app jar file for 2.09 with our custom new class file. But have not yet renamed anything to gs-pdfbox-app as there will be flow on effects elsewhere as described in the Readme, can do all this in a separate commit.

1 edited


  • main/trunk/greenstone2/perllib/util.pm

    r32096 r32193  
    17171717# Used by pdfpstoimg.pl and PDFBoxConverter to create a .item file from
    1718 # a directory containing sequentially numbered images.
     1718# a directory containing sequentially numbered images (and optional matching sequentially numbered .txt files).
    17191719sub create_itemfile
    17521752    print $item_fh "<PagedDocument>\n";
     1754    # In the past, sub create_itemfile() never output txtfile names into the item file (they were left as empty strings),
     1755    # only image file names. Now that PDFBox is being customised for GS with the new GS_PDFToImagesAndText.java class to
     1756    # create images of each PDF page and extract text for that page if extractable, we can have matching txt files for
     1757    # each img file. So now we can output txt file names if we're working with txt files.
     1758    # We just test if a text file exists in the same dir that matches the name of the first image file
     1759    # if a matching txt file does not exist, don't output txtfile names into the item file
     1761    my ($tailname, $dirname, $suffix) = &File::Basename::fileparse($firstfile, "\\.[^\\.]+\$"); # relative filenames so no dirname
     1762    my $txtfilename = &FileUtils::filenameConcatenate($output_dir, $tailname . ".txt");
     1763    my $hasTxtFile = &FileUtils::fileExists($txtfilename);
    17541765    foreach my $file (@dir_files){
    1755     if ($file !~ /\.item/i){
     1766    if ($file !~ /\.item/i && $file !~ /\.txt/i){
    17561767        $page_num = page_number($file);
    17571768        $page_num++ if $starts_at_0; # image numbers start at 0, so add 1
    1758         print $item_fh "   <Page pagenum=\"$page_num\" imgfile=\"$file\" txtfile=\"\"/>\n";
     1769        if($hasTxtFile) {
     1770        print $item_fh "   <Page pagenum=\"$page_num\" imgfile=\"$file\" txtfile=\"$page_num.txt\"/>\n";
     1771        } else {
     1772        print $item_fh "   <Page pagenum=\"$page_num\" imgfile=\"$file\" txtfile=\"\"/>\n";
     1773        }
    17591774    }
    17601775    }
Note: See TracChangeset for help on using the changeset viewer.