Context Navigation

← Previous Change
Next Change →

Changeset 32193 for main

Timestamp:

2018-06-05T21:11:04+12:00 (6 years ago)

Author:

ak19

Message:

All the *essential* changes related to the PDFBox modifications Kathy asked for. The PDFBox app used to be used to generated either images for every PDF page or extract txt from the PDF. Kathy wanted to ideally produce paged images with extracted text, where available, so that this would be searchable. So images AND extracted text. Her idea was to modify the pdfbox app code to do it: a new class based on the existing one that generated the images for each page that would (based on Kathy's answers to my questions) need to be modified to additionally extract the text of each page, so that txt search results matched the correct img page presented. Might as well upgrade the pdfbox app version our GS code used. After testing that the latest version (2.09) did not have any of the issues for which we previously settled on v 1.8.2 (lower than the then most up to date version), the necessary code changes were made. All of these are documented in the newly included GS_PDFBox_README.txt. The new java file is called GS_PDFToImagesAndText.java and is located in the new java/src subfolder. This will need to be put into the pdfbox app 2.09 *src* code to be built, and the generated class file should then be copied into the java/lib/java/pdfbox-app.jar, all as explained in the GS_PDFBox_README.txt. Other files modified for the changes requested by Kathy are PDFBoxConvertger.pm, to refer to our new class and its new java package location as packages have changed in 2.09, and util.pm's create_itemfile() function which now may additionally deal with txt files matching each img file generated. (Not committing minor adjustment to ReadTextFile.pm to prevent a warning, as my fix seems hacky. But the fix is described in the Readme). The pdfbox ext zip/tarballs also modified to contain the changed PDFBoxConverter.pm and pdfbox-app jar file for 2.09 with our custom new class file. But have not yet renamed anything to gs-pdfbox-app as there will be flow on effects elsewhere as described in the Readme, can do all this in a separate commit.

File:

: 1 edited

main/trunk/greenstone2/perllib/util.pm (modified) (2 diffs)

Legend:

: Unmodified
: Added
: Removed

main/trunk/greenstone2/perllib/util.pm

-              r32096
+              r32193
 # Used by pdfpstoimg.pl and PDFBoxConverter to create a .item file from
 # a directory containing sequentially numbered images.
+# a directory containing sequentially numbered images (and optional matching sequentially numbered .txt files).
 sub create_itemfile
+{
 …
     print $item_fh "<PagedDocument>\n";
+    # In the past, sub create_itemfile() never output txtfile names into the item file (they were left as empty strings),
+    # only image file names. Now that PDFBox is being customised for GS with the new GS_PDFToImagesAndText.java class to
+    # create images of each PDF page and extract text for that page if extractable, we can have matching txt files for
+    # each img file. So now we can output txt file names if we're working with txt files.
+    # We just test if a text file exists in the same dir that matches the name of the first image file
+    # if a matching txt file does not exist, don't output txtfile names into the item file
+    my ($tailname, $dirname, $suffix) = &File::Basename::fileparse($firstfile, "\\.[^\\.]+\$"); # relative filenames so no dirname
+    my $txtfilename = &FileUtils::filenameConcatenate($output_dir, $tailname . ".txt");
+    my $hasTxtFile = &FileUtils::fileExists($txtfilename);
     foreach my $file (@dir_files){
     if ($file !~ /\.item/i){
+    if ($file !~ /\.item/i && $file !~ /\.txt/i){
         $page_num = page_number($file);
         $page_num++ if $starts_at_0; # image numbers start at 0, so add 1
+        print $item_fh "   <Page pagenum=\"$page_num\" imgfile=\"$file\" txtfile=\"\"/>\n";
+        if($hasTxtFile) {
+        print $item_fh "   <Page pagenum=\"$page_num\" imgfile=\"$file\" txtfile=\"$page_num.txt\"/>\n";
+        } else {
+        print $item_fh "   <Page pagenum=\"$page_num\" imgfile=\"$file\" txtfile=\"\"/>\n";
+        }
+    }
+    }

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 32193 for main

Legend:

main/trunk/greenstone2/perllib/util.pm

Download in other formats: