Context Navigation

← Previous Change
Next Change →

PDFBoxConverter.pm

Timestamp:

2018-07-17T22:13:17+12:00 (6 years ago)

Author:

ak19

Message:

This was meant to be oart of commit 32278, where I forgot to commit the updated PDFBoxConvert.pm. The commit message for 32278 was: Our custom pdf-box class PDFToImagesAndText.java now takes two additional flags, textOnly and imagesOnly, which can be used to support paged_text and the original pagedimg_ output formats, besides pagedimgtxt_

File:

: 1 edited

gs2-extensions/pdf-box/trunk/java/perllib/plugins/PDFBoxConverter.pm (modified) (8 diffs)

Legend:

: Unmodified
: Added
: Removed

gs2-extensions/pdf-box/trunk/java/perllib/plugins/PDFBoxConverter.pm

-              r32273
+              r32282
     $self->{'pdfbox_txt_launch_cmd'} = "$java -cp \"$pbajar\" org.apache.pdfbox.tools.ExtractText";
     $self->{'pdfbox_html_launch_cmd'} = "$java -cp \"$pbajar\" -Dline.separator=\"<br />\" org.apache.pdfbox.tools.ExtractText";
+    #$self->{'pdfbox_img_launch_cmd'} = "java -cp \"$pbajar\" org.apache.pdfbox.tools.PDFToImage"; # pdfbox 2.09 cmd for converting each PDF page to an image (gif, jpg, png)
+    # Now: use this cmd to launch our new custom PDFBox class (PDFBoxToImagesAndText.java) to convert each PDF page into an image (gif, jpg, png)
+    # AND its extracted text. An item file is still generated, but this time referring to txtfiles too, not just the images. Result: searchable paged output.
+#   $self->{'pdfbox_img_launch_cmd'} = "java -cp \"$pbajar\" org.apache.pdfbox.tools.PDFToImage"; # pdfbox 2.09 cmd for converting each PDF page to an image (gif, jpg, png)
+    # We use this next cmd to launch our new custom PDFBox class (PDFBoxToImagesAndText.java) to convert each PDF page into an image (gif, jpg, png)
+    # AND its extracted text. Or just each page's extracted text. An item file is still generated,
+    # but this time referring to txtfiles too, not just the images. Result: searchable paged output.
     # Our new custom class PDFBoxToImagesAndText.java lives in the new build folder, so add that to the classpath for the launch cmd
     my $pdfbox_build = &FileUtils::filenameConcatenate($gextpb_home,"build");
     my $classpath = &util::pathname_cat($pbajar,$pdfbox_build);
     $self->{'pdfbox_img_launch_cmd'} = "java -cp \"$classpath\" org.greenstone.pdfbox.PDFBoxToImagesAndText";
+    $self->{'pdfbox_imgtxt_launch_cmd'} = "java -cp \"$classpath\" org.greenstone.pdfbox.PDFBoxToImagesAndText";
+    }
     else {
 …
     my $img_output_mode = 0;
+    my $convert_to = $self->{'convert_to'};
+    my $paged_txt_output_mode = ($convert_to =~ /(pagedimgtxt|paged_text)/) ? 1 : 0;
     # the following line is necessary to avoid 'uninitialised variable' error
     # messages concerning the converted_to member variable when PDFPlugin's
 …
     if ($target_file_type eq "html") {
     $self->{'converted_to'} = "HTML";
+    } elsif ($target_file_type eq "jpg" || $target_file_type eq "gif" || $target_file_type eq "png") {
+    } elsif ($target_file_type eq "jpg" || $target_file_type eq "gif" || $target_file_type eq "png") {
+    # GIF not supported by PDFBox at present, see https://pdfbox.apache.org/1.8/commandline.html#pdftoimage
     $self->{'converted_to'} = $target_file_type;
     $img_output_mode = 1;
 …
     # append the output filetype suffix only for non-image output formats, since for
     # images we can be outputting multiple image files per single PDF input file
     my $target_file = $img_output_mode ? "$file_root" : "$file_root.$target_file_type";
+    my $target_file = ($img_output_mode || $paged_txt_output_mode) ? "$file_root" : "$file_root.$target_file_type";
     $target_file_path = &FileUtils::filenameConcatenate($cache_dir,$target_file);
 …
     # for image files, remove the suffix, since we can have many output image files
     # per input PDF (one img for each page of the PDF, for example)
     if($img_output_mode) {
+    if($img_output_mode || $paged_txt_output_mode) {
         $target_file_path =~ s/\.[^.]*$//g;
         if(!&FileUtils::directoryExists($target_file_path)) {
 …
         # item file generated in it can be deleted in one go on clean_up
+    }
     push(@{$self->{'pbtmp_file_paths'}}, $target_file_path);
+    }
 …
     my ($tailname, $dirname, $suffix) = &File::Basename::fileparse($source_file_full_path, "\\.[^\\.]+\$");
     if($img_output_mode) { # converting to images
+    if($img_output_mode || $paged_txt_output_mode) { # converting each page to image and/or text
     my $output_prefix = &FileUtils::filenameConcatenate($target_file_path, $tailname);
+    $convert_cmd = $self->{'pdfbox_img_launch_cmd'};
+    $convert_cmd .= " -imageType $target_file_type";
+    #$convert_cmd = $paged_txt_output_mode ? $self->{'pdfbox_imgtxt_launch_cmd'} : $self->{'pdfbox_img_launch_cmd'};
+    $convert_cmd = $self->{'pdfbox_imgtxt_launch_cmd'};
+    $convert_cmd .= " -textOnly" unless($img_output_mode); # if paged txt only and no images
+    $convert_cmd .= " -imagesOnly" unless($paged_txt_output_mode); # set to images only unless there's text too
+    $convert_cmd .= " -imageType $target_file_type" if($img_output_mode);
     $convert_cmd .= " -outputPrefix \"$output_prefix\"";
     $convert_cmd .= " \"$source_file_full_path\"";
     } else { # html or text
+    } else { # single stream of text or html
     if ($target_file_type eq "html") {
 …
     = $self->autorun_general_cmd($convert_cmd,$source_file_full_path, $target_file_path,$print_info);
     if($img_output_mode) {
+    if($img_output_mode || $paged_txt_output_mode) {
     # now the images have been generated, generate the "$target_file_path/tailname.item"
     # item file for them, which is also the target_file_path that needs to be returned

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 32282 for gs2-extensions/pdf-box/trunk/java/perllib/plugins/PDFBoxConverter.pm

Legend:

gs2-extensions/pdf-box/trunk/java/perllib/plugins/PDFBoxConverter.pm

Download in other formats: