Ticket #839 (new enhancement)

Opened 7 years ago

Last modified 6 years ago

PDFBox extension should handle convert_to image formats

Reported by: ak19 Owned by: nobody
Priority: low Milestone:
Component: Collection Building Severity: major
Keywords: Cc:

Description

At present, PDFBox always executes ExtractText? to convert newer PDF versions to html or text.

However, PDFPlugin's configure options include convert_to pagedimg_jpg/gif/png, and the PDFBox app can actually convert PDFs to images using the PDFToImage command, instead of the usual ExtractText? command used by the PDFBox ext.

There's two ways to PDFBox with PDFToImage:

1. java -jar "/research/<>/gs2-svn2/ext/pdf-box/lib/java/pdfbox-app.jar" PDFToImage -imageType png -outputPrefix "/home/<>/Desktop/dump/pinky" "/research/<>/tutorial_sample_files/Word_and_PDF/difficult_pdf/pdf05-notext.pdf"

2. java -cp "/research/<>/gs2-svn2/ext/pdf-box/lib/java/pdfbox-app.jar" org.apache.pdfbox.PDFToImage -imageType png -outputPrefix "/home/<>/Desktop/dump/lala" "/research/<>/tutorial_sample_files/Word_and_PDF/difficult_pdf/pdf05-notext.pdf"

When I tried adding this in to the extension in PDFBoxConverter.pm, I got stuck in a part of the code trying to read the output file generated. This used to be html or text when ExtractText? was used, but when the PDFToImage cmd is used instead, multiple output files can be created (an image for each page) and are stored in a temporary output folder.

At present the perl building code appears to choke trying to read the output folder thinking it's a single file.

Change History

Changed 7 years ago by ak19

Changes to PDFBoxConverter.pm:

1. the new() constructor:

my $launch_cmd = "java -cp \"$pbajar\" -Dline.separator=\"<br />\" org.apache.pdfbox.ExtractText?";

$self->{'pdfbox_launch_cmd'} = $launch_cmd; # for html and text extract $self->{'pdfbox_img_launch_cmd'} = "java -cp \"$pbajar\" org.apache.pdfbox.PDFToImage"; # for image extraction (gif, jpg, png)

2. In convert(), it needs to accept jpg/gif/png output file types:

if ($target_file_type eq "html") {

$self->{'converted_to'} = "HTML";

} elsif ($target_file_type eq "jpg" $target_file_type eq "gif" $target_file_type eq "png") {

$self->{'converted_to'} = $target_file_type;

} else {

$self->{'converted_to'} = "text";

}

3. Still in convert(), we need to work out the target file path without appending the file suffix (which it does by default):

# Determine the full name and path of the output file my $target_file_path; if ($self->{'enable_cache'}) {

$self->init_cache_for_file($source_file_full_path); my $cache_dir = $self->{'cached_dir'}; my $file_root = $self->{'cached_file_root'}; #$file_root .= "_$convert_id" if ($convert_id ne ""); my $target_file = "$file_root"; # append the output filetype suffix only for non-image output formats, since for images we # can be outputting multiple image files per single PDF input file if ($target_file_type ne "jpg" && $target_file_type ne "gif" && $target_file_type ne "png") {

$target_file .= $target_file_type;

} $target_file_path = &util::filename_cat($cache_dir,$target_file);

} else {

# this is in gsdl/tmp. get a tmp filename in collection instead??? $target_file_path = &util::get_tmp_filename($target_file_type);

# for image files, remove the suffix, since we can have many output image files per input PDF # (one img for each page of the PDF, for example)

if($target_file_type eq "jpg" $target_file_type eq "gif" $target_file_type eq "png") {

$target_file_path =~ s/\.[.]*$//g; if(!&util::dir_exists($target_file_path)) {

mkdir($target_file_path);

}

#my ($tailname, $dirname, $suffix) = &File::Basename::fileparse($target_file_path, "\\.[\\.]+\$"); #$target_file_path = $dirname;

} push(@{$self->{'pbtmp_file_paths'}}, $target_file_path);

}

4. Finally, make it run the special image convert command when the target file type is an image format:

# Generate and run the convert command

my $convert_cmd = "";

if($target_file_type eq "jpg" $target_file_type eq "gif" $target_file_type eq "png") { # converting to images

# want the filename without extension, because the images are to be generated with the same filename as the PDF my ($tailname, $dirname, $suffix) = &File::Basename::fileparse($source_file_full_path, "\\.[\\.]+\$"); #$tailname =~ s/ //g; # removing spaces in input filename

$dirname = &util::filename_cat($target_file_path, $tailname); #mkdir("/research/<>/gs2-svn2/tmp/F11"); #$dirname = "/research/<>/gs2-svn2/tmp/F11/pdf05-notext"; ####

$convert_cmd = $self->{'pdfbox_img_launch_cmd'}; $convert_cmd .= " -imageType $target_file_type"; $convert_cmd .= " -outputPrefix $dirname"; $convert_cmd .= " \"$source_file_full_path\"";

} else { # html or text

$convert_cmd = $self->{'pdfbox_launch_cmd'}; $convert_cmd .= " -html" if ($target_file_type eq "html"); $convert_cmd .= " \"$source_file_full_path\" \"$target_file_path\"";

}

Changed 7 years ago by ak19

  • priority changed from moderate to low
  • type changed from defect to enhancement

The above code fails with the following error message:

import.pl> File "pdf05-notext.pdf" matches filespec "pdf05-notext\.pdf" import.pl> DirectoryPlugin? recurring: pdf05-notext.pdf import.pl> Convert command: java -cp "/research/<>/gs2-svn2/ext/pdf-box/lib/java/pdfbox-app.jar" org.apache.pdfbox.PDFToImage -imageType png -outputPrefix /research/<>/gs2-svn2/tmp/F551/pdf05-notext "/research/<>/gs2-svn2/collect/pdf/import/pdf05-notext.pdf" import.pl> PDFBox Conversion: java -cp "/research/<>/gs2-svn2/ext/pdf-box/lib/java/pdfbox-app.jar" org.apache.pdfbox.PDFToImage -imageType png -outputPrefix /research/<>/gs2-svn2/tmp/F551/pdf05-notext "/research/<>/gs2-svn2/collect/pdf/import/pdf05-notext.pdf" import.pl> Converting pdf05-notext.pdf to: png ... import.pl> ...done import.pl> Use of uninitialized value $gc in ord at /research/<>/gs2-svn2/perllib/multiread.pm line 225. import.pl> ConvertToPlug::write_file {ConvertToPlug?.could_not_open_for_writing} (Is a directory) import.pl> Error: Failed to run: "/usr/bin/perl" -S import.pl -removeold "-gli" "-language" "en" "-collectdir" "/research/<>/gs2-svn2/collect" "-verbosity" "5" "pdf" import.pl> Command failed.

Changed 7 years ago by ak19

The above code fails with the following error message:

import.pl> File "pdf05-notext.pdf" matches filespec "pdf05-notext\.pdf"

import.pl> DirectoryPlugin? recurring: pdf05-notext.pdf

import.pl> Convert command: java -cp "/research/ak19/gs2-svn2/ext/pdf-box/lib/java/pdfbox-app.jar" org.apache.pdfbox.PDFToImage -imageType png -outputPrefix /research/ak19/gs2-svn2/tmp/F551/pdf05-notext "/research/ak19/gs2-svn2/collect/pdf/import/pdf05-notext.pdf"

import.pl> PDFBox Conversion: java -cp "/research/ak19/gs2-svn2/ext/pdf-box/lib/java/pdfbox-app.jar" org.apache.pdfbox.PDFToImage -imageType png -outputPrefix /research/ak19/gs2-svn2/tmp/F551/pdf05-notext "/research/ak19/gs2-svn2/collect/pdf/import/pdf05-notext.pdf"

import.pl> Converting pdf05-notext.pdf to: png ...

import.pl> ...done

import.pl> Use of uninitialized value $gc in ord at /research/ak19/gs2-svn2/perllib/multiread.pm line 225.

import.pl> ConvertToPlug::write_file {ConvertToPlug?.could_not_open_for_writing} (Is a directory)

import.pl> Error: Failed to run: "/usr/bin/perl" -S import.pl -removeold "-gli" "-language" "en" "-collectdir" "/research/ak19/gs2-svn2/collect" "-verbosity" "5" "pdf"

import.pl> Command failed.

Changed 6 years ago by robertthomas

The most up to date listing of coupons for  jabong. Our editors check coupon codes to ensure validity every day.

Note: See TracTickets for help on using tickets.