Opened 12 years ago

Last modified 10 years ago

#839 new enhancement

PDFBox extension should handle convert_to image formats

Reported by: ak19 Owned by: nobody
Priority: low Milestone:
Component: Collection Building Severity: major
Keywords: Cc:

Description

At present, PDFBox always executes ExtractText to convert newer PDF versions to html or text.

However, PDFPlugin's configure options include convert_to pagedimg_jpg/gif/png, and the PDFBox app can actually convert PDFs to images using the PDFToImage command, instead of the usual ExtractText command used by the PDFBox ext.

There's two ways to PDFBox with PDFToImage:

  1. java -jar "/research/<>/gs2-svn2/ext/pdf-box/lib/java/pdfbox-app.jar" PDFToImage -imageType png -outputPrefix "/home/<>/Desktop/dump/pinky" "/research/<>/tutorial_sample_files/Word_and_PDF/difficult_pdf/pdf05-notext.pdf"
  1. java -cp "/research/<>/gs2-svn2/ext/pdf-box/lib/java/pdfbox-app.jar" org.apache.pdfbox.PDFToImage -imageType png -outputPrefix "/home/<>/Desktop/dump/lala" "/research/<>/tutorial_sample_files/Word_and_PDF/difficult_pdf/pdf05-notext.pdf"

When I tried adding this in to the extension in PDFBoxConverter.pm, I got stuck in a part of the code trying to read the output file generated. This used to be html or text when ExtractText was used, but when the PDFToImage cmd is used instead, multiple output files can be created (an image for each page) and are stored in a temporary output folder.

At present the perl building code appears to choke trying to read the output folder thinking it's a single file.

Change History (4)

comment:1 by ak19, 12 years ago

Changes to PDFBoxConverter.pm:

  1. the new() constructor:

my $launch_cmd = "java -cp \"$pbajar\" -Dline.separator=\"<br />\" org.apache.pdfbox.ExtractText";

$self->{'pdfbox_launch_cmd'} = $launch_cmd; # for html and text extract $self->{'pdfbox_img_launch_cmd'} = "java -cp \"$pbajar\" org.apache.pdfbox.PDFToImage"; # for image extraction (gif, jpg, png)

  1. In convert(), it needs to accept jpg/gif/png output file types:

if ($target_file_type eq "html") {

$self->{'converted_to'} = "HTML";

} elsif ($target_file_type eq "jpg"
$target_file_type eq "gif" $target_file_type eq "png") {

$self->{'converted_to'} = $target_file_type;

} else {

$self->{'converted_to'} = "text";

}

  1. Still in convert(), we need to work out the target file path without appending the file suffix (which it does by default):

# Determine the full name and path of the output file my $target_file_path; if ($self->{'enable_cache'}) {

$self->init_cache_for_file($source_file_full_path); my $cache_dir = $self->{'cached_dir'}; my $file_root = $self->{'cached_file_root'}; #$file_root .= "_$convert_id" if ($convert_id ne ""); my $target_file = "$file_root"; # append the output filetype suffix only for non-image output formats, since for images we # can be outputting multiple image files per single PDF input file if ($target_file_type ne "jpg" && $target_file_type ne "gif" && $target_file_type ne "png") {

$target_file .= $target_file_type;

} $target_file_path = &util::filename_cat($cache_dir,$target_file);

} else {

# this is in gsdl/tmp. get a tmp filename in collection instead??? $target_file_path = &util::get_tmp_filename($target_file_type);

# for image files, remove the suffix, since we can have many output image files per input PDF # (one img for each page of the PDF, for example)

if($target_file_type eq "jpg"
$target_file_type eq "gif" $target_file_type eq "png") {

$target_file_path =~ s/\.[.]*$g; if(!&util::dir_exists($target_file_path)) {

mkdir($target_file_path);

}

#my ($tailname, $dirname, $suffix) = &File::Basename::fileparse($target_file_path, "
.[
.]+\$"); #$target_file_path = $dirname;

} push(@{$self->{'pbtmp_file_paths'}}, $target_file_path);

}

  1. Finally, make it run the special image convert command when the target file type is an image format:

# Generate and run the convert command

my $convert_cmd = "";

if($target_file_type eq "jpg"
$target_file_type eq "gif" $target_file_type eq "png") { # converting to images

# want the filename without extension, because the images are to be generated with the same filename as the PDF my ($tailname, $dirname, $suffix) = &File::Basename::fileparse($source_file_full_path, "
.[
.]+\$"); #$tailname =~ s/ g; # removing spaces in input filename

$dirname = &util::filename_cat($target_file_path, $tailname); #mkdir("/research/<>/gs2-svn2/tmp/F11"); #$dirname = "/research/<>/gs2-svn2/tmp/F11/pdf05-notext"; ####

$convert_cmd = $self->{'pdfbox_img_launch_cmd'}; $convert_cmd .= " -imageType $target_file_type"; $convert_cmd .= " -outputPrefix $dirname"; $convert_cmd .= " \"$source_file_full_path\"";

} else { # html or text

$convert_cmd = $self->{'pdfbox_launch_cmd'}; $convert_cmd .= " -html" if ($target_file_type eq "html"); $convert_cmd .= " \"$source_file_full_path\" \"$target_file_path\"";

}

comment:2 by ak19, 12 years ago

Priority: moderatelow
Type: defectenhancement

The above code fails with the following error message:

import.pl> File "pdf05-notext.pdf" matches filespec "pdf05-notext\.pdf" import.pl> DirectoryPlugin recurring: pdf05-notext.pdf import.pl> Convert command: java -cp "/research/<>/gs2-svn2/ext/pdf-box/lib/java/pdfbox-app.jar" org.apache.pdfbox.PDFToImage -imageType png -outputPrefix /research/<>/gs2-svn2/tmp/F551/pdf05-notext "/research/<>/gs2-svn2/collect/pdf/import/pdf05-notext.pdf" import.pl> PDFBox Conversion: java -cp "/research/<>/gs2-svn2/ext/pdf-box/lib/java/pdfbox-app.jar" org.apache.pdfbox.PDFToImage -imageType png -outputPrefix /research/<>/gs2-svn2/tmp/F551/pdf05-notext "/research/<>/gs2-svn2/collect/pdf/import/pdf05-notext.pdf" import.pl> Converting pdf05-notext.pdf to: png ... import.pl> ...done import.pl> Use of uninitialized value $gc in ord at /research/<>/gs2-svn2/perllib/multiread.pm line 225. import.pl> ConvertToPlug::write_file {ConvertToPlug.could_not_open_for_writing} (Is a directory) import.pl> Error: Failed to run: "/usr/bin/perl" -S import.pl -removeold "-gli" "-language" "en" "-collectdir" "/research/<>/gs2-svn2/collect" "-verbosity" "5" "pdf" import.pl> Command failed.

comment:3 by ak19, 12 years ago

The above code fails with the following error message:

import.pl> File "pdf05-notext.pdf" matches filespec "pdf05-notext\.pdf"

import.pl> DirectoryPlugin recurring: pdf05-notext.pdf

import.pl> Convert command: java -cp "/research/ak19/gs2-svn2/ext/pdf-box/lib/java/pdfbox-app.jar" org.apache.pdfbox.PDFToImage -imageType png -outputPrefix /research/ak19/gs2-svn2/tmp/F551/pdf05-notext "/research/ak19/gs2-svn2/collect/pdf/import/pdf05-notext.pdf"

import.pl> PDFBox Conversion: java -cp "/research/ak19/gs2-svn2/ext/pdf-box/lib/java/pdfbox-app.jar" org.apache.pdfbox.PDFToImage -imageType png -outputPrefix /research/ak19/gs2-svn2/tmp/F551/pdf05-notext "/research/ak19/gs2-svn2/collect/pdf/import/pdf05-notext.pdf"

import.pl> Converting pdf05-notext.pdf to: png ...

import.pl> ...done

import.pl> Use of uninitialized value $gc in ord at /research/ak19/gs2-svn2/perllib/multiread.pm line 225.

import.pl> ConvertToPlug::write_file {ConvertToPlug.could_not_open_for_writing} (Is a directory)

import.pl> Error: Failed to run: "/usr/bin/perl" -S import.pl -removeold "-gli" "-language" "en" "-collectdir" "/research/ak19/gs2-svn2/collect" "-verbosity" "5" "pdf"

import.pl> Command failed.

comment:4 by robertthomas, 10 years ago

The most up to date listing of coupons for jabong. Our editors check coupon codes to ensure validity every day.

Note: See TracTickets for help on using tickets.