Opened 11 years ago
Last modified 9 years ago
#839 new enhancement
PDFBox extension should handle convert_to image formats
Reported by: | ak19 | Owned by: | nobody |
---|---|---|---|
Priority: | low | Milestone: | |
Component: | Collection Building | Severity: | major |
Keywords: | Cc: |
Description
At present, PDFBox always executes ExtractText to convert newer PDF versions to html or text.
However, PDFPlugin's configure options include convert_to pagedimg_jpg/gif/png, and the PDFBox app can actually convert PDFs to images using the PDFToImage command, instead of the usual ExtractText command used by the PDFBox ext.
There's two ways to PDFBox with PDFToImage:
- java -jar "/research/<>/gs2-svn2/ext/pdf-box/lib/java/pdfbox-app.jar" PDFToImage -imageType png -outputPrefix "/home/<>/Desktop/dump/pinky" "/research/<>/tutorial_sample_files/Word_and_PDF/difficult_pdf/pdf05-notext.pdf"
- java -cp "/research/<>/gs2-svn2/ext/pdf-box/lib/java/pdfbox-app.jar" org.apache.pdfbox.PDFToImage -imageType png -outputPrefix "/home/<>/Desktop/dump/lala" "/research/<>/tutorial_sample_files/Word_and_PDF/difficult_pdf/pdf05-notext.pdf"
When I tried adding this in to the extension in PDFBoxConverter.pm, I got stuck in a part of the code trying to read the output file generated. This used to be html or text when ExtractText was used, but when the PDFToImage cmd is used instead, multiple output files can be created (an image for each page) and are stored in a temporary output folder.
At present the perl building code appears to choke trying to read the output folder thinking it's a single file.
Change History (4)
comment:1 by , 11 years ago
comment:2 by , 11 years ago
Priority: | moderate → low |
---|---|
Type: | defect → enhancement |
The above code fails with the following error message:
import.pl> File "pdf05-notext.pdf" matches filespec "pdf05-notext\.pdf" import.pl> DirectoryPlugin recurring: pdf05-notext.pdf import.pl> Convert command: java -cp "/research/<>/gs2-svn2/ext/pdf-box/lib/java/pdfbox-app.jar" org.apache.pdfbox.PDFToImage -imageType png -outputPrefix /research/<>/gs2-svn2/tmp/F551/pdf05-notext "/research/<>/gs2-svn2/collect/pdf/import/pdf05-notext.pdf" import.pl> PDFBox Conversion: java -cp "/research/<>/gs2-svn2/ext/pdf-box/lib/java/pdfbox-app.jar" org.apache.pdfbox.PDFToImage -imageType png -outputPrefix /research/<>/gs2-svn2/tmp/F551/pdf05-notext "/research/<>/gs2-svn2/collect/pdf/import/pdf05-notext.pdf" import.pl> Converting pdf05-notext.pdf to: png ... import.pl> ...done import.pl> Use of uninitialized value $gc in ord at /research/<>/gs2-svn2/perllib/multiread.pm line 225. import.pl> ConvertToPlug::write_file {ConvertToPlug.could_not_open_for_writing} (Is a directory) import.pl> Error: Failed to run: "/usr/bin/perl" -S import.pl -removeold "-gli" "-language" "en" "-collectdir" "/research/<>/gs2-svn2/collect" "-verbosity" "5" "pdf" import.pl> Command failed.
comment:3 by , 11 years ago
The above code fails with the following error message:
import.pl> File "pdf05-notext.pdf" matches filespec "pdf05-notext\.pdf"
import.pl> DirectoryPlugin recurring: pdf05-notext.pdf
import.pl> Convert command: java -cp "/research/ak19/gs2-svn2/ext/pdf-box/lib/java/pdfbox-app.jar" org.apache.pdfbox.PDFToImage -imageType png -outputPrefix /research/ak19/gs2-svn2/tmp/F551/pdf05-notext "/research/ak19/gs2-svn2/collect/pdf/import/pdf05-notext.pdf"
import.pl> PDFBox Conversion: java -cp "/research/ak19/gs2-svn2/ext/pdf-box/lib/java/pdfbox-app.jar" org.apache.pdfbox.PDFToImage -imageType png -outputPrefix /research/ak19/gs2-svn2/tmp/F551/pdf05-notext "/research/ak19/gs2-svn2/collect/pdf/import/pdf05-notext.pdf"
import.pl> Converting pdf05-notext.pdf to: png ...
import.pl> ...done
import.pl> Use of uninitialized value $gc in ord at /research/ak19/gs2-svn2/perllib/multiread.pm line 225.
import.pl> ConvertToPlug::write_file {ConvertToPlug.could_not_open_for_writing} (Is a directory)
import.pl> Error: Failed to run: "/usr/bin/perl" -S import.pl -removeold "-gli" "-language" "en" "-collectdir" "/research/ak19/gs2-svn2/collect" "-verbosity" "5" "pdf"
import.pl> Command failed.
comment:4 by , 9 years ago
The most up to date listing of coupons for jabong. Our editors check coupon codes to ensure validity every day.
Changes to PDFBoxConverter.pm: