Opened 11 years ago

Last modified 11 years ago

#841 new defect

Upgrade to PDFBox 1.7 as it can convert txt pages to images

Reported by: ak19 Owned by: nobody
Priority: moderate Milestone: Possible 2.88 Release
Component: Collection Building Severity: major
Keywords: PDFBox extension, PDFToImage Cc:


The -pagedimg_FORMAT option is now supported when using the PDFBox extension. Howerver, our pdfbox jar file comes to version 1.5, and only "generates pages as images" when PDF pages are actually images.

The pdfbox jar version 1.7 is able to generate pages as images from PDFs containing text. However, the output images aren't always clean: sometimes columns of multi column documents overlap. This may be because the PDFToImage command of PDFBox is still in beta.

Otherwise, including in terms of line spacing (an issue we had in the past), the 1.7 pdfbox jar file appears to perform like the 1.5 version.

Should we upgrade already, or wait until the PDFToImage command works well before bothering to, since not much is gained at present?

Change History (1)

comment:1 by ak19, 11 years ago

Milestone: 2.86 Release
Note: See TracTickets for help on using tickets.