Ticket #841 (new defect)

Opened 5 years ago

Last modified 5 years ago

Upgrade to PDFBox 1.7 as it can convert txt pages to images

Reported by: ak19 Owned by: nobody
Priority: moderate Milestone: 2.87 Release
Component: Collection Building Severity: major
Keywords: PDFBox extension, PDFToImage Cc:

Description

The -pagedimg_FORMAT option is now supported when using the PDFBox extension. Howerver, our pdfbox jar file comes to version 1.5, and only "generates pages as images" when PDF pages are actually images.

The pdfbox jar version 1.7 is able to generate pages as images from PDFs containing text. However, the output images aren't always clean: sometimes columns of multi column documents overlap. This may be because the PDFToImage command of PDFBox is still in beta.

Otherwise, including in terms of line spacing (an issue we had in the past), the 1.7 pdfbox jar file appears to perform like the 1.5 version.

Should we upgrade already, or wait until the PDFToImage command works well before bothering to, since not much is gained at present?

Change History

Changed 5 years ago by ak19

  • milestone set to 2.86 Release
Note: See TracTickets for help on using tickets.