Ticket #287 (new defect)

Opened 11 years ago

Last modified 9 years ago

PDFPlugin: pdftohtml requires '-c' option to properly process multi-column pdf files

Reported by: mcennis Owned by: nobody
Priority: moderate Milestone: Collection building wishlist
Component: Collection Building: Plugins Severity: minor
Keywords: PDFPlugin pdftohtml Cc:

Description

When using the PDFPlugin pdftohtml to generate HTML for the HTMLPlugin to extract text, there is a flaw with multi-column pdfs. In the simple version (currently used by Greenstone), text is extracted left to right, ignoring columns. This results in correct HTML where text is aligned properly for viewing, but the text is not in a logical order. When the text is extracted, the content of the columns are interleaved line by line. This can be fixed by adding the '-c' option on pdftohtml when called. This has the added benefit of providing additional structure that can be extracted (sections, etc.)

Change History

Changed 11 years ago by kjdon

  • severity changed from major to minor
  • milestone Release 3.06 deleted

Can -c go on all the time? What if there are some docs with columns, some without? Can it go wrong?

Changed 9 years ago by kjdon

  • milestone set to Collection building wishlist
Note: See TracTickets for help on using tickets.