Opened 13 years ago

Last modified 12 years ago

#287 new defect

PDFPlugin: pdftohtml requires '-c' option to properly process multi-column pdf files

Reported by: mcennis Owned by: nobody
Priority: moderate Milestone: Collection building wishlist
Component: Collection Building: Plugins Severity: minor
Keywords: PDFPlugin pdftohtml Cc:

Description

When using the PDFPlugin pdftohtml to generate HTML for the HTMLPlugin to extract text, there is a flaw with multi-column pdfs. In the simple version (currently used by Greenstone), text is extracted left to right, ignoring columns. This results in correct HTML where text is aligned properly for viewing, but the text is not in a logical order. When the text is extracted, the content of the columns are interleaved line by line. This can be fixed by adding the '-c' option on pdftohtml when called. This has the added benefit of providing additional structure that can be extracted (sections, etc.)

Change History (2)

comment:1 by kjdon, 13 years ago

Milestone: Release 3.06
Severity: majorminor

Can -c go on all the time? What if there are some docs with columns, some without? Can it go wrong?

comment:2 by kjdon, 12 years ago

Milestone: Collection building wishlist
Note: See TracTickets for help on using tickets.