Context Navigation

← Previous Ticket
Next Ticket →

#287 new defect

PDFPlugin: pdftohtml requires '-c' option to properly process multi-column pdf files

Reported by:	mcennis	Owned by:	nobody
Priority:	moderate	Milestone:	Collection building wishlist
Component:	Collection Building: Plugins	Severity:	minor
Keywords:	PDFPlugin pdftohtml	Cc:

Description

When using the PDFPlugin pdftohtml to generate HTML for the HTMLPlugin to extract text, there is a flaw with multi-column pdfs. In the simple version (currently used by Greenstone), text is extracted left to right, ignoring columns. This results in correct HTML where text is aligned properly for viewing, but the text is not in a logical order. When the text is extracted, the content of the columns are interleaved line by line. This can be fixed by adding the '-c' option on pdftohtml when called. This has the added benefit of providing additional structure that can be extracted (sections, etc.)

Change History (2)

comment:1 by kjdon, 16 years ago

Milestone:	Release 3.06
Severity:	major → minor

Can -c go on all the time? What if there are some docs with columns, some without? Can it go wrong?

comment:2 by kjdon, 15 years ago

Milestone:	→ Collection building wishlist

Note: See TracTickets for help on using tickets.

Download in other formats: