Opened 16 years ago
Last modified 15 years ago
#287 new defect
PDFPlugin: pdftohtml requires '-c' option to properly process multi-column pdf files
Reported by: | mcennis | Owned by: | nobody |
---|---|---|---|
Priority: | moderate | Milestone: | Collection building wishlist |
Component: | Collection Building: Plugins | Severity: | minor |
Keywords: | PDFPlugin pdftohtml | Cc: |
Description
When using the PDFPlugin pdftohtml to generate HTML for the HTMLPlugin to extract text, there is a flaw with multi-column pdfs. In the simple version (currently used by Greenstone), text is extracted left to right, ignoring columns. This results in correct HTML where text is aligned properly for viewing, but the text is not in a logical order. When the text is extracted, the content of the columns are interleaved line by line. This can be fixed by adding the '-c' option on pdftohtml when called. This has the added benefit of providing additional structure that can be extracted (sections, etc.)
Change History (2)
comment:1 by , 16 years ago
Milestone: | Release 3.06 |
---|---|
Severity: | major → minor |
comment:2 by , 15 years ago
Milestone: | → Collection building wishlist |
---|
Can -c go on all the time? What if there are some docs with columns, some without? Can it go wrong?