Ticket #699 (new enhancement)

Opened 10 years ago

Last modified 9 years ago

handling sections in PDF

Reported by: kjdon Owned by: sjm84
Priority: moderate Milestone: Possible 2.88 Release
Component: Collection Building: Plugins Severity: major
Keywords: Cc:


Users want the ability to extract section info from PDF like we can from HTML or Word.

Does -complex work for this?

Will new converters handle this better?

Change History

Changed 10 years ago by kjdon

adobe pdf reference manual has sections and toc apparently. test on that.

Changed 10 years ago by kjdon

  • milestone changed from 2.84 Release to 2.85 Release

PDFBox api has some hooks that will let us get section information out of a PDF (assuming the info is there in the PDF). This goes beyond the default PDFtoHTML/txt utility provided by apache, but should be doable with a bit of programming effort on our part.

Changed 9 years ago by sjm84

  • milestone changed from 2.85 Release to 2.86 Release

Changed 9 years ago by ak19

PDFBox now works with PDFPlugin's use_sections flag (http://trac.greenstone.org/ticket/753)

Regarding this ticket, what sort of section info is to be extracted (is it metadata embedded in the PDF)?

Note: See TracTickets for help on using tickets.