Ticket #699 (new enhancement)

Opened 7 years ago

Last modified 6 years ago

handling sections in PDF

Reported by: kjdon Owned by: sjm84
Priority: moderate Milestone: 2.87 Release
Component: Collection Building: Plugins Severity: major
Keywords: Cc:

Description

Users want the ability to extract section info from PDF like we can from HTML or Word.

Does -complex work for this?

Will new converters handle this better?

Change History

Changed 7 years ago by kjdon

adobe pdf reference manual has sections and toc apparently. test on that.

Changed 7 years ago by kjdon

  • milestone changed from 2.84 Release to 2.85 Release

PDFBox api has some hooks that will let us get section information out of a PDF (assuming the info is there in the PDF). This goes beyond the default PDFtoHTML/txt utility provided by apache, but should be doable with a bit of programming effort on our part.

Changed 6 years ago by sjm84

  • milestone changed from 2.85 Release to 2.86 Release

Changed 6 years ago by ak19

PDFBox now works with PDFPlugin's use_sections flag (http://trac.greenstone.org/ticket/753)

Regarding this ticket, what sort of section info is to be extracted (is it metadata embedded in the PDF)?

Note: See TracTickets for help on using tickets.