Opened 13 years ago

#781 new defect

Section extraction for PDF using Word

Reported by: ak19 Owned by: nobody
Priority: moderate Milestone: Possible 2.88 Release
Component: Collection Building Severity: major
Keywords: Cc:


John Rose wanted to know two things:

a) on whether extracting metadata (including section data) from Word documents using a microsoft utility is possible for Macs as well.

b) could the facility of automatic section information extraction which exists only for Word be used for pdf files as follows:

  1. Generate the document in Word format (in Word or Open Office).
  2. Import it into Greenstone Windows version and generate an html file with the section information incorporated but hidden by right clicking on the file in the Gather view.
  3. Generate a pdf file from the Word file (for example in Open Office) and put it in the collection.
  4. Find a way to tell Greenstone that the pdf document is the scrlink for the html document (initially by adding a link manually in the archives file, but later perhaps by finding a way to set this by a parameter in HTMLPlugin, something like an associated file). One would also have to make sure that the associated pdf file is not treated a second time as the primary file.

In this way one could do section searching on the html file and display in the pdf file. The problem is step 4, could you advise?

Change History (0)

Note: See TracTickets for help on using tickets.