Ticket #753 (closed defect: fixed)

Opened 7 years ago

Last modified 6 years ago

PDFBOX, pdfbox_conversion and use_sections incompatibility

Reported by: ak19 Owned by: ak19
Priority: moderate Milestone: 2.86 Release
Component: Collection Building: Plugins Severity: enhancement
Keywords: Cc:

Description

Diego reported 19/05/11 09:44:

is there any kind of incompatibility with pdfbox_conversion option and use_sections options?

When I configure PDFPlugin with -use_section (without pdfbox_conversion) then it cuts the document into sections and also saves the numpages metadatata with the number of pages it has. But when I add -pdfbox_conversion it didn´t split the pdf in sections and numpages metadata has a value of 0.

Import process also gives the following error:

MetadataXMLPlugin: processing 20100819083927/metadata.xml

Converting selcom.pdf to: html ... ...done

Use of uninitialized value in string eq at /opt/gsdl-2.84/perllib/plugins/PDFPlugin.pm line 273. HTMLPlugin processing /opt/gsdl-2.84/tmp/F420.html Use of uninitialized value in string eq at /opt/gsdl-2.84/perllib/plugins/PDFPlugin.pm line 378.

Both code blocks are related with the use_sections parameter.

Change History

Changed 7 years ago by ak19

  • milestone set to 2.85 Release

Changed 7 years ago by ak19

  • summary changed from pdfbox_conversion and use_sections incompatibility to PDFBOX, pdfbox_conversion and use_sections incompatibility

Robert Ntalaka reported ("Re: [greenstone-devel] GS Admin Password") on 14/05/11 21:00:

under GS2 server some of the pdf documents are not processed. When you install the extension for pdfs and you use the local server, all pdfs are processed. When you switch to the GS2 server and build the collections, new pdfs are rejected until you configure it to convert to pageimg.

Changed 7 years ago by sjm84

  • milestone changed from 2.85 Release to 2.86 Release

Changed 7 years ago by ak19

PDFBox conversion does not extract pages one by one, but the text of the entire document in one go.

1. If the user defines the sections by using PDF's bookmark feature, then we have to write some java code to extract text between bookmarks.

 http://pdfbox.apache.org/userguide/text_extraction.html

2. An easier route we can take is split each page into its own section (even if it's not a meaningful section, it will be meaningful as a set of pages to click through as happens with the usual PDFPlugin), although that will require launching PDFBox for converting each page.

 http://pdfbox.apache.org/commandlineutilities/ExtractText.html

Finally, we will need to markup the start of each page with <a name="">, since that is what PDFPlugin expects at this stage. (See lines in PDFPlugin:

# we have "<a name=1></a>" etc for each page

# it may be <A name=

my @sections = split('<[Aa] name=', $text);

)

Changed 7 years ago by ak19

Actually, an easier way to implement the split on pages would be to update PDFBox with custom code that extracts text from one page at a time (based a little on the sample code provided at  http://pdfbox.apache.org/userguide/text_extraction.html) and then inserts the <a name="pagenum"></a> at the start of each extracted page's text. This shouldn't be too hard in theory, and modifying the java code means we wouldn't need to keep calling PDFBox for extracting each page either.

Changed 7 years ago by ak19

  • owner changed from nobody to ak19
  • component changed from Collection Building to Collection Building: Plugins
  • severity changed from major to enhancement

use_sections is now implemented:

1. Updated PDFBoxConverter to set the convert_to member field to HTML if html was the output type chosen.

2. Got the latest version of the PDFBox pre-built binary jar file: pdfbox-app-1.5.0.jar

This generates div tags (with page-attributes) to embed each page in.

3. Updated PDFPlugin to work with the paging-related div tags that PDFBox outputs, so that the PDFPlugin can use that to split the pages.

(And tested that PDFPlugin still works with pdf2html and its paging-related anchor tags. Besides, PDFPlugin adds the same tags to the PDFBox output, in order to process it all the same way.)

Will move the other PDF related problem into its own ticket so I can close this one.

Changed 7 years ago by ak19

  • status changed from new to closed
  • resolution set to fixed

Changed 7 years ago by ak19

Can test out the pdfbox jar on a collection "pdfcol" containing the PDF Tutorial's pdf03.pdf and save as "3.html" with the following cmd (replace "/full/path/to"):

/full/path/to/greenstone2/ext/pdf-box/lib/java>java -cp /full/path/to/greenstone2/ext/pdf-box/lib/java/pdfbox-app.jar:/full/path/to/greenstone2/gli/lib/apache.jar:/full/path/to/greenstone2/gli/lib/qfslib.jar org.apache.pdfbox.ExtractText? -html /full/path/to/greenstone2/collect/pdfcol/import/pdf03.pdf /full/path/to/greenstone2/collect/pdfcol/import/3.html

Changed 6 years ago by ak19

Additionally, commit 24197:

Fixed a bug Sam discovered in the latest PDFBox that I recently committed in place of the earlier one (since the latest one was adding page separator tags so we could do paging when use_sections of PDFPlugin is turned on). The bug was that PDFBox code uses the System.lineSeparator for when a newline is required. But this resolves to be the empty string for me, so that the formatting of html output is incorrect, especially noticeable when processing the GS3 manual PDF where the Table of contents ends up appearing all on one line. The change made to the PDFBox src code is in its util folder, PDFText2HTML.java: setLineSeparator(htmlLineSeparator); (instead of systemLineSeparator) where String htmlLineSeparator = "<br />" +systemLineSeparator;

Note: See TracTickets for help on using tickets.