Ticket #941 (closed enhancement: fixed)

Opened 2 weeks ago

Last modified 9 days ago

PDF conversion should output searchable paged images, so need to extract text besides

Reported by: ak19 Owned by: ak19
Priority: moderate Milestone: 3.09 Release
Component: Greenstone2&3 Severity: major
Keywords: pdfbox Cc:


Kathy discovered there was a need for the following (copied from her email):

"how we can go about generating image plus text for pdfs. currently you can have item file with images, or html text which is ugly. It would be good to have item file with images plus text. they you can view images but search on the extracted text. do we write our own pdf box program? or maybe icecite already does this??"

Change History

Changed 2 weeks ago by ak19

  • status changed from new to closed
  • resolution set to fixed

Committed the solution along with Kathy's more narrowed-down problem description that points out how PDFBox may be modified to achieve this (whereas icecite can only convert to txt, json or xml):


For further details, also read the Readme file committed with the above. It also contains notes on additional work to be done, particularly renaming the customised pdfbox-app to something like gs-pdfbox-app, to contain the usual preferred gs prefix.

Changed 9 days ago by ak19

Further modifications at http://trac.greenstone.org/changeset/32197

The "Remaining work" section has been resolved: * Licence issues taken care off. But need to still confirm that it's done correctly

* No need for gs prefix to pdfbox-app.jar because the new custom java file just lives in our pdfbox code location in a new greenstone java package and it gets compiled with a basic java command (java -cp classpath file.java) against an out of the box pdfbox-app.jar. The produced class file ends up in the new build folder (not in the pdfbox-app.jar). The class is launched with its new Java package name from PDFBoxConverter.pm

* extracted text for each page does not cause the "too little text" message to appear. This used to appear when I had a syntax error in PDFBoxConverter.pm which resulted in pdfbox_conversion to become deactivated and other things failing further on in the build process when processing the newer version PDF test file (with pdfbox_conversion off, the old pdftohtml tool is invoked and can't handle the newer version of PDFs).

Note: See TracTickets for help on using tickets.