Ticket #942 (new defect)

Opened 5 months ago

Last modified 4 months ago

New pdftohtml with Xpdf tools - works with newer PDFs too

Reported by: ak19 Owned by: ak19
Priority: moderate Milestone: 3.09 Release
Component: Collection Building Severity: major
Keywords: xpdf Cc:

Description

Kathy found that users on the mailing list wanted more HTML output options with PDFPlugin. PDFBox's pagedimg output option was modified to produce img+text, but Kathy was hoping there were more possibilities for actual PDF to HTML support out there.

Dr Bainbridge first found PDFtoDOM which was based on PDFBox. But this produced unsatisfactory HTML (sometimes fonts weren't extracted, often fonts made the display hard to read due to overlapping characters, a <div> element around every word rather than every line).

Then Dr Bainbridge found XPdf Tools, which contained a new pdftohtml, which produced results we liked. Its pdftohtml tool outputs screenshots of each PDF page's background + the text overlaid, all as HTML. One html doc per page was produced, and we'd manipulate these into a single sectionalised HTML doc.

To get Xpdf tools to work with GS in this way:

1. Downloaded Xpdf tools binaries for Lin/Win/Mac, eventually to be compiled up for Lin & Mac

2. To manipulate the HTML DOM produced, Dr Bainbridge found the perl module Mojo::DOM, which he compiled up.

3. Then the code was modified to make use of these. The list of commit revisions so far follow below.

4. This led to thinking that PDFPlugin needed to be restructured as its configuration options were already complicated and filled with mutually contradictory options since pdfbox_conversion was included, and now to become more complicated and contradictory with the inclusion of XPDF tools.

The commit revisions thus far that make use of Xpdf Tools' pdftohtml and its pdftotext to finally support PDF to text conversion on Windows are as follows. None of these commits concern restructuring the PDFPlugin as yet.

http://trac.greenstone.org/changeset/32205 - http://trac.greenstone.org/changeset/32210,

http://trac.greenstone.org/changeset/32215,

http://trac.greenstone.org/changeset/32219 - http://trac.greenstone.org/changeset/32224

Note that the Xpdf tools binaries for mac have been committed to an svn ignored folder and that they're not yet automatically checked out. Either we get Xpdf tools to compile from src (if we can get past the fact that Xpdf tools use CMake to configure and build rather than autotools' configure script that we're used to) or we find a better SVN location to put the Mac binaries of Xpdf tools.

Change History

Changed 5 months ago by ak19

XPDF

Xpdf's last mod date is in 2017 and it includes its own pdftohtml utility tool, whereas the old "pdftohtml" tool that GS used was last updated 2013 (and itself made use of Xpdf, possible older versions).

1.  https://www.xpdfreader.com/download.html

As per the Readme file found in the linux binary of Xpdf Tools, the Xpdf Viewer requires the qt toolkit, but not the Xpdf Tools. Have not read the Install file to confirm whether the same is the case for when compiling the command line tools. (But in that case, can't we just include the tools binary available for all 3 OS, instead of compiling on each platform)

Using Xpdf's pdftohtml tool:

greenstone@machine:~/Downloads/xpdf-tools-linux-4.00/bin64$./pdftohtml -z 1.5 ~/Downloads/ApacheLicence.pdf licence

where licence is a folder

*  https://www.xpdfreader.com/pdftohtml-man.html

*  https://linux.die.net/man/5/xpdfrc

(Configuration flags you can put into ~/.xpdfrc to use as defaults when running xpdf tool commands)

2. We're using Xpdf Tools version: xpdf-tools-linux-4.00

Changed 5 months ago by ak19

Mojo::DOM (Perl)

1. Before Dr Bainbridge found Mojo::DOM, he looked at
*  https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers
*  http://radar.oreilly.com/2014/02/parsing-html-with-perl-2.html

2. Main links for Mojo::DOM
*  https://mojolicious.org/perldoc/Mojo/DOM
*  https://metacpan.org/pod/Mojo::DOM

Dependencies:  http://deps.cpantesters.org/?module=Mojo%3A%3ADOM;perl=latest

Once Mojo::DOM's src has been downloaded, follow Dr Bainbridge's sequence of commands for building the Mojo::DOM CPAN module of perl below.
We're using this module to parse the HTML output by XPDF tool pdftohtml.

2019 mkdir cpan
2020 tar xvzf Mojolicious-7.84.tar.gz
2021 cd Mojolicious-7.84/
2028 perl ./Makefile.PL PREFIX=pwd/installed
2030 make
2031 make install
2033 cp -r installed/share/perl/5.18.2 ../cpan
2043 cd ..
2044 export PERL5LIB=pwd/cpan

2053 emacs -nw test.pl

Add the shebang and an import statement:

#!/usr/bin/perl -w
add in 'use v5.10;'

Save the file and testrun the test.pl program included with the Mojo package:

2054 chmod a+x test.pl
2055 ./test.pl

3. The relevant folders of the compiled up Mojo module were then committed to SVN to become part of perllib/cpan:
http://trac.greenstone.org/changeset/32205

Changed 5 months ago by ak19

PDF2DOM: tried it out, but wasn't what we wanted

At present, Greenstone does not make use of PDFtoDOM.
We had earlier tried out the PDFBox-based PDFtoDOM for converting PDF to HTML, but settled with Xpdf-tools instead. In case we ever want to proceed with PDFtoDOM, here are some instructions.

Using PDFBox to convert a PDF to full HTML, where both images and text are produced and placed correctly with respect to each other, is tricky, see  https://stackoverflow.com/questions/9671239/pdfbox-convert-a-pdf-to-text-or-html-including-images-from-the-pdf
(Google: pdfbox to convert pdf to html with images)

PDF2DOM tool (based on PDFBox) can be used to convert PDF to HTML with images
*  http://cssbox.sourceforge.net/pdf2dom/documentation.php
* Got the command line jar tool, PDFToHTML.jar version 1.7, from  https://sourceforge.net/projects/cssbox/files/Pdf2DOM/
* Further information and source code at  https://github.com/radkovo/Pdf2Dom
* API:  http://cssbox.sourceforge.net/pdf2dom/api/index.html

Running

java -jar PDFToHTML.jar <infile> [<outfile>]

greenstone@machine-name:~/Downloads$ java -jar PDFToHTML.jar SampleDoc?1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2

It will output the page, but you'll see the following output indicating that the logger is not displaying anything:

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder?". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See  http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

See  https://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder

To see error output download SLF4J simple jar, run as follows:

greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML ApacheLicencePDFA.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2

The above is a MS Word produced PDF (archive format) and works fine: font folder generated containing the extracted fonts.

The following is a PDF produced from the same doc file by the latest libreoffice installed on Windows:

ApacheLicencePDFA_FromODT.pdf

But running the same command on it produces the following font errors:

greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML ApacheLicencePDFA_FromODT.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2

[main] INFO org.reflections.Reflections - Reflections took 163 ms to scan 1 urls, producing 36 keys and 222 values
[main] WARN org.fit.pdfdom.FontTable? - Error loading font 'BAAAAA+Georgia' Message: FontVerter? could not detect the input font's type. class java.io.IOException
[main] WARN org.fit.pdfdom.FontTable? - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter? could not detect the input font's type. class java.io.IOException
[main] WARN org.fit.pdfdom.FontTable? - Error loading font 'BAAAAA+Georgia' Message: FontVerter? could not detect the input font's type. class java.io.IOException
[main] WARN org.fit.pdfdom.FontTable? - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter? could not detect the input font's type. class java.io.IOException

So: Fonts get extracted if the source PDF was generated by MS Word's doc to PDF conversion. Fonts didn't get extracted from PDF upon conversion to HTML when libreoffice was used to convert a .doc to the source PDF.

Further, the HTML file produced contains <div> elements around every word rather than around every line or paragraph, and this is a problem when we build the HTML in Greenstone: on preview, each word is obviously on its own line as the styling information in the <head> sections of the HTML was not maintained. Also, the HTML files generated by PDFtoDOM are unnecessarily large because of the <div>s around every word, each of which has long sections of duplicated inline styling.

Changed 4 months ago by ak19

The next step was to compile up xpdf-tools from source on both linux and mac to create static executables of pdftohtml, pdftotext and other tools. Then generate the tarballs that we will be extracting (or maybe commit each tool in the linux/bin and mac/bin folder as before).

The changesets to do with the above are from http://trac.greenstone.org/changeset/32225 to http://trac.greenstone.org/changeset/32254 inclusive.

Changed 4 months ago by ak19

Further commits related to compiling up static xpdf-tools binaries and getting them into the correct location on trac:

http://trac.greenstone.org/changeset/32255 to http://trac.greenstone.org/changeset/32268

Changed 4 months ago by ak19

Some further minor changes related to compilling up xpdftools statically on Linux/Mac:

http://trac.greenstone.org/changeset/32269 and http://trac.greenstone.org/changeset/32272

Changed 4 months ago by ak19

Restructuring and refactoring PDFPlugin (old pdftohtml + pdfbox) into

- PDFv1Plugin: uses only the old pdftohtml

- PDFv1Plugin: uses pdfbox and xpdftools (from xpdftools, just pdftohtml and pdftotext at present).

Commits related to the refactoring:

http://trac.greenstone.org/changeset/32273 - http://trac.greenstone.org/changeset/32275, http://trac.greenstone.org/changeset/32277 - http://trac.greenstone.org/changeset/32287

Note that "PDFPlugin", which is being deprecated, is for now still present for backwards compatibility with those migrating from an earlier version of GS to the newer one.

IMPORTANT: PDFv2Plugin assumes the pdf-box extension is installed and everything is setup for it like perl. So it will work out of the box for GS3, but not for GS2 where the pdf-box extension has to be manually downloaded into the right folder.

Changed 4 months ago by ak19

Changed 4 months ago by ak19

Changed 4 months ago by ak19

To get all the files modified by this ticket (so far):

* gs2build/setup.bat

http://trac.greenstone.org/browser/main/trunk/greenstone2/setup.bat?format=txt

* gs2build/setup.bash

http://trac.greenstone.org/browser/main/trunk/greenstone2/setup.bash?format=txt

* gs2build/collect/modelcol/etc/collectionConfig.xml

http://trac.greenstone.org/browser/main/trunk/greenstone2/collect/modelcol/etc/collectionConfig.xml?format=txt

* gs2build/bin/script/mkcol.pl

http://trac.greenstone.org/browser/main/trunk/greenstone2/bin/script/mkcol.pl?format=txt

* gs2build/bin/script/gsConvert.pl

http://trac.greenstone.org/browser/main/trunk/greenstone2/bin/script/gsConvert.pl?format=txt

* gs2build/perllib/util.pm

http://trac.greenstone.org/browser/main/trunk/greenstone2/perllib/util.pm?format=txt

* gs2build/perllib/plugins/CommonUtil.pm

http://trac.greenstone.org/browser/main/trunk/greenstone2/perllib/plugins/CommonUtil.pm?format=txt

* gs2build/perllib/plugins/ConvertBinaryFile.pm

http://trac.greenstone.org/browser/main/trunk/greenstone2/perllib/plugins/ConvertBinaryFile.pm?format=txt

* gs2build/perllib/plugins/PDFPlugin.pm

http://trac.greenstone.org/browser/main/trunk/greenstone2/perllib/plugins/PDFPlugin.pm?format=txt

* gs2build/perllib/plugins/PDFv1Plugin.pm

http://trac.greenstone.org/browser/main/trunk/greenstone2/perllib/plugins/PDFv1Plugin.pm?format=txt

* gs2build/perllib/plugins/PDFv2Plugin.pm

http://trac.greenstone.org/browser/main/trunk/greenstone2/perllib/plugins/PDFv2Plugin.pm?format=txt

* gs2build/perllib/strings.properties

http://trac.greenstone.org/export/32290/main/trunk/greenstone2/perllib/strings.properties

PDF-Box tarball:

* http://trac.greenstone.org/export/32292/gs2-extensions/pdf-box/trunk/pdf-box-java.tar.gz

* http://trac.greenstone.org/export/32292/gs2-extensions/pdf-box/trunk/pdf-box-java.zip

One of:

* http://trac.greenstone.org/browser/main/trunk/greenstone2/bin/linux/xpdf-tools

* http://trac.greenstone.org/browser/main/trunk/greenstone2/bin/darwin/xpdf-tools

* http://trac.greenstone.org/browser/main/trunk/binaries/windows/bin/xpdf-tools

* gs2build/perllib/cpan/Mojo

http://trac.greenstone.org/browser/main/trunk/greenstone2/perllib/cpan/Mojo

* gs2build/perllib/cpan/Mojolicious

http://trac.greenstone.org/browser/main/trunk/greenstone2/perllib/cpan/Mojolicious

Changed 4 months ago by ak19

Also related to restructuring and refactoring PDFPlugin:

http://trac.greenstone.org/changeset/32294

Note: See TracTickets for help on using tickets.