==== Compiling Xpdf Tools ====
Needs CMake to compile.

==== Information related to Xpdf Tools, and to general PDF to html conversion ====
__________________________________________________________
Mojo::DOM (Perl)
__________________________________________________________

1. Before Dr Bainbridge found Mojo::DOM, he looked at
* https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers
* http://radar.oreilly.com/2014/02/parsing-html-with-perl-2.html

2. Main links for Mojo::DOM
* https://mojolicious.org/perldoc/Mojo/DOM
* https://metacpan.org/pod/Mojo::DOM
	Dependencies: http://deps.cpantesters.org/?module=Mojo%3A%3ADOM;perl=latest

Once you've downloaded Mojo::DOM's src, follow Dr Bainbridge's sequence of commands for building the Mojo::DOM CPAN module of perl below.
We'll be using this module to be used for parsing the HTML output by XPDF tool pdftohtml


mkdir cpan
 2020  tar xvzf Mojolicious-7.84.tar.gz 
 2021  cd Mojolicious-7.84/
 2028  perl ./Makefile.PL PREFIX=`pwd`/installed
 2030  make
 2031  make install
 2033  cp -r installed/share/perl/5.18.2 ../cpan
cd ..
 2044  export PERL5LIB=`pwd`/cpan

 2053  emacs -nw test.pl

#!/usr/bin/perl -w
add in 'use v5.10;'
 
 2054  chmod a+x test.pl
 2055  ./test.pl 


__________________________________________________________
XPDF
__________________________________________________________

Xpdf's last mod date is in 2017 and it includes its own pdftohtml utility tool, whereas the old "pdftohtml" tool that GS used was last updated 2013 (and itself made use of Xpdf, possible older versions).

1. https://www.xpdfreader.com/download.html

As per the Readme file found in the linux binary of Xpdf Tools, the Xpdf Viewer requires the qt toolkit, but not the Xpdf Tools. Have not read the Install file to confirm whether the same is the case for when compiling the command line tools. (But in that case, can't we just include the tools binary available for all 3 OS, instead of compiling on each platform)

Using Xpdf's pdftohtml tool:
greenstone@machine-name:~/Downloads/xpdf-tools-linux-4.00/bin64$./pdftohtml -z 1.5 ~/Downloads/ApacheLicence.pdf licence

	where licence is a folder

https://www.xpdfreader.com/pdftohtml-man.html
https://linux.die.net/man/5/xpdfrc
(Configuration flags you can put into ~/.xpdfrc to use as defaults when running xpdf tool commands)

2. We're using Xpdf Tools version: xpdf-tools-linux-4.00

__________________________________________________________
PDF2DOM: tried it out, but wasn't what we wanted
__________________________________________________________
Using PDFBox to convert a PDF to full HTML, both images and text and placed correctly with respect to each other, is tricky, see https://stackoverflow.com/questions/9671239/pdfbox-convert-a-pdf-to-text-or-html-including-images-from-the-pdf
(Google: pdfbox to convert pdf to html with images)

PDF2DOM tool (based on PDFBox) to convert PDF to HTML with images
* http://cssbox.sourceforge.net/pdf2dom/documentation.php
* Got the command line jar tool, PDFToHTML.jar version 1.7, from https://sourceforge.net/projects/cssbox/files/Pdf2DOM/
* Further information and source code at https://github.com/radkovo/Pdf2Dom
* API: http://cssbox.sourceforge.net/pdf2dom/api/index.html


1. Running

java -jar PDFToHTML.jar <infile> [<outfile>]

	greenstone@machine-name:~/Downloads$ java -jar PDFToHTML.jar SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2


It will output the page, but you'll see the following output indicating that the logger is not displaying anything:
	SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
	SLF4J: Defaulting to no-operation (NOP) logger implementation
	SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

See https://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder

To see error output download SLF4J simple jar, run as follows:

	greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML ApacheLicencePDFA.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2

The above is a MS Word produced PDF (archive format) and works fine: font folder generated containing the extracted fonts

The following is a PDF produced from the same doc file by the latest libreoffice installed on Windows:
	ApacheLicencePDFA_FromODT.pdf
But running the same command on it produces the following font errors:

greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML ApacheLicencePDFA_FromODT.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
[main] INFO org.reflections.Reflections - Reflections took 163 ms to scan 1 urls, producing 36 keys and 222 values 
[main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException
[main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException
[main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException
[main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException

Fonts get extracted if the source PDF was generated by MS Word's doc to PDF conversion. Fonts didn't get extracted from PDF upon conversion to HTML when libreoffice was used to convert a .doc to the source PDF.

2. Check version of PDF
https://www.codeproject.com/Questions/167550/How-to-check-different-versions-of-PDF


3. pdf to html command line conversion open source
https://stackoverflow.com/questions/8370014/how-to-convert-pdf-to-html

"Download

    pdfbox-2.0.3.jar
    fontbox-2.0.3.jar
    preflight-2.0.3.jar
    xmpbox-2.0.3.jar
    pdfbox-tools-2.0.3.jar
    pdfbox-debugger-2.0.3.jar

from http://pdfbox.apache.org/
...

PLEASE NOTE: Images do not get pushed to the HTML output."


4. Need a way to check if PDF contains images, then use pdf2dom, else basic pdfbox conversion to html (less div tags with inline style markup)?
https://stackoverflow.com/questions/46215879/count-images-in-pdf-using-pdfbox


UNUSED
Googled for: java tool convert pdf version
* https://stackoverflow.com/questions/11137912/all-inclusive-tool-to-convert-different-types-of-documents-to-pdf
* https://www.qoppa.com/pdfprocess/
jPDFProcess – Java PDF Library to Create, Manipulate PDF
(appears to be payware)
* https://www.gnostice.com/nl_article.asp?id=95&t=How_to_Change_the_PDF_Version_of_a_Document
How to Convert a PDF Document to an Older or Newer Version
uses .NET
* http://www.baeldung.com/pdf-conversions-java
PDF Conversions in Java
e.g. PDF to html and html to PDF


__________________________________________________________

greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
[main] INFO org.reflections.Reflections - Reflections took 153 ms to scan 1 urls, producing 36 keys and 222 values 
[main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException
[main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException
[main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException
[main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException


greenstone@machine-name:~/Downloads$ java -classpath Pdf2Dom/target/pdf2dom-1.8-SNAPSHOT.jar:pdfbox-app.jar:slf4j-jdk14-1.6.6.jar:log4j-over-slf4j-1.6.6.jar:slf4j-api-1.6.6.jar  org.fit.pdfdom.PDFToHTML SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
Exception in thread "main" java.lang.NoClassDefFoundError: org/mabb/fontverter/FontVerter
	at org.fit.pdfdom.FontTable$Entry.loadTrueTypeFont(FontTable.java:178)
	at org.fit.pdfdom.FontTable$Entry.getData(FontTable.java:147)
	at org.fit.pdfdom.FontTable$Entry.isEntryValid(FontTable.java:161)
	at org.fit.pdfdom.FontTable.addEntry(FontTable.java:48)
	at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:378)
	at org.fit.pdfdom.PDFBoxTree.updateFontTable(PDFBoxTree.java:361)
	at org.fit.pdfdom.PDFDomTree.updateFontTable(PDFDomTree.java:544)
	at org.fit.pdfdom.PDFBoxTree.processPage(PDFBoxTree.java:206)
	at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
	at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
	at org.fit.pdfdom.PDFDomTree.createDOM(PDFDomTree.java:218)
	at org.fit.pdfdom.PDFDomTree.writeText(PDFDomTree.java:194)
	at org.fit.pdfdom.PDFToHTML.main(PDFToHTML.java:77)
Caused by: java.lang.ClassNotFoundException: org.mabb.fontverter.FontVerter
	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
	... 13 more
greenstone@machine-name:~/Downloads$