==== Compiling Xpdf Tools ==== Needs CMake to compile. ==== Information related to Xpdf Tools, and to general PDF to html conversion ==== __________________________________________________________ Mojo::DOM (Perl) __________________________________________________________ 1. Before Dr Bainbridge found Mojo::DOM, he looked at * https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers * http://radar.oreilly.com/2014/02/parsing-html-with-perl-2.html 2. Main links for Mojo::DOM * https://mojolicious.org/perldoc/Mojo/DOM * https://metacpan.org/pod/Mojo::DOM Dependencies: http://deps.cpantesters.org/?module=Mojo%3A%3ADOM;perl=latest Once you've downloaded Mojo::DOM's src, follow Dr Bainbridge's sequence of commands for building the Mojo::DOM CPAN module of perl below. We'll be using this module to be used for parsing the HTML output by XPDF tool pdftohtml mkdir cpan 2020 tar xvzf Mojolicious-7.84.tar.gz 2021 cd Mojolicious-7.84/ 2028 perl ./Makefile.PL PREFIX=`pwd`/installed 2030 make 2031 make install 2033 cp -r installed/share/perl/5.18.2 ../cpan cd .. 2044 export PERL5LIB=`pwd`/cpan 2053 emacs -nw test.pl #!/usr/bin/perl -w add in 'use v5.10;' 2054 chmod a+x test.pl 2055 ./test.pl __________________________________________________________ XPDF __________________________________________________________ Xpdf's last mod date is in 2017 and it includes its own pdftohtml utility tool, whereas the old "pdftohtml" tool that GS used was last updated 2013 (and itself made use of Xpdf, possible older versions). 1. https://www.xpdfreader.com/download.html As per the Readme file found in the linux binary of Xpdf Tools, the Xpdf Viewer requires the qt toolkit, but not the Xpdf Tools. Have not read the Install file to confirm whether the same is the case for when compiling the command line tools. (But in that case, can't we just include the tools binary available for all 3 OS, instead of compiling on each platform) Using Xpdf's pdftohtml tool: greenstone@machine-name:~/Downloads/xpdf-tools-linux-4.00/bin64$./pdftohtml -z 1.5 ~/Downloads/ApacheLicence.pdf licence where licence is a folder https://www.xpdfreader.com/pdftohtml-man.html https://linux.die.net/man/5/xpdfrc (Configuration flags you can put into ~/.xpdfrc to use as defaults when running xpdf tool commands) 2. We're using Xpdf Tools version: xpdf-tools-linux-4.00 __________________________________________________________ PDF2DOM: tried it out, but wasn't what we wanted __________________________________________________________ Using PDFBox to convert a PDF to full HTML, both images and text and placed correctly with respect to each other, is tricky, see https://stackoverflow.com/questions/9671239/pdfbox-convert-a-pdf-to-text-or-html-including-images-from-the-pdf (Google: pdfbox to convert pdf to html with images) PDF2DOM tool (based on PDFBox) to convert PDF to HTML with images * http://cssbox.sourceforge.net/pdf2dom/documentation.php * Got the command line jar tool, PDFToHTML.jar version 1.7, from https://sourceforge.net/projects/cssbox/files/Pdf2DOM/ * Further information and source code at https://github.com/radkovo/Pdf2Dom * API: http://cssbox.sourceforge.net/pdf2dom/api/index.html 1. Running java -jar PDFToHTML.jar [] greenstone@machine-name:~/Downloads$ java -jar PDFToHTML.jar SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2 It will output the page, but you'll see the following output indicating that the logger is not displaying anything: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. See https://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder To see error output download SLF4J simple jar, run as follows: greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML ApacheLicencePDFA.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2 The above is a MS Word produced PDF (archive format) and works fine: font folder generated containing the extracted fonts The following is a PDF produced from the same doc file by the latest libreoffice installed on Windows: ApacheLicencePDFA_FromODT.pdf But running the same command on it produces the following font errors: greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML ApacheLicencePDFA_FromODT.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2 [main] INFO org.reflections.Reflections - Reflections took 163 ms to scan 1 urls, producing 36 keys and 222 values [main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException [main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException [main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException [main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException Fonts get extracted if the source PDF was generated by MS Word's doc to PDF conversion. Fonts didn't get extracted from PDF upon conversion to HTML when libreoffice was used to convert a .doc to the source PDF. 2. Check version of PDF https://www.codeproject.com/Questions/167550/How-to-check-different-versions-of-PDF 3. pdf to html command line conversion open source https://stackoverflow.com/questions/8370014/how-to-convert-pdf-to-html "Download pdfbox-2.0.3.jar fontbox-2.0.3.jar preflight-2.0.3.jar xmpbox-2.0.3.jar pdfbox-tools-2.0.3.jar pdfbox-debugger-2.0.3.jar from http://pdfbox.apache.org/ ... PLEASE NOTE: Images do not get pushed to the HTML output." 4. Need a way to check if PDF contains images, then use pdf2dom, else basic pdfbox conversion to html (less div tags with inline style markup)? https://stackoverflow.com/questions/46215879/count-images-in-pdf-using-pdfbox UNUSED Googled for: java tool convert pdf version * https://stackoverflow.com/questions/11137912/all-inclusive-tool-to-convert-different-types-of-documents-to-pdf * https://www.qoppa.com/pdfprocess/ jPDFProcess – Java PDF Library to Create, Manipulate PDF (appears to be payware) * https://www.gnostice.com/nl_article.asp?id=95&t=How_to_Change_the_PDF_Version_of_a_Document How to Convert a PDF Document to an Older or Newer Version uses .NET * http://www.baeldung.com/pdf-conversions-java PDF Conversions in Java e.g. PDF to html and html to PDF __________________________________________________________ greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2 [main] INFO org.reflections.Reflections - Reflections took 153 ms to scan 1 urls, producing 36 keys and 222 values [main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException [main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException [main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException [main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException greenstone@machine-name:~/Downloads$ java -classpath Pdf2Dom/target/pdf2dom-1.8-SNAPSHOT.jar:pdfbox-app.jar:slf4j-jdk14-1.6.6.jar:log4j-over-slf4j-1.6.6.jar:slf4j-api-1.6.6.jar org.fit.pdfdom.PDFToHTML SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2 Exception in thread "main" java.lang.NoClassDefFoundError: org/mabb/fontverter/FontVerter at org.fit.pdfdom.FontTable$Entry.loadTrueTypeFont(FontTable.java:178) at org.fit.pdfdom.FontTable$Entry.getData(FontTable.java:147) at org.fit.pdfdom.FontTable$Entry.isEntryValid(FontTable.java:161) at org.fit.pdfdom.FontTable.addEntry(FontTable.java:48) at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:378) at org.fit.pdfdom.PDFBoxTree.updateFontTable(PDFBoxTree.java:361) at org.fit.pdfdom.PDFDomTree.updateFontTable(PDFDomTree.java:544) at org.fit.pdfdom.PDFBoxTree.processPage(PDFBoxTree.java:206) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) at org.fit.pdfdom.PDFDomTree.createDOM(PDFDomTree.java:218) at org.fit.pdfdom.PDFDomTree.writeText(PDFDomTree.java:194) at org.fit.pdfdom.PDFToHTML.main(PDFToHTML.java:77) Caused by: java.lang.ClassNotFoundException: org.mabb.fontverter.FontVerter at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 13 more greenstone@machine-name:~/Downloads$