1 | ==== Compiling Xpdf Tools ====
|
---|
2 | Needs CMake to compile.
|
---|
3 |
|
---|
4 | ==== Information related to Xpdf Tools, and to general PDF to html conversion ====
|
---|
5 | __________________________________________________________
|
---|
6 | Mojo::DOM (Perl)
|
---|
7 | __________________________________________________________
|
---|
8 |
|
---|
9 | 1. Before Dr Bainbridge found Mojo::DOM, he looked at
|
---|
10 | * https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers
|
---|
11 | * http://radar.oreilly.com/2014/02/parsing-html-with-perl-2.html
|
---|
12 |
|
---|
13 | 2. Main links for Mojo::DOM
|
---|
14 | * https://mojolicious.org/perldoc/Mojo/DOM
|
---|
15 | * https://metacpan.org/pod/Mojo::DOM
|
---|
16 | Dependencies: http://deps.cpantesters.org/?module=Mojo%3A%3ADOM;perl=latest
|
---|
17 |
|
---|
18 | Once you've downloaded Mojo::DOM's src, follow Dr Bainbridge's sequence of commands for building the Mojo::DOM CPAN module of perl below.
|
---|
19 | We'll be using this module to be used for parsing the HTML output by XPDF tool pdftohtml
|
---|
20 |
|
---|
21 |
|
---|
22 | mkdir cpan
|
---|
23 | 2020 tar xvzf Mojolicious-7.84.tar.gz
|
---|
24 | 2021 cd Mojolicious-7.84/
|
---|
25 | 2028 perl ./Makefile.PL PREFIX=`pwd`/installed
|
---|
26 | 2030 make
|
---|
27 | 2031 make install
|
---|
28 | 2033 cp -r installed/share/perl/5.18.2 ../cpan
|
---|
29 | cd ..
|
---|
30 | 2044 export PERL5LIB=`pwd`/cpan
|
---|
31 |
|
---|
32 | 2053 emacs -nw test.pl
|
---|
33 |
|
---|
34 | #!/usr/bin/perl -w
|
---|
35 | add in 'use v5.10;'
|
---|
36 |
|
---|
37 | 2054 chmod a+x test.pl
|
---|
38 | 2055 ./test.pl
|
---|
39 |
|
---|
40 |
|
---|
41 | __________________________________________________________
|
---|
42 | XPDF
|
---|
43 | __________________________________________________________
|
---|
44 |
|
---|
45 | Xpdf's last mod date is in 2017 and it includes its own pdftohtml utility tool, whereas the old "pdftohtml" tool that GS used was last updated 2013 (and itself made use of Xpdf, possible older versions).
|
---|
46 |
|
---|
47 | 1. https://www.xpdfreader.com/download.html
|
---|
48 |
|
---|
49 | As per the Readme file found in the linux binary of Xpdf Tools, the Xpdf Viewer requires the qt toolkit, but not the Xpdf Tools. Have not read the Install file to confirm whether the same is the case for when compiling the command line tools. (But in that case, can't we just include the tools binary available for all 3 OS, instead of compiling on each platform)
|
---|
50 |
|
---|
51 | Using Xpdf's pdftohtml tool:
|
---|
52 | greenstone@machine-name:~/Downloads/xpdf-tools-linux-4.00/bin64$./pdftohtml -z 1.5 ~/Downloads/ApacheLicence.pdf licence
|
---|
53 |
|
---|
54 | where licence is a folder
|
---|
55 |
|
---|
56 | https://www.xpdfreader.com/pdftohtml-man.html
|
---|
57 | https://linux.die.net/man/5/xpdfrc
|
---|
58 | (Configuration flags you can put into ~/.xpdfrc to use as defaults when running xpdf tool commands)
|
---|
59 |
|
---|
60 | 2. We're using Xpdf Tools version: xpdf-tools-linux-4.00
|
---|
61 |
|
---|
62 | __________________________________________________________
|
---|
63 | PDF2DOM: tried it out, but wasn't what we wanted
|
---|
64 | __________________________________________________________
|
---|
65 | Using PDFBox to convert a PDF to full HTML, both images and text and placed correctly with respect to each other, is tricky, see https://stackoverflow.com/questions/9671239/pdfbox-convert-a-pdf-to-text-or-html-including-images-from-the-pdf
|
---|
66 | (Google: pdfbox to convert pdf to html with images)
|
---|
67 |
|
---|
68 | PDF2DOM tool (based on PDFBox) to convert PDF to HTML with images
|
---|
69 | * http://cssbox.sourceforge.net/pdf2dom/documentation.php
|
---|
70 | * Got the command line jar tool, PDFToHTML.jar version 1.7, from https://sourceforge.net/projects/cssbox/files/Pdf2DOM/
|
---|
71 | * Further information and source code at https://github.com/radkovo/Pdf2Dom
|
---|
72 | * API: http://cssbox.sourceforge.net/pdf2dom/api/index.html
|
---|
73 |
|
---|
74 |
|
---|
75 | 1. Running
|
---|
76 |
|
---|
77 | java -jar PDFToHTML.jar <infile> [<outfile>]
|
---|
78 |
|
---|
79 | greenstone@machine-name:~/Downloads$ java -jar PDFToHTML.jar SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
|
---|
80 |
|
---|
81 |
|
---|
82 | It will output the page, but you'll see the following output indicating that the logger is not displaying anything:
|
---|
83 | SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
|
---|
84 | SLF4J: Defaulting to no-operation (NOP) logger implementation
|
---|
85 | SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
|
---|
86 |
|
---|
87 | See https://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder
|
---|
88 |
|
---|
89 | To see error output download SLF4J simple jar, run as follows:
|
---|
90 |
|
---|
91 | greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML ApacheLicencePDFA.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
|
---|
92 |
|
---|
93 | The above is a MS Word produced PDF (archive format) and works fine: font folder generated containing the extracted fonts
|
---|
94 |
|
---|
95 | The following is a PDF produced from the same doc file by the latest libreoffice installed on Windows:
|
---|
96 | ApacheLicencePDFA_FromODT.pdf
|
---|
97 | But running the same command on it produces the following font errors:
|
---|
98 |
|
---|
99 | greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML ApacheLicencePDFA_FromODT.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
|
---|
100 | [main] INFO org.reflections.Reflections - Reflections took 163 ms to scan 1 urls, producing 36 keys and 222 values
|
---|
101 | [main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException
|
---|
102 | [main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException
|
---|
103 | [main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException
|
---|
104 | [main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException
|
---|
105 |
|
---|
106 | Fonts get extracted if the source PDF was generated by MS Word's doc to PDF conversion. Fonts didn't get extracted from PDF upon conversion to HTML when libreoffice was used to convert a .doc to the source PDF.
|
---|
107 |
|
---|
108 | 2. Check version of PDF
|
---|
109 | https://www.codeproject.com/Questions/167550/How-to-check-different-versions-of-PDF
|
---|
110 |
|
---|
111 |
|
---|
112 | 3. pdf to html command line conversion open source
|
---|
113 | https://stackoverflow.com/questions/8370014/how-to-convert-pdf-to-html
|
---|
114 |
|
---|
115 | "Download
|
---|
116 |
|
---|
117 | pdfbox-2.0.3.jar
|
---|
118 | fontbox-2.0.3.jar
|
---|
119 | preflight-2.0.3.jar
|
---|
120 | xmpbox-2.0.3.jar
|
---|
121 | pdfbox-tools-2.0.3.jar
|
---|
122 | pdfbox-debugger-2.0.3.jar
|
---|
123 |
|
---|
124 | from http://pdfbox.apache.org/
|
---|
125 | ...
|
---|
126 |
|
---|
127 | PLEASE NOTE: Images do not get pushed to the HTML output."
|
---|
128 |
|
---|
129 |
|
---|
130 | 4. Need a way to check if PDF contains images, then use pdf2dom, else basic pdfbox conversion to html (less div tags with inline style markup)?
|
---|
131 | https://stackoverflow.com/questions/46215879/count-images-in-pdf-using-pdfbox
|
---|
132 |
|
---|
133 |
|
---|
134 | UNUSED
|
---|
135 | Googled for: java tool convert pdf version
|
---|
136 | * https://stackoverflow.com/questions/11137912/all-inclusive-tool-to-convert-different-types-of-documents-to-pdf
|
---|
137 | * https://www.qoppa.com/pdfprocess/
|
---|
138 | jPDFProcess â Java PDF Library to Create, Manipulate PDF
|
---|
139 | (appears to be payware)
|
---|
140 | * https://www.gnostice.com/nl_article.asp?id=95&t=How_to_Change_the_PDF_Version_of_a_Document
|
---|
141 | How to Convert a PDF Document to an Older or Newer Version
|
---|
142 | uses .NET
|
---|
143 | * http://www.baeldung.com/pdf-conversions-java
|
---|
144 | PDF Conversions in Java
|
---|
145 | e.g. PDF to html and html to PDF
|
---|
146 |
|
---|
147 |
|
---|
148 |
|
---|
149 | __________________________________________________________
|
---|
150 |
|
---|
151 | greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
|
---|
152 | [main] INFO org.reflections.Reflections - Reflections took 153 ms to scan 1 urls, producing 36 keys and 222 values
|
---|
153 | [main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException
|
---|
154 | [main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException
|
---|
155 | [main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException
|
---|
156 | [main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException
|
---|
157 |
|
---|
158 |
|
---|
159 |
|
---|
160 | greenstone@machine-name:~/Downloads$ java -classpath Pdf2Dom/target/pdf2dom-1.8-SNAPSHOT.jar:pdfbox-app.jar:slf4j-jdk14-1.6.6.jar:log4j-over-slf4j-1.6.6.jar:slf4j-api-1.6.6.jar org.fit.pdfdom.PDFToHTML SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
|
---|
161 | Exception in thread "main" java.lang.NoClassDefFoundError: org/mabb/fontverter/FontVerter
|
---|
162 | at org.fit.pdfdom.FontTable$Entry.loadTrueTypeFont(FontTable.java:178)
|
---|
163 | at org.fit.pdfdom.FontTable$Entry.getData(FontTable.java:147)
|
---|
164 | at org.fit.pdfdom.FontTable$Entry.isEntryValid(FontTable.java:161)
|
---|
165 | at org.fit.pdfdom.FontTable.addEntry(FontTable.java:48)
|
---|
166 | at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:378)
|
---|
167 | at org.fit.pdfdom.PDFBoxTree.updateFontTable(PDFBoxTree.java:361)
|
---|
168 | at org.fit.pdfdom.PDFDomTree.updateFontTable(PDFDomTree.java:544)
|
---|
169 | at org.fit.pdfdom.PDFBoxTree.processPage(PDFBoxTree.java:206)
|
---|
170 | at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
|
---|
171 | at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
|
---|
172 | at org.fit.pdfdom.PDFDomTree.createDOM(PDFDomTree.java:218)
|
---|
173 | at org.fit.pdfdom.PDFDomTree.writeText(PDFDomTree.java:194)
|
---|
174 | at org.fit.pdfdom.PDFToHTML.main(PDFToHTML.java:77)
|
---|
175 | Caused by: java.lang.ClassNotFoundException: org.mabb.fontverter.FontVerter
|
---|
176 | at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
|
---|
177 | at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
|
---|
178 | at java.security.AccessController.doPrivileged(Native Method)
|
---|
179 | at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
|
---|
180 | at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
|
---|
181 | at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
|
---|
182 | at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
|
---|
183 | ... 13 more
|
---|
184 | greenstone@machine-name:~/Downloads$
|
---|