source: gs2-extensions/xpdf-tools/trunk/src/README.txt@ 32227

Last change on this file since 32227 was 32227, checked in by ak19, 6 years ago

First attempt at compiling up xpdf-tools and the cmake that it needs. Haven't yet tried compiling against the FreeType, libpng and zlib libraries needed for xpdf-tools, but at present the compile sequence runs successfully to completion and generates the binaries in the trunk/src/linux/bin folder.

File size: 9.7 KB
Line 
1==== Compiling Xpdf Tools ====
2Needs CMake to compile.
3
4==== Information related to Xpdf Tools, and to general PDF to html conversion ====
5__________________________________________________________
6Mojo::DOM (Perl)
7__________________________________________________________
8
91. Before Dr Bainbridge found Mojo::DOM, he looked at
10* https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers
11* http://radar.oreilly.com/2014/02/parsing-html-with-perl-2.html
12
132. Main links for Mojo::DOM
14* https://mojolicious.org/perldoc/Mojo/DOM
15* https://metacpan.org/pod/Mojo::DOM
16 Dependencies: http://deps.cpantesters.org/?module=Mojo%3A%3ADOM;perl=latest
17
18Once you've downloaded Mojo::DOM's src, follow Dr Bainbridge's sequence of commands for building the Mojo::DOM CPAN module of perl below.
19We'll be using this module to be used for parsing the HTML output by XPDF tool pdftohtml
20
21
22mkdir cpan
23 2020 tar xvzf Mojolicious-7.84.tar.gz
24 2021 cd Mojolicious-7.84/
25 2028 perl ./Makefile.PL PREFIX=`pwd`/installed
26 2030 make
27 2031 make install
28 2033 cp -r installed/share/perl/5.18.2 ../cpan
29cd ..
30 2044 export PERL5LIB=`pwd`/cpan
31
32 2053 emacs -nw test.pl
33
34#!/usr/bin/perl -w
35add in 'use v5.10;'
36
37 2054 chmod a+x test.pl
38 2055 ./test.pl
39
40
41__________________________________________________________
42XPDF
43__________________________________________________________
44
45Xpdf's last mod date is in 2017 and it includes its own pdftohtml utility tool, whereas the old "pdftohtml" tool that GS used was last updated 2013 (and itself made use of Xpdf, possible older versions).
46
471. https://www.xpdfreader.com/download.html
48
49As per the Readme file found in the linux binary of Xpdf Tools, the Xpdf Viewer requires the qt toolkit, but not the Xpdf Tools. Have not read the Install file to confirm whether the same is the case for when compiling the command line tools. (But in that case, can't we just include the tools binary available for all 3 OS, instead of compiling on each platform)
50
51Using Xpdf's pdftohtml tool:
52greenstone@machine-name:~/Downloads/xpdf-tools-linux-4.00/bin64$./pdftohtml -z 1.5 ~/Downloads/ApacheLicence.pdf licence
53
54 where licence is a folder
55
56https://www.xpdfreader.com/pdftohtml-man.html
57https://linux.die.net/man/5/xpdfrc
58(Configuration flags you can put into ~/.xpdfrc to use as defaults when running xpdf tool commands)
59
602. We're using Xpdf Tools version: xpdf-tools-linux-4.00
61
62__________________________________________________________
63PDF2DOM: tried it out, but wasn't what we wanted
64__________________________________________________________
65Using PDFBox to convert a PDF to full HTML, both images and text and placed correctly with respect to each other, is tricky, see https://stackoverflow.com/questions/9671239/pdfbox-convert-a-pdf-to-text-or-html-including-images-from-the-pdf
66(Google: pdfbox to convert pdf to html with images)
67
68PDF2DOM tool (based on PDFBox) to convert PDF to HTML with images
69* http://cssbox.sourceforge.net/pdf2dom/documentation.php
70* Got the command line jar tool, PDFToHTML.jar version 1.7, from https://sourceforge.net/projects/cssbox/files/Pdf2DOM/
71* Further information and source code at https://github.com/radkovo/Pdf2Dom
72* API: http://cssbox.sourceforge.net/pdf2dom/api/index.html
73
74
751. Running
76
77java -jar PDFToHTML.jar <infile> [<outfile>]
78
79 greenstone@machine-name:~/Downloads$ java -jar PDFToHTML.jar SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
80
81
82It will output the page, but you'll see the following output indicating that the logger is not displaying anything:
83 SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
84 SLF4J: Defaulting to no-operation (NOP) logger implementation
85 SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
86
87See https://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder
88
89To see error output download SLF4J simple jar, run as follows:
90
91 greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML ApacheLicencePDFA.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
92
93The above is a MS Word produced PDF (archive format) and works fine: font folder generated containing the extracted fonts
94
95The following is a PDF produced from the same doc file by the latest libreoffice installed on Windows:
96 ApacheLicencePDFA_FromODT.pdf
97But running the same command on it produces the following font errors:
98
99greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML ApacheLicencePDFA_FromODT.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
100[main] INFO org.reflections.Reflections - Reflections took 163 ms to scan 1 urls, producing 36 keys and 222 values
101[main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException
102[main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException
103[main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException
104[main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException
105
106Fonts get extracted if the source PDF was generated by MS Word's doc to PDF conversion. Fonts didn't get extracted from PDF upon conversion to HTML when libreoffice was used to convert a .doc to the source PDF.
107
1082. Check version of PDF
109https://www.codeproject.com/Questions/167550/How-to-check-different-versions-of-PDF
110
111
1123. pdf to html command line conversion open source
113https://stackoverflow.com/questions/8370014/how-to-convert-pdf-to-html
114
115"Download
116
117 pdfbox-2.0.3.jar
118 fontbox-2.0.3.jar
119 preflight-2.0.3.jar
120 xmpbox-2.0.3.jar
121 pdfbox-tools-2.0.3.jar
122 pdfbox-debugger-2.0.3.jar
123
124from http://pdfbox.apache.org/
125...
126
127PLEASE NOTE: Images do not get pushed to the HTML output."
128
129
1304. Need a way to check if PDF contains images, then use pdf2dom, else basic pdfbox conversion to html (less div tags with inline style markup)?
131https://stackoverflow.com/questions/46215879/count-images-in-pdf-using-pdfbox
132
133
134UNUSED
135Googled for: java tool convert pdf version
136* https://stackoverflow.com/questions/11137912/all-inclusive-tool-to-convert-different-types-of-documents-to-pdf
137* https://www.qoppa.com/pdfprocess/
138jPDFProcess – Java PDF Library to Create, Manipulate PDF
139(appears to be payware)
140* https://www.gnostice.com/nl_article.asp?id=95&t=How_to_Change_the_PDF_Version_of_a_Document
141How to Convert a PDF Document to an Older or Newer Version
142uses .NET
143* http://www.baeldung.com/pdf-conversions-java
144PDF Conversions in Java
145e.g. PDF to html and html to PDF
146
147
148
149__________________________________________________________
150
151greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
152[main] INFO org.reflections.Reflections - Reflections took 153 ms to scan 1 urls, producing 36 keys and 222 values
153[main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException
154[main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException
155[main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException
156[main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException
157
158
159
160greenstone@machine-name:~/Downloads$ java -classpath Pdf2Dom/target/pdf2dom-1.8-SNAPSHOT.jar:pdfbox-app.jar:slf4j-jdk14-1.6.6.jar:log4j-over-slf4j-1.6.6.jar:slf4j-api-1.6.6.jar org.fit.pdfdom.PDFToHTML SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
161Exception in thread "main" java.lang.NoClassDefFoundError: org/mabb/fontverter/FontVerter
162 at org.fit.pdfdom.FontTable$Entry.loadTrueTypeFont(FontTable.java:178)
163 at org.fit.pdfdom.FontTable$Entry.getData(FontTable.java:147)
164 at org.fit.pdfdom.FontTable$Entry.isEntryValid(FontTable.java:161)
165 at org.fit.pdfdom.FontTable.addEntry(FontTable.java:48)
166 at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:378)
167 at org.fit.pdfdom.PDFBoxTree.updateFontTable(PDFBoxTree.java:361)
168 at org.fit.pdfdom.PDFDomTree.updateFontTable(PDFDomTree.java:544)
169 at org.fit.pdfdom.PDFBoxTree.processPage(PDFBoxTree.java:206)
170 at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
171 at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
172 at org.fit.pdfdom.PDFDomTree.createDOM(PDFDomTree.java:218)
173 at org.fit.pdfdom.PDFDomTree.writeText(PDFDomTree.java:194)
174 at org.fit.pdfdom.PDFToHTML.main(PDFToHTML.java:77)
175Caused by: java.lang.ClassNotFoundException: org.mabb.fontverter.FontVerter
176 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
177 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
178 at java.security.AccessController.doPrivileged(Native Method)
179 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
180 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
181 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
182 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
183 ... 13 more
184greenstone@machine-name:~/Downloads$
Note: See TracBrowser for help on using the repository browser.