source: gs2-extensions/xpdf-tools/trunk/src/GS-README.txt@ 32269

Last change on this file since 32269 was 32269, checked in by ak19, 6 years ago

Some important general info notes added to our GS-README.txt for xpdf-tools.

File size: 49.0 KB
RevLine 
[32248]1__________________________________________________________
[32249]2CONTENTS
3__________________________________________________________
4
5Xpdf-Tools related
[32248]6A. XPDF
[32249]7B. Mojo::DOM perl package for parsing HTML
8C. Compiling Xpdf-Tools: statically or dynamically linked
9D. How we got Xpdf-Tools to compile using CASCADE-MAKE
10E. Getting more output when running CMake (verbosity)
11F. APPENDIX - Useful links
12
13LIBJPEG related
14G. LIBJPEG and LIBTIFF
[32253]15- Moving from 2008's libjpeg version 6b to the newer 2018 version 9c
16- Issues building LIBJPEG version 6b on 64 bit machines and the patch
[32249]17
[32253]18H. Licensing information and making the distributable tarball
19
20I. PDF2DOM
[32250]21 unused, replaced by Xpdf-Tools' more suited pdftohtml capabilities
22
[32248]23__________________________________________________________
[32249]24A. XPDF
25__________________________________________________________
[32229]26
[32248]27Xpdf's last mod date is in 2017 and it includes its own pdftohtml utility tool, whereas the old "pdftohtml" tool that GS used was last updated 2013 (and itself made use of Xpdf, possible older versions).
[32229]28
[32248]29The tool takes a PDF and produces an HTML file for each page of the PDF, consisting of selectable HTML text overlaid on top of "screenshot" image of the page. (A page's text is not part of the screenshot.)
[32229]30
[32248]311. https://www.xpdfreader.com/download.html
[32229]32
[32248]33As per the Readme file found in the linux binary of Xpdf Tools, the Xpdf Viewer requires the qt toolkit, but not the Xpdf Tools. Have not read the Install file to confirm whether the same is the case for when compiling the command line tools. (But in that case, can't we just include the tools binary available for all 3 OS, instead of compiling on each platform)
34
35 - Using Xpdf's pdftohtml tool:
36 greenstone@bedrock:~/Downloads/xpdf-tools-linux-4.00/bin64$./pdftohtml -z 1.5 ~/Downloads/ApacheLicence.pdf licence
37
38 where licence is a folder.
39
40 - Using Xpdf's pdftotext tool:
41 greenstone@bedrock:~/Downloads/xpdf-tools-linux-4.00/bin64$./pdftotext -nopgbrk ~/Downloads/ApacheLicence.pdf ~/Downloads/ApacheLicence.txt
42
43 where the output text file must be specified with a full path name.
44
45
462. Documentation on Xpdf-Tools:
47- https://www.xpdfreader.com/support.html
48 for example, the pdftohtml man page: https://www.xpdfreader.com/pdftohtml-man.html
49- https://linux.die.net/man/5/xpdfrc
50(Configuration flags you can put into ~/.xpdfrc to use as defaults when running xpdf tool commands)
51
523. We're using Xpdf Tools version: xpdf-tools-linux-4.00
53
544. We started by working with the ready-made Xpdf-tools binaries available for download from the xpdf site for Win, Linux and Mac.
55
565. We're now moving to compiling up Xpdf-tools ourselves using CASCADE-MAKE, which we have so far got to successfully compile statically on Linux (LSB environment inclusive) to build working binaries.
57
[32269]586. On Mac, it's not possible to produce statically linked libraries, they're dynamically linked against system libraries, but at least use the statically linked libraries for libpng, zlib, libjpeg and freetype that we compile up.
[32248]59
[32269]607. IMPORTANT:
61- for Windows we use the 32 bit precompiled binaries downloaded from the XPDF website. These work on 32 and 64 bit Windows and we don't compile them up ourselves.
62They're put into winbin on trac and end up in GS2/gs2build's GSDLOS/bin folder.
63- for Linux, we build 32 bit binaries on the 32 bit LSB VM.
64- for mac, we build 64 bit binaries.
[32248]65
[32269]66We build the binaries by running ./CASCADE-MAKE.sh on the xpdf-tools gs2-extension.
67We then run "./CASCADE-MAKE.SH makedist", which generates the xpdf-tools tarball which we extract into GS2/gs2build's GSDLOS/bin folder.
68
[32248]69__________________________________________________________
70B. Mojo::DOM perl package for parsing HTML
71__________________________________________________________
72
73XPDF's pdftohtml conversion of a single PDF document produces multiple HTML files: one for each page in the source PDF.
74We want the output to be "paged_html": a single HTML file that is sectionalised, each section representing a page of the
75original PDF.
76
77We need to be able to parse the many HTML pages produced by XPDF's pdftohtml conversion of a doc, in order to massage the output
78into the single sectionalised HTML file. For this we needed a HTML parser package for Perl.
79
801. Before Dr Bainbridge found Mojo::DOM, he looked at
81* https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers
82* http://radar.oreilly.com/2014/02/parsing-html-with-perl-2.html
83
842. Main links for Mojo::DOM
85* https://mojolicious.org/perldoc/Mojo/DOM
86* https://metacpan.org/pod/Mojo::DOM
87 Dependencies: http://deps.cpantesters.org/?module=Mojo%3A%3ADOM;perl=latest
88
893. Once you've downloaded Mojo::DOM's src, follow Dr Bainbridge's sequence of commands for building the Mojo::DOM CPAN module of perl below.
90We'll be using this module to be used for parsing the HTML output by XPDF tool pdftohtml
91
92
93 mkdir cpan
94 2020 tar xvzf Mojolicious-7.84.tar.gz
95 2021 cd Mojolicious-7.84/
96 2028 perl ./Makefile.PL PREFIX=`pwd`/installed
97 2030 make
98 2031 make install
99 2033 cp -r installed/share/perl/5.18.2 ../cpan
100 cd ..
101 2044 export PERL5LIB=`pwd`/cpan
102
103 2053 emacs -nw test.pl
104
105 #!/usr/bin/perl -w
106 add in 'use v5.10;'
107
108 2054 chmod a+x test.pl
109 2055 ./test.pl
110
111
112__________________________________________________________
113C. Compiling Xpdf-Tools: statically or dynamically linked
114__________________________________________________________
115
[32249]116As explained in detail in section D below, we have a customised gs-CMakeLists.txt file which replaces the one in the xpf-4.00.tar.gz package's xpdf subfolder after this is untarred. This customised CMake configure/make file now allows us to compile xpdf-tools either statically (as we've now set it up for by default) or dynamically (as its CMake makefiles were originally set up for).
[32248]117
1181. To compile Xpdf-Tools statically, packages/CASCADE-MAKE/XPDFTOOLS.sh should contain:
119
[32229]120 cmake -DCMAKE_BUILD_TYPE=Release \
121 -DCMAKE_INSTALL_PREFIX=$prefix \
[32248]122 -DZLIB_LIBRARY=$prefix/lib/libz.a \ # <========= THIS
123 -DPNG_LIBRARY=$prefix/lib/libpng15.a \ # <========= THIS
124 -DFREETYPE_LIBRARY=$prefix/lib/libfreetype.a \ # <========= THIS
[32229]125 -DCMAKE_DISABLE_FIND_PACKAGE_Qt4=1 \
126 -DCMAKE_DISABLE_FIND_PACKAGE_Qt5Widgets=1 \
127 -DCMAKE_C_FLAGS="$CFLAGS" \
128 -DCMAKE_CXX_FLAGS="$CXXFLAGS" \
129 -DCMAKE_EXE_LINKER_FLAGS="$LDFLAGS" \
[32248]130 -DGSDLFLAG_STATIC="$static_flag" \ # <========= THIS
[32229]131 $GEXT_XPDFTOOLS/packages/$package$version
132
[32248]133In place of FREETYPE_LIBRARY above, could also try the following,
134 -DFREETYPE_DIR=$prefix \
135but then check the built binaries by running "ldd" and "file" over them, to make sure they're not referencing any .so dynamic link libraries:
[32229]136
137
[32256]1382. To compile Xpdf-Tools dynamically and make it find *our* dynamically linked libraries for its helper packages zlib, libpng, libjpeg and freetype, edit packages/CASCADE-MAKE/XPDFTOOLS.sh to contain:
[32229]139
[32248]140 cmake -DCMAKE_BUILD_TYPE=Release \
141 -DCMAKE_INSTALL_PREFIX=$prefix \
142 -DZLIB_LIBRARY=$prefix/lib/libz.so.1.2.7 \ # <========= THIS
143 -DPNG_LIBRARY=$prefix/lib/libpng15.so.15.30.0 \ # <========= THIS
[32256]144 -DJPEG_LIBRARY=$prefix/lib/libjpeg.so.PUT_THE_NUMBER_HERE \ # <========= THIS AND ENTER THE .SO VERSION NUMBER
[32248]145 -DFREETYPE_LIBRARY=$prefix/lib/libfreetype.so.6.3.20 \ # <========= THIS
146 -DCMAKE_DISABLE_FIND_PACKAGE_Qt4=1 \
147 -DCMAKE_DISABLE_FIND_PACKAGE_Qt5Widgets=1 \
148 -DCMAKE_C_FLAGS="$CFLAGS" \
149 -DCMAKE_CXX_FLAGS="$CXXFLAGS" \
150 -DCMAKE_EXE_LINKER_FLAGS="$LDFLAGS" \
151 $GEXT_XPDFTOOLS/packages/$package$version # <=== -DGSDLFLAG_STATIC removed
152
153
154
155 (1) In the above, you could also set
156 -DFREETYPE_DIR=$prefix
157 in place of
158 -DGSDLFLAG_STATIC="$static_flag"
159
160 In that case it makes, xpdf-tools compilation find the "libfreetype.so" (no versioning at end) in our gs2-extension.
161 After successfully building, make sure to have sourced the gs2-extension's setup.bash before running "ldd" over the
162 generated xpdf-tools binaries, in order to let it use the $LD_LIBRARY_PATH we set to find our .so files.
163
164 (2) Note that there are no equivalent for ZLIB and LIBPNG: doing -DZLIB_DIR=$prefix or -DPNG_DIR=$prefix will be
165 ineffective, as neither are recognised by xpdf-tools' CMake set up.
166
167__________________________________________________________
168D. How we got Xpdf-Tools to compile using CASCADE-MAKE
169__________________________________________________________
170
171The process:
172
1731. We set up a CASCADE-MAKE GS2-extension "xpdf-tools" at trac.greenstone.org/browser/gs2-extensions/xpdf-tools/trunk/src
174Be aware that its lowercased "cascade-make" subfolder is an svn external, the original is at http://trac.greenstone.org/browser/other-projects/cascade-make/trunk/
175
176So far, this CASCADE-MAKE project includes the Xpdf-Tools source tarball, its helper packages zlib, libpng and freetype, as well as CMake to compile the Xpdf-Tools source code.
177The next step is to include JPEG and TIFF libraries too.
178
1792a. We downloaded the Xpdf-Tools source tarball, xpdf-4.00.tar.gz, from the xpdf site at https://www.xpdfreader.com/download.html under section "Download the Xpdf source code".
180
181The xpdf-tools source code tarball consists of the source for Xpf-tools and Xpdf (Xpdf-Reader). The Xpdf-Reader additionally requires Qt to build and run, but we don't want the Xpdf-Reader, just Xpdf-Tools.
182
183b. Compiling Xpdf-Tools fron source and running them requires the following packages and libraries, as per the xpdf-tools source code INSTALL file:
184
185To build xpdf-tools:
186- CMake 2.8.8 or newer
187
188Libraries to link against and used by xpdf-tools:
189- FreeType 2.0.5 or newer
190- libpng (for pdftoppm and pdftohtml)
191- zlib (for pdftoppm and pdftohtml)
192
193
1943. Compilation of xpdf-tools worked with CMake 3.11.4 on the linux resnet machine. However, CMake 3.11.3 itself failed to compile in the LSB environment and on the Mac Mountain Lion machine because of a version incompatibility between the older g++ installed there and the advanced version of CMake 3.11.4.
195
196CMake version 3.9.6 however is supposed to be compatible with older versions of g++, as per https://stackoverflow.com/questions/47886400/cmake-configure-error-in-3-10-1-but-not-in-3-9-6
197To avoid installing newer versions of g++ and clang in the LSB virtual machine and the Mac, I've shifted the CMake version back to version 3.9.6, still
198
199
2004a. On building xpdf-tools to work with dynamically linked libs found anywhere.
201
202If compiling xpdf-tools against dynamic linked libraries for these packages, then the basic CMake command in packages/CASECADE-MAKE/XPDFTOOLS.sh can look like:
203 cmake -DCMAKE_BUILD_TYPE=Release \
204 -DCMAKE_INSTALL_PREFIX=$prefix \
205 -DCMAKE_DISABLE_FIND_PACKAGE_Qt4=1 \
206 -DCMAKE_DISABLE_FIND_PACKAGE_Qt5Widgets=1 \
207 -DCMAKE_C_FLAGS="$CFLAGS" \
208 -DCMAKE_CXX_FLAGS="$CXXFLAGS" \
209 -DCMAKE_EXE_LINKER_FLAGS="$LDFLAGS" \
210 $GEXT_XPDFTOOLS/packages/$package$version # Note: no -DGSDLFLAG_STATIC=...
211
212With the above, the xpdf-tools source code and its make files work out of the box.
213
2144b. On building xpdf-tools to work with the dynamically linked libs for freetype libpng, zlib that we produce when cascade-making the xpdf-tools gs2-extension.
215
216Since we're compiling up freetype, libpng and zlib packages as part of the Xpdf-Tools GS2-extension with CASCADE-MAKE, the next step was to compile xpdf-tools by dynamically linking against our .so files for these 3 libraries. To do so, XPDFTOOL.sh should have the following changes
217
[32256]218 (1) For linux, we need to build on the LSB environment.
219 We're moreover hoping that 32 bit binaries generated this way will work on both 32 and 64 bit machines.
[32248]220
[32256]221 However, on the 32 bit LSB environment, we additionally need to pass in "-march=i486|i586|i686" to gcc
222 Without it, things end up with the error
223 undefined reference to `__sync_add_and_fetch_4'
224 See https://stackoverflow.com/questions/130740/link-error-when-compiling-gcc-atomic-operation-in-32-bit-mode
225 which further explains that
226 "-march=" means "generate code for a particular CPU (and don't run on older CPUs)".
227 So, although uname -m returns i686 on the 32 bit linux VM that generates the nightly bins, we
228 still want to support i586 and i486 systems, so passing that in as the architecture
229 Don't do this for 64 bit systems.
230 And it seems it only needs to be set on CXXFLAGS in this case.
231
232 arch=`uname -m`
233 if [[ $arch = *"64"* ]]; then
234 arch=
235 else
236 echo "@@@ 32 bit machine, need to pass in -march=i486 to avoid certain linking errors"
237 arch="-march=i486"
238 fi
239 ...
240 export CXXFLAGS="$CXXFLAGS -I$GEXTXPDFTOOLS_INSTALLED/include -I$GEXTXPDFTOOLS_INSTALLED/include/libpng15 $arch"
241
242 (2) set up CFLAGS, CXXFLAGS, CPPFLAGS and LDFLAGS to help linkage of xpdf-tools find our .so versions of the necessary libs:
243
[32248]244 export CFLAGS="$CFLAGS -I$GEXTXPDFTOOLS_INSTALLED/include -I$GEXTXPDFTOOLS_INSTALLED/include/libpng15"
245 export CPPFLAGS="$CPPFLAGS -I$GEXTXPDFTOOLS_INSTALLED/include -I$GEXTXPDFTOOLS_INSTALLED/include/libpng15"
[32256]246 export CXXFLAGS="$CXXFLAGS -I$GEXTXPDFTOOLS_INSTALLED/include -I$GEXTXPDFTOOLS_INSTALLED/include/libpng15 $arch"
[32248]247 export LDFLAGS="$LDFLAGS -L$GEXTXPDFTOOLS_INSTALLED/lib"
248
[32256]249 (3) The CMAKE command we run must pass the full paths to the actual .so library files (the ones with specific
[32248]250 versions in their files names) rather than the symbolically linked generally-named .so files (the latter won't
251 be found when building xpdf-tools and CMake will try to look for the .so library files elsewhere on the system):
252
253 cmake -DCMAKE_BUILD_TYPE=Release \
254 -DCMAKE_INSTALL_PREFIX=$prefix \
255 -DZLIB_LIBRARY=$prefix/lib/libz.so.1.2.7 \ # <========= NEW
256 -DPNG_LIBRARY=$prefix/lib/libpng15.so.15.30.0 \ # <========= NEW
257 -DFREETYPE_LIBRARY=$prefix/lib/libfreetype.so.6.3.20 \ # <========= NEW
258 -DCMAKE_DISABLE_FIND_PACKAGE_Qt4=1 \
259 -DCMAKE_DISABLE_FIND_PACKAGE_Qt5Widgets=1 \
260 -DCMAKE_C_FLAGS="$CFLAGS" \
261 -DCMAKE_CXX_FLAGS="$CXXFLAGS" \
262 -DCMAKE_EXE_LINKER_FLAGS="$LDFLAGS" \
263 $GEXT_XPDFTOOLS/packages/$package$version # Again: no -DGSDLFLAG_STATIC=...
264
265Further, the "xpdf/CMakeLists.txt" file within the xpdf-4.00.tar.gz source code tarball needs to be modified to refer to ZLIB_LIBRARIES when linking pdftops and pdftoppm. The linking commands for *both* the "pdftops" and "pdftoppm" executable targets in xpdf/CMakeLists.txt should look like the following,
266
267 target_link_libraries(pdftoppm goo fofi splash
268 ${FREETYPE_LIBRARY} ${FREETYPE_OTHER_LIBS}
269 ${DTYPE_LIBRARY}
270 ${LCMS_LIBRARY}
271 ${ZLIB_LIBRARIES}) # <========= NEW
272
273
[32256]274 (4) Since CMakeLists.txt has been modified, we initially renamed the xpdf src tarball to gs-xpdf-4.00.tar.gz.
[32248]275 However, the current version works with the regular downloaded xpdf-4.00.tar.gz tarball. But after extraction,
276 XPDFTOOLS.sh copies across the custom packages/gs-CMakeLists.txt into the extracted tarball's xpdf subdirectory,
277 renaming the file as CMakeLists.txt (so the path to it becomes "xpdf-4.00/xpdf/CMakeLists.txt"). In XPDFTOOLS.sh:
278
279 # patch the original tarball with our custom makefile
280 if [[ -d "$package$version/xpdf" && -f "gs-CMakeLists.txt" ]]; then
281 echo "*******************************************************************"
282 echo "Using our custom gs-CMakeLists.txt instead of the one included in $package$version"
283 echo "Renaming gs-CMakeLists.txt to $package$version/xpdf/CMakeLists.txt"
284 echo "*******************************************************************"
285
286 cp "gs-CMakeLists.txt" "$package$version/xpdf/CMakeLists.txt"
287 fi
288
289
2904c. On building static xpdf-tools binaries using the static *.a freetype libpng, zlib libraries that we produce when cascade-making the xpdf-tools gs2-extension.
291
292In order to compile up xpdf-tools *statically*, so that it builds against the static *.a libraries of freetype, libpng and zlib that we produce during the gs2-extension's CASCADE-MAKE process, we have to make further modifications.
293
294 (1) First, the XPDFTOOLS.sh cascade-make file should pass the full paths to the actual (non-symbolic link) .a file for each library.
295 A custom GS flag, GSDLFLAG_STATIC, is also invented in gs-CMakeLists.txt and assigned "-static for linux
296 and "-Bstatic" for Mac, to pass in during the linking stage of building xpdf-tools.
297
298 For Mac OSX, when -static is passed in for linking as on linux, this produced the error
299 "ld: library not found for -lcrt0.o" during the build of the xpdf-tools package. For information, see
300 https://stackoverflow.com/questions/3801011/ld-library-not-found-for-lcrt0-o-on-osx-10-6-with-gcc-clang-static-flag
301 The page https://stackoverflow.com/questions/844819/how-to-static-link-on-os-x mentions compiling
302 with -Bstatic on Mac OSX instead. To do so, XPDFTOOLS.sh passes in the GSDLFLAG_STATIC set to either
303 "-static" (for linux) or "-Bstatic" for darwin.
304 However the last mentioned stackoverflow page also says that -Bstatic is a no-op, and this appears to be
305 the case when "otool -L" is run over the generated xpdf-tools binaries: the binaries are all dynamically
306 linked. Although they're finding our .so files of freetype, libpng and zlib, they're not finding the .a
307 versions, even though XPDFTOOLS.sh tries to point gs-CMakeLists.txt to the correct .a files.
308
309 The new modifications to XPDFTOOLS.sh:
310
311 if [ "x$GSDLOS" == "xdarwin" ] ; then
312 static_flag=-Bstatic
313 else
314 static_flag=-static
315 fi
316
317 ...
318 cmake -DCMAKE_BUILD_TYPE=Release \
319 -DCMAKE_INSTALL_PREFIX=$prefix \
320 -DZLIB_LIBRARY=$prefix/lib/libz.a \ # <========= MODIFIED TO .a
321 -DPNG_LIBRARY=$prefix/lib/libpng15.a \ # <========= MODIFIED TO .a
322 -DFREETYPE_LIBRARY=$prefix/lib/libfreetype.a \ # <========= MODIFIED TO .a
323 -DCMAKE_DISABLE_FIND_PACKAGE_Qt4=1 \
324 -DCMAKE_DISABLE_FIND_PACKAGE_Qt5Widgets=1 \
325 -DCMAKE_C_FLAGS="$CFLAGS" \
326 -DCMAKE_CXX_FLAGS="$CXXFLAGS" \
327 -DCMAKE_EXE_LINKER_FLAGS="$LDFLAGS" \
328 -DGSDLFLAG_STATIC="$static_flag" \ # <========= NEW
329 $GEXT_XPDFTOOLS/packages/$package$version
330
331 (2) Our customised gs-CMakeLists.txt file now checks for this flag GSDLFLAG_STATIC being set and, if it is,
332 uses it during the linking stage. As in (1) above, it will be set to "-static" for Linux and "-Bstatic" for Mac.
333
334 - When the flag is set, the linking flags passed into each occurrence of target_link_libraries() in
335 gs-CMakeLists.txt is moreover manually written in the form of "-static -l<libs>" rather than using
336 the default linking commands inherited from the original CMakeLists.txt.
337 - If GSDLFLAG_STATIC isn't set, then we don't build statically, and the linking flags passed to each
338 target_link_libraries() are mostly the original ones.
339
340 For example,
341
342 if(GSDLFLAG_STATIC)
343 target_link_libraries(pdftoppm goo fofi splash
344 ${GSDLFLAG_STATIC} -lfreetype ${DTYPE_LIBRARY} ${LCMS_LIBRARY} -lz -lm -lc -lpthread)
345 else ()
346 target_link_libraries(pdftoppm goo fofi splash
347 ${FREETYPE_LIBRARY} ${FREETYPE_OTHER_LIBS}
348 ${DTYPE_LIBRARY}
349 ${LCMS_LIBRARY}
350 ${ZLIB_LIBRARIES})
351 endif ()
352
353 DETAILED EXPLANATION:
354 We found that when building *statically*, gs-CMakeLists.txt needed to NOT use the PNG_LIBRARIES, ZLIB_LIBRARIES
355 and FREETYPE_LIBRARY in its linker commands, target_link_libraries(), as doing so produced partially dynamic
356 xpdf-tools executables which were moreover BROKEN. They wouldn't run, and in fact attempting to run an xpdf-tool,
357 like "./pdftohtml", would produce a file not found error. Something like "bash: no such file or directory".
358
359 Online discussions mentioned that this generally happened when attempting to run 32 bit executables on 64 bit
360 linux when 32 bit loaders are not installed. (In such cases, the solution was to apt-get install some 32 bit package.)
361 However, our broken binaries were all 64 bit, as indicated when running the "file" command on them. However, their
362 being further partially dynamically linked executables didn't imply that they would be broken, as we were eventually
363 able to produce partially dynamic executables that did work, before solving static linking altogether.
364
365 The real issue was that including references to ${FREETYPE_LIBRARY} ${FREETYPE_OTHER_LIBS}, ${PNG_LIBRARIES} and
366 ${ZLIB_LIBRARIES} in any target_link_libraries() resulted in the wrong linking command producing broken binaries.
367
368 Doing the regular target_link_libraries() in static mode results in building with
369 "-Wl,-Bstatic -lfreetype -lpng15 -lz -Wl,-Bdynamic -lpthread" at end of link line
370 and produces broken binaries for pdftohtml/pdftoppm/pdftops/pdftopng.
371
372 Note that PNG_LIBRARIES includes zlib/lz: "-lpng -lz", and along with freetype,
373 these are linked statically. However, Threads/lpthread is included as a dynamically
374 linked library instead of including a .a (regardless of whether it's appended
375 as -lpthread or Threads::Threads in the target_link_libraries()), contributing to
376 the pdfhtml binary produced being a partially static, partially dynamic one,
377 so a dynamic executable overall.
378
379 The order of dynamic .so files listed by ldd in the broken static binary of pdftohtml differs from
380 a manually statically linked working version of pdftohtml, and seems to be the only difference
381 between the two in ldd's output. Not using "-Wl,-Bstatic" and using -static (-Bstatic on Mac)
382 in its place creates a partially static dynamic executable that isn't broken, whereas
383 additionally removing "-Wl,-Bdynamic -lpthread" and replacing it with -lpthread
384 moreover produces a working pdftohtml that is a fully static linked executable.
385
386 The inclusion of the math lib and c lib (lm and lc) in the final link command
387 are to completely bypass the remaining .so dependencies that were present in
388 the executable and produce the fully static executable. The lm and lc libs were referenced
389 by all xpdf-tool binaries (as indicated when generating dynamic ones and running ldd over them)
390 but Dr Bainbridge said that -lm and -lc were some libs passed in by the compiler by default,
391 which would explain why explicitly setting them for some xpdftools and not other may not have
392 mattered.
393
394NOTES:
395Initial attempts at modifying gs-CMakeLists.txt for static compiling that proved to be unnecessary:
396
[32256]397 (i) Setting -static globally doesn't have a useful effect.
[32248]398
399 # We want to build static xpdf-tools binaries. See
400 # https://stackoverflow.com/questions/24648357/compiling-a-static-executable-with-cmake
401 # Want to make the min number of changes for building statically, so using the way
402 # below. Beware, must *append* "-static" to existing CMAKE_EXE_LINKER_FLAGS=LD_FLAGS
403 ##SET(CMAKE_FIND_LIBRARY_SUFFIXES ".a")
404 ##SET(BUILD_SHARED_LIBS OFF)
405 ##SET(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -static")
406
407 The above 3 lines just add a -static before the "-O2 -Wall -fPIC -rdynamic ..." during linking, such as below.
408 But they have no further effect on whether static building actually succeeds or not. The only effective static
409 linking command (for Linux so far) was to pass -static in the target_link_libraries() command followed by the
410 "-l<libname>" for each library in the correct order.
411
412----
[32249]413/usr/bin/c++ -I/home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/include -I/home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/include -I/home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/include/libpng15 -O3 -Wall -fPIC -L/home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/lib -L/home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/lib -static ***** <- HERE ****** -O2 -Wall -fPIC -rdynamic CMakeFiles/pdftohtml.dir/HTMLGen.cc.o CMakeFiles/pdftohtml.dir/SplashOutputDev.cc.o CMakeFiles/pdftohtml.dir/TextOutputDev.cc.o CMakeFiles/pdftohtml.dir/pdftohtml.cc.o CMakeFiles/xpdf_objs.dir/AcroForm.cc.o CMakeFiles/xpdf_objs.dir/Annot.cc.o CMakeFiles/xpdf_objs.dir/Array.cc.o CMakeFiles/xpdf_objs.dir/BuiltinFont.cc.o CMakeFiles/xpdf_objs.dir/BuiltinFontTables.cc.o CMakeFiles/xpdf_objs.dir/Catalog.cc.o CMakeFiles/xpdf_objs.dir/CharCodeToUnicode.cc.o CMakeFiles/xpdf_objs.dir/CMap.cc.o CMakeFiles/xpdf_objs.dir/Decrypt.cc.o CMakeFiles/xpdf_objs.dir/Dict.cc.o CMakeFiles/xpdf_objs.dir/Error.cc.o CMakeFiles/xpdf_objs.dir/FontEncodingTables.cc.o CMakeFiles/xpdf_objs.dir/Form.cc.o CMakeFiles/xpdf_objs.dir/Function.cc.o CMakeFiles/xpdf_objs.dir/Gfx.cc.o CMakeFiles/xpdf_objs.dir/GfxFont.cc.o CMakeFiles/xpdf_objs.dir/GfxState.cc.o CMakeFiles/xpdf_objs.dir/GlobalParams.cc.o CMakeFiles/xpdf_objs.dir/JArithmeticDecoder.cc.o CMakeFiles/xpdf_objs.dir/JBIG2Stream.cc.o CMakeFiles/xpdf_objs.dir/JPXStream.cc.o CMakeFiles/xpdf_objs.dir/Lexer.cc.o CMakeFiles/xpdf_objs.dir/Link.cc.o CMakeFiles/xpdf_objs.dir/NameToCharCode.cc.o CMakeFiles/xpdf_objs.dir/Object.cc.o CMakeFiles/xpdf_objs.dir/OptionalContent.cc.o CMakeFiles/xpdf_objs.dir/Outline.cc.o CMakeFiles/xpdf_objs.dir/OutputDev.cc.o CMakeFiles/xpdf_objs.dir/Page.cc.o CMakeFiles/xpdf_objs.dir/Parser.cc.o CMakeFiles/xpdf_objs.dir/PDFDoc.cc.o CMakeFiles/xpdf_objs.dir/PDFDocEncoding.cc.o CMakeFiles/xpdf_objs.dir/PSTokenizer.cc.o CMakeFiles/xpdf_objs.dir/SecurityHandler.cc.o CMakeFiles/xpdf_objs.dir/Stream.cc.o CMakeFiles/xpdf_objs.dir/TextString.cc.o CMakeFiles/xpdf_objs.dir/UnicodeMap.cc.o CMakeFiles/xpdf_objs.dir/UnicodeTypeTable.cc.o CMakeFiles/xpdf_objs.dir/UTF8.cc.o CMakeFiles/xpdf_objs.dir/XFAForm.cc.o CMakeFiles/xpdf_objs.dir/XRef.cc.o CMakeFiles/xpdf_objs.dir/Zoox.cc.o -o pdftohtml ../goo/libgoo.a ../fofi/libfofi.a ../splash/libsplash.a -static -lfreetype -lpng -lz -lm -lc -lpthread
[32248]414----
415
[32256]416 (ii) Threads::Threads instead of -lpthread results in a partially dynamic executable.
[32248]417
418 # The original, unmodified CMakeLists.txt was not set up sufficiently
419 # for static compilation of xpdf-tools. As a result, compile would first fail
420 # with errors about undefined refs to mutex / lpthread.
421 # When building xpdf-tools statically, need to add the following 2 lines as well
422 # as append "Threads::Threads" to the end of each "target_link_libraries(<list>)"
423 # See https://stackoverflow.com/questions/1620918/cmake-and-libpthread
424 # found googling cmake and "-lpthread" (pthread) after ERRORS to do with this, like:
425 # undefined reference to `pthread_mutex_unlock'
426 ##set(THREADS_PREFER_PTHREAD_FLAG ON)
427 ##find_package(Threads REQUIRED)
428
429 In instances when compilation was successful, including the above 2 lines in combination with "Threads::Threads"
430 as the final argument to every target_link_libraries(...) occurrence in gs-CMakeLists.txt would only manage to
431 produce partially dynamically linked xpdftools binaries. (Depending on what the linking command was when building
432 Xpdf-Tools, the partially dynamically linked executables may work or may be broken. See explanation further above.)
433 We wanted fully statically linked binaries, for which we needed to pass in "-lpthread" as the trailing argument
434 to each target_link_libraries(...). So without either, compilation will fail. However, with "Threads::Threads"
435 the binaries weren't fully static, whereas with -lpthread the xpdftools executables were fully static as CMake no
436 longer tried to link against a dynamic Threads library.
437
438
[32256]4395. To view the unmodified CMakeLists.txt included in the xpdf-4.00 source code tarball, untar it and look for its "xpdf/CMakeLists.txt" (not the toplevel file of the same name).
[32248]440Run a 'diff' against gs-CMakeLists.txt to see further differences, such as debug statements and comments. Most comments have been removed and placed into this readme file instead.
441
442
[32256]4436. When CASCADE-MAKE is run on the xpdf-tools GS2-extension, it first compiles up CMake, needed to compile up xpdf-tools.
[32248]444Unlike the library packages like freetype, libpng and zlib that we also build for xpdf-tools as part of this gs2-extension, CMake's build products don't need to be included in the distribution tarball of our built xpdf-tools executables.
445
446There's a "move-cmake.sh" script in the xpdf-tools gs2-extension that can be run with the "away" and "back" options to move the CMake stuff out of the way (into a "devel" folder) after successfully building xpdf binaries and that can also be run to move them back if wanting to recompile.
447
448The script can be run manually, but it's also run by the extension:
449- packages/CASCADE-MAKE/XPDFTOOLS.sh runs "move-cmake.sh away" after xpdf-tools has been built, so that the extension's install location is ready for tarring up for distribution.
450- When recompiling the xpdf-tools extenion, the CASCADE-MAKE process will run packages/CASCADE-MAKE/CMAKE.sh file which in turn runs "move-cmake.sh back" if there's a prebuilt CMake which had earlier been moved out of the way.
451
452
[32249]453__________________________________________________________
454E. Getting more output when running CMake (verbosity)
455__________________________________________________________
456See https://www.linuxquestions.org/questions/programming-9/cmake-or-make-debug-output-show-command-624800/
457To turn on debugging:
458 export VERBOSE=1
459 ./CASCADE-MAKE.sh
460
461To turn off debugging, need to actually make VERBOSE undefined again (don't set it to 0):
462 export VERBOSE=
463 ./CASCADE-MAKE.sh
464
465
466__________________________________________________________
467F. APPENDIX - Useful links
468__________________________________________________________
469A. Helping CMake along. (Not all of this was necessary for compiling xpdftools statically, but they're generally useful links)
470
471https://github.com/SynoCommunity/spksrc/issues/1779
472https://stackoverflow.com/questions/1620918/cmake-and-libpthread
473https://cmake.org/cmake/help/v3.0/prop_tgt/LINK_FLAGS.html
474https://cmake.org/cmake/help/v3.11/command/target_link_libraries.html?highlight=target_link_libraries
475https://stackoverflow.com/questions/24648357/compiling-a-static-executable-with-cmake
476https://stackoverflow.com/questions/42815420/cmake-cant-find-my-static-libs
477https://cmake.org/cmake/help/v3.0/command/message.html
478https://stackoverflow.com/questions/30980383/cmake-compile-options-for-libpng
479 https://stackoverflow.com/questions/36220123/undefined-reference-to-png-set-longjmp-fn-when-compiling-pcl-source-file
480
481
482B. About the error "bash: no such file or directory" when run on a statically generated binary:
483
484https://askubuntu.com/questions/351827/unable-to-run-a-32-bit-program-on-64-bit-vm/353497#353497
485https://unix.stackexchange.com/questions/13391/getting-not-found-message-when-running-a-32-bit-binary-on-a-64-bit-system/13409#13409
486https://arstechnica.com/civis/viewtopic.php?f=16&t=1173118
487https://superuser.com/questions/344533/no-such-file-or-directory-error-in-bash-but-the-file-exists
488https://unix.stackexchange.com/questions/45277/executing-binary-file-file-not-found
489
490C. Other links
491
492https://unix.stackexchange.com/questions/279397/ldd-dont-find-path-how-to-add
493
494
[32251]495D. On why you can't build static binaries on Mac, but can build static libraries and link against them
496
497https://developer.apple.com/library/archive/qa/qa1118/_index.html (official page on how Mac doesn't support static binaries)
498https://stackoverflow.com/questions/3801011/ld-library-not-found-for-lcrt0-o-on-osx-10-6-with-gcc-clang-static-flag
499https://stackoverflow.com/questions/844819/how-to-static-link-on-os-x (mention of -Bstatic)
500https://www.allegro.cc/forums/thread/610923
[32252]501https://stackoverflow.com/questions/5259249/creating-static-mac-os-x-c-build (has some other suggestions)
502 http://www.network-theory.co.uk/docs/gccintro/gccintro_79.html
503Dead end: https://nelsonslog.wordpress.com/2013/04/24/macos-doesnt-support-static-binaries/
504https://dropline.net/2015/10/static-linking-on-mac-os-x/
505 explains that on Mac, .dylibs must be hidden for .a versions of libraries to be selected when linking
506 This must be true for non-system dylibs too.
507 This means that where possible we want to essentially do "--enable-static --disable-shared", or equivalent,
508 when generating freetype, libz, libpng, libjpg, libtiff library files, so that Xpdf-Tools links against the
509 .a files we generated rather than additional .dylib files
[32251]510
511http://www.simplesystems.org/libtiff/build.html
512configuration options for building libtiff. Want to turn off the compile process for libtiff producing tiff binaries, but there appears to be no such option.
513
514
[32249]515__________________________________________________________
516G. LIBJPEG and LIBTIFF
517__________________________________________________________
518
[32253]5191. The first version of LIBJPEG to work out was version 6b, which required some patching up before it could be built, see point 2 below.
520Besides the fact that version 6b needed patching up, it was also from 2008. I've now found a version of libjpeg from Jan 2018, called "jpegsrc.v9c.tar.gz"
521which was downloadable from www.ijg.org at http://www.ijg.org/files/jpegsrc.v9c.tar.gz. Version 9c can build both static and dynamically linked libraries of
522libjpeg, though we only want the former. (The older version 6b could only generate the static libjpeg.a library file, and contrary to online instructions.)
[32249]523
[32253]524As needed to be done with the older 6b version, this tarball was renamed to jpeg-9c.tar.gz to fit the naming pattern of its folder once extracted.
525
526There was an incompatibility between the existing CASCADE-MAKE/LIBJPEG.sh and the Makefile generated by configuring the Makefile.in/.am in the jpeg-9c tarball.
527The LIBJPEG.sh would run "make install-lib" at the end, to install the libjpeg.a in the lib folder and to install 4 header files. This is as per the install.txt
528instructions in the older and current version of jpeg src tarball. However, the header files never got installed when doing so, whether in version 6b or the
529current 9c. And install-lib is not a recognised target in 9c's Makefile, where the target is install-libLTLIBRARIES. So LIBJPEG.sh has been modified to use this
530target name and to moreover copy over the header files (even though they weren't necessary when compiling xpdftools against the libjpeg 6b library previously and
531possibly now with 9c).
532
533Since we want to only generate libjpeg.a and not the .so/.dylib dynamically linked versions, the latter is turned off during configure by passing --disable-shared.
534
535A final change made to LIBJPEG.sh was to undo it copying over the patch file "gs-libjpeg-config.sub" into the extracted jpeg tarball, since the patch was only
536necessary for libjpeg version 6b and not for 9c. These steps have been commented out in LIBJPEG.sh now.
537
538
5392. Issues building LIBJPEG VERSION 6b on 64 bit machines and the patch
540
541LIBJPEG version 6b is from 2008.
542
[32249]543I copied the LIBJPEG package from http://trac.greenstone.org/browser/other-projects/realistic-books/trunk/packages (also at http://trac.greenstone.org/browser/gs2-extensions/ocr/trunk/packages/cmdline).
544
545 * Configuring out of the box produced the following error:
546 checking host system type... Invalid configuration `x86_64-unknown-linux-gnu': machine `x86_64-unknown' not recognized
547
548 * So that, as a consequence, when running make on the libjpeg package, make failed with the error:
549 ./libtool --mode=compile gcc -I/home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/include -fPIC -I/home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/include -I. -c ./jcapimin.c
550 make: ./libtool: Command not found
551 make: *** [jcapimin.lo] Error 127
552 Error encountered running *make * stage of ./CASCADE-MAKE/LIBJPEG.sh
553
554The same was true when I grabbed the libjpeg from sourceforge (https://sourceforge.net/projects/libjpeg/files/), which was also still version jpeg 6b from 2008.
555
556I found the following webpages discussing the above error messages:
557- https://unix.stackexchange.com/questions/80479/how-to-work-with-libtool
558- https://github.com/rwestlund/freesweep/issues/1
559- https://ubuntuforums.org/showthread.php?t=1232714
560- https://stackoverflow.com/questions/12828687/configure-fails-to-detect-proper-ld-on-a-64-bit-system-with-32-bit-userland
561- SOLUTION: https://sourceforge.net/p/libjpeg/bugs/12/
562
563However, the error only strikes when configure is run with --enable-static.
564
565Note also that contrary to the above pages, running configure with the additional options
566 --host=x86_64-linux-gnu --build=x86_64-linux-gnu --target=x86_64-linux-gnu --disable-shared --enable-static
567did not help. Nor did adding the above flags get rid of configure attempting to work with host=x86_64-unknown(-unknown)-linux-gnu
568
569The SOLUTION, found when searching for the error message along with "enable-static", as it's the combination that is relevant, is described
570at https://sourceforge.net/p/libjpeg/bugs/12/
571
572which was to patch up the config.sub filed included in the jpeg-6b tarball, to also cover x86_64-* machines:
573 tahoe | i860 | x86_64-* | m32r | m68k | m68000 | m88k | ns32k | arc | arm \
574
575The above change is necessary because this libjpeg is outdated and has been superceded by other JPEG libraries, also discussed at https://sourceforge.net/p/libjpeg/bugs/12/
576I'm not sure if those libraries are compatible with XpdfTools however, so I'm sticking with libjpeg as long as I can get it to build and be recognised by XpdfTools.
577
578The solution is once more to have a patch file: CASCADE-MAKE/LIBJPEG.sh replaces the config.sub with in the jpeg-6b package after this is untarred with packages/gs-libjpeg-config.sub, which contains the patch.
579
580
5812. I followed the instructions at http://www.linuxfromscratch.org/blfs/view/6.3/general/libjpeg.html
582to try to build libjpeg with --enable-static and --enable-shared to produce both libjpeg.a and libjpeg.so.
583
584However, nothing I try gets it to generate a libjpeg.so. It seems to always produce a libjpeg.a in xpdf-tools/linux/lib
585regardless of whether CASCADE-MAKE/LIBJPEG.sh passes the --enable-static flag to the configure command or not, and regardless of whether --enable-shared is additionally or individually passed in.
586
587As a consequence, there's no libjpeg.so file to set the -DJPEG_LIBRARY flag in XPDFTOOLS.sh to for when building xpdf-tools against dynamically linked libraries.
588
589I tried the various combinations with the lib jpeg-6b source tarballs from
590- sourceforge, https://sourceforge.net/projects/libjpeg/files/, the latest tarball of this was from 2008
591- http://www.linuxfromscratch.org/blfs/view/6.3/general/libjpeg.html, which was last updated in 2007
592- http://trac.greenstone.org/browser/other-projects/realistic-books/trunk/packages/jpeg-6b.tar.gz, which was added to trac in 2009 but is probably the 2008 or 2007 version too.
593
594
5953. Modifications for using TIFF and JPEG libraries when building Xpdf-Tools:
596
597* CASCADE-MAKE.sh, replaced
598 PACKAGES="CMAKE LIBZ LIBPNG FREETYPE XPDFTOOLS"
599with
600 PACKAGES="CMAKE LIBZ LIBTIFF LIBPNG LIBJPEG FREETYPE XPDFTOOLS"
601
602
603* XPDFTOOLS.sh
604If compiling statically make sure the CMake command contains the following changes:
605 -DTIFF_INCLUDE_DIR=$prefix/include \ # <========== new
606 -DJPEG_INCLUDE_DIR=$prefix/include \ # <========== new
607 -DZLIB_LIBRARY=$prefix/lib/libz.a \
608 -DTIFF_LIBRARY=$prefix/lib/libtiff.a \ # <========== new
609 -DPNG_LIBRARY=$prefix/lib/libpng15.a \
610 -DJPEG_LIBRARY=$prefix/lib/libjpeg.a \ # <========== new
611 -DFREETYPE_LIBRARY=$prefix/lib/libfreetype.a \
612 -DGSDLFLAG_STATIC="$static_flag" \
613
614
615
616The above flag names were discovered by deleting the untarred xpdf-4.00 folder.
617Then in a fresh terminal, source devel.bash from xpdf-tools and re-run CASCADE-MAKE.sh without the above modifications:
618
619 -- Found FreeType (new-style includes): /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/lib/libfreetype.a
620 -- Found ZLIB: /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/lib/libz.a (found version "1.2.8")
621 -- Found PNG: /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/lib/libpng15.a (found version "1.2.50")
622 -- Could NOT find JPEG (missing: JPEG_LIBRARY JPEG_INCLUDE_DIR)
623 -- Could NOT find TIFF (missing: TIFF_LIBRARY TIFF_INCLUDE_DIR)
624 -- lcms2 not found
625 -- No Qt library found
626
627
628* packages/gs-CMakeLists.txt was modified again,
629
630 - this time to also pass:
631 -ltiff and -ljpeg to all target_link_libraries() commands that run when GSDLFLAG_STATIC is set
632 and
633 ${TIFF_LIBRARY} and ${JPEG_LIBRARY} to all target_link_libraries() commands that run when GSDLFLAG_STATIC is not set
634
635 - And to add in the include directories and defitions if JPEG/TIFF libraries were provided:
636 if (JPEG_FOUND)
637 include_directories("${JPEG_INCLUDE_DIR}")
638 add_definitions("${JPEG_DEFINITIONS}")
639 message(STATUS "@@@@@@@@@@@@@@@ JPEG_FOUND (include_dir ; include_dirs): ${JPEG_INCLUDE_DIR} ; ${JPEG_INCLUDE_DIRS}")
640 else ()
641 message(STATUS "@@@@@@@@@@@@@@@ NO JPEG_FOUND")
642 endif ()
643 if (TIFF_FOUND)
644 include_directories("${TIFF_INCLUDE_DIRS}")
645 add_definitions("${TIFF_DEFINITIONS}")
646 message(STATUS "@@@@@@@@@@@@@@@ TIFF_FOUND ${TIFF_INCLUDE_DIRS}")
647 else ()
648 message(STATUS "@@@@@@@@@@@@@@@ NO TIFF_FOUND")
649 endif ()
650
651 Note however that although gs-CMakeLists.txt now knows what the pluralised TIFF_INCLUDE_DIRS is (and TIFF_INCLUDE_DIR)
652 as for PNG and ZLIB, gs-CMakeLists.txt does not have a value for the pluralised JPEG_INCLUDE_DIRS, only the
653 JPEG_INCLUDE_DIRS set above. And both the CMAKE flags in XPDFTOOLS.sh for tiff and jpeg libs seem to have been setup
654 in the same way now. Not sure where these automatically assigned variables come from in order to check up on them.
655
[32253]656__________________________________________________________
657H. Licensing information and making the distributable tarball
658__________________________________________________________
[32249]659
[32253]660XpdfTools' README lists which files need to be included as per its license when redistributing xpdf-tools binaries.
661
[32258]662Running "./CASCADE-MAKE.sh makedist" assembles a custom whitelist of files to include in the distribution tarball of the xpdf-tools we compile up.
[32253]663
[32258]664The files and folders into the distribution tarball xpdf-tools-GSDLOS.tar.gz are:
665- the GSDLOS/bin/pdf* statically linked binaries (or dynamic executables linked against mostly static libraries in the case of Macs),
[32259]666- the GSDLOS/man folder as well as the further compulsory files README, COPYING and COPYING3 as required for xpdf-tools' license.
[32253]667
[32258]668Beware that the cascade-make makedist function always maintains the directory structure of folders but also files included in the whitelist.
[32259]669So when untarred, the folder xpdf-tools is produced with subfolders like linux/bin (containing the pdf* binaries), a linux/man subfolder
670and files README, COPYING, COPYING3.
[32258]671
672
[32250]673__________________________________________________________
[32253]674I. PDF2DOM: tried it out, but wasn't what we wanted
[32250]675__________________________________________________________
676Using PDFBox to convert a PDF to full HTML, both images and text and placed correctly with respect to each other, is tricky, see https://stackoverflow.com/questions/9671239/pdfbox-convert-a-pdf-to-text-or-html-including-images-from-the-pdf
677(Google: pdfbox to convert pdf to html with images)
678
679PDF2DOM tool (based on PDFBox) to convert PDF to HTML with images
680* http://cssbox.sourceforge.net/pdf2dom/documentation.php
681* Got the command line jar tool, PDFToHTML.jar version 1.7, from https://sourceforge.net/projects/cssbox/files/Pdf2DOM/
682* Further information and source code at https://github.com/radkovo/Pdf2Dom
683* API: http://cssbox.sourceforge.net/pdf2dom/api/index.html
684
685
6861. Running
687
688java -jar PDFToHTML.jar <infile> [<outfile>]
689
690 greenstone@machine-name:~/Downloads$ java -jar PDFToHTML.jar SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
691
692
693It will output the page, but you'll see the following output indicating that the logger is not displaying anything:
694 SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
695 SLF4J: Defaulting to no-operation (NOP) logger implementation
696 SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
697
698See https://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder
699
700To see error output download SLF4J simple jar, run as follows:
701
702 greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML ApacheLicencePDFA.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
703
704The above is a MS Word produced PDF (archive format) and works fine: font folder generated containing the extracted fonts
705
706The following is a PDF produced from the same doc file by the latest libreoffice installed on Windows:
707 ApacheLicencePDFA_FromODT.pdf
708But running the same command on it produces the following font errors:
709
710greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML ApacheLicencePDFA_FromODT.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
711[main] INFO org.reflections.Reflections - Reflections took 163 ms to scan 1 urls, producing 36 keys and 222 values
712[main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException
713[main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException
714[main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException
715[main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException
716
717Fonts get extracted if the source PDF was generated by MS Word's doc to PDF conversion. Fonts didn't get extracted from PDF upon conversion to HTML when libreoffice was used to convert a .doc to the source PDF.
718
7192. Check version of PDF
720https://www.codeproject.com/Questions/167550/How-to-check-different-versions-of-PDF
721
722
7233. pdf to html command line conversion open source
724https://stackoverflow.com/questions/8370014/how-to-convert-pdf-to-html
725
726"Download
727
728 pdfbox-2.0.3.jar
729 fontbox-2.0.3.jar
730 preflight-2.0.3.jar
731 xmpbox-2.0.3.jar
732 pdfbox-tools-2.0.3.jar
733 pdfbox-debugger-2.0.3.jar
734
735from http://pdfbox.apache.org/
736...
737
738PLEASE NOTE: Images do not get pushed to the HTML output."
739
740
7414. Need a way to check if PDF contains images, then use pdf2dom, else basic pdfbox conversion to html (less div tags with inline style markup)?
742https://stackoverflow.com/questions/46215879/count-images-in-pdf-using-pdfbox
743
744
745UNUSED
746Googled for: java tool convert pdf version
747* https://stackoverflow.com/questions/11137912/all-inclusive-tool-to-convert-different-types-of-documents-to-pdf
748* https://www.qoppa.com/pdfprocess/
749jPDFProcess – Java PDF Library to Create, Manipulate PDF
750(appears to be payware)
751* https://www.gnostice.com/nl_article.asp?id=95&t=How_to_Change_the_PDF_Version_of_a_Document
752How to Convert a PDF Document to an Older or Newer Version
753uses .NET
754* http://www.baeldung.com/pdf-conversions-java
755PDF Conversions in Java
756e.g. PDF to html and html to PDF
757
758
759__________________________________________________________
760
761greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
762[main] INFO org.reflections.Reflections - Reflections took 153 ms to scan 1 urls, producing 36 keys and 222 values
763[main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException
764[main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException
765[main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException
766[main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException
767
768
769
770greenstone@machine-name:~/Downloads$ java -classpath Pdf2Dom/target/pdf2dom-1.8-SNAPSHOT.jar:pdfbox-app.jar:slf4j-jdk14-1.6.6.jar:log4j-over-slf4j-1.6.6.jar:slf4j-api-1.6.6.jar org.fit.pdfdom.PDFToHTML SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
771Exception in thread "main" java.lang.NoClassDefFoundError: org/mabb/fontverter/FontVerter
772 at org.fit.pdfdom.FontTable$Entry.loadTrueTypeFont(FontTable.java:178)
773 at org.fit.pdfdom.FontTable$Entry.getData(FontTable.java:147)
774 at org.fit.pdfdom.FontTable$Entry.isEntryValid(FontTable.java:161)
775 at org.fit.pdfdom.FontTable.addEntry(FontTable.java:48)
776 at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:378)
777 at org.fit.pdfdom.PDFBoxTree.updateFontTable(PDFBoxTree.java:361)
778 at org.fit.pdfdom.PDFDomTree.updateFontTable(PDFDomTree.java:544)
779 at org.fit.pdfdom.PDFBoxTree.processPage(PDFBoxTree.java:206)
780 at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
781 at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
782 at org.fit.pdfdom.PDFDomTree.createDOM(PDFDomTree.java:218)
783 at org.fit.pdfdom.PDFDomTree.writeText(PDFDomTree.java:194)
784 at org.fit.pdfdom.PDFToHTML.main(PDFToHTML.java:77)
785Caused by: java.lang.ClassNotFoundException: org.mabb.fontverter.FontVerter
786 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
787 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
788 at java.security.AccessController.doPrivileged(Native Method)
789 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
790 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
791 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
792 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
793 ... 13 more
794greenstone@machine-name:~/Downloads$
Note: See TracBrowser for help on using the repository browser.