root/gs2-extensions/xpdf-tools/trunk/src/GS-README.txt @ 32269

Revision 32269, 49.0 KB (checked in by ak19, 2 years ago)

Some important general info notes added to our GS-README.txt for xpdf-tools.

Line 
1__________________________________________________________
2CONTENTS
3__________________________________________________________
4
5Xpdf-Tools related
6A. XPDF
7B. Mojo::DOM perl package for parsing HTML
8C. Compiling Xpdf-Tools: statically or dynamically linked
9D. How we got Xpdf-Tools to compile using CASCADE-MAKE
10E. Getting more output when running CMake (verbosity)
11F. APPENDIX - Useful links
12
13LIBJPEG related
14G. LIBJPEG and LIBTIFF
15- Moving from 2008's libjpeg version 6b to the newer 2018 version 9c
16- Issues building LIBJPEG version 6b on 64 bit machines and the patch
17
18H. Licensing information and making the distributable tarball
19
20I. PDF2DOM
21    unused, replaced by Xpdf-Tools' more suited pdftohtml capabilities
22
23__________________________________________________________
24A. XPDF
25__________________________________________________________
26
27Xpdf's last mod date is in 2017 and it includes its own pdftohtml utility tool, whereas the old "pdftohtml" tool that GS used was last updated 2013 (and itself made use of Xpdf, possible older versions).
28
29The tool takes a PDF and produces an HTML file for each page of the PDF, consisting of selectable HTML text overlaid on top of "screenshot" image of the page. (A page's text is not part of the screenshot.)
30
311. https://www.xpdfreader.com/download.html
32
33As per the Readme file found in the linux binary of Xpdf Tools, the Xpdf Viewer requires the qt toolkit, but not the Xpdf Tools. Have not read the Install file to confirm whether the same is the case for when compiling the command line tools. (But in that case, can't we just include the tools binary available for all 3 OS, instead of compiling on each platform)
34
35    - Using Xpdf's pdftohtml tool:
36    greenstone@bedrock:~/Downloads/xpdf-tools-linux-4.00/bin64$./pdftohtml -z 1.5 ~/Downloads/ApacheLicence.pdf licence
37
38        where licence is a folder.
39
40    - Using Xpdf's pdftotext tool:
41    greenstone@bedrock:~/Downloads/xpdf-tools-linux-4.00/bin64$./pdftotext -nopgbrk ~/Downloads/ApacheLicence.pdf ~/Downloads/ApacheLicence.txt
42
43        where the output text file must be specified with a full path name.
44
45
462. Documentation on Xpdf-Tools:
47- https://www.xpdfreader.com/support.html
48    for example, the pdftohtml man page: https://www.xpdfreader.com/pdftohtml-man.html
49- https://linux.die.net/man/5/xpdfrc
50(Configuration flags you can put into ~/.xpdfrc to use as defaults when running xpdf tool commands)
51
523. We're using Xpdf Tools version: xpdf-tools-linux-4.00
53
544. We started by working with the ready-made Xpdf-tools binaries available for download from the xpdf site for Win, Linux and Mac.
55
565. We're now moving to compiling up Xpdf-tools ourselves using CASCADE-MAKE, which we have so far got to successfully compile statically on Linux (LSB environment inclusive) to build working binaries.
57
586. On Mac, it's not possible to produce statically linked libraries, they're dynamically linked against system libraries, but at least use the statically linked libraries for libpng, zlib, libjpeg and freetype that we compile up.
59
607. IMPORTANT:
61- for Windows we use the 32 bit precompiled binaries downloaded from the XPDF website. These work on 32 and 64 bit Windows and we don't compile them up ourselves.
62They're put into winbin on trac and end up in GS2/gs2build's GSDLOS/bin folder.
63- for Linux, we build 32 bit binaries on the 32 bit LSB VM.
64- for mac, we build 64 bit binaries.
65
66We build the binaries by running ./CASCADE-MAKE.sh on the xpdf-tools gs2-extension.
67We then run "./CASCADE-MAKE.SH makedist", which generates the xpdf-tools tarball which we extract into GS2/gs2build's GSDLOS/bin folder.
68
69__________________________________________________________
70B. Mojo::DOM perl package for parsing HTML
71__________________________________________________________
72
73XPDF's pdftohtml conversion of a single PDF document produces multiple HTML files: one for each page in the source PDF.
74We want the output to be "paged_html": a single HTML file that is sectionalised, each section representing a page of the
75original PDF.
76
77We need to be able to parse the many HTML pages produced by XPDF's pdftohtml conversion of a doc, in order to massage the output
78into the single sectionalised HTML file. For this we needed a HTML parser package for Perl.
79
801. Before Dr Bainbridge found Mojo::DOM, he looked at
81* https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers
82* http://radar.oreilly.com/2014/02/parsing-html-with-perl-2.html
83
842. Main links for Mojo::DOM
85* https://mojolicious.org/perldoc/Mojo/DOM
86* https://metacpan.org/pod/Mojo::DOM
87    Dependencies: http://deps.cpantesters.org/?module=Mojo%3A%3ADOM;perl=latest
88
893. Once you've downloaded Mojo::DOM's src, follow Dr Bainbridge's sequence of commands for building the Mojo::DOM CPAN module of perl below.
90We'll be using this module to be used for parsing the HTML output by XPDF tool pdftohtml
91
92
93    mkdir cpan
94     2020  tar xvzf Mojolicious-7.84.tar.gz
95     2021  cd Mojolicious-7.84/
96     2028  perl ./Makefile.PL PREFIX=`pwd`/installed
97     2030  make
98     2031  make install
99     2033  cp -r installed/share/perl/5.18.2 ../cpan
100    cd ..
101     2044  export PERL5LIB=`pwd`/cpan
102
103     2053  emacs -nw test.pl
104
105    #!/usr/bin/perl -w
106    add in 'use v5.10;'
107     
108     2054  chmod a+x test.pl
109     2055  ./test.pl
110
111
112__________________________________________________________
113C. Compiling Xpdf-Tools: statically or dynamically linked
114__________________________________________________________
115
116As explained in detail in section D below, we have a customised gs-CMakeLists.txt file which replaces the one in the xpf-4.00.tar.gz package's xpdf subfolder after this is untarred. This customised CMake configure/make file now allows us to compile xpdf-tools either statically (as we've now set it up for by default) or dynamically (as its CMake makefiles were originally set up for).
117
1181. To compile Xpdf-Tools statically, packages/CASCADE-MAKE/XPDFTOOLS.sh should contain:
119
120    cmake -DCMAKE_BUILD_TYPE=Release \
121        -DCMAKE_INSTALL_PREFIX=$prefix \
122        -DZLIB_LIBRARY=$prefix/lib/libz.a \         # <========= THIS
123        -DPNG_LIBRARY=$prefix/lib/libpng15.a \      # <========= THIS
124        -DFREETYPE_LIBRARY=$prefix/lib/libfreetype.a \  # <========= THIS
125        -DCMAKE_DISABLE_FIND_PACKAGE_Qt4=1 \
126        -DCMAKE_DISABLE_FIND_PACKAGE_Qt5Widgets=1 \
127        -DCMAKE_C_FLAGS="$CFLAGS" \
128        -DCMAKE_CXX_FLAGS="$CXXFLAGS" \
129        -DCMAKE_EXE_LINKER_FLAGS="$LDFLAGS" \
130        -DGSDLFLAG_STATIC="$static_flag" \          # <========= THIS
131        $GEXT_XPDFTOOLS/packages/$package$version
132
133In place of FREETYPE_LIBRARY above, could also try the following,
134        -DFREETYPE_DIR=$prefix \
135but then check the built binaries by running "ldd" and "file" over them, to make sure they're not referencing any .so dynamic link libraries:
136
137
1382. To compile Xpdf-Tools dynamically and make it find *our* dynamically linked libraries for its helper packages zlib, libpng, libjpeg and freetype, edit packages/CASCADE-MAKE/XPDFTOOLS.sh to contain:
139
140    cmake -DCMAKE_BUILD_TYPE=Release \
141        -DCMAKE_INSTALL_PREFIX=$prefix \
142        -DZLIB_LIBRARY=$prefix/lib/libz.so.1.2.7 \          # <========= THIS
143        -DPNG_LIBRARY=$prefix/lib/libpng15.so.15.30.0 \     # <========= THIS
144        -DJPEG_LIBRARY=$prefix/lib/libjpeg.so.PUT_THE_NUMBER_HERE \ # <========= THIS AND ENTER THE .SO VERSION NUMBER
145        -DFREETYPE_LIBRARY=$prefix/lib/libfreetype.so.6.3.20 \  # <========= THIS
146        -DCMAKE_DISABLE_FIND_PACKAGE_Qt4=1 \
147        -DCMAKE_DISABLE_FIND_PACKAGE_Qt5Widgets=1 \
148        -DCMAKE_C_FLAGS="$CFLAGS" \
149        -DCMAKE_CXX_FLAGS="$CXXFLAGS" \
150        -DCMAKE_EXE_LINKER_FLAGS="$LDFLAGS" \
151        $GEXT_XPDFTOOLS/packages/$package$version       # <=== -DGSDLFLAG_STATIC removed
152
153
154
155    (1) In the above, you could also set
156        -DFREETYPE_DIR=$prefix
157    in place of
158        -DGSDLFLAG_STATIC="$static_flag"
159
160    In that case it makes, xpdf-tools compilation find the "libfreetype.so" (no versioning at end) in our gs2-extension.
161    After successfully building, make sure to have sourced the gs2-extension's setup.bash before running "ldd" over the
162    generated xpdf-tools binaries, in order to let it use the $LD_LIBRARY_PATH we set to find our .so files.
163
164    (2) Note that there are no equivalent for ZLIB and LIBPNG: doing -DZLIB_DIR=$prefix or -DPNG_DIR=$prefix will be
165    ineffective, as neither are recognised by xpdf-tools' CMake set up.
166
167__________________________________________________________
168D. How we got Xpdf-Tools to compile using CASCADE-MAKE
169__________________________________________________________
170
171The process:
172
1731. We set up a CASCADE-MAKE GS2-extension "xpdf-tools" at trac.greenstone.org/browser/gs2-extensions/xpdf-tools/trunk/src
174Be aware that its lowercased "cascade-make" subfolder is an svn external, the original is at http://trac.greenstone.org/browser/other-projects/cascade-make/trunk/
175
176So far, this CASCADE-MAKE project includes the Xpdf-Tools source tarball, its helper packages zlib, libpng and freetype, as well as CMake to compile the Xpdf-Tools source code.
177The next step is to include JPEG and TIFF libraries too.
178
1792a. We downloaded the Xpdf-Tools source tarball, xpdf-4.00.tar.gz, from the xpdf site at https://www.xpdfreader.com/download.html under section "Download the Xpdf source code".
180
181The xpdf-tools source code tarball consists of the source for Xpf-tools and Xpdf (Xpdf-Reader). The Xpdf-Reader additionally requires Qt to build and run, but we don't want the Xpdf-Reader, just Xpdf-Tools.
182
183b. Compiling Xpdf-Tools fron source and running them requires the following packages and libraries, as per the xpdf-tools source code INSTALL file:
184
185To build xpdf-tools:
186- CMake 2.8.8 or newer
187
188Libraries to link against and used by xpdf-tools:
189- FreeType 2.0.5 or newer
190- libpng (for pdftoppm and pdftohtml)
191- zlib (for pdftoppm and pdftohtml)
192
193
1943. Compilation of xpdf-tools worked with CMake 3.11.4 on the linux resnet machine. However, CMake 3.11.3 itself failed to compile in the LSB environment and on the Mac Mountain Lion machine because of a version incompatibility between the older g++ installed there and the advanced version of CMake 3.11.4.
195
196CMake version 3.9.6 however is supposed to be compatible with older versions of g++, as per https://stackoverflow.com/questions/47886400/cmake-configure-error-in-3-10-1-but-not-in-3-9-6
197To avoid installing newer versions of g++ and clang in the LSB virtual machine and the Mac, I've shifted the CMake version back to version 3.9.6, still
198
199
2004a. On building xpdf-tools to work with dynamically linked libs found anywhere.
201
202If compiling xpdf-tools against dynamic linked libraries for these packages, then the basic CMake command in packages/CASECADE-MAKE/XPDFTOOLS.sh can look like:
203    cmake -DCMAKE_BUILD_TYPE=Release \
204        -DCMAKE_INSTALL_PREFIX=$prefix \
205        -DCMAKE_DISABLE_FIND_PACKAGE_Qt4=1 \
206        -DCMAKE_DISABLE_FIND_PACKAGE_Qt5Widgets=1 \
207        -DCMAKE_C_FLAGS="$CFLAGS" \
208        -DCMAKE_CXX_FLAGS="$CXXFLAGS" \
209        -DCMAKE_EXE_LINKER_FLAGS="$LDFLAGS" \
210        $GEXT_XPDFTOOLS/packages/$package$version   # Note: no -DGSDLFLAG_STATIC=...
211
212With the above, the xpdf-tools source code and its make files work out of the box.
213
2144b. On building xpdf-tools to work with the dynamically linked libs for freetype libpng, zlib that we produce when cascade-making the xpdf-tools gs2-extension.
215
216Since we're compiling up freetype, libpng and zlib packages as part of the Xpdf-Tools GS2-extension with CASCADE-MAKE, the next step was to compile xpdf-tools by dynamically linking against our .so files for these 3 libraries. To do so, XPDFTOOL.sh should have the following changes
217
218      (1) For linux, we need to build on the LSB environment.
219      We're moreover hoping that 32 bit binaries generated this way will work on both 32 and 64 bit machines.
220
221      However, on the 32 bit LSB environment, we additionally need to pass in "-march=i486|i586|i686" to gcc
222      Without it, things end up with the error
223          undefined reference to `__sync_add_and_fetch_4'
224      See https://stackoverflow.com/questions/130740/link-error-when-compiling-gcc-atomic-operation-in-32-bit-mode
225      which further explains that
226          "-march=" means "generate code for a particular CPU (and don't run on older CPUs)".
227      So, although uname -m returns i686 on the 32 bit linux VM that generates the nightly bins, we
228      still want to support i586 and i486 systems, so passing that in as the architecture
229      Don't do this for 64 bit systems.
230      And it seems it only needs to be set on CXXFLAGS in this case.
231
232          arch=`uname -m`
233      if [[ $arch = *"64"* ]]; then
234             arch=
235      else
236          echo "@@@ 32 bit machine, need to pass in -march=i486 to avoid certain linking errors"
237          arch="-march=i486"
238      fi
239      ...
240      export CXXFLAGS="$CXXFLAGS -I$GEXTXPDFTOOLS_INSTALLED/include -I$GEXTXPDFTOOLS_INSTALLED/include/libpng15 $arch"
241
242    (2) set up CFLAGS, CXXFLAGS, CPPFLAGS and LDFLAGS to help linkage of xpdf-tools find our .so versions of the necessary libs:
243
244    export CFLAGS="$CFLAGS -I$GEXTXPDFTOOLS_INSTALLED/include -I$GEXTXPDFTOOLS_INSTALLED/include/libpng15"
245    export CPPFLAGS="$CPPFLAGS -I$GEXTXPDFTOOLS_INSTALLED/include -I$GEXTXPDFTOOLS_INSTALLED/include/libpng15"
246    export CXXFLAGS="$CXXFLAGS -I$GEXTXPDFTOOLS_INSTALLED/include -I$GEXTXPDFTOOLS_INSTALLED/include/libpng15 $arch"
247    export LDFLAGS="$LDFLAGS -L$GEXTXPDFTOOLS_INSTALLED/lib"
248
249    (3) The CMAKE command we run must pass the full paths to the actual .so library files (the ones with specific
250    versions in their files names) rather than the symbolically linked generally-named .so files (the latter won't
251    be found when building xpdf-tools and CMake will try to look for the .so library files elsewhere on the system):
252
253    cmake -DCMAKE_BUILD_TYPE=Release \
254        -DCMAKE_INSTALL_PREFIX=$prefix \
255        -DZLIB_LIBRARY=$prefix/lib/libz.so.1.2.7 \              # <========= NEW
256        -DPNG_LIBRARY=$prefix/lib/libpng15.so.15.30.0 \         # <========= NEW
257        -DFREETYPE_LIBRARY=$prefix/lib/libfreetype.so.6.3.20 \      # <========= NEW
258        -DCMAKE_DISABLE_FIND_PACKAGE_Qt4=1 \
259        -DCMAKE_DISABLE_FIND_PACKAGE_Qt5Widgets=1 \
260        -DCMAKE_C_FLAGS="$CFLAGS" \
261        -DCMAKE_CXX_FLAGS="$CXXFLAGS" \
262        -DCMAKE_EXE_LINKER_FLAGS="$LDFLAGS" \
263        $GEXT_XPDFTOOLS/packages/$package$version   # Again: no -DGSDLFLAG_STATIC=...
264
265Further, the "xpdf/CMakeLists.txt" file within the xpdf-4.00.tar.gz source code tarball needs to be modified to refer to ZLIB_LIBRARIES when linking pdftops and pdftoppm. The linking commands for *both* the "pdftops" and "pdftoppm" executable targets in xpdf/CMakeLists.txt should look like the following,
266
267        target_link_libraries(pdftoppm goo fofi splash
268                        ${FREETYPE_LIBRARY} ${FREETYPE_OTHER_LIBS}
269                        ${DTYPE_LIBRARY}
270                        ${LCMS_LIBRARY}
271            ${ZLIB_LIBRARIES})              # <========= NEW
272
273
274    (4) Since CMakeLists.txt has been modified, we initially renamed the xpdf src tarball to gs-xpdf-4.00.tar.gz.
275    However, the current version works with the regular downloaded xpdf-4.00.tar.gz tarball. But after extraction,
276    XPDFTOOLS.sh copies across the custom packages/gs-CMakeLists.txt into the extracted tarball's xpdf subdirectory,
277    renaming the file as CMakeLists.txt (so the path to it becomes "xpdf-4.00/xpdf/CMakeLists.txt"). In XPDFTOOLS.sh:   
278
279    # patch the original tarball with our custom makefile
280    if [[ -d "$package$version/xpdf" && -f "gs-CMakeLists.txt" ]]; then
281        echo "*******************************************************************"
282        echo "Using our custom gs-CMakeLists.txt instead of the one included in $package$version"
283        echo "Renaming gs-CMakeLists.txt to $package$version/xpdf/CMakeLists.txt"
284        echo "*******************************************************************"
285
286        cp "gs-CMakeLists.txt" "$package$version/xpdf/CMakeLists.txt"
287    fi
288
289
2904c. On building static xpdf-tools binaries using the static *.a freetype libpng, zlib libraries that we produce when cascade-making the xpdf-tools gs2-extension.
291
292In order to compile up xpdf-tools *statically*, so that it builds against the static *.a libraries of freetype, libpng and zlib that we produce during the gs2-extension's CASCADE-MAKE process, we have to make further modifications.
293
294    (1) First, the XPDFTOOLS.sh cascade-make file should pass the full paths to the actual (non-symbolic link) .a file for each library.
295    A custom GS flag, GSDLFLAG_STATIC, is also invented in gs-CMakeLists.txt and assigned "-static for linux
296    and "-Bstatic" for Mac, to pass in during the linking stage of building xpdf-tools.
297
298    For Mac OSX, when -static is passed in for linking as on linux, this produced the error
299    "ld: library not found for -lcrt0.o" during the build of the xpdf-tools package. For information, see
300    https://stackoverflow.com/questions/3801011/ld-library-not-found-for-lcrt0-o-on-osx-10-6-with-gcc-clang-static-flag
301    The page https://stackoverflow.com/questions/844819/how-to-static-link-on-os-x mentions compiling
302    with -Bstatic on Mac OSX instead. To do so, XPDFTOOLS.sh passes in the GSDLFLAG_STATIC set to either
303    "-static" (for linux) or "-Bstatic" for darwin.
304    However the last mentioned stackoverflow page also says that -Bstatic is a no-op, and this appears to be
305    the case when "otool -L" is run over the generated xpdf-tools binaries: the binaries are all dynamically
306    linked. Although they're finding our .so files of freetype, libpng and zlib, they're not finding the .a
307    versions, even though XPDFTOOLS.sh tries to point gs-CMakeLists.txt to the correct .a files.
308
309    The new modifications to XPDFTOOLS.sh:
310
311    if [ "x$GSDLOS" == "xdarwin" ] ; then
312        static_flag=-Bstatic
313    else
314        static_flag=-static
315    fi
316
317    ...
318    cmake -DCMAKE_BUILD_TYPE=Release \
319        -DCMAKE_INSTALL_PREFIX=$prefix \
320        -DZLIB_LIBRARY=$prefix/lib/libz.a \                 # <========= MODIFIED TO .a
321        -DPNG_LIBRARY=$prefix/lib/libpng15.a \              # <========= MODIFIED TO .a
322        -DFREETYPE_LIBRARY=$prefix/lib/libfreetype.a \          # <========= MODIFIED TO .a
323        -DCMAKE_DISABLE_FIND_PACKAGE_Qt4=1 \
324        -DCMAKE_DISABLE_FIND_PACKAGE_Qt5Widgets=1 \
325        -DCMAKE_C_FLAGS="$CFLAGS" \
326        -DCMAKE_CXX_FLAGS="$CXXFLAGS" \
327        -DCMAKE_EXE_LINKER_FLAGS="$LDFLAGS" \
328        -DGSDLFLAG_STATIC="$static_flag" \                  # <========= NEW
329        $GEXT_XPDFTOOLS/packages/$package$version
330
331    (2) Our customised gs-CMakeLists.txt file now checks for this flag GSDLFLAG_STATIC being set and, if it is,
332    uses it during the linking stage. As in (1) above, it will be set to "-static" for Linux and "-Bstatic" for Mac.
333   
334    - When the flag is set, the linking flags passed into each occurrence of target_link_libraries() in
335    gs-CMakeLists.txt is moreover manually written in the form of "-static -l<libs>" rather than using
336    the default linking commands inherited from the original CMakeLists.txt.
337    - If GSDLFLAG_STATIC isn't set, then we don't build statically, and the linking flags passed to each
338    target_link_libraries() are mostly the original ones.
339
340    For example,
341
342        if(GSDLFLAG_STATIC)
343            target_link_libraries(pdftoppm goo fofi splash
344              ${GSDLFLAG_STATIC} -lfreetype ${DTYPE_LIBRARY} ${LCMS_LIBRARY} -lz -lm -lc -lpthread)
345        else ()
346            target_link_libraries(pdftoppm goo fofi splash
347                            ${FREETYPE_LIBRARY} ${FREETYPE_OTHER_LIBS}
348                            ${DTYPE_LIBRARY}
349                           ${LCMS_LIBRARY}
350                    ${ZLIB_LIBRARIES})
351        endif ()
352
353    DETAILED EXPLANATION:
354    We found that when building *statically*, gs-CMakeLists.txt needed to NOT use the PNG_LIBRARIES, ZLIB_LIBRARIES
355    and FREETYPE_LIBRARY in its linker commands, target_link_libraries(), as doing so produced partially dynamic
356    xpdf-tools executables which were moreover BROKEN. They wouldn't run, and in fact attempting to run an xpdf-tool,
357    like "./pdftohtml", would produce a file not found error. Something like "bash: no such file or directory".
358
359    Online discussions mentioned that this generally happened when attempting to run 32 bit executables on 64 bit
360    linux when 32 bit loaders are not installed. (In such cases, the solution was to apt-get install some 32 bit package.)
361    However, our broken binaries were all 64 bit, as indicated when running the "file" command on them. However, their
362    being further partially dynamically linked executables didn't imply that they would be broken, as we were eventually
363    able to produce partially dynamic executables that did work, before solving static linking altogether.
364
365    The real issue was that including references to  ${FREETYPE_LIBRARY} ${FREETYPE_OTHER_LIBS}, ${PNG_LIBRARIES} and
366    ${ZLIB_LIBRARIES} in any target_link_libraries() resulted in the wrong linking command producing broken binaries.
367
368    Doing the regular target_link_libraries() in static mode results in building with
369    "-Wl,-Bstatic -lfreetype -lpng15 -lz -Wl,-Bdynamic -lpthread" at end of link line
370    and produces broken binaries for pdftohtml/pdftoppm/pdftops/pdftopng.
371
372    Note that PNG_LIBRARIES includes zlib/lz: "-lpng -lz", and along with freetype,
373    these are linked statically. However, Threads/lpthread is included as a dynamically
374    linked library instead of including a .a (regardless of whether it's appended
375    as -lpthread or Threads::Threads in the target_link_libraries()), contributing to
376    the pdfhtml binary produced being a partially static, partially dynamic one,
377    so a dynamic executable overall.
378
379    The order of dynamic .so files listed by ldd in the broken static binary of pdftohtml differs from
380    a manually statically linked working version of pdftohtml, and seems to be the only difference
381    between the two in ldd's output. Not using "-Wl,-Bstatic" and using -static (-Bstatic on Mac)
382    in its place creates a partially static dynamic executable that isn't broken, whereas
383    additionally removing "-Wl,-Bdynamic -lpthread" and replacing it with -lpthread
384    moreover produces a working pdftohtml that is a fully static linked executable.
385
386    The inclusion of the math lib and c lib (lm and lc) in the final link command
387    are to completely bypass the remaining .so dependencies that were present in
388    the executable and produce the fully static executable. The lm and lc libs were referenced
389    by all xpdf-tool binaries (as indicated when generating dynamic ones and running ldd over them)
390    but Dr Bainbridge said that -lm and -lc were some libs passed in by the compiler by default,
391    which would explain why explicitly setting them for some xpdftools and not other may not have
392    mattered.
393
394NOTES:
395Initial attempts at modifying gs-CMakeLists.txt for static compiling that proved to be unnecessary:
396
397    (i) Setting -static globally doesn't have a useful effect.
398
399    # We want to build static xpdf-tools binaries. See
400    # https://stackoverflow.com/questions/24648357/compiling-a-static-executable-with-cmake
401    # Want to make the min number of changes for building statically, so using the way
402    # below. Beware, must *append* "-static" to existing CMAKE_EXE_LINKER_FLAGS=LD_FLAGS
403    ##SET(CMAKE_FIND_LIBRARY_SUFFIXES ".a")
404    ##SET(BUILD_SHARED_LIBS OFF)
405    ##SET(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -static")
406
407    The above 3 lines just add a -static before the "-O2 -Wall -fPIC -rdynamic ..." during linking, such as below.
408    But they have no further effect on whether static building actually succeeds or not. The only effective static
409    linking command (for Linux so far) was to pass -static in the target_link_libraries() command followed by the
410    "-l<libname>" for each library in the correct order.
411
412----
413/usr/bin/c++  -I/home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/include  -I/home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/include -I/home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/include/libpng15 -O3 -Wall -fPIC  -L/home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/lib  -L/home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/lib -static ***** <- HERE ****** -O2 -Wall -fPIC -rdynamic CMakeFiles/pdftohtml.dir/HTMLGen.cc.o CMakeFiles/pdftohtml.dir/SplashOutputDev.cc.o CMakeFiles/pdftohtml.dir/TextOutputDev.cc.o CMakeFiles/pdftohtml.dir/pdftohtml.cc.o CMakeFiles/xpdf_objs.dir/AcroForm.cc.o CMakeFiles/xpdf_objs.dir/Annot.cc.o CMakeFiles/xpdf_objs.dir/Array.cc.o CMakeFiles/xpdf_objs.dir/BuiltinFont.cc.o CMakeFiles/xpdf_objs.dir/BuiltinFontTables.cc.o CMakeFiles/xpdf_objs.dir/Catalog.cc.o CMakeFiles/xpdf_objs.dir/CharCodeToUnicode.cc.o CMakeFiles/xpdf_objs.dir/CMap.cc.o CMakeFiles/xpdf_objs.dir/Decrypt.cc.o CMakeFiles/xpdf_objs.dir/Dict.cc.o CMakeFiles/xpdf_objs.dir/Error.cc.o CMakeFiles/xpdf_objs.dir/FontEncodingTables.cc.o CMakeFiles/xpdf_objs.dir/Form.cc.o CMakeFiles/xpdf_objs.dir/Function.cc.o CMakeFiles/xpdf_objs.dir/Gfx.cc.o CMakeFiles/xpdf_objs.dir/GfxFont.cc.o CMakeFiles/xpdf_objs.dir/GfxState.cc.o CMakeFiles/xpdf_objs.dir/GlobalParams.cc.o CMakeFiles/xpdf_objs.dir/JArithmeticDecoder.cc.o CMakeFiles/xpdf_objs.dir/JBIG2Stream.cc.o CMakeFiles/xpdf_objs.dir/JPXStream.cc.o CMakeFiles/xpdf_objs.dir/Lexer.cc.o CMakeFiles/xpdf_objs.dir/Link.cc.o CMakeFiles/xpdf_objs.dir/NameToCharCode.cc.o CMakeFiles/xpdf_objs.dir/Object.cc.o CMakeFiles/xpdf_objs.dir/OptionalContent.cc.o CMakeFiles/xpdf_objs.dir/Outline.cc.o CMakeFiles/xpdf_objs.dir/OutputDev.cc.o CMakeFiles/xpdf_objs.dir/Page.cc.o CMakeFiles/xpdf_objs.dir/Parser.cc.o CMakeFiles/xpdf_objs.dir/PDFDoc.cc.o CMakeFiles/xpdf_objs.dir/PDFDocEncoding.cc.o CMakeFiles/xpdf_objs.dir/PSTokenizer.cc.o CMakeFiles/xpdf_objs.dir/SecurityHandler.cc.o CMakeFiles/xpdf_objs.dir/Stream.cc.o CMakeFiles/xpdf_objs.dir/TextString.cc.o CMakeFiles/xpdf_objs.dir/UnicodeMap.cc.o CMakeFiles/xpdf_objs.dir/UnicodeTypeTable.cc.o CMakeFiles/xpdf_objs.dir/UTF8.cc.o CMakeFiles/xpdf_objs.dir/XFAForm.cc.o CMakeFiles/xpdf_objs.dir/XRef.cc.o CMakeFiles/xpdf_objs.dir/Zoox.cc.o  -o pdftohtml ../goo/libgoo.a ../fofi/libfofi.a ../splash/libsplash.a -static -lfreetype -lpng -lz -lm -lc -lpthread
414----
415
416    (ii) Threads::Threads instead of -lpthread results in a partially dynamic executable.
417
418    # The original, unmodified CMakeLists.txt was not set up sufficiently
419    # for static compilation of xpdf-tools. As a result, compile would first fail
420    # with errors about undefined refs to mutex / lpthread.
421    # When building xpdf-tools statically, need to add the following 2 lines as well
422    # as append "Threads::Threads" to the end of each "target_link_libraries(<list>)"
423    # See https://stackoverflow.com/questions/1620918/cmake-and-libpthread
424    # found googling cmake and "-lpthread" (pthread) after ERRORS to do with this, like:
425    #   undefined reference to `pthread_mutex_unlock'
426    ##set(THREADS_PREFER_PTHREAD_FLAG ON)
427    ##find_package(Threads REQUIRED)
428
429    In instances when compilation was successful, including the above 2 lines in combination with "Threads::Threads"
430    as the final argument to every target_link_libraries(...) occurrence in gs-CMakeLists.txt would only manage to
431    produce partially dynamically linked xpdftools binaries. (Depending on what the linking command was when building
432    Xpdf-Tools, the partially dynamically linked executables may work or may be broken. See explanation further above.)
433    We wanted fully statically linked binaries, for which we needed to pass in "-lpthread" as the trailing argument
434    to each target_link_libraries(...). So without either, compilation will fail. However, with "Threads::Threads"
435    the binaries weren't fully static, whereas with -lpthread the xpdftools executables were fully static as CMake no
436    longer tried to link against a dynamic Threads library.
437
438
4395. To view the unmodified CMakeLists.txt included in the xpdf-4.00 source code tarball, untar it and look for its "xpdf/CMakeLists.txt" (not the toplevel file of the same name).
440Run a 'diff' against gs-CMakeLists.txt to see further differences, such as debug statements and comments. Most comments have been removed and placed into this readme file instead.
441
442
4436. When CASCADE-MAKE is run on the xpdf-tools GS2-extension, it first compiles up CMake, needed to compile up xpdf-tools.
444Unlike the library packages like freetype, libpng and zlib that we also build for xpdf-tools as part of this gs2-extension, CMake's build products don't need to be included in the distribution tarball of our built xpdf-tools executables.
445
446There's a "move-cmake.sh" script in the xpdf-tools gs2-extension that can be run with the "away" and "back" options to move the CMake stuff out of the way (into a "devel" folder) after successfully building xpdf binaries and that can also be run to move them back if wanting to recompile.
447
448The script can be run manually, but it's also run by the extension:
449- packages/CASCADE-MAKE/XPDFTOOLS.sh runs "move-cmake.sh away" after xpdf-tools has been built, so that the extension's install location is ready for tarring up for distribution.
450- When recompiling the xpdf-tools extenion, the CASCADE-MAKE process will run packages/CASCADE-MAKE/CMAKE.sh file which in turn runs "move-cmake.sh back" if there's a prebuilt CMake which had earlier been moved out of the way.
451
452
453__________________________________________________________
454E. Getting more output when running CMake (verbosity)
455__________________________________________________________
456See https://www.linuxquestions.org/questions/programming-9/cmake-or-make-debug-output-show-command-624800/
457To turn on debugging:
458    export VERBOSE=1
459    ./CASCADE-MAKE.sh
460
461To turn off debugging, need to actually make VERBOSE undefined again (don't set it to 0):
462    export VERBOSE=
463    ./CASCADE-MAKE.sh
464
465
466__________________________________________________________
467F. APPENDIX - Useful links
468__________________________________________________________
469A. Helping CMake along. (Not all of this was necessary for compiling xpdftools statically, but they're generally useful links)
470
471https://github.com/SynoCommunity/spksrc/issues/1779
472https://stackoverflow.com/questions/1620918/cmake-and-libpthread
473https://cmake.org/cmake/help/v3.0/prop_tgt/LINK_FLAGS.html
474https://cmake.org/cmake/help/v3.11/command/target_link_libraries.html?highlight=target_link_libraries
475https://stackoverflow.com/questions/24648357/compiling-a-static-executable-with-cmake
476https://stackoverflow.com/questions/42815420/cmake-cant-find-my-static-libs
477https://cmake.org/cmake/help/v3.0/command/message.html
478https://stackoverflow.com/questions/30980383/cmake-compile-options-for-libpng
479    https://stackoverflow.com/questions/36220123/undefined-reference-to-png-set-longjmp-fn-when-compiling-pcl-source-file
480
481
482B. About the error "bash: no such file or directory" when run on a statically generated binary:
483
484https://askubuntu.com/questions/351827/unable-to-run-a-32-bit-program-on-64-bit-vm/353497#353497
485https://unix.stackexchange.com/questions/13391/getting-not-found-message-when-running-a-32-bit-binary-on-a-64-bit-system/13409#13409
486https://arstechnica.com/civis/viewtopic.php?f=16&t=1173118
487https://superuser.com/questions/344533/no-such-file-or-directory-error-in-bash-but-the-file-exists
488https://unix.stackexchange.com/questions/45277/executing-binary-file-file-not-found
489
490C. Other links
491
492https://unix.stackexchange.com/questions/279397/ldd-dont-find-path-how-to-add
493
494
495D. On why you can't build static binaries on Mac, but can build static libraries and link against them
496
497https://developer.apple.com/library/archive/qa/qa1118/_index.html (official page on how Mac doesn't support static binaries)
498https://stackoverflow.com/questions/3801011/ld-library-not-found-for-lcrt0-o-on-osx-10-6-with-gcc-clang-static-flag
499https://stackoverflow.com/questions/844819/how-to-static-link-on-os-x (mention of -Bstatic)
500https://www.allegro.cc/forums/thread/610923
501https://stackoverflow.com/questions/5259249/creating-static-mac-os-x-c-build (has some other suggestions)
502    http://www.network-theory.co.uk/docs/gccintro/gccintro_79.html
503Dead end: https://nelsonslog.wordpress.com/2013/04/24/macos-doesnt-support-static-binaries/
504https://dropline.net/2015/10/static-linking-on-mac-os-x/
505    explains that on Mac, .dylibs must be hidden for .a versions of libraries to be selected when linking
506    This must be true for non-system dylibs too.
507    This means that where possible we want to essentially do "--enable-static --disable-shared", or equivalent,
508    when generating freetype, libz, libpng, libjpg, libtiff library files, so that Xpdf-Tools links against the
509    .a files we generated rather than additional .dylib files
510
511http://www.simplesystems.org/libtiff/build.html
512configuration options for building libtiff. Want to turn off the compile process for libtiff producing tiff binaries, but there appears to be no such option.
513
514
515__________________________________________________________
516G. LIBJPEG and LIBTIFF
517__________________________________________________________
518
5191. The first version of LIBJPEG to work out was version 6b, which required some patching up before it could be built, see point 2 below.
520Besides the fact that version 6b needed patching up, it was also from 2008. I've now found a version of libjpeg from Jan 2018, called "jpegsrc.v9c.tar.gz"
521which was downloadable from www.ijg.org at http://www.ijg.org/files/jpegsrc.v9c.tar.gz. Version 9c can build both static and dynamically linked libraries of
522libjpeg, though we only want the former. (The older version 6b could only generate the static libjpeg.a library file, and contrary to online instructions.)
523
524As needed to be done with the older 6b version, this tarball was renamed to jpeg-9c.tar.gz to fit the naming pattern of its folder once extracted.
525
526There was an incompatibility between the existing CASCADE-MAKE/LIBJPEG.sh and the Makefile generated by configuring the Makefile.in/.am in the jpeg-9c tarball.
527The LIBJPEG.sh would run "make install-lib"  at the end, to install the libjpeg.a in the lib folder and to install 4 header files. This is as per the install.txt
528instructions in the older and current version of jpeg src tarball. However, the header files never got installed when doing so, whether in version 6b or the
529current 9c. And install-lib is not a recognised target in 9c's Makefile, where the target is install-libLTLIBRARIES. So LIBJPEG.sh has been modified to use this
530target name and to moreover copy over the header files (even though they weren't necessary when compiling xpdftools against the libjpeg 6b library previously and
531possibly now with 9c).
532
533Since we want to only generate libjpeg.a and not the .so/.dylib dynamically linked versions, the latter is turned off during configure by passing --disable-shared.
534
535A final change made to LIBJPEG.sh was to undo it copying over the patch file "gs-libjpeg-config.sub" into the extracted jpeg tarball, since the patch was only
536necessary for libjpeg version 6b and not for 9c. These steps have been commented out in LIBJPEG.sh now.
537
538
5392. Issues building LIBJPEG VERSION 6b on 64 bit machines and the patch
540
541LIBJPEG version 6b is from 2008.
542
543I copied the LIBJPEG package from http://trac.greenstone.org/browser/other-projects/realistic-books/trunk/packages (also at http://trac.greenstone.org/browser/gs2-extensions/ocr/trunk/packages/cmdline).
544
545    * Configuring out of the box produced the following error:
546       checking host system type... Invalid configuration `x86_64-unknown-linux-gnu': machine `x86_64-unknown' not recognized
547
548    * So that, as a consequence, when running make on the libjpeg package, make failed with the error:
549       ./libtool --mode=compile gcc -I/home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/include -fPIC  -I/home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/include  -I. -c ./jcapimin.c
550       make: ./libtool: Command not found
551       make: *** [jcapimin.lo] Error 127
552        Error encountered running *make * stage of ./CASCADE-MAKE/LIBJPEG.sh
553
554The same was true when I grabbed the libjpeg from sourceforge (https://sourceforge.net/projects/libjpeg/files/), which was also still version jpeg 6b from 2008.
555
556I found the following webpages discussing the above error messages:
557- https://unix.stackexchange.com/questions/80479/how-to-work-with-libtool
558- https://github.com/rwestlund/freesweep/issues/1
559- https://ubuntuforums.org/showthread.php?t=1232714
560- https://stackoverflow.com/questions/12828687/configure-fails-to-detect-proper-ld-on-a-64-bit-system-with-32-bit-userland
561- SOLUTION: https://sourceforge.net/p/libjpeg/bugs/12/
562
563However, the error only strikes when configure is run with --enable-static.
564
565Note also that contrary to the above pages, running configure with the additional options
566    --host=x86_64-linux-gnu --build=x86_64-linux-gnu --target=x86_64-linux-gnu --disable-shared --enable-static
567did not help. Nor did adding the above flags get rid of configure attempting to work with host=x86_64-unknown(-unknown)-linux-gnu
568
569The SOLUTION, found when searching for the error message along with "enable-static", as it's the combination that is relevant, is described
570at https://sourceforge.net/p/libjpeg/bugs/12/
571
572which was to patch up the config.sub filed included in the jpeg-6b tarball, to also cover x86_64-* machines:
573        tahoe | i860 | x86_64-* | m32r | m68k | m68000 | m88k | ns32k | arc | arm \
574
575The above change is necessary because this libjpeg is outdated and has been superceded by other JPEG libraries, also discussed at https://sourceforge.net/p/libjpeg/bugs/12/
576I'm not sure if those libraries are compatible with XpdfTools however, so I'm sticking with libjpeg as long as I can get it to build and be recognised by XpdfTools.
577
578The solution is once more to have a patch file: CASCADE-MAKE/LIBJPEG.sh replaces the config.sub with in the jpeg-6b package after this is untarred with packages/gs-libjpeg-config.sub, which contains the patch.
579
580
5812. I followed the instructions at http://www.linuxfromscratch.org/blfs/view/6.3/general/libjpeg.html
582to try to build libjpeg with --enable-static and --enable-shared to produce both libjpeg.a and libjpeg.so.
583
584However, nothing I try gets it to generate a libjpeg.so. It seems to always produce a libjpeg.a in xpdf-tools/linux/lib
585regardless of whether CASCADE-MAKE/LIBJPEG.sh passes the --enable-static flag to the configure command or not, and regardless of whether --enable-shared is additionally or individually passed in.
586
587As a consequence, there's no  libjpeg.so file to set the -DJPEG_LIBRARY flag in XPDFTOOLS.sh to for when building xpdf-tools against dynamically linked libraries.
588
589I tried the various combinations with the lib jpeg-6b source tarballs from
590- sourceforge, https://sourceforge.net/projects/libjpeg/files/, the latest tarball of this was from 2008
591- http://www.linuxfromscratch.org/blfs/view/6.3/general/libjpeg.html, which was last updated in 2007
592- http://trac.greenstone.org/browser/other-projects/realistic-books/trunk/packages/jpeg-6b.tar.gz, which was added to trac in 2009 but is probably the 2008 or 2007 version too.
593
594
5953. Modifications for using TIFF and JPEG libraries when building Xpdf-Tools:
596   
597* CASCADE-MAKE.sh, replaced
598    PACKAGES="CMAKE LIBZ LIBPNG FREETYPE XPDFTOOLS"
599with
600    PACKAGES="CMAKE LIBZ LIBTIFF LIBPNG LIBJPEG FREETYPE XPDFTOOLS"
601
602
603* XPDFTOOLS.sh
604If compiling statically make sure the CMake command contains the following changes:
605        -DTIFF_INCLUDE_DIR=$prefix/include \        # <========== new
606        -DJPEG_INCLUDE_DIR=$prefix/include \        # <========== new
607        -DZLIB_LIBRARY=$prefix/lib/libz.a \
608        -DTIFF_LIBRARY=$prefix/lib/libtiff.a \      # <========== new
609        -DPNG_LIBRARY=$prefix/lib/libpng15.a \
610        -DJPEG_LIBRARY=$prefix/lib/libjpeg.a \      # <========== new
611        -DFREETYPE_LIBRARY=$prefix/lib/libfreetype.a \
612        -DGSDLFLAG_STATIC="$static_flag" \
613
614
615
616The above flag names were discovered by deleting the untarred xpdf-4.00 folder.
617Then in a fresh terminal, source devel.bash from xpdf-tools and re-run CASCADE-MAKE.sh without the above modifications:
618
619    -- Found FreeType (new-style includes): /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/lib/libfreetype.a
620    -- Found ZLIB: /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/lib/libz.a (found version "1.2.8")
621    -- Found PNG: /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/lib/libpng15.a (found version "1.2.50")
622    -- Could NOT find JPEG (missing: JPEG_LIBRARY JPEG_INCLUDE_DIR)
623    -- Could NOT find TIFF (missing: TIFF_LIBRARY TIFF_INCLUDE_DIR)
624    -- lcms2 not found
625    -- No Qt library found
626
627
628* packages/gs-CMakeLists.txt was modified again,
629
630    - this time to also pass:
631        -ltiff and -ljpeg to all target_link_libraries() commands that run when GSDLFLAG_STATIC is set
632    and
633        ${TIFF_LIBRARY} and ${JPEG_LIBRARY} to all target_link_libraries() commands that run when GSDLFLAG_STATIC is not set
634
635    - And to add in the include directories and defitions if JPEG/TIFF libraries were provided:
636        if (JPEG_FOUND)
637          include_directories("${JPEG_INCLUDE_DIR}")
638          add_definitions("${JPEG_DEFINITIONS}")
639          message(STATUS "@@@@@@@@@@@@@@@ JPEG_FOUND (include_dir ; include_dirs): ${JPEG_INCLUDE_DIR} ; ${JPEG_INCLUDE_DIRS}")
640        else ()
641          message(STATUS "@@@@@@@@@@@@@@@ NO JPEG_FOUND")
642        endif ()
643        if (TIFF_FOUND)
644          include_directories("${TIFF_INCLUDE_DIRS}")
645          add_definitions("${TIFF_DEFINITIONS}")
646          message(STATUS "@@@@@@@@@@@@@@@ TIFF_FOUND ${TIFF_INCLUDE_DIRS}")
647        else ()
648          message(STATUS "@@@@@@@@@@@@@@@ NO TIFF_FOUND")
649        endif ()
650
651    Note however that although gs-CMakeLists.txt now knows what the pluralised TIFF_INCLUDE_DIRS is (and TIFF_INCLUDE_DIR)
652    as for PNG and ZLIB, gs-CMakeLists.txt does not have a value for the pluralised JPEG_INCLUDE_DIRS, only the
653    JPEG_INCLUDE_DIRS set above. And both the CMAKE flags in XPDFTOOLS.sh for tiff and jpeg libs seem to have been setup
654    in the same way now. Not sure where these automatically assigned variables come from in order to check up on them.
655
656__________________________________________________________
657H. Licensing information and making the distributable tarball
658__________________________________________________________
659
660XpdfTools' README lists which files need to be included as per its license when redistributing xpdf-tools binaries.
661
662Running "./CASCADE-MAKE.sh makedist" assembles a custom whitelist of files to include in the distribution tarball of the xpdf-tools we compile up.
663
664The files and folders into the distribution tarball xpdf-tools-GSDLOS.tar.gz are:
665- the GSDLOS/bin/pdf* statically linked binaries (or dynamic executables linked against mostly static libraries in the case of Macs),
666- the GSDLOS/man folder as well as the further compulsory files README, COPYING and COPYING3 as required for xpdf-tools' license.
667
668Beware that the cascade-make makedist function always maintains the directory structure of folders but also files included in the whitelist.
669So when untarred, the folder xpdf-tools is produced with subfolders like linux/bin (containing the pdf* binaries), a linux/man subfolder
670and files README, COPYING, COPYING3.
671
672
673__________________________________________________________
674I. PDF2DOM: tried it out, but wasn't what we wanted
675__________________________________________________________
676Using PDFBox to convert a PDF to full HTML, both images and text and placed correctly with respect to each other, is tricky, see https://stackoverflow.com/questions/9671239/pdfbox-convert-a-pdf-to-text-or-html-including-images-from-the-pdf
677(Google: pdfbox to convert pdf to html with images)
678
679PDF2DOM tool (based on PDFBox) to convert PDF to HTML with images
680* http://cssbox.sourceforge.net/pdf2dom/documentation.php
681* Got the command line jar tool, PDFToHTML.jar version 1.7, from https://sourceforge.net/projects/cssbox/files/Pdf2DOM/
682* Further information and source code at https://github.com/radkovo/Pdf2Dom
683* API: http://cssbox.sourceforge.net/pdf2dom/api/index.html
684
685
6861. Running
687
688java -jar PDFToHTML.jar <infile> [<outfile>]
689
690    greenstone@machine-name:~/Downloads$ java -jar PDFToHTML.jar SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
691
692
693It will output the page, but you'll see the following output indicating that the logger is not displaying anything:
694    SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
695    SLF4J: Defaulting to no-operation (NOP) logger implementation
696    SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
697
698See https://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder
699
700To see error output download SLF4J simple jar, run as follows:
701
702    greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML ApacheLicencePDFA.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
703
704The above is a MS Word produced PDF (archive format) and works fine: font folder generated containing the extracted fonts
705
706The following is a PDF produced from the same doc file by the latest libreoffice installed on Windows:
707    ApacheLicencePDFA_FromODT.pdf
708But running the same command on it produces the following font errors:
709
710greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML ApacheLicencePDFA_FromODT.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
711[main] INFO org.reflections.Reflections - Reflections took 163 ms to scan 1 urls, producing 36 keys and 222 values
712[main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException
713[main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException
714[main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException
715[main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException
716
717Fonts get extracted if the source PDF was generated by MS Word's doc to PDF conversion. Fonts didn't get extracted from PDF upon conversion to HTML when libreoffice was used to convert a .doc to the source PDF.
718
7192. Check version of PDF
720https://www.codeproject.com/Questions/167550/How-to-check-different-versions-of-PDF
721
722
7233. pdf to html command line conversion open source
724https://stackoverflow.com/questions/8370014/how-to-convert-pdf-to-html
725
726"Download
727
728    pdfbox-2.0.3.jar
729    fontbox-2.0.3.jar
730    preflight-2.0.3.jar
731    xmpbox-2.0.3.jar
732    pdfbox-tools-2.0.3.jar
733    pdfbox-debugger-2.0.3.jar
734
735from http://pdfbox.apache.org/
736...
737
738PLEASE NOTE: Images do not get pushed to the HTML output."
739
740
7414. Need a way to check if PDF contains images, then use pdf2dom, else basic pdfbox conversion to html (less div tags with inline style markup)?
742https://stackoverflow.com/questions/46215879/count-images-in-pdf-using-pdfbox
743
744
745UNUSED
746Googled for: java tool convert pdf version
747* https://stackoverflow.com/questions/11137912/all-inclusive-tool-to-convert-different-types-of-documents-to-pdf
748* https://www.qoppa.com/pdfprocess/
749jPDFProcess – Java PDF Library to Create, Manipulate PDF
750(appears to be payware)
751* https://www.gnostice.com/nl_article.asp?id=95&t=How_to_Change_the_PDF_Version_of_a_Document
752How to Convert a PDF Document to an Older or Newer Version
753uses .NET
754* http://www.baeldung.com/pdf-conversions-java
755PDF Conversions in Java
756e.g. PDF to html and html to PDF
757
758
759__________________________________________________________
760
761greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
762[main] INFO org.reflections.Reflections - Reflections took 153 ms to scan 1 urls, producing 36 keys and 222 values
763[main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException
764[main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException
765[main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException
766[main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException
767
768
769
770greenstone@machine-name:~/Downloads$ java -classpath Pdf2Dom/target/pdf2dom-1.8-SNAPSHOT.jar:pdfbox-app.jar:slf4j-jdk14-1.6.6.jar:log4j-over-slf4j-1.6.6.jar:slf4j-api-1.6.6.jar  org.fit.pdfdom.PDFToHTML SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
771Exception in thread "main" java.lang.NoClassDefFoundError: org/mabb/fontverter/FontVerter
772    at org.fit.pdfdom.FontTable$Entry.loadTrueTypeFont(FontTable.java:178)
773    at org.fit.pdfdom.FontTable$Entry.getData(FontTable.java:147)
774    at org.fit.pdfdom.FontTable$Entry.isEntryValid(FontTable.java:161)
775    at org.fit.pdfdom.FontTable.addEntry(FontTable.java:48)
776    at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:378)
777    at org.fit.pdfdom.PDFBoxTree.updateFontTable(PDFBoxTree.java:361)
778    at org.fit.pdfdom.PDFDomTree.updateFontTable(PDFDomTree.java:544)
779    at org.fit.pdfdom.PDFBoxTree.processPage(PDFBoxTree.java:206)
780    at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
781    at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
782    at org.fit.pdfdom.PDFDomTree.createDOM(PDFDomTree.java:218)
783    at org.fit.pdfdom.PDFDomTree.writeText(PDFDomTree.java:194)
784    at org.fit.pdfdom.PDFToHTML.main(PDFToHTML.java:77)
785Caused by: java.lang.ClassNotFoundException: org.mabb.fontverter.FontVerter
786    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
787    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
788    at java.security.AccessController.doPrivileged(Native Method)
789    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
790    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
791    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
792    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
793    ... 13 more
794greenstone@machine-name:~/Downloads$
Note: See TracBrowser for help on using the browser.