__________________________________________________________ CONTENTS __________________________________________________________ Xpdf-Tools related A. XPDF B. Mojo::DOM perl package for parsing HTML C. Compiling Xpdf-Tools: statically or dynamically linked D. How we got Xpdf-Tools to compile using CASCADE-MAKE E. Getting more output when running CMake (verbosity) F. APPENDIX - Useful links LIBJPEG related G. LIBJPEG and LIBTIFF - Issues building LIBJPEG on 64 bit machines and the patch H. PDF2DOM unused, replaced by Xpdf-Tools' more suited pdftohtml capabilities __________________________________________________________ A. XPDF __________________________________________________________ Xpdf's last mod date is in 2017 and it includes its own pdftohtml utility tool, whereas the old "pdftohtml" tool that GS used was last updated 2013 (and itself made use of Xpdf, possible older versions). The tool takes a PDF and produces an HTML file for each page of the PDF, consisting of selectable HTML text overlaid on top of "screenshot" image of the page. (A page's text is not part of the screenshot.) 1. https://www.xpdfreader.com/download.html As per the Readme file found in the linux binary of Xpdf Tools, the Xpdf Viewer requires the qt toolkit, but not the Xpdf Tools. Have not read the Install file to confirm whether the same is the case for when compiling the command line tools. (But in that case, can't we just include the tools binary available for all 3 OS, instead of compiling on each platform) - Using Xpdf's pdftohtml tool: greenstone@bedrock:~/Downloads/xpdf-tools-linux-4.00/bin64$./pdftohtml -z 1.5 ~/Downloads/ApacheLicence.pdf licence where licence is a folder. - Using Xpdf's pdftotext tool: greenstone@bedrock:~/Downloads/xpdf-tools-linux-4.00/bin64$./pdftotext -nopgbrk ~/Downloads/ApacheLicence.pdf ~/Downloads/ApacheLicence.txt where the output text file must be specified with a full path name. 2. Documentation on Xpdf-Tools: - https://www.xpdfreader.com/support.html for example, the pdftohtml man page: https://www.xpdfreader.com/pdftohtml-man.html - https://linux.die.net/man/5/xpdfrc (Configuration flags you can put into ~/.xpdfrc to use as defaults when running xpdf tool commands) 3. We're using Xpdf Tools version: xpdf-tools-linux-4.00 4. We started by working with the ready-made Xpdf-tools binaries available for download from the xpdf site for Win, Linux and Mac. 5. We're now moving to compiling up Xpdf-tools ourselves using CASCADE-MAKE, which we have so far got to successfully compile statically on Linux (LSB environment inclusive) to build working binaries. On Mac, I've been unable to get it to produce statically linked libraries: at this stage they're dynamically linked. __________________________________________________________ B. Mojo::DOM perl package for parsing HTML __________________________________________________________ XPDF's pdftohtml conversion of a single PDF document produces multiple HTML files: one for each page in the source PDF. We want the output to be "paged_html": a single HTML file that is sectionalised, each section representing a page of the original PDF. We need to be able to parse the many HTML pages produced by XPDF's pdftohtml conversion of a doc, in order to massage the output into the single sectionalised HTML file. For this we needed a HTML parser package for Perl. 1. Before Dr Bainbridge found Mojo::DOM, he looked at * https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers * http://radar.oreilly.com/2014/02/parsing-html-with-perl-2.html 2. Main links for Mojo::DOM * https://mojolicious.org/perldoc/Mojo/DOM * https://metacpan.org/pod/Mojo::DOM Dependencies: http://deps.cpantesters.org/?module=Mojo%3A%3ADOM;perl=latest 3. Once you've downloaded Mojo::DOM's src, follow Dr Bainbridge's sequence of commands for building the Mojo::DOM CPAN module of perl below. We'll be using this module to be used for parsing the HTML output by XPDF tool pdftohtml mkdir cpan 2020 tar xvzf Mojolicious-7.84.tar.gz 2021 cd Mojolicious-7.84/ 2028 perl ./Makefile.PL PREFIX=`pwd`/installed 2030 make 2031 make install 2033 cp -r installed/share/perl/5.18.2 ../cpan cd .. 2044 export PERL5LIB=`pwd`/cpan 2053 emacs -nw test.pl #!/usr/bin/perl -w add in 'use v5.10;' 2054 chmod a+x test.pl 2055 ./test.pl __________________________________________________________ C. Compiling Xpdf-Tools: statically or dynamically linked __________________________________________________________ As explained in detail in section D below, we have a customised gs-CMakeLists.txt file which replaces the one in the xpf-4.00.tar.gz package's xpdf subfolder after this is untarred. This customised CMake configure/make file now allows us to compile xpdf-tools either statically (as we've now set it up for by default) or dynamically (as its CMake makefiles were originally set up for). 1. To compile Xpdf-Tools statically, packages/CASCADE-MAKE/XPDFTOOLS.sh should contain: cmake -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_INSTALL_PREFIX=$prefix \ -DZLIB_LIBRARY=$prefix/lib/libz.a \ # <========= THIS -DPNG_LIBRARY=$prefix/lib/libpng15.a \ # <========= THIS -DFREETYPE_LIBRARY=$prefix/lib/libfreetype.a \ # <========= THIS -DCMAKE_DISABLE_FIND_PACKAGE_Qt4=1 \ -DCMAKE_DISABLE_FIND_PACKAGE_Qt5Widgets=1 \ -DCMAKE_C_FLAGS="$CFLAGS" \ -DCMAKE_CXX_FLAGS="$CXXFLAGS" \ -DCMAKE_EXE_LINKER_FLAGS="$LDFLAGS" \ -DGSDLFLAG_STATIC="$static_flag" \ # <========= THIS $GEXT_XPDFTOOLS/packages/$package$version In place of FREETYPE_LIBRARY above, could also try the following, -DFREETYPE_DIR=$prefix \ but then check the built binaries by running "ldd" and "file" over them, to make sure they're not referencing any .so dynamic link libraries: 2. To compile Xpdf-Tools dynamically and make it find *our* dynamically linked libraries for its helper packages zlib, libpng and freetype, edit packages/CASCADE-MAKE/XPDFTOOLS.sh to contain: cmake -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_INSTALL_PREFIX=$prefix \ -DZLIB_LIBRARY=$prefix/lib/libz.so.1.2.7 \ # <========= THIS -DPNG_LIBRARY=$prefix/lib/libpng15.so.15.30.0 \ # <========= THIS -DFREETYPE_LIBRARY=$prefix/lib/libfreetype.so.6.3.20 \ # <========= THIS -DCMAKE_DISABLE_FIND_PACKAGE_Qt4=1 \ -DCMAKE_DISABLE_FIND_PACKAGE_Qt5Widgets=1 \ -DCMAKE_C_FLAGS="$CFLAGS" \ -DCMAKE_CXX_FLAGS="$CXXFLAGS" \ -DCMAKE_EXE_LINKER_FLAGS="$LDFLAGS" \ $GEXT_XPDFTOOLS/packages/$package$version # <=== -DGSDLFLAG_STATIC removed (1) In the above, you could also set -DFREETYPE_DIR=$prefix in place of -DGSDLFLAG_STATIC="$static_flag" In that case it makes, xpdf-tools compilation find the "libfreetype.so" (no versioning at end) in our gs2-extension. After successfully building, make sure to have sourced the gs2-extension's setup.bash before running "ldd" over the generated xpdf-tools binaries, in order to let it use the $LD_LIBRARY_PATH we set to find our .so files. (2) Note that there are no equivalent for ZLIB and LIBPNG: doing -DZLIB_DIR=$prefix or -DPNG_DIR=$prefix will be ineffective, as neither are recognised by xpdf-tools' CMake set up. __________________________________________________________ D. How we got Xpdf-Tools to compile using CASCADE-MAKE __________________________________________________________ The process: 1. We set up a CASCADE-MAKE GS2-extension "xpdf-tools" at trac.greenstone.org/browser/gs2-extensions/xpdf-tools/trunk/src Be aware that its lowercased "cascade-make" subfolder is an svn external, the original is at http://trac.greenstone.org/browser/other-projects/cascade-make/trunk/ So far, this CASCADE-MAKE project includes the Xpdf-Tools source tarball, its helper packages zlib, libpng and freetype, as well as CMake to compile the Xpdf-Tools source code. The next step is to include JPEG and TIFF libraries too. 2a. We downloaded the Xpdf-Tools source tarball, xpdf-4.00.tar.gz, from the xpdf site at https://www.xpdfreader.com/download.html under section "Download the Xpdf source code". The xpdf-tools source code tarball consists of the source for Xpf-tools and Xpdf (Xpdf-Reader). The Xpdf-Reader additionally requires Qt to build and run, but we don't want the Xpdf-Reader, just Xpdf-Tools. b. Compiling Xpdf-Tools fron source and running them requires the following packages and libraries, as per the xpdf-tools source code INSTALL file: To build xpdf-tools: - CMake 2.8.8 or newer Libraries to link against and used by xpdf-tools: - FreeType 2.0.5 or newer - libpng (for pdftoppm and pdftohtml) - zlib (for pdftoppm and pdftohtml) 3. Compilation of xpdf-tools worked with CMake 3.11.4 on the linux resnet machine. However, CMake 3.11.3 itself failed to compile in the LSB environment and on the Mac Mountain Lion machine because of a version incompatibility between the older g++ installed there and the advanced version of CMake 3.11.4. CMake version 3.9.6 however is supposed to be compatible with older versions of g++, as per https://stackoverflow.com/questions/47886400/cmake-configure-error-in-3-10-1-but-not-in-3-9-6 To avoid installing newer versions of g++ and clang in the LSB virtual machine and the Mac, I've shifted the CMake version back to version 3.9.6, still 4a. On building xpdf-tools to work with dynamically linked libs found anywhere. If compiling xpdf-tools against dynamic linked libraries for these packages, then the basic CMake command in packages/CASECADE-MAKE/XPDFTOOLS.sh can look like: cmake -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_INSTALL_PREFIX=$prefix \ -DCMAKE_DISABLE_FIND_PACKAGE_Qt4=1 \ -DCMAKE_DISABLE_FIND_PACKAGE_Qt5Widgets=1 \ -DCMAKE_C_FLAGS="$CFLAGS" \ -DCMAKE_CXX_FLAGS="$CXXFLAGS" \ -DCMAKE_EXE_LINKER_FLAGS="$LDFLAGS" \ $GEXT_XPDFTOOLS/packages/$package$version # Note: no -DGSDLFLAG_STATIC=... With the above, the xpdf-tools source code and its make files work out of the box. 4b. On building xpdf-tools to work with the dynamically linked libs for freetype libpng, zlib that we produce when cascade-making the xpdf-tools gs2-extension. Since we're compiling up freetype, libpng and zlib packages as part of the Xpdf-Tools GS2-extension with CASCADE-MAKE, the next step was to compile xpdf-tools by dynamically linking against our .so files for these 3 libraries. To do so, XPDFTOOL.sh should have the following changes (1) set up CFLAGS, CXXFLAGS, CPPFLAGS and LDFLAGS to help linkage of xpdf-tools find our .so versions of the necessary libs: export CFLAGS="$CFLAGS -I$GEXTXPDFTOOLS_INSTALLED/include -I$GEXTXPDFTOOLS_INSTALLED/include/libpng15" export CPPFLAGS="$CPPFLAGS -I$GEXTXPDFTOOLS_INSTALLED/include -I$GEXTXPDFTOOLS_INSTALLED/include/libpng15" export CXXFLAGS="$CXXFLAGS -I$GEXTXPDFTOOLS_INSTALLED/include -I$GEXTXPDFTOOLS_INSTALLED/include/libpng15" export LDFLAGS="$LDFLAGS -L$GEXTXPDFTOOLS_INSTALLED/lib" (2) The CMAKE command we run must pass the full paths to the actual .so library files (the ones with specific versions in their files names) rather than the symbolically linked generally-named .so files (the latter won't be found when building xpdf-tools and CMake will try to look for the .so library files elsewhere on the system): cmake -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_INSTALL_PREFIX=$prefix \ -DZLIB_LIBRARY=$prefix/lib/libz.so.1.2.7 \ # <========= NEW -DPNG_LIBRARY=$prefix/lib/libpng15.so.15.30.0 \ # <========= NEW -DFREETYPE_LIBRARY=$prefix/lib/libfreetype.so.6.3.20 \ # <========= NEW -DCMAKE_DISABLE_FIND_PACKAGE_Qt4=1 \ -DCMAKE_DISABLE_FIND_PACKAGE_Qt5Widgets=1 \ -DCMAKE_C_FLAGS="$CFLAGS" \ -DCMAKE_CXX_FLAGS="$CXXFLAGS" \ -DCMAKE_EXE_LINKER_FLAGS="$LDFLAGS" \ $GEXT_XPDFTOOLS/packages/$package$version # Again: no -DGSDLFLAG_STATIC=... Further, the "xpdf/CMakeLists.txt" file within the xpdf-4.00.tar.gz source code tarball needs to be modified to refer to ZLIB_LIBRARIES when linking pdftops and pdftoppm. The linking commands for *both* the "pdftops" and "pdftoppm" executable targets in xpdf/CMakeLists.txt should look like the following, target_link_libraries(pdftoppm goo fofi splash ${FREETYPE_LIBRARY} ${FREETYPE_OTHER_LIBS} ${DTYPE_LIBRARY} ${LCMS_LIBRARY} ${ZLIB_LIBRARIES}) # <========= NEW (3) Since CMakeLists.txt has been modified, we initially renamed the xpdf src tarball to gs-xpdf-4.00.tar.gz. However, the current version works with the regular downloaded xpdf-4.00.tar.gz tarball. But after extraction, XPDFTOOLS.sh copies across the custom packages/gs-CMakeLists.txt into the extracted tarball's xpdf subdirectory, renaming the file as CMakeLists.txt (so the path to it becomes "xpdf-4.00/xpdf/CMakeLists.txt"). In XPDFTOOLS.sh: # patch the original tarball with our custom makefile if [[ -d "$package$version/xpdf" && -f "gs-CMakeLists.txt" ]]; then echo "*******************************************************************" echo "Using our custom gs-CMakeLists.txt instead of the one included in $package$version" echo "Renaming gs-CMakeLists.txt to $package$version/xpdf/CMakeLists.txt" echo "*******************************************************************" cp "gs-CMakeLists.txt" "$package$version/xpdf/CMakeLists.txt" fi 4c. On building static xpdf-tools binaries using the static *.a freetype libpng, zlib libraries that we produce when cascade-making the xpdf-tools gs2-extension. In order to compile up xpdf-tools *statically*, so that it builds against the static *.a libraries of freetype, libpng and zlib that we produce during the gs2-extension's CASCADE-MAKE process, we have to make further modifications. (1) First, the XPDFTOOLS.sh cascade-make file should pass the full paths to the actual (non-symbolic link) .a file for each library. A custom GS flag, GSDLFLAG_STATIC, is also invented in gs-CMakeLists.txt and assigned "-static for linux and "-Bstatic" for Mac, to pass in during the linking stage of building xpdf-tools. For Mac OSX, when -static is passed in for linking as on linux, this produced the error "ld: library not found for -lcrt0.o" during the build of the xpdf-tools package. For information, see https://stackoverflow.com/questions/3801011/ld-library-not-found-for-lcrt0-o-on-osx-10-6-with-gcc-clang-static-flag The page https://stackoverflow.com/questions/844819/how-to-static-link-on-os-x mentions compiling with -Bstatic on Mac OSX instead. To do so, XPDFTOOLS.sh passes in the GSDLFLAG_STATIC set to either "-static" (for linux) or "-Bstatic" for darwin. However the last mentioned stackoverflow page also says that -Bstatic is a no-op, and this appears to be the case when "otool -L" is run over the generated xpdf-tools binaries: the binaries are all dynamically linked. Although they're finding our .so files of freetype, libpng and zlib, they're not finding the .a versions, even though XPDFTOOLS.sh tries to point gs-CMakeLists.txt to the correct .a files. The new modifications to XPDFTOOLS.sh: if [ "x$GSDLOS" == "xdarwin" ] ; then static_flag=-Bstatic else static_flag=-static fi ... cmake -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_INSTALL_PREFIX=$prefix \ -DZLIB_LIBRARY=$prefix/lib/libz.a \ # <========= MODIFIED TO .a -DPNG_LIBRARY=$prefix/lib/libpng15.a \ # <========= MODIFIED TO .a -DFREETYPE_LIBRARY=$prefix/lib/libfreetype.a \ # <========= MODIFIED TO .a -DCMAKE_DISABLE_FIND_PACKAGE_Qt4=1 \ -DCMAKE_DISABLE_FIND_PACKAGE_Qt5Widgets=1 \ -DCMAKE_C_FLAGS="$CFLAGS" \ -DCMAKE_CXX_FLAGS="$CXXFLAGS" \ -DCMAKE_EXE_LINKER_FLAGS="$LDFLAGS" \ -DGSDLFLAG_STATIC="$static_flag" \ # <========= NEW $GEXT_XPDFTOOLS/packages/$package$version (2) Our customised gs-CMakeLists.txt file now checks for this flag GSDLFLAG_STATIC being set and, if it is, uses it during the linking stage. As in (1) above, it will be set to "-static" for Linux and "-Bstatic" for Mac. - When the flag is set, the linking flags passed into each occurrence of target_link_libraries() in gs-CMakeLists.txt is moreover manually written in the form of "-static -l" rather than using the default linking commands inherited from the original CMakeLists.txt. - If GSDLFLAG_STATIC isn't set, then we don't build statically, and the linking flags passed to each target_link_libraries() are mostly the original ones. For example, if(GSDLFLAG_STATIC) target_link_libraries(pdftoppm goo fofi splash ${GSDLFLAG_STATIC} -lfreetype ${DTYPE_LIBRARY} ${LCMS_LIBRARY} -lz -lm -lc -lpthread) else () target_link_libraries(pdftoppm goo fofi splash ${FREETYPE_LIBRARY} ${FREETYPE_OTHER_LIBS} ${DTYPE_LIBRARY} ${LCMS_LIBRARY} ${ZLIB_LIBRARIES}) endif () DETAILED EXPLANATION: We found that when building *statically*, gs-CMakeLists.txt needed to NOT use the PNG_LIBRARIES, ZLIB_LIBRARIES and FREETYPE_LIBRARY in its linker commands, target_link_libraries(), as doing so produced partially dynamic xpdf-tools executables which were moreover BROKEN. They wouldn't run, and in fact attempting to run an xpdf-tool, like "./pdftohtml", would produce a file not found error. Something like "bash: no such file or directory". Online discussions mentioned that this generally happened when attempting to run 32 bit executables on 64 bit linux when 32 bit loaders are not installed. (In such cases, the solution was to apt-get install some 32 bit package.) However, our broken binaries were all 64 bit, as indicated when running the "file" command on them. However, their being further partially dynamically linked executables didn't imply that they would be broken, as we were eventually able to produce partially dynamic executables that did work, before solving static linking altogether. The real issue was that including references to ${FREETYPE_LIBRARY} ${FREETYPE_OTHER_LIBS}, ${PNG_LIBRARIES} and ${ZLIB_LIBRARIES} in any target_link_libraries() resulted in the wrong linking command producing broken binaries. Doing the regular target_link_libraries() in static mode results in building with "-Wl,-Bstatic -lfreetype -lpng15 -lz -Wl,-Bdynamic -lpthread" at end of link line and produces broken binaries for pdftohtml/pdftoppm/pdftops/pdftopng. Note that PNG_LIBRARIES includes zlib/lz: "-lpng -lz", and along with freetype, these are linked statically. However, Threads/lpthread is included as a dynamically linked library instead of including a .a (regardless of whether it's appended as -lpthread or Threads::Threads in the target_link_libraries()), contributing to the pdfhtml binary produced being a partially static, partially dynamic one, so a dynamic executable overall. The order of dynamic .so files listed by ldd in the broken static binary of pdftohtml differs from a manually statically linked working version of pdftohtml, and seems to be the only difference between the two in ldd's output. Not using "-Wl,-Bstatic" and using -static (-Bstatic on Mac) in its place creates a partially static dynamic executable that isn't broken, whereas additionally removing "-Wl,-Bdynamic -lpthread" and replacing it with -lpthread moreover produces a working pdftohtml that is a fully static linked executable. The inclusion of the math lib and c lib (lm and lc) in the final link command are to completely bypass the remaining .so dependencies that were present in the executable and produce the fully static executable. The lm and lc libs were referenced by all xpdf-tool binaries (as indicated when generating dynamic ones and running ldd over them) but Dr Bainbridge said that -lm and -lc were some libs passed in by the compiler by default, which would explain why explicitly setting them for some xpdftools and not other may not have mattered. NOTES: Initial attempts at modifying gs-CMakeLists.txt for static compiling that proved to be unnecessary: (1) Setting -static globally doesn't have a useful effect. # We want to build static xpdf-tools binaries. See # https://stackoverflow.com/questions/24648357/compiling-a-static-executable-with-cmake # Want to make the min number of changes for building statically, so using the way # below. Beware, must *append* "-static" to existing CMAKE_EXE_LINKER_FLAGS=LD_FLAGS ##SET(CMAKE_FIND_LIBRARY_SUFFIXES ".a") ##SET(BUILD_SHARED_LIBS OFF) ##SET(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -static") The above 3 lines just add a -static before the "-O2 -Wall -fPIC -rdynamic ..." during linking, such as below. But they have no further effect on whether static building actually succeeds or not. The only effective static linking command (for Linux so far) was to pass -static in the target_link_libraries() command followed by the "-l" for each library in the correct order. ---- /usr/bin/c++ -I/home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/include -I/home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/include -I/home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/include/libpng15 -O3 -Wall -fPIC -L/home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/lib -L/home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/lib -static ***** <- HERE ****** -O2 -Wall -fPIC -rdynamic CMakeFiles/pdftohtml.dir/HTMLGen.cc.o CMakeFiles/pdftohtml.dir/SplashOutputDev.cc.o CMakeFiles/pdftohtml.dir/TextOutputDev.cc.o CMakeFiles/pdftohtml.dir/pdftohtml.cc.o CMakeFiles/xpdf_objs.dir/AcroForm.cc.o CMakeFiles/xpdf_objs.dir/Annot.cc.o CMakeFiles/xpdf_objs.dir/Array.cc.o CMakeFiles/xpdf_objs.dir/BuiltinFont.cc.o CMakeFiles/xpdf_objs.dir/BuiltinFontTables.cc.o CMakeFiles/xpdf_objs.dir/Catalog.cc.o CMakeFiles/xpdf_objs.dir/CharCodeToUnicode.cc.o CMakeFiles/xpdf_objs.dir/CMap.cc.o CMakeFiles/xpdf_objs.dir/Decrypt.cc.o CMakeFiles/xpdf_objs.dir/Dict.cc.o CMakeFiles/xpdf_objs.dir/Error.cc.o CMakeFiles/xpdf_objs.dir/FontEncodingTables.cc.o CMakeFiles/xpdf_objs.dir/Form.cc.o CMakeFiles/xpdf_objs.dir/Function.cc.o CMakeFiles/xpdf_objs.dir/Gfx.cc.o CMakeFiles/xpdf_objs.dir/GfxFont.cc.o CMakeFiles/xpdf_objs.dir/GfxState.cc.o CMakeFiles/xpdf_objs.dir/GlobalParams.cc.o CMakeFiles/xpdf_objs.dir/JArithmeticDecoder.cc.o CMakeFiles/xpdf_objs.dir/JBIG2Stream.cc.o CMakeFiles/xpdf_objs.dir/JPXStream.cc.o CMakeFiles/xpdf_objs.dir/Lexer.cc.o CMakeFiles/xpdf_objs.dir/Link.cc.o CMakeFiles/xpdf_objs.dir/NameToCharCode.cc.o CMakeFiles/xpdf_objs.dir/Object.cc.o CMakeFiles/xpdf_objs.dir/OptionalContent.cc.o CMakeFiles/xpdf_objs.dir/Outline.cc.o CMakeFiles/xpdf_objs.dir/OutputDev.cc.o CMakeFiles/xpdf_objs.dir/Page.cc.o CMakeFiles/xpdf_objs.dir/Parser.cc.o CMakeFiles/xpdf_objs.dir/PDFDoc.cc.o CMakeFiles/xpdf_objs.dir/PDFDocEncoding.cc.o CMakeFiles/xpdf_objs.dir/PSTokenizer.cc.o CMakeFiles/xpdf_objs.dir/SecurityHandler.cc.o CMakeFiles/xpdf_objs.dir/Stream.cc.o CMakeFiles/xpdf_objs.dir/TextString.cc.o CMakeFiles/xpdf_objs.dir/UnicodeMap.cc.o CMakeFiles/xpdf_objs.dir/UnicodeTypeTable.cc.o CMakeFiles/xpdf_objs.dir/UTF8.cc.o CMakeFiles/xpdf_objs.dir/XFAForm.cc.o CMakeFiles/xpdf_objs.dir/XRef.cc.o CMakeFiles/xpdf_objs.dir/Zoox.cc.o -o pdftohtml ../goo/libgoo.a ../fofi/libfofi.a ../splash/libsplash.a -static -lfreetype -lpng -lz -lm -lc -lpthread ---- (2) Threads::Threads instead of -lpthread results in a partially dynamic executable. # The original, unmodified CMakeLists.txt was not set up sufficiently # for static compilation of xpdf-tools. As a result, compile would first fail # with errors about undefined refs to mutex / lpthread. # When building xpdf-tools statically, need to add the following 2 lines as well # as append "Threads::Threads" to the end of each "target_link_libraries()" # See https://stackoverflow.com/questions/1620918/cmake-and-libpthread # found googling cmake and "-lpthread" (pthread) after ERRORS to do with this, like: # undefined reference to `pthread_mutex_unlock' ##set(THREADS_PREFER_PTHREAD_FLAG ON) ##find_package(Threads REQUIRED) In instances when compilation was successful, including the above 2 lines in combination with "Threads::Threads" as the final argument to every target_link_libraries(...) occurrence in gs-CMakeLists.txt would only manage to produce partially dynamically linked xpdftools binaries. (Depending on what the linking command was when building Xpdf-Tools, the partially dynamically linked executables may work or may be broken. See explanation further above.) We wanted fully statically linked binaries, for which we needed to pass in "-lpthread" as the trailing argument to each target_link_libraries(...). So without either, compilation will fail. However, with "Threads::Threads" the binaries weren't fully static, whereas with -lpthread the xpdftools executables were fully static as CMake no longer tried to link against a dynamic Threads library. (5) To view the unmodified CMakeLists.txt included in the xpdf-4.00 source code tarball, untar it and look for its "xpdf/CMakeLists.txt" (not the toplevel file of the same name). Run a 'diff' against gs-CMakeLists.txt to see further differences, such as debug statements and comments. Most comments have been removed and placed into this readme file instead. (6) When CASCADE-MAKE is run on the xpdf-tools GS2-extension, it first compiles up CMake, needed to compile up xpdf-tools. Unlike the library packages like freetype, libpng and zlib that we also build for xpdf-tools as part of this gs2-extension, CMake's build products don't need to be included in the distribution tarball of our built xpdf-tools executables. There's a "move-cmake.sh" script in the xpdf-tools gs2-extension that can be run with the "away" and "back" options to move the CMake stuff out of the way (into a "devel" folder) after successfully building xpdf binaries and that can also be run to move them back if wanting to recompile. The script can be run manually, but it's also run by the extension: - packages/CASCADE-MAKE/XPDFTOOLS.sh runs "move-cmake.sh away" after xpdf-tools has been built, so that the extension's install location is ready for tarring up for distribution. - When recompiling the xpdf-tools extenion, the CASCADE-MAKE process will run packages/CASCADE-MAKE/CMAKE.sh file which in turn runs "move-cmake.sh back" if there's a prebuilt CMake which had earlier been moved out of the way. __________________________________________________________ E. Getting more output when running CMake (verbosity) __________________________________________________________ See https://www.linuxquestions.org/questions/programming-9/cmake-or-make-debug-output-show-command-624800/ To turn on debugging: export VERBOSE=1 ./CASCADE-MAKE.sh To turn off debugging, need to actually make VERBOSE undefined again (don't set it to 0): export VERBOSE= ./CASCADE-MAKE.sh __________________________________________________________ F. APPENDIX - Useful links __________________________________________________________ A. Helping CMake along. (Not all of this was necessary for compiling xpdftools statically, but they're generally useful links) https://github.com/SynoCommunity/spksrc/issues/1779 https://stackoverflow.com/questions/1620918/cmake-and-libpthread https://cmake.org/cmake/help/v3.0/prop_tgt/LINK_FLAGS.html https://cmake.org/cmake/help/v3.11/command/target_link_libraries.html?highlight=target_link_libraries https://stackoverflow.com/questions/24648357/compiling-a-static-executable-with-cmake https://stackoverflow.com/questions/42815420/cmake-cant-find-my-static-libs https://cmake.org/cmake/help/v3.0/command/message.html https://stackoverflow.com/questions/30980383/cmake-compile-options-for-libpng https://stackoverflow.com/questions/36220123/undefined-reference-to-png-set-longjmp-fn-when-compiling-pcl-source-file B. About the error "bash: no such file or directory" when run on a statically generated binary: https://askubuntu.com/questions/351827/unable-to-run-a-32-bit-program-on-64-bit-vm/353497#353497 https://unix.stackexchange.com/questions/13391/getting-not-found-message-when-running-a-32-bit-binary-on-a-64-bit-system/13409#13409 https://arstechnica.com/civis/viewtopic.php?f=16&t=1173118 https://superuser.com/questions/344533/no-such-file-or-directory-error-in-bash-but-the-file-exists https://unix.stackexchange.com/questions/45277/executing-binary-file-file-not-found C. Other links https://unix.stackexchange.com/questions/279397/ldd-dont-find-path-how-to-add D. On why you can't build static binaries on Mac, but can build static libraries and link against them https://developer.apple.com/library/archive/qa/qa1118/_index.html (official page on how Mac doesn't support static binaries) https://stackoverflow.com/questions/3801011/ld-library-not-found-for-lcrt0-o-on-osx-10-6-with-gcc-clang-static-flag https://stackoverflow.com/questions/844819/how-to-static-link-on-os-x (mention of -Bstatic) https://www.allegro.cc/forums/thread/610923 https://dropline.net/2015/10/static-linking-on-mac-os-x/ (explains that on Mac, .dylibs must be hidden for .a versions of libraries to be selected when linking) This means that where possible we want to essentially do "--enable-static --disable-shared", or equivalent, when generating freetype, libz, libpng, libjpg, libtiff library files so that Xpdf-Tools links against the .a files we generated rather than additional .dylib files http://www.simplesystems.org/libtiff/build.html configuration options for building libtiff. Want to turn off the compile process for libtiff producing tiff binaries, but there appears to be no such option. __________________________________________________________ G. LIBJPEG and LIBTIFF __________________________________________________________ 1. Issues building LIBJPEG on 64 bit machines and the patch I copied the LIBJPEG package from http://trac.greenstone.org/browser/other-projects/realistic-books/trunk/packages (also at http://trac.greenstone.org/browser/gs2-extensions/ocr/trunk/packages/cmdline). * Configuring out of the box produced the following error: checking host system type... Invalid configuration `x86_64-unknown-linux-gnu': machine `x86_64-unknown' not recognized * So that, as a consequence, when running make on the libjpeg package, make failed with the error: ./libtool --mode=compile gcc -I/home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/include -fPIC -I/home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/include -I. -c ./jcapimin.c make: ./libtool: Command not found make: *** [jcapimin.lo] Error 127 Error encountered running *make * stage of ./CASCADE-MAKE/LIBJPEG.sh The same was true when I grabbed the libjpeg from sourceforge (https://sourceforge.net/projects/libjpeg/files/), which was also still version jpeg 6b from 2008. I found the following webpages discussing the above error messages: - https://unix.stackexchange.com/questions/80479/how-to-work-with-libtool - https://github.com/rwestlund/freesweep/issues/1 - https://ubuntuforums.org/showthread.php?t=1232714 - https://stackoverflow.com/questions/12828687/configure-fails-to-detect-proper-ld-on-a-64-bit-system-with-32-bit-userland - SOLUTION: https://sourceforge.net/p/libjpeg/bugs/12/ However, the error only strikes when configure is run with --enable-static. Note also that contrary to the above pages, running configure with the additional options --host=x86_64-linux-gnu --build=x86_64-linux-gnu --target=x86_64-linux-gnu --disable-shared --enable-static did not help. Nor did adding the above flags get rid of configure attempting to work with host=x86_64-unknown(-unknown)-linux-gnu The SOLUTION, found when searching for the error message along with "enable-static", as it's the combination that is relevant, is described at https://sourceforge.net/p/libjpeg/bugs/12/ which was to patch up the config.sub filed included in the jpeg-6b tarball, to also cover x86_64-* machines: tahoe | i860 | x86_64-* | m32r | m68k | m68000 | m88k | ns32k | arc | arm \ The above change is necessary because this libjpeg is outdated and has been superceded by other JPEG libraries, also discussed at https://sourceforge.net/p/libjpeg/bugs/12/ I'm not sure if those libraries are compatible with XpdfTools however, so I'm sticking with libjpeg as long as I can get it to build and be recognised by XpdfTools. The solution is once more to have a patch file: CASCADE-MAKE/LIBJPEG.sh replaces the config.sub with in the jpeg-6b package after this is untarred with packages/gs-libjpeg-config.sub, which contains the patch. 2. I followed the instructions at http://www.linuxfromscratch.org/blfs/view/6.3/general/libjpeg.html to try to build libjpeg with --enable-static and --enable-shared to produce both libjpeg.a and libjpeg.so. However, nothing I try gets it to generate a libjpeg.so. It seems to always produce a libjpeg.a in xpdf-tools/linux/lib regardless of whether CASCADE-MAKE/LIBJPEG.sh passes the --enable-static flag to the configure command or not, and regardless of whether --enable-shared is additionally or individually passed in. As a consequence, there's no libjpeg.so file to set the -DJPEG_LIBRARY flag in XPDFTOOLS.sh to for when building xpdf-tools against dynamically linked libraries. I tried the various combinations with the lib jpeg-6b source tarballs from - sourceforge, https://sourceforge.net/projects/libjpeg/files/, the latest tarball of this was from 2008 - http://www.linuxfromscratch.org/blfs/view/6.3/general/libjpeg.html, which was last updated in 2007 - http://trac.greenstone.org/browser/other-projects/realistic-books/trunk/packages/jpeg-6b.tar.gz, which was added to trac in 2009 but is probably the 2008 or 2007 version too. 3. Modifications for using TIFF and JPEG libraries when building Xpdf-Tools: * CASCADE-MAKE.sh, replaced PACKAGES="CMAKE LIBZ LIBPNG FREETYPE XPDFTOOLS" with PACKAGES="CMAKE LIBZ LIBTIFF LIBPNG LIBJPEG FREETYPE XPDFTOOLS" * XPDFTOOLS.sh If compiling statically make sure the CMake command contains the following changes: -DTIFF_INCLUDE_DIR=$prefix/include \ # <========== new -DJPEG_INCLUDE_DIR=$prefix/include \ # <========== new -DZLIB_LIBRARY=$prefix/lib/libz.a \ -DTIFF_LIBRARY=$prefix/lib/libtiff.a \ # <========== new -DPNG_LIBRARY=$prefix/lib/libpng15.a \ -DJPEG_LIBRARY=$prefix/lib/libjpeg.a \ # <========== new -DFREETYPE_LIBRARY=$prefix/lib/libfreetype.a \ -DGSDLFLAG_STATIC="$static_flag" \ The above flag names were discovered by deleting the untarred xpdf-4.00 folder. Then in a fresh terminal, source devel.bash from xpdf-tools and re-run CASCADE-MAKE.sh without the above modifications: -- Found FreeType (new-style includes): /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/lib/libfreetype.a -- Found ZLIB: /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/lib/libz.a (found version "1.2.8") -- Found PNG: /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/xpdf-tools/linux/lib/libpng15.a (found version "1.2.50") -- Could NOT find JPEG (missing: JPEG_LIBRARY JPEG_INCLUDE_DIR) -- Could NOT find TIFF (missing: TIFF_LIBRARY TIFF_INCLUDE_DIR) -- lcms2 not found -- No Qt library found * packages/gs-CMakeLists.txt was modified again, - this time to also pass: -ltiff and -ljpeg to all target_link_libraries() commands that run when GSDLFLAG_STATIC is set and ${TIFF_LIBRARY} and ${JPEG_LIBRARY} to all target_link_libraries() commands that run when GSDLFLAG_STATIC is not set - And to add in the include directories and defitions if JPEG/TIFF libraries were provided: if (JPEG_FOUND) include_directories("${JPEG_INCLUDE_DIR}") add_definitions("${JPEG_DEFINITIONS}") message(STATUS "@@@@@@@@@@@@@@@ JPEG_FOUND (include_dir ; include_dirs): ${JPEG_INCLUDE_DIR} ; ${JPEG_INCLUDE_DIRS}") else () message(STATUS "@@@@@@@@@@@@@@@ NO JPEG_FOUND") endif () if (TIFF_FOUND) include_directories("${TIFF_INCLUDE_DIRS}") add_definitions("${TIFF_DEFINITIONS}") message(STATUS "@@@@@@@@@@@@@@@ TIFF_FOUND ${TIFF_INCLUDE_DIRS}") else () message(STATUS "@@@@@@@@@@@@@@@ NO TIFF_FOUND") endif () Note however that although gs-CMakeLists.txt now knows what the pluralised TIFF_INCLUDE_DIRS is (and TIFF_INCLUDE_DIR) as for PNG and ZLIB, gs-CMakeLists.txt does not have a value for the pluralised JPEG_INCLUDE_DIRS, only the JPEG_INCLUDE_DIRS set above. And both the CMAKE flags in XPDFTOOLS.sh for tiff and jpeg libs seem to have been setup in the same way now. Not sure where these automatically assigned variables come from in order to check up on them. __________________________________________________________ H. PDF2DOM: tried it out, but wasn't what we wanted __________________________________________________________ Using PDFBox to convert a PDF to full HTML, both images and text and placed correctly with respect to each other, is tricky, see https://stackoverflow.com/questions/9671239/pdfbox-convert-a-pdf-to-text-or-html-including-images-from-the-pdf (Google: pdfbox to convert pdf to html with images) PDF2DOM tool (based on PDFBox) to convert PDF to HTML with images * http://cssbox.sourceforge.net/pdf2dom/documentation.php * Got the command line jar tool, PDFToHTML.jar version 1.7, from https://sourceforge.net/projects/cssbox/files/Pdf2DOM/ * Further information and source code at https://github.com/radkovo/Pdf2Dom * API: http://cssbox.sourceforge.net/pdf2dom/api/index.html 1. Running java -jar PDFToHTML.jar [] greenstone@machine-name:~/Downloads$ java -jar PDFToHTML.jar SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2 It will output the page, but you'll see the following output indicating that the logger is not displaying anything: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. See https://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder To see error output download SLF4J simple jar, run as follows: greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML ApacheLicencePDFA.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2 The above is a MS Word produced PDF (archive format) and works fine: font folder generated containing the extracted fonts The following is a PDF produced from the same doc file by the latest libreoffice installed on Windows: ApacheLicencePDFA_FromODT.pdf But running the same command on it produces the following font errors: greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML ApacheLicencePDFA_FromODT.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2 [main] INFO org.reflections.Reflections - Reflections took 163 ms to scan 1 urls, producing 36 keys and 222 values [main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException [main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException [main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException [main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException Fonts get extracted if the source PDF was generated by MS Word's doc to PDF conversion. Fonts didn't get extracted from PDF upon conversion to HTML when libreoffice was used to convert a .doc to the source PDF. 2. Check version of PDF https://www.codeproject.com/Questions/167550/How-to-check-different-versions-of-PDF 3. pdf to html command line conversion open source https://stackoverflow.com/questions/8370014/how-to-convert-pdf-to-html "Download pdfbox-2.0.3.jar fontbox-2.0.3.jar preflight-2.0.3.jar xmpbox-2.0.3.jar pdfbox-tools-2.0.3.jar pdfbox-debugger-2.0.3.jar from http://pdfbox.apache.org/ ... PLEASE NOTE: Images do not get pushed to the HTML output." 4. Need a way to check if PDF contains images, then use pdf2dom, else basic pdfbox conversion to html (less div tags with inline style markup)? https://stackoverflow.com/questions/46215879/count-images-in-pdf-using-pdfbox UNUSED Googled for: java tool convert pdf version * https://stackoverflow.com/questions/11137912/all-inclusive-tool-to-convert-different-types-of-documents-to-pdf * https://www.qoppa.com/pdfprocess/ jPDFProcess – Java PDF Library to Create, Manipulate PDF (appears to be payware) * https://www.gnostice.com/nl_article.asp?id=95&t=How_to_Change_the_PDF_Version_of_a_Document How to Convert a PDF Document to an Older or Newer Version uses .NET * http://www.baeldung.com/pdf-conversions-java PDF Conversions in Java e.g. PDF to html and html to PDF __________________________________________________________ greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2 [main] INFO org.reflections.Reflections - Reflections took 153 ms to scan 1 urls, producing 36 keys and 222 values [main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException [main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException [main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException [main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException greenstone@machine-name:~/Downloads$ java -classpath Pdf2Dom/target/pdf2dom-1.8-SNAPSHOT.jar:pdfbox-app.jar:slf4j-jdk14-1.6.6.jar:log4j-over-slf4j-1.6.6.jar:slf4j-api-1.6.6.jar org.fit.pdfdom.PDFToHTML SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2 Exception in thread "main" java.lang.NoClassDefFoundError: org/mabb/fontverter/FontVerter at org.fit.pdfdom.FontTable$Entry.loadTrueTypeFont(FontTable.java:178) at org.fit.pdfdom.FontTable$Entry.getData(FontTable.java:147) at org.fit.pdfdom.FontTable$Entry.isEntryValid(FontTable.java:161) at org.fit.pdfdom.FontTable.addEntry(FontTable.java:48) at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:378) at org.fit.pdfdom.PDFBoxTree.updateFontTable(PDFBoxTree.java:361) at org.fit.pdfdom.PDFDomTree.updateFontTable(PDFDomTree.java:544) at org.fit.pdfdom.PDFBoxTree.processPage(PDFBoxTree.java:206) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) at org.fit.pdfdom.PDFDomTree.createDOM(PDFDomTree.java:218) at org.fit.pdfdom.PDFDomTree.writeText(PDFDomTree.java:194) at org.fit.pdfdom.PDFToHTML.main(PDFToHTML.java:77) Caused by: java.lang.ClassNotFoundException: org.mabb.fontverter.FontVerter at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 13 more greenstone@machine-name:~/Downloads$