Changeset 32250

Show
Ignore:
Timestamp:
09.07.2018 21:46:18 (12 days ago)
Author:
ak19
Message:

Since we have GS-README.txt, we're no longer using README.txt. Moved the extra information from README, about the now unused PDF2DOM, into GS-README.txt

Location:
gs2-extensions/xpdf-tools/trunk/src
Files:
1 removed
1 modified

Legend:

Unmodified
Added
Removed
  • gs2-extensions/xpdf-tools/trunk/src/packages/GS-README.txt

    r32249 r32250  
    1414G. LIBJPEG and LIBTIFF 
    1515- Issues building LIBJPEG on 64 bit machines and the patch 
     16 
     17H. PDF2DOM 
     18    unused, replaced by Xpdf-Tools' more suited pdftohtml capabilities 
    1619 
    1720__________________________________________________________ 
     
    574577 
    575578 
     579__________________________________________________________ 
     580H. PDF2DOM: tried it out, but wasn't what we wanted 
     581__________________________________________________________ 
     582Using PDFBox to convert a PDF to full HTML, both images and text and placed correctly with respect to each other, is tricky, see https://stackoverflow.com/questions/9671239/pdfbox-convert-a-pdf-to-text-or-html-including-images-from-the-pdf 
     583(Google: pdfbox to convert pdf to html with images) 
     584 
     585PDF2DOM tool (based on PDFBox) to convert PDF to HTML with images 
     586* http://cssbox.sourceforge.net/pdf2dom/documentation.php 
     587* Got the command line jar tool, PDFToHTML.jar version 1.7, from https://sourceforge.net/projects/cssbox/files/Pdf2DOM/ 
     588* Further information and source code at https://github.com/radkovo/Pdf2Dom 
     589* API: http://cssbox.sourceforge.net/pdf2dom/api/index.html 
     590 
     591 
     5921. Running 
     593 
     594java -jar PDFToHTML.jar <infile> [<outfile>] 
     595 
     596    greenstone@machine-name:~/Downloads$ java -jar PDFToHTML.jar SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2 
     597 
     598 
     599It will output the page, but you'll see the following output indicating that the logger is not displaying anything: 
     600    SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". 
     601    SLF4J: Defaulting to no-operation (NOP) logger implementation 
     602    SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. 
     603 
     604See https://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder 
     605 
     606To see error output download SLF4J simple jar, run as follows: 
     607 
     608    greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML ApacheLicencePDFA.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2 
     609 
     610The above is a MS Word produced PDF (archive format) and works fine: font folder generated containing the extracted fonts 
     611 
     612The following is a PDF produced from the same doc file by the latest libreoffice installed on Windows: 
     613    ApacheLicencePDFA_FromODT.pdf 
     614But running the same command on it produces the following font errors: 
     615 
     616greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML ApacheLicencePDFA_FromODT.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2 
     617[main] INFO org.reflections.Reflections - Reflections took 163 ms to scan 1 urls, producing 36 keys and 222 values  
     618[main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException 
     619[main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException 
     620[main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException 
     621[main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException 
     622 
     623Fonts get extracted if the source PDF was generated by MS Word's doc to PDF conversion. Fonts didn't get extracted from PDF upon conversion to HTML when libreoffice was used to convert a .doc to the source PDF. 
     624 
     6252. Check version of PDF 
     626https://www.codeproject.com/Questions/167550/How-to-check-different-versions-of-PDF 
     627 
     628 
     6293. pdf to html command line conversion open source 
     630https://stackoverflow.com/questions/8370014/how-to-convert-pdf-to-html 
     631 
     632"Download 
     633 
     634    pdfbox-2.0.3.jar 
     635    fontbox-2.0.3.jar 
     636    preflight-2.0.3.jar 
     637    xmpbox-2.0.3.jar 
     638    pdfbox-tools-2.0.3.jar 
     639    pdfbox-debugger-2.0.3.jar 
     640 
     641from http://pdfbox.apache.org/ 
     642... 
     643 
     644PLEASE NOTE: Images do not get pushed to the HTML output." 
     645 
     646 
     6474. Need a way to check if PDF contains images, then use pdf2dom, else basic pdfbox conversion to html (less div tags with inline style markup)? 
     648https://stackoverflow.com/questions/46215879/count-images-in-pdf-using-pdfbox 
     649 
     650 
     651UNUSED 
     652Googled for: java tool convert pdf version 
     653* https://stackoverflow.com/questions/11137912/all-inclusive-tool-to-convert-different-types-of-documents-to-pdf 
     654* https://www.qoppa.com/pdfprocess/ 
     655jPDFProcess – Java PDF Library to Create, Manipulate PDF 
     656(appears to be payware) 
     657* https://www.gnostice.com/nl_article.asp?id=95&t=How_to_Change_the_PDF_Version_of_a_Document 
     658How to Convert a PDF Document to an Older or Newer Version 
     659uses .NET 
     660* http://www.baeldung.com/pdf-conversions-java 
     661PDF Conversions in Java 
     662e.g. PDF to html and html to PDF 
     663 
     664 
     665__________________________________________________________ 
     666 
     667greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2 
     668[main] INFO org.reflections.Reflections - Reflections took 153 ms to scan 1 urls, producing 36 keys and 222 values  
     669[main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException 
     670[main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException 
     671[main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException 
     672[main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException 
     673 
     674 
     675 
     676greenstone@machine-name:~/Downloads$ java -classpath Pdf2Dom/target/pdf2dom-1.8-SNAPSHOT.jar:pdfbox-app.jar:slf4j-jdk14-1.6.6.jar:log4j-over-slf4j-1.6.6.jar:slf4j-api-1.6.6.jar  org.fit.pdfdom.PDFToHTML SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2 
     677Exception in thread "main" java.lang.NoClassDefFoundError: org/mabb/fontverter/FontVerter 
     678    at org.fit.pdfdom.FontTable$Entry.loadTrueTypeFont(FontTable.java:178) 
     679    at org.fit.pdfdom.FontTable$Entry.getData(FontTable.java:147) 
     680    at org.fit.pdfdom.FontTable$Entry.isEntryValid(FontTable.java:161) 
     681    at org.fit.pdfdom.FontTable.addEntry(FontTable.java:48) 
     682    at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:378) 
     683    at org.fit.pdfdom.PDFBoxTree.updateFontTable(PDFBoxTree.java:361) 
     684    at org.fit.pdfdom.PDFDomTree.updateFontTable(PDFDomTree.java:544) 
     685    at org.fit.pdfdom.PDFBoxTree.processPage(PDFBoxTree.java:206) 
     686    at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) 
     687    at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) 
     688    at org.fit.pdfdom.PDFDomTree.createDOM(PDFDomTree.java:218) 
     689    at org.fit.pdfdom.PDFDomTree.writeText(PDFDomTree.java:194) 
     690    at org.fit.pdfdom.PDFToHTML.main(PDFToHTML.java:77) 
     691Caused by: java.lang.ClassNotFoundException: org.mabb.fontverter.FontVerter 
     692    at java.net.URLClassLoader$1.run(URLClassLoader.java:366) 
     693    at java.net.URLClassLoader$1.run(URLClassLoader.java:355) 
     694    at java.security.AccessController.doPrivileged(Native Method) 
     695    at java.net.URLClassLoader.findClass(URLClassLoader.java:354) 
     696    at java.lang.ClassLoader.loadClass(ClassLoader.java:425) 
     697    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) 
     698    at java.lang.ClassLoader.loadClass(ClassLoader.java:358) 
     699    ... 13 more 
     700greenstone@machine-name:~/Downloads$