Changeset 32250


Ignore:
Timestamp:
07/09/18 21:46:18 (3 years ago)
Author:
ak19
Message:

Since we have GS-README.txt, we're no longer using README.txt. Moved the extra information from README, about the now unused PDF2DOM, into GS-README.txt

Location:
gs2-extensions/xpdf-tools/trunk/src
Files:
1 deleted
1 edited

Legend:

Unmodified
Added
Removed
  • gs2-extensions/xpdf-tools/trunk/src/packages/GS-README.txt

    r32249 r32250  
    1414G. LIBJPEG and LIBTIFF
    1515- Issues building LIBJPEG on 64 bit machines and the patch
     16
     17H. PDF2DOM
     18    unused, replaced by Xpdf-Tools' more suited pdftohtml capabilities
    1619
    1720__________________________________________________________
     
    574577
    575578
     579__________________________________________________________
     580H. PDF2DOM: tried it out, but wasn't what we wanted
     581__________________________________________________________
     582Using PDFBox to convert a PDF to full HTML, both images and text and placed correctly with respect to each other, is tricky, see https://stackoverflow.com/questions/9671239/pdfbox-convert-a-pdf-to-text-or-html-including-images-from-the-pdf
     583(Google: pdfbox to convert pdf to html with images)
     584
     585PDF2DOM tool (based on PDFBox) to convert PDF to HTML with images
     586* http://cssbox.sourceforge.net/pdf2dom/documentation.php
     587* Got the command line jar tool, PDFToHTML.jar version 1.7, from https://sourceforge.net/projects/cssbox/files/Pdf2DOM/
     588* Further information and source code at https://github.com/radkovo/Pdf2Dom
     589* API: http://cssbox.sourceforge.net/pdf2dom/api/index.html
     590
     591
     5921. Running
     593
     594java -jar PDFToHTML.jar <infile> [<outfile>]
     595
     596    greenstone@machine-name:~/Downloads$ java -jar PDFToHTML.jar SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
     597
     598
     599It will output the page, but you'll see the following output indicating that the logger is not displaying anything:
     600    SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
     601    SLF4J: Defaulting to no-operation (NOP) logger implementation
     602    SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
     603
     604See https://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder
     605
     606To see error output download SLF4J simple jar, run as follows:
     607
     608    greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML ApacheLicencePDFA.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
     609
     610The above is a MS Word produced PDF (archive format) and works fine: font folder generated containing the extracted fonts
     611
     612The following is a PDF produced from the same doc file by the latest libreoffice installed on Windows:
     613    ApacheLicencePDFA_FromODT.pdf
     614But running the same command on it produces the following font errors:
     615
     616greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML ApacheLicencePDFA_FromODT.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
     617[main] INFO org.reflections.Reflections - Reflections took 163 ms to scan 1 urls, producing 36 keys and 222 values
     618[main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException
     619[main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException
     620[main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException
     621[main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException
     622
     623Fonts get extracted if the source PDF was generated by MS Word's doc to PDF conversion. Fonts didn't get extracted from PDF upon conversion to HTML when libreoffice was used to convert a .doc to the source PDF.
     624
     6252. Check version of PDF
     626https://www.codeproject.com/Questions/167550/How-to-check-different-versions-of-PDF
     627
     628
     6293. pdf to html command line conversion open source
     630https://stackoverflow.com/questions/8370014/how-to-convert-pdf-to-html
     631
     632"Download
     633
     634    pdfbox-2.0.3.jar
     635    fontbox-2.0.3.jar
     636    preflight-2.0.3.jar
     637    xmpbox-2.0.3.jar
     638    pdfbox-tools-2.0.3.jar
     639    pdfbox-debugger-2.0.3.jar
     640
     641from http://pdfbox.apache.org/
     642...
     643
     644PLEASE NOTE: Images do not get pushed to the HTML output."
     645
     646
     6474. Need a way to check if PDF contains images, then use pdf2dom, else basic pdfbox conversion to html (less div tags with inline style markup)?
     648https://stackoverflow.com/questions/46215879/count-images-in-pdf-using-pdfbox
     649
     650
     651UNUSED
     652Googled for: java tool convert pdf version
     653* https://stackoverflow.com/questions/11137912/all-inclusive-tool-to-convert-different-types-of-documents-to-pdf
     654* https://www.qoppa.com/pdfprocess/
     655jPDFProcess – Java PDF Library to Create, Manipulate PDF
     656(appears to be payware)
     657* https://www.gnostice.com/nl_article.asp?id=95&t=How_to_Change_the_PDF_Version_of_a_Document
     658How to Convert a PDF Document to an Older or Newer Version
     659uses .NET
     660* http://www.baeldung.com/pdf-conversions-java
     661PDF Conversions in Java
     662e.g. PDF to html and html to PDF
     663
     664
     665__________________________________________________________
     666
     667greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
     668[main] INFO org.reflections.Reflections - Reflections took 153 ms to scan 1 urls, producing 36 keys and 222 values
     669[main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException
     670[main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException
     671[main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException
     672[main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException
     673
     674
     675
     676greenstone@machine-name:~/Downloads$ java -classpath Pdf2Dom/target/pdf2dom-1.8-SNAPSHOT.jar:pdfbox-app.jar:slf4j-jdk14-1.6.6.jar:log4j-over-slf4j-1.6.6.jar:slf4j-api-1.6.6.jar  org.fit.pdfdom.PDFToHTML SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2
     677Exception in thread "main" java.lang.NoClassDefFoundError: org/mabb/fontverter/FontVerter
     678    at org.fit.pdfdom.FontTable$Entry.loadTrueTypeFont(FontTable.java:178)
     679    at org.fit.pdfdom.FontTable$Entry.getData(FontTable.java:147)
     680    at org.fit.pdfdom.FontTable$Entry.isEntryValid(FontTable.java:161)
     681    at org.fit.pdfdom.FontTable.addEntry(FontTable.java:48)
     682    at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:378)
     683    at org.fit.pdfdom.PDFBoxTree.updateFontTable(PDFBoxTree.java:361)
     684    at org.fit.pdfdom.PDFDomTree.updateFontTable(PDFDomTree.java:544)
     685    at org.fit.pdfdom.PDFBoxTree.processPage(PDFBoxTree.java:206)
     686    at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
     687    at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
     688    at org.fit.pdfdom.PDFDomTree.createDOM(PDFDomTree.java:218)
     689    at org.fit.pdfdom.PDFDomTree.writeText(PDFDomTree.java:194)
     690    at org.fit.pdfdom.PDFToHTML.main(PDFToHTML.java:77)
     691Caused by: java.lang.ClassNotFoundException: org.mabb.fontverter.FontVerter
     692    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
     693    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
     694    at java.security.AccessController.doPrivileged(Native Method)
     695    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
     696    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
     697    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
     698    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
     699    ... 13 more
     700greenstone@machine-name:~/Downloads$
Note: See TracChangeset for help on using the changeset viewer.