Changeset 32250 for gs2-extensions/xpdf-tools
- Timestamp:
- 2018-07-09T21:46:18+12:00 (6 years ago)
- Location:
- gs2-extensions/xpdf-tools/trunk/src
- Files:
-
- 1 deleted
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs2-extensions/xpdf-tools/trunk/src/packages/GS-README.txt
r32249 r32250 14 14 G. LIBJPEG and LIBTIFF 15 15 - Issues building LIBJPEG on 64 bit machines and the patch 16 17 H. PDF2DOM 18 unused, replaced by Xpdf-Tools' more suited pdftohtml capabilities 16 19 17 20 __________________________________________________________ … … 574 577 575 578 579 __________________________________________________________ 580 H. PDF2DOM: tried it out, but wasn't what we wanted 581 __________________________________________________________ 582 Using PDFBox to convert a PDF to full HTML, both images and text and placed correctly with respect to each other, is tricky, see https://stackoverflow.com/questions/9671239/pdfbox-convert-a-pdf-to-text-or-html-including-images-from-the-pdf 583 (Google: pdfbox to convert pdf to html with images) 584 585 PDF2DOM tool (based on PDFBox) to convert PDF to HTML with images 586 * http://cssbox.sourceforge.net/pdf2dom/documentation.php 587 * Got the command line jar tool, PDFToHTML.jar version 1.7, from https://sourceforge.net/projects/cssbox/files/Pdf2DOM/ 588 * Further information and source code at https://github.com/radkovo/Pdf2Dom 589 * API: http://cssbox.sourceforge.net/pdf2dom/api/index.html 590 591 592 1. Running 593 594 java -jar PDFToHTML.jar <infile> [<outfile>] 595 596 greenstone@machine-name:~/Downloads$ java -jar PDFToHTML.jar SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2 597 598 599 It will output the page, but you'll see the following output indicating that the logger is not displaying anything: 600 SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". 601 SLF4J: Defaulting to no-operation (NOP) logger implementation 602 SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. 603 604 See https://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder 605 606 To see error output download SLF4J simple jar, run as follows: 607 608 greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML ApacheLicencePDFA.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2 609 610 The above is a MS Word produced PDF (archive format) and works fine: font folder generated containing the extracted fonts 611 612 The following is a PDF produced from the same doc file by the latest libreoffice installed on Windows: 613 ApacheLicencePDFA_FromODT.pdf 614 But running the same command on it produces the following font errors: 615 616 greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML ApacheLicencePDFA_FromODT.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2 617 [main] INFO org.reflections.Reflections - Reflections took 163 ms to scan 1 urls, producing 36 keys and 222 values 618 [main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException 619 [main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException 620 [main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException 621 [main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException 622 623 Fonts get extracted if the source PDF was generated by MS Word's doc to PDF conversion. Fonts didn't get extracted from PDF upon conversion to HTML when libreoffice was used to convert a .doc to the source PDF. 624 625 2. Check version of PDF 626 https://www.codeproject.com/Questions/167550/How-to-check-different-versions-of-PDF 627 628 629 3. pdf to html command line conversion open source 630 https://stackoverflow.com/questions/8370014/how-to-convert-pdf-to-html 631 632 "Download 633 634 pdfbox-2.0.3.jar 635 fontbox-2.0.3.jar 636 preflight-2.0.3.jar 637 xmpbox-2.0.3.jar 638 pdfbox-tools-2.0.3.jar 639 pdfbox-debugger-2.0.3.jar 640 641 from http://pdfbox.apache.org/ 642 ... 643 644 PLEASE NOTE: Images do not get pushed to the HTML output." 645 646 647 4. Need a way to check if PDF contains images, then use pdf2dom, else basic pdfbox conversion to html (less div tags with inline style markup)? 648 https://stackoverflow.com/questions/46215879/count-images-in-pdf-using-pdfbox 649 650 651 UNUSED 652 Googled for: java tool convert pdf version 653 * https://stackoverflow.com/questions/11137912/all-inclusive-tool-to-convert-different-types-of-documents-to-pdf 654 * https://www.qoppa.com/pdfprocess/ 655 jPDFProcess â Java PDF Library to Create, Manipulate PDF 656 (appears to be payware) 657 * https://www.gnostice.com/nl_article.asp?id=95&t=How_to_Change_the_PDF_Version_of_a_Document 658 How to Convert a PDF Document to an Older or Newer Version 659 uses .NET 660 * http://www.baeldung.com/pdf-conversions-java 661 PDF Conversions in Java 662 e.g. PDF to html and html to PDF 663 664 665 __________________________________________________________ 666 667 greenstone@machine-name:~/Downloads$ java -classpath slf4j-simple-1.7.25.jar:PDFToHTML.jar org.fit.pdfdom.PDFToHTML SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2 668 [main] INFO org.reflections.Reflections - Reflections took 153 ms to scan 1 urls, producing 36 keys and 222 values 669 [main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException 670 [main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException 671 [main] WARN org.fit.pdfdom.FontTable - Error loading font 'BAAAAA+Georgia' Message: FontVerter could not detect the input font's type. class java.io.IOException 672 [main] WARN org.fit.pdfdom.FontTable - Error loading font 'CAAAAA+Georgia-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException 673 674 675 676 greenstone@machine-name:~/Downloads$ java -classpath Pdf2Dom/target/pdf2dom-1.8-SNAPSHOT.jar:pdfbox-app.jar:slf4j-jdk14-1.6.6.jar:log4j-over-slf4j-1.6.6.jar:slf4j-api-1.6.6.jar org.fit.pdfdom.PDFToHTML SampleDoc1.pdf -im=SAVE_TO_DIR -idir=/home/greenstone/Downloads/tmp1 -fm=SAVE_TO_DIR -fdir=/home/greenstone/Downloads/tmp2 677 Exception in thread "main" java.lang.NoClassDefFoundError: org/mabb/fontverter/FontVerter 678 at org.fit.pdfdom.FontTable$Entry.loadTrueTypeFont(FontTable.java:178) 679 at org.fit.pdfdom.FontTable$Entry.getData(FontTable.java:147) 680 at org.fit.pdfdom.FontTable$Entry.isEntryValid(FontTable.java:161) 681 at org.fit.pdfdom.FontTable.addEntry(FontTable.java:48) 682 at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:378) 683 at org.fit.pdfdom.PDFBoxTree.updateFontTable(PDFBoxTree.java:361) 684 at org.fit.pdfdom.PDFDomTree.updateFontTable(PDFDomTree.java:544) 685 at org.fit.pdfdom.PDFBoxTree.processPage(PDFBoxTree.java:206) 686 at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) 687 at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) 688 at org.fit.pdfdom.PDFDomTree.createDOM(PDFDomTree.java:218) 689 at org.fit.pdfdom.PDFDomTree.writeText(PDFDomTree.java:194) 690 at org.fit.pdfdom.PDFToHTML.main(PDFToHTML.java:77) 691 Caused by: java.lang.ClassNotFoundException: org.mabb.fontverter.FontVerter 692 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) 693 at java.net.URLClassLoader$1.run(URLClassLoader.java:355) 694 at java.security.AccessController.doPrivileged(Native Method) 695 at java.net.URLClassLoader.findClass(URLClassLoader.java:354) 696 at java.lang.ClassLoader.loadClass(ClassLoader.java:425) 697 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) 698 at java.lang.ClassLoader.loadClass(ClassLoader.java:358) 699 ... 13 more 700 greenstone@machine-name:~/Downloads$
Note:
See TracChangeset
for help on using the changeset viewer.