Changeset 32279


Ignore:
Timestamp:
2018-07-17T19:55:47+12:00 (6 years ago)
Author:
ak19
Message:

Adding details on the updates to the pdfbox extension's GS-README

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs2-extensions/pdf-box/trunk/GS_PDFBox_README.txt

    r32278 r32279  
    2424- grab the pdfbox 2.09 source code
    2525- introduce the new class PDFBoxToImagesAndText.java (originally added to the pdfbox app 2.09 src locally, now maintained separately with a GS Java package name). The new class is based on Apache PDFBox's PDFToImage.java, with added code to extract text based on Apache PDFBox's ExtractText.java
     26Since 17 July 2018, this class also recognises 2 additional flags: -textOnly and -imagesOnly to support the new paged_text and the original pagedimg_<imgext> output formats, besides the recently introduced pagedimgtxt_<imgext> output format that outputs images and text for each page.
    2627- OLD: rebuild the source, a maven project. Then modify the pdfbox app 2.09 (non source) ext jar file by including the new .class file
    2728- compiling the new java class against the pdfbox-app.jar
     
    5960
    6061To run, that build folder needs to be on the classpath, besides pdfbox-app.jar itself. See PDFBoxConverter.pm
     62
     63Example of a run command, where -textOnly is thrown in to generate paged_text (no images). Leave out -textOnly if an image should still be generated for each page, besides the page's text:
     64
     65    java -cp "GS3/gs2build/ext/pdf-box/lib/java/pdfbox-app.jar:GS3/gs2build/ext/pdf-box/build" org.greenstone.pdfbox.PDFBoxToImagesAndText -textOnly -outputPrefix "GS3/gs2build/tmp/F228.txt/ApacheLicencePDFA" "GS3/web/sites/localsite/collect/pdfv2/import/ApacheLicencePDFA.pdf"
     66
    6167
    62684. For convenience PDFBoxToImagesAndText.java further no longer generates <PDF filename prefix><sequential number>.img-ext (jpg/gif/png) and matchingly named txt file, but just <sequential number>.img-ext and <sequential number>.txt.
Note: See TracChangeset for help on using the changeset viewer.