Changeset 32279 for gs2-extensions

Show
Ignore:
Timestamp:
17.07.2018 19:55:47 (12 months ago)
Author:
ak19
Message:

Adding details on the updates to the pdfbox extension's GS-README

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • gs2-extensions/pdf-box/trunk/GS_PDFBox_README.txt

    r32278 r32279  
    2424- grab the pdfbox 2.09 source code 
    2525- introduce the new class PDFBoxToImagesAndText.java (originally added to the pdfbox app 2.09 src locally, now maintained separately with a GS Java package name). The new class is based on Apache PDFBox's PDFToImage.java, with added code to extract text based on Apache PDFBox's ExtractText.java 
     26Since 17 July 2018, this class also recognises 2 additional flags: -textOnly and -imagesOnly to support the new paged_text and the original pagedimg_<imgext> output formats, besides the recently introduced pagedimgtxt_<imgext> output format that outputs images and text for each page. 
    2627- OLD: rebuild the source, a maven project. Then modify the pdfbox app 2.09 (non source) ext jar file by including the new .class file 
    2728- compiling the new java class against the pdfbox-app.jar 
     
    5960 
    6061To run, that build folder needs to be on the classpath, besides pdfbox-app.jar itself. See PDFBoxConverter.pm 
     62 
     63Example of a run command, where -textOnly is thrown in to generate paged_text (no images). Leave out -textOnly if an image should still be generated for each page, besides the page's text: 
     64 
     65    java -cp "GS3/gs2build/ext/pdf-box/lib/java/pdfbox-app.jar:GS3/gs2build/ext/pdf-box/build" org.greenstone.pdfbox.PDFBoxToImagesAndText -textOnly -outputPrefix "GS3/gs2build/tmp/F228.txt/ApacheLicencePDFA" "GS3/web/sites/localsite/collect/pdfv2/import/ApacheLicencePDFA.pdf" 
     66 
    6167 
    62684. For convenience PDFBoxToImagesAndText.java further no longer generates <PDF filename prefix><sequential number>.img-ext (jpg/gif/png) and matchingly named txt file, but just <sequential number>.img-ext and <sequential number>.txt.