Changeset 32279
- Timestamp:
- 2018-07-17T19:55:47+12:00 (6 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs2-extensions/pdf-box/trunk/GS_PDFBox_README.txt
r32278 r32279 24 24 - grab the pdfbox 2.09 source code 25 25 - introduce the new class PDFBoxToImagesAndText.java (originally added to the pdfbox app 2.09 src locally, now maintained separately with a GS Java package name). The new class is based on Apache PDFBox's PDFToImage.java, with added code to extract text based on Apache PDFBox's ExtractText.java 26 Since 17 July 2018, this class also recognises 2 additional flags: -textOnly and -imagesOnly to support the new paged_text and the original pagedimg_<imgext> output formats, besides the recently introduced pagedimgtxt_<imgext> output format that outputs images and text for each page. 26 27 - OLD: rebuild the source, a maven project. Then modify the pdfbox app 2.09 (non source) ext jar file by including the new .class file 27 28 - compiling the new java class against the pdfbox-app.jar … … 59 60 60 61 To run, that build folder needs to be on the classpath, besides pdfbox-app.jar itself. See PDFBoxConverter.pm 62 63 Example of a run command, where -textOnly is thrown in to generate paged_text (no images). Leave out -textOnly if an image should still be generated for each page, besides the page's text: 64 65 java -cp "GS3/gs2build/ext/pdf-box/lib/java/pdfbox-app.jar:GS3/gs2build/ext/pdf-box/build" org.greenstone.pdfbox.PDFBoxToImagesAndText -textOnly -outputPrefix "GS3/gs2build/tmp/F228.txt/ApacheLicencePDFA" "GS3/web/sites/localsite/collect/pdfv2/import/ApacheLicencePDFA.pdf" 66 61 67 62 68 4. For convenience PDFBoxToImagesAndText.java further no longer generates <PDF filename prefix><sequential number>.img-ext (jpg/gif/png) and matchingly named txt file, but just <sequential number>.img-ext and <sequential number>.txt.
Note:
See TracChangeset
for help on using the changeset viewer.