Changeset 32197 for gs2-extensions/pdf-box/trunk/GS_PDFBox_README.txt
- Timestamp:
- 2018-06-11T17:54:08+12:00 (6 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs2-extensions/pdf-box/trunk/GS_PDFBox_README.txt
r32194 r32197 23 23 - check the new version didn't break the stuff for which we specifically settled on version 1.8.2 before 24 24 - grab the pdfbox 2.09 source code 25 - introduce the new class GS_PDFToImagesAndText.java to the pdfbox app 2.09 src, based on its PDFToImage.java with added code to extract text based onExtractText.java26 - rebuild the source, a maven project27 - modify the pdfbox app 2.09 (non source) ext jar file by including the new .class file28 - Now we're possibl egenerating a txtfile for each img file (assuming there was extractable txt), some modifications needed to be made to the generated .item file to ensure these referred to the txt file.25 - introduce the new class PDFBoxToImagesAndText.java (originally added to the pdfbox app 2.09 src locally, now maintained separately with a GS Java package name). The new class is based on Apache PDFBox's PDFToImage.java, with added code to extract text based on Apache PDFBox's ExtractText.java 26 - OLD: rebuild the source, a maven project. Then modify the pdfbox app 2.09 (non source) ext jar file by including the new .class file 27 - compiling the new java class against the pdfbox-app.jar 28 - Now we're possibly generating a txtfile for each img file (assuming there was extractable txt), some modifications needed to be made to the generated .item file to ensure these referred to the txt file. 29 29 - work that remains to be done 30 30 31 31 32 0. Used to use pdfbox app version 1.8.2, see commit message at http://trac.greenstone.org/changeset/29029/gs2-extensions/pdf-box/trunk/java/lib/java/pdfbox-app.jar32 1. Used to use pdfbox app version 1.8.2, see commit message at http://trac.greenstone.org/changeset/29029/gs2-extensions/pdf-box/trunk/java/lib/java/pdfbox-app.jar 33 33 34 1. Upgrading to pdfbox app 2.0.9 (downloaded from https://pdfbox.apache.org/download.cgi?Preferred=http%3A%2F%2Fwww-us.apache.org%2Fdist%2F):34 2. Upgrading to pdfbox app 2.0.9 (downloaded from https://pdfbox.apache.org/download.cgi?Preferred=http%3A%2F%2Fwww-us.apache.org%2Fdist%2F): 35 35 36 36 Doesn't work out of the box, such as to extract text from a PDF file. … … 50 50 if ($best_encoding eq "utf_8" || $best_encoding =~ /utf_8/) { $best_encoding = "utf8" } 51 51 52 3. To recompile the new PDFBoxToImagesAndText.java code: 52 53 53 2. Need to edit org.apache.pdfbox.tools.PDFToImage (not ExtractImages) as that's the tool that converts a PDF's individual pages to images 54 - grab the svn version of the Greenstone pdfbox extension 55 - Then from the svn checked out pdfbox folder, run 56 $pdfbox > javac -cp /path/to/pdfbox/java/lib/java/pdfbox-app.jar -d /path/to/pdfbox/java/build java/src/org/greenstone/pdfbox/PDFBoxToImagesAndText.java 57 which will compile against the pdfbox-app.jar in the classpath (-cp_ and output the .class file into the directory denoted by -d 58 59 To run, that build folder needs to be on the classpath, besides pdfbox-app.jar itself. See PDFBoxConverter.pm 60 61 4. For convenience PDFBoxToImagesAndText.java further no longer generates <PDF filename prefix><sequential number>.img-ext (jpg/gif/png) and matchingly named txt file, but just <sequential number>.img-ext and <sequential number>.txt. 62 63 Further, needed to modify GS perllib's util::create_itemfile(), so that it no longer leaves each txtfile="" property empty in the item file if we've generated matching text files for the image files. In the past, util::create_itemfile() only worked with images, but now it's modified to additionally deal with matching text files. 64 65 66 5. No need to rename the pdf-box, its containing tarball and its jar to gs-pdf-box. Doing that may have required changes to get nightly builds, releases, diffcol to get them to still work. 67 68 However, Dr Bainbridge explained that the solution is to not rename pdfbox-app.jar to gs-pdfbox-app.jar, but have our new custom java file living separately rather than within the pdfbox-app.jar though compiled against it. The java file will have its own greenstone Java package (org.greenstone.pdfbox) and the class file produced from our custom class will then be run from its package by PDFBoxConverter.pm 69 70 71 6. I was not sure what the licensing information for pdfbox/java/src/PDFBoxToImagesAndText.java should be in its preamble comment section, since most of the code is a copy or slight modification of the Apache PDFBox app's code. I asked Dr Nichols about the licensing, and he pointed out that Apache's PDFBox pages referred to https://www.apache.org/licenses/LICENSE-2.0 72 Dr Nichols drew attention to its section "Redistribution", saying that that should cover things. I've tried to follow the points there, and have therefore included apache's LICENSE.txt and NOTICE.txt files into our pdfbox/java/src and pdfbox/java/src build folders where PDFBoxToImagesAndText.java and PDFBoxToImagesAndText.class, respectively, will live. 73 74 ------------------ 75 76 77 EXTRA INFORMATION 78 79 Older but useful information, as it covers how to use maven to build the pdfbox-app, and include a new java file to be built into the pdfbox-app and distribute the file within the app. 80 81 1. Originally I thought I would need to edit org.apache.pdfbox.tools.PDFToImage (not ExtractImages), as that's the tool that converts a PDF's individual pages to images. Eventually, since there was a new class PDFBoxToImagesAndText.java, Dr Bainbridge explained that this could just be compiled against an exiting pdfbox-app.jar, rather than including it inside the jar and recompiling the pdfbox-app in entirety. 82 83 When I was still compiling the pdfbox-app in entirety, with and without my new code included therein, to obtain the basecode and build the app, the steps were: 54 84 55 85 … … 70 100 71 101 - The source code to edit is at /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9/tools/src/main/java/org/apache/pdfbox/tools 72 Create GS_PDFToImagesAndText.java based on PDFToImage.java into which to incorporate from ExtractText.java102 Create PDFBoxToImagesAndText.java based on PDFToImage.java into which to incorporate from ExtractText.java 73 103 74 104 - When built, will get /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9/tools/target/classes/org/apache/pdfbox/tools … … 76 106 77 107 78 3. Once the pdfbox-app has been entirely built the first time, want smaller rebuilds.108 2. Once the pdfbox-app has been entirely built the first time, want smaller rebuilds. 79 109 80 110 Final section of output of full rebuild: … … 118 148 greenstone@bedrock:~/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9$ mvn clean install -pl tools -am 119 149 120 (This will also compile our new GS_PDFToImagesAndText.java and put its class file GS_PDFToImagesAndText.class into the target dir /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9/tools/target/classes/org/apache/pdfbox/tools) 121 122 123 4. For convenience GS_PDFToImagesAndText.java further no longer generates <PDF filename prefix><sequential number>.img-ext (jpg/gif/png) and matchingly named txt file, but just <sequential number>.img-ext and <sequential number>.txt. 124 125 Further, needed to modify GS perllib's util::create_itemfile(), so that it no longer leaves each txtfile="" property empty in the item file if we've generated matching text files for the image files. In the past, util::create_itemfile() only worked with images, but now it's modified to additionally deal with matching text files. 126 127 128 5. Remaining Issues. 129 130 a. Build messages (when building lower level PDF conversion commands): 131 ImageConverter: "/home/greenstone/gs3-svn-26Mar2018/gs2build/tmp/F143/A9-access-best-practices1.txt" too small, skipping 132 Same for: A9-access-best-practices56.txt 133 134 135 Lower level PDF conversion commands: 136 calling cmd "/usr/bin/perl" -S gsConvert.pl -verbose 2 -pdf_zoom 2 -errlog "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/tmp/1528179077/err.log" -output pagedimg_jpg "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/tmp/1528179077/A9-access-best-practices.pdf" 137 import.pl> 138 139 140 "/usr/bin/perl" -S gsConvert.pl -verbose 2 -pdf_zoom 2 -errlog "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/tmp/1528179077/err.log" -output pagedimg_jpg "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices.pdf" 141 142 143 "/usr/bin/perl" -S pdfpstoimg.pl -convert_to jpg "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices.pdf" "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices" > "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices.out" 2> "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices.err" 144 145 146 @@@@ pdfpstoimg.pl::pdfps_to_img(): cmd = "/usr/bin/perl" -S gs-magick.pl convert "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices.pdf" "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices/A9-access-best-practices.jpg" 147 148 149 b. Have not yet renamed the pdf-box, its containing tarball and its jar to gs-pdf-box yet, since a lot of things foreseen (nightly builds, releases, diffcol) and unforeseen can break due to the name change. 150 151 c. Not sure what the licensing information for pdfbox/java/src/GS_PDFToImagesAndText.java should be in its preamble comment section, since most of the code is a copy or slight modification of the Apache PDFBox app's code. 150 (This will also compile our new PDFBoxToImagesAndText.java and put its class file PDFBoxToImagesAndText.class into the target dir /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9/tools/target/classes/org/apache/pdfbox/tools) 152 151 153 152 __________________________________________________________
Note:
See TracChangeset
for help on using the changeset viewer.