Ignore:
Timestamp:
2018-06-11T17:54:08+12:00 (6 years ago)
Author:
ak19
Message:

Updates to the recent commit's modifications to do with pdfbox: new class has been renamed from GS_PDFToImagesAndText.java to org/greenstone/pdfbox/PDFBoxToImagesAndText.java and uses a GS package. This class file is no longer included in pdfbox-app.jar, but is just compiled against that. Added Apache v 2.0 licensing related files. PDFBoxConverter.pm now refers to the newly named Java class with the new org.greenstone.pdfbox package name. Updated the Readme to add instructions to do with compiling the new java file and its new folder/package structure, and information related to the Apache license. There's also the new java/build subfolder containing the precompiled class file (and Java pkg structure) for the new class. This new build folder with the new custom class, and the modified PDFBoxConverter.pm and the modified pdfbox-app.jar (without the custom class) are modifications to the pdfbox tarball/zip files too.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs2-extensions/pdf-box/trunk/GS_PDFBox_README.txt

    r32194 r32197  
    2323- check the new version didn't break the stuff for which we specifically settled on version 1.8.2 before
    2424- grab the pdfbox 2.09 source code
    25 - introduce the new class GS_PDFToImagesAndText.java to the pdfbox app 2.09 src, based on its PDFToImage.java with added code to extract text based on ExtractText.java
    26 - rebuild the source, a maven project
    27 - modify the pdfbox app 2.09 (non source) ext jar file by including the new .class file
    28 - Now we're possible generating a txtfile for each img file (assuming there was extractable txt), some modifications needed to be made to the generated .item file to ensure these referred to the txt file.
     25- introduce the new class PDFBoxToImagesAndText.java (originally added to the pdfbox app 2.09 src locally, now maintained separately with a GS Java package name). The new class is based on Apache PDFBox's PDFToImage.java, with added code to extract text based on Apache PDFBox's ExtractText.java
     26- OLD: rebuild the source, a maven project. Then modify the pdfbox app 2.09 (non source) ext jar file by including the new .class file
     27- compiling the new java class against the pdfbox-app.jar
     28- Now we're possibly generating a txtfile for each img file (assuming there was extractable txt), some modifications needed to be made to the generated .item file to ensure these referred to the txt file.
    2929- work that remains to be done
    3030
    3131
    32 0. Used to use pdfbox app version 1.8.2, see commit message at http://trac.greenstone.org/changeset/29029/gs2-extensions/pdf-box/trunk/java/lib/java/pdfbox-app.jar
     321. Used to use pdfbox app version 1.8.2, see commit message at http://trac.greenstone.org/changeset/29029/gs2-extensions/pdf-box/trunk/java/lib/java/pdfbox-app.jar
    3333
    34 1. Upgrading to pdfbox app 2.0.9 (downloaded from https://pdfbox.apache.org/download.cgi?Preferred=http%3A%2F%2Fwww-us.apache.org%2Fdist%2F):
     342. Upgrading to pdfbox app 2.0.9 (downloaded from https://pdfbox.apache.org/download.cgi?Preferred=http%3A%2F%2Fwww-us.apache.org%2Fdist%2F):
    3535
    3636Doesn't work out of the box, such as to extract text from a PDF file.
     
    5050    if ($best_encoding eq "utf_8" ||  $best_encoding =~ /utf_8/) { $best_encoding = "utf8" }
    5151
     523. To recompile the new PDFBoxToImagesAndText.java code:
    5253
    53 2. Need to edit org.apache.pdfbox.tools.PDFToImage (not ExtractImages) as that's the tool that converts a PDF's individual pages to images
     54- grab the svn version of the Greenstone pdfbox extension
     55- Then from the svn checked out pdfbox folder, run
     56    $pdfbox > javac -cp /path/to/pdfbox/java/lib/java/pdfbox-app.jar -d /path/to/pdfbox/java/build java/src/org/greenstone/pdfbox/PDFBoxToImagesAndText.java
     57which will compile against the pdfbox-app.jar in the classpath (-cp_ and output the .class file into the directory denoted by -d
     58
     59To run, that build folder needs to be on the classpath, besides pdfbox-app.jar itself. See PDFBoxConverter.pm
     60
     614. For convenience PDFBoxToImagesAndText.java further no longer generates <PDF filename prefix><sequential number>.img-ext (jpg/gif/png) and matchingly named txt file, but just <sequential number>.img-ext and <sequential number>.txt.
     62
     63Further, needed to modify GS perllib's util::create_itemfile(), so that it no longer leaves each txtfile="" property empty in the item file if we've generated matching text files for the image files. In the past, util::create_itemfile() only worked with images, but now it's modified to additionally deal with matching text files.
     64
     65
     665. No need to rename the pdf-box, its containing tarball and its jar to gs-pdf-box. Doing that may have required changes to get nightly builds, releases, diffcol to get them to still work.
     67
     68However, Dr Bainbridge explained that the solution is to not rename pdfbox-app.jar to gs-pdfbox-app.jar, but have our new custom java file living separately rather than within the pdfbox-app.jar though compiled against it. The java file will have its own greenstone Java package (org.greenstone.pdfbox) and the class file produced from our custom class will then be run from its package by PDFBoxConverter.pm
     69
     70
     716. I was not sure what the licensing information for pdfbox/java/src/PDFBoxToImagesAndText.java should be in its preamble comment section, since most of the code is a copy or slight modification of the Apache PDFBox app's code. I asked Dr Nichols about the licensing, and he pointed out that Apache's PDFBox pages referred to https://www.apache.org/licenses/LICENSE-2.0
     72Dr Nichols drew attention to its section "Redistribution", saying that that should cover things. I've tried to follow the points there, and have therefore included apache's LICENSE.txt and NOTICE.txt files into our pdfbox/java/src and pdfbox/java/src build folders where PDFBoxToImagesAndText.java and PDFBoxToImagesAndText.class, respectively, will live.
     73
     74------------------
     75
     76
     77EXTRA INFORMATION
     78
     79Older but useful information, as it covers how to use maven to build the pdfbox-app, and include a new java file to be built into the pdfbox-app and distribute the file within the app.
     80
     811. Originally I thought I would need to edit org.apache.pdfbox.tools.PDFToImage (not ExtractImages), as that's the tool that converts a PDF's individual pages to images. Eventually, since there was a new class PDFBoxToImagesAndText.java, Dr Bainbridge explained that this could just be compiled against an exiting pdfbox-app.jar, rather than including it inside the jar and recompiling the pdfbox-app in entirety.
     82
     83When I was still compiling the pdfbox-app in entirety, with and without my new code included therein, to obtain the basecode and build the app, the steps were:
    5484
    5585
     
    70100
    71101- The source code to edit is at /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9/tools/src/main/java/org/apache/pdfbox/tools
    72 Create GS_PDFToImagesAndText.java based on PDFToImage.java into which to incorporate from ExtractText.java
     102Create PDFBoxToImagesAndText.java based on PDFToImage.java into which to incorporate from ExtractText.java
    73103
    74104- When built, will get /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9/tools/target/classes/org/apache/pdfbox/tools
     
    76106
    77107
    78 3. Once the pdfbox-app has been entirely built the first time, want smaller rebuilds.
     1082. Once the pdfbox-app has been entirely built the first time, want smaller rebuilds.
    79109
    80110Final section of output of full rebuild:
     
    118148    greenstone@bedrock:~/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9$ mvn clean install -pl tools -am
    119149
    120 (This will also compile our new GS_PDFToImagesAndText.java and put its class file GS_PDFToImagesAndText.class into the target dir /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9/tools/target/classes/org/apache/pdfbox/tools)
    121 
    122 
    123 4. For convenience GS_PDFToImagesAndText.java further no longer generates <PDF filename prefix><sequential number>.img-ext (jpg/gif/png) and matchingly named txt file, but just <sequential number>.img-ext and <sequential number>.txt.
    124 
    125 Further, needed to modify GS perllib's util::create_itemfile(), so that it no longer leaves each txtfile="" property empty in the item file if we've generated matching text files for the image files. In the past, util::create_itemfile() only worked with images, but now it's modified to additionally deal with matching text files.
    126 
    127 
    128 5. Remaining Issues.
    129 
    130 a. Build messages (when building lower level PDF conversion commands):
    131 ImageConverter: "/home/greenstone/gs3-svn-26Mar2018/gs2build/tmp/F143/A9-access-best-practices1.txt" too small, skipping
    132 Same for: A9-access-best-practices56.txt
    133 
    134 
    135 Lower level PDF conversion commands:
    136 calling cmd "/usr/bin/perl" -S gsConvert.pl -verbose 2 -pdf_zoom 2 -errlog "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/tmp/1528179077/err.log" -output pagedimg_jpg "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/tmp/1528179077/A9-access-best-practices.pdf"
    137 import.pl>
    138 
    139 
    140 "/usr/bin/perl" -S gsConvert.pl -verbose 2 -pdf_zoom 2 -errlog "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/tmp/1528179077/err.log" -output pagedimg_jpg "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices.pdf"
    141 
    142 
    143 "/usr/bin/perl" -S pdfpstoimg.pl -convert_to jpg "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices.pdf" "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices" > "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices.out" 2> "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices.err"
    144 
    145 
    146 @@@@ pdfpstoimg.pl::pdfps_to_img(): cmd = "/usr/bin/perl" -S gs-magick.pl convert "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices.pdf" "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices/A9-access-best-practices.jpg"
    147 
    148 
    149 b. Have not yet renamed the pdf-box, its containing tarball and its jar to gs-pdf-box yet, since a lot of things foreseen (nightly builds, releases, diffcol) and unforeseen can break due to the name change.
    150 
    151 c. Not sure what the licensing information for pdfbox/java/src/GS_PDFToImagesAndText.java should be in its preamble comment section, since most of the code is a copy or slight modification of the Apache PDFBox app's code.
     150(This will also compile our new PDFBoxToImagesAndText.java and put its class file PDFBoxToImagesAndText.class into the target dir /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9/tools/target/classes/org/apache/pdfbox/tools)
    152151
    153152__________________________________________________________
Note: See TracChangeset for help on using the changeset viewer.