1 | __________________________________________________________
|
---|
2 | README for GS modifications to the PDFBox App.
|
---|
3 | __________________________________________________________
|
---|
4 |
|
---|
5 | 5 June 2018
|
---|
6 |
|
---|
7 | PROBLEM:
|
---|
8 | PDFBoxConverter.pm used the PDFBox App (v 1.8.2 at the time) to allow us the option of either extracting text from the entire PDF or saving each page of the PDF as an image.
|
---|
9 | Kathy wanted the option to have both the PDF's pages as images but also text extracted for each page (if extractable), so that contents could be searched and viewed in image format. Kathy said the HTML output was ugly, so we just pursue txt extraction.
|
---|
10 |
|
---|
11 | IDEA FOR SOLUTION:
|
---|
12 | Kathy had found that the PDFBox app had an API. She looked at the API documentation for ExtractImages.java, which had a main function.
|
---|
13 | She explained that our GS perl code was calling the main function of classes such as this in order to do PDF to txt or img conversion.
|
---|
14 |
|
---|
15 | Instead, of doing that, Kathy's idea was to modify the PDFBox app's ExtractImages.java into a GS version that would additionally extract text. The additional code could be copied in from the ExtractText.java class.
|
---|
16 |
|
---|
17 | Note: In looking into this, I found that ExtractImages literally extracts any images in the PDF, rather than converting pages to images. The actual PDFBox class that our perl code calls to do the PDF pages to images conversion is PDFToImage.java.
|
---|
18 |
|
---|
19 |
|
---|
20 | CONTENTS
|
---|
21 | The following are the steps I found I needed to go through and which are described in this readme:
|
---|
22 | - upgrade to pdfbox app 2.09,
|
---|
23 | - check the new version didn't break the stuff for which we specifically settled on version 1.8.2 before
|
---|
24 | - grab the pdfbox 2.09 source code
|
---|
25 | - introduce the new class GS_PDFToImagesAndText.java to the pdfbox app 2.09 src, based on its PDFToImage.java with added code to extract text based on ExtractText.java
|
---|
26 | - rebuild the source, a maven project
|
---|
27 | - modify the pdfbox app 2.09 (non source) ext jar file by including the new .class file
|
---|
28 | - Now we're possible generating a txtfile for each img file (assuming there was extractable txt), some modifications needed to be made to the generated .item file to ensure these referred to the txt file.
|
---|
29 | - work that remains to be done
|
---|
30 |
|
---|
31 |
|
---|
32 | 0. Used to use pdfbox app version 1.8.2, see commit message at http://trac.greenstone.org/changeset/29029/gs2-extensions/pdf-box/trunk/java/lib/java/pdfbox-app.jar
|
---|
33 |
|
---|
34 | 1. Upgrading to pdfbox app 2.0.9 (downloaded from https://pdfbox.apache.org/download.cgi?Preferred=http%3A%2F%2Fwww-us.apache.org%2Fdist%2F):
|
---|
35 |
|
---|
36 | Doesn't work out of the box, such as to extract text from a PDF file.
|
---|
37 |
|
---|
38 | But if references to org.apache.pdfbox.* in PDFBoxConverter.pm are changed to org.apache.pdfbox.tools.*, the pdf is still processed.
|
---|
39 |
|
---|
40 | And the extracted text contains fi ("file") and fl ("workflow") instead of single character ligatures for these. Once again, refer to the commit message at http://trac.greenstone.org/changeset/29029/gs2-extensions/pdf-box/trunk/java/lib/java/pdfbox-app.jar
|
---|
41 |
|
---|
42 |
|
---|
43 | BUT we get to see this warning message now:
|
---|
44 | ReadTextFile: WARNING: /home/greenstone/gs3-svn-26Mar2018/gs2build/tmp/F996.html appears to be encoded in an unsupported encoding ("utf_8) - using utf8
|
---|
45 |
|
---|
46 |
|
---|
47 | -> In ReadTextFile.pm change:
|
---|
48 | if ($best_encoding eq "utf_8") { $best_encoding = "utf8" }
|
---|
49 | to:
|
---|
50 | if ($best_encoding eq "utf_8" || $best_encoding =~ /utf_8/) { $best_encoding = "utf8" }
|
---|
51 |
|
---|
52 |
|
---|
53 | 2. Need to edit org.apache.pdfbox.tools.PDFToImage (not ExtractImages) as that's the tool that converts a PDF's individual pages to images
|
---|
54 |
|
---|
55 |
|
---|
56 | - unzipped the src version of pdfbox 2.09
|
---|
57 | - downloaded and unzipped maven
|
---|
58 | Got it from https://maven.apache.org/download.cgi
|
---|
59 | - added maven/bin to PATH and set JAVA_TOOL_OPTIONS as below* in setenv.sh then sourced setenv.sh (instructions at https://maven.apache.org/install.html)
|
---|
60 | export PATH=$JAVA_HOME/bin:/home/greenstone/apache-maven-3.5.3/bin:$PATH
|
---|
61 | export JAVA_TOOL_OPTIONS="-Dhttps.protocols=TLSv1.2"
|
---|
62 |
|
---|
63 | (JAVA_TOOL_OPTIONS needs to be set as above because when using maven to build pdfbox src without it, saw mvn fail to build with the error "SSLException: Received fatal alert: protocol_version" explained at: https://github.com/jenkinsci/ghprb-plugin/issues/638)
|
---|
64 |
|
---|
65 | - As per pdfbox 2.09 src's README.md, the build command is:
|
---|
66 | mvn clean install
|
---|
67 |
|
---|
68 | Run that command after first going into the extracted pdfbox-2.09 src code's folder, wherever this is extracted, e.g.
|
---|
69 | greenstone@bedrock:~/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9
|
---|
70 |
|
---|
71 | - The source code to edit is at /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9/tools/src/main/java/org/apache/pdfbox/tools
|
---|
72 | Create GS_PDFToImagesAndText.java based on PDFToImage.java into which to incorporate from ExtractText.java
|
---|
73 |
|
---|
74 | - When built, will get /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9/tools/target/classes/org/apache/pdfbox/tools
|
---|
75 | Put that into the unzipped pdfbox app 2.09 version (a.o.t. the unzipped pdfbox src 2.09)
|
---|
76 |
|
---|
77 |
|
---|
78 | 3. Once the pdfbox-app has been entirely built the first time, want smaller rebuilds.
|
---|
79 |
|
---|
80 | Final section of output of full rebuild:
|
---|
81 |
|
---|
82 |
|
---|
83 | [INFO] ------------------------------------------------------------------------
|
---|
84 | [INFO] Reactor Summary:
|
---|
85 | [INFO]
|
---|
86 | [INFO] PDFBox parent ...................................... SUCCESS [ 1.457 s]
|
---|
87 | [INFO] Apache FontBox ..................................... SUCCESS [ 16.135 s]
|
---|
88 | [INFO] Apache XmpBox ...................................... SUCCESS [ 7.046 s]
|
---|
89 | [INFO] Apache PDFBox ...................................... SUCCESS [01:37 min]
|
---|
90 | [INFO] Apache Preflight ................................... SUCCESS [ 20.115 s]
|
---|
91 | [INFO] Apache Preflight application ....................... SUCCESS [ 8.270 s]
|
---|
92 | [INFO] Apache PDFBox Debugger ............................. SUCCESS [ 2.246 s]
|
---|
93 | [INFO] Apache PDFBox tools ................................ SUCCESS [ 8.567 s]
|
---|
94 | [INFO] Apache PDFBox application .......................... SUCCESS [ 7.401 s]
|
---|
95 | [INFO] Apache PDFBox Debugger application ................. SUCCESS [ 7.184 s]
|
---|
96 | [INFO] Apache PDFBox examples ............................. SUCCESS [ 15.077 s]
|
---|
97 | [INFO] PDFBox reactor 2.0.9 ............................... SUCCESS [ 0.148 s]
|
---|
98 | [INFO] ------------------------------------------------------------------------
|
---|
99 | [INFO] BUILD SUCCESS
|
---|
100 | [INFO] ------------------------------------------------------------------------
|
---|
101 | [INFO] Total time: 03:12 min
|
---|
102 | [INFO] Finished at: 2018-06-05T16:03:27+12:00
|
---|
103 | [INFO] ------------------------------------------------------------------------
|
---|
104 |
|
---|
105 |
|
---|
106 |
|
---|
107 | The maven module in pdfbox-app to actually build: "Apache PDFBox tools", as this contains ExtractText.java, PDFToImage.java and the new class we're going to put in there.
|
---|
108 |
|
---|
109 | To build a project component ("module") in maven, see
|
---|
110 | - https://maven.apache.org/guides/mini/guide-multiple-modules.html
|
---|
111 | - https://stackoverflow.com/questions/1114026/maven-modules-building-a-single-specific-module
|
---|
112 | - https://stackoverflow.com/questions/23075415/maven-build-second-level-child-projects-using-reactor-option-pl
|
---|
113 |
|
---|
114 | So since we want to build the package "tools" (and its dependencies), we do:
|
---|
115 | mvn [clean] install -pl tools -am
|
---|
116 |
|
---|
117 | Again, run the above wherever the pdfbox app's src code v 2.09 is extracted:
|
---|
118 | greenstone@bedrock:~/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9$ mvn clean install -pl tools -am
|
---|
119 |
|
---|
120 | (This will also compile our new GS_PDFToImagesAndText.java and put its class file GS_PDFToImagesAndText.class into the target dir /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9/tools/target/classes/org/apache/pdfbox/tools)
|
---|
121 |
|
---|
122 |
|
---|
123 | 4. For convenience GS_PDFToImagesAndText.java further no longer generates <PDF filename prefix><sequential number>.img-ext (jpg/gif/png) and matchingly named txt file, but just <sequential number>.img-ext and <sequential number>.txt.
|
---|
124 |
|
---|
125 | Further, needed to modify GS perllib's util::create_itemfile(), so that it no longer leaves each txtfile="" property empty in the item file if we've generated matching text files for the image files. In the past, util::create_itemfile() only worked with images, but now it's modified to additionally deal with matching text files.
|
---|
126 |
|
---|
127 |
|
---|
128 | 5. Remaining Issues.
|
---|
129 |
|
---|
130 | a. Build messages (when building lower level PDF conversion commands):
|
---|
131 | ImageConverter: "/home/greenstone/gs3-svn-26Mar2018/gs2build/tmp/F143/A9-access-best-practices1.txt" too small, skipping
|
---|
132 | Same for: A9-access-best-practices56.txt
|
---|
133 |
|
---|
134 |
|
---|
135 | Lower level PDF conversion commands:
|
---|
136 | calling cmd "/usr/bin/perl" -S gsConvert.pl -verbose 2 -pdf_zoom 2 -errlog "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/tmp/1528179077/err.log" -output pagedimg_jpg "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/tmp/1528179077/A9-access-best-practices.pdf"
|
---|
137 | import.pl>
|
---|
138 |
|
---|
139 |
|
---|
140 | "/usr/bin/perl" -S gsConvert.pl -verbose 2 -pdf_zoom 2 -errlog "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/tmp/1528179077/err.log" -output pagedimg_jpg "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices.pdf"
|
---|
141 |
|
---|
142 |
|
---|
143 | "/usr/bin/perl" -S pdfpstoimg.pl -convert_to jpg "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices.pdf" "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices" > "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices.out" 2> "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices.err"
|
---|
144 |
|
---|
145 |
|
---|
146 | @@@@ pdfpstoimg.pl::pdfps_to_img(): cmd = "/usr/bin/perl" -S gs-magick.pl convert "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices.pdf" "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices/A9-access-best-practices.jpg"
|
---|
147 |
|
---|
148 |
|
---|
149 | b. Have not yet renamed the pdf-box, its containing tarball and its jar to gs-pdf-box yet, since a lot of things foreseen (nightly builds, releases, diffcol) and unforeseen can break due to the name change.
|
---|
150 |
|
---|
151 | c. Not sure what the licensing information for pdfbox/java/src/GS_PDFToImagesAndText.java should be in its preamble comment section, since most of the code is a copy or slight modification of the Apache PDFBox app's code.
|
---|
152 |
|
---|
153 | __________________________________________________________
|
---|
154 | Unused, originally considered as starting points. Kathy's searches:
|
---|
155 |
|
---|
156 | Google: pdfbox api extract text
|
---|
157 | https://stackoverflow.com/questions/23813727/how-to-extract-text-from-a-pdf-file-with-apache-pdfbox
|
---|
158 |
|
---|
159 | Google: pdfbox extractimages
|
---|
160 | https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/tools/ExtractImages.html
|
---|
161 | https://stackoverflow.com/questions/8705163/extract-images-from-pdf-using-pdfbox
|
---|