source: gs2-extensions/pdf-box/trunk/GS_PDFBox_README.txt@ 32278

Last change on this file since 32278 was 32278, checked in by ak19, 6 years ago

Our custom pdf-box class PDFToImagesAndText.java now takes two additional flags, textOnly and imagesOnly, which can be used to support paged_text and the original pagedimg_ output formats, besides pagedimgtxt_

File size: 11.0 KB
Line 
1__________________________________________________________
2README for GS modifications to the PDFBox App.
3__________________________________________________________
4
55 June 2018
6
7PROBLEM:
8PDFBoxConverter.pm used the PDFBox App (v 1.8.2 at the time) to allow us the option of either extracting text from the entire PDF or saving each page of the PDF as an image.
9Kathy wanted the option to have both the PDF's pages as images but also text extracted for each page (if extractable), so that contents could be searched and viewed in image format. Kathy said the HTML output was ugly, so we just pursue txt extraction.
10
11IDEA FOR SOLUTION:
12Kathy had found that the PDFBox app had an API. She looked at the API documentation for ExtractImages.java, which had a main function.
13She explained that our GS perl code was calling the main function of classes such as this in order to do PDF to txt or img conversion.
14
15Instead, of doing that, Kathy's idea was to modify the PDFBox app's ExtractImages.java into a GS version that would additionally extract text. The additional code could be copied in from the ExtractText.java class.
16
17Note: In looking into this, I found that ExtractImages literally extracts any images in the PDF, rather than converting pages to images. The actual PDFBox class that our perl code calls to do the PDF pages to images conversion is PDFToImage.java.
18
19
20CONTENTS
21The following are the steps I found I needed to go through and which are described in this readme:
22- upgrade to pdfbox app 2.09,
23- check the new version didn't break the stuff for which we specifically settled on version 1.8.2 before
24- grab the pdfbox 2.09 source code
25- introduce the new class PDFBoxToImagesAndText.java (originally added to the pdfbox app 2.09 src locally, now maintained separately with a GS Java package name). The new class is based on Apache PDFBox's PDFToImage.java, with added code to extract text based on Apache PDFBox's ExtractText.java
26- OLD: rebuild the source, a maven project. Then modify the pdfbox app 2.09 (non source) ext jar file by including the new .class file
27- compiling the new java class against the pdfbox-app.jar
28- Now we're possibly generating a txtfile for each img file (assuming there was extractable txt), some modifications needed to be made to the generated .item file to ensure these referred to the txt file.
29- work that remains to be done
30
31
321. Used to use pdfbox app version 1.8.2, see commit message at http://trac.greenstone.org/changeset/29029/gs2-extensions/pdf-box/trunk/java/lib/java/pdfbox-app.jar
33
342. Upgrading to pdfbox app 2.0.9 (downloaded from https://pdfbox.apache.org/download.cgi?Preferred=http%3A%2F%2Fwww-us.apache.org%2Fdist%2F):
35
36Doesn't work out of the box, such as to extract text from a PDF file.
37
38But if references to org.apache.pdfbox.* in PDFBoxConverter.pm are changed to org.apache.pdfbox.tools.*, the pdf is still processed.
39
40And the extracted text contains fi ("file") and fl ("workflow") instead of single character ligatures for these. Once again, refer to the commit message at http://trac.greenstone.org/changeset/29029/gs2-extensions/pdf-box/trunk/java/lib/java/pdfbox-app.jar
41
42
43BUT we get to see this warning message now:
44 ReadTextFile: WARNING: /home/greenstone/gs3-svn-26Mar2018/gs2build/tmp/F996.html appears to be encoded in an unsupported encoding ("utf_8) - using utf8
45
46
47-> In ReadTextFile.pm change:
48 if ($best_encoding eq "utf_8") { $best_encoding = "utf8" }
49to:
50 if ($best_encoding eq "utf_8" || $best_encoding =~ /utf_8/) { $best_encoding = "utf8" }
51
523. To recompile the new PDFBoxToImagesAndText.java code:
53
54- grab the svn version of the Greenstone pdfbox extension
55- Then from the svn checked out pdfbox (trunk/java) folder, run
56 $ javac -cp `pwd`/lib/java/pdfbox-app.jar -d `pwd`/build src/org/greenstone/pdfbox/PDFBoxToImagesAndText.java
57
58which will compile our custom PDFBoxToImagesAndText.java file against the pdfbox-app.jar in the classpath (-cp) and output the .class file into the directory denoted by -d
59
60To run, that build folder needs to be on the classpath, besides pdfbox-app.jar itself. See PDFBoxConverter.pm
61
624. For convenience PDFBoxToImagesAndText.java further no longer generates <PDF filename prefix><sequential number>.img-ext (jpg/gif/png) and matchingly named txt file, but just <sequential number>.img-ext and <sequential number>.txt.
63
64Further, needed to modify GS perllib's util::create_itemfile(), so that it no longer leaves each txtfile="" property empty in the item file if we've generated matching text files for the image files. In the past, util::create_itemfile() only worked with images, but now it's modified to additionally deal with matching text files.
65
66
675. No need to rename the pdf-box, its containing tarball and its jar to gs-pdf-box. Doing that may have required changes to get nightly builds, releases, diffcol to get them to still work.
68
69However, Dr Bainbridge explained that the solution is to not rename pdfbox-app.jar to gs-pdfbox-app.jar, but have our new custom java file living separately rather than within the pdfbox-app.jar though compiled against it. The java file will have its own greenstone Java package (org.greenstone.pdfbox) and the class file produced from our custom class will then be run from its package by PDFBoxConverter.pm
70
71
726. I was not sure what the licensing information for pdfbox/java/src/PDFBoxToImagesAndText.java should be in its preamble comment section, since most of the code is a copy or slight modification of the Apache PDFBox app's code. I asked Dr Nichols about the licensing, and he pointed out that Apache's PDFBox pages referred to https://www.apache.org/licenses/LICENSE-2.0
73Dr Nichols drew attention to its section "Redistribution", saying that that should cover things. I've tried to follow the points there, and have therefore included apache's LICENSE.txt and NOTICE.txt files into our pdfbox/java/src and pdfbox/java/src build folders where PDFBoxToImagesAndText.java and PDFBoxToImagesAndText.class, respectively, will live.
74
75------------------
76
77
78EXTRA INFORMATION
79
80Older but useful information, as it covers how to use maven to build the pdfbox-app, and include a new java file to be built into the pdfbox-app and distribute the file within the app.
81
821. Originally I thought I would need to edit org.apache.pdfbox.tools.PDFToImage (not ExtractImages), as that's the tool that converts a PDF's individual pages to images. Eventually, since there was a new class PDFBoxToImagesAndText.java, Dr Bainbridge explained that this could just be compiled against an exiting pdfbox-app.jar, rather than including it inside the jar and recompiling the pdfbox-app in entirety.
83
84When I was still compiling the pdfbox-app in entirety, with and without my new code included therein, to obtain the basecode and build the app, the steps were:
85
86
87- unzipped the src version of pdfbox 2.09
88- downloaded and unzipped maven
89 Got it from https://maven.apache.org/download.cgi
90- added maven/bin to PATH and set JAVA_TOOL_OPTIONS as below* in setenv.sh then sourced setenv.sh (instructions at https://maven.apache.org/install.html)
91 export PATH=$JAVA_HOME/bin:/home/greenstone/apache-maven-3.5.3/bin:$PATH
92 export JAVA_TOOL_OPTIONS="-Dhttps.protocols=TLSv1.2"
93
94(JAVA_TOOL_OPTIONS needs to be set as above because when using maven to build pdfbox src without it, saw mvn fail to build with the error "SSLException: Received fatal alert: protocol_version" explained at: https://github.com/jenkinsci/ghprb-plugin/issues/638)
95
96- As per pdfbox 2.09 src's README.md, the build command is:
97 mvn clean install
98
99Run that command after first going into the extracted pdfbox-2.09 src code's folder, wherever this is extracted, e.g.
100 greenstone@bedrock:~/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9
101
102- The source code to edit is at /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9/tools/src/main/java/org/apache/pdfbox/tools
103Create PDFBoxToImagesAndText.java based on PDFToImage.java into which to incorporate from ExtractText.java
104
105- When built, will get /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9/tools/target/classes/org/apache/pdfbox/tools
106Put that into the unzipped pdfbox app 2.09 version (a.o.t. the unzipped pdfbox src 2.09)
107
108
1092. Once the pdfbox-app has been entirely built the first time, want smaller rebuilds.
110
111Final section of output of full rebuild:
112
113
114[INFO] ------------------------------------------------------------------------
115[INFO] Reactor Summary:
116[INFO]
117[INFO] PDFBox parent ...................................... SUCCESS [ 1.457 s]
118[INFO] Apache FontBox ..................................... SUCCESS [ 16.135 s]
119[INFO] Apache XmpBox ...................................... SUCCESS [ 7.046 s]
120[INFO] Apache PDFBox ...................................... SUCCESS [01:37 min]
121[INFO] Apache Preflight ................................... SUCCESS [ 20.115 s]
122[INFO] Apache Preflight application ....................... SUCCESS [ 8.270 s]
123[INFO] Apache PDFBox Debugger ............................. SUCCESS [ 2.246 s]
124[INFO] Apache PDFBox tools ................................ SUCCESS [ 8.567 s]
125[INFO] Apache PDFBox application .......................... SUCCESS [ 7.401 s]
126[INFO] Apache PDFBox Debugger application ................. SUCCESS [ 7.184 s]
127[INFO] Apache PDFBox examples ............................. SUCCESS [ 15.077 s]
128[INFO] PDFBox reactor 2.0.9 ............................... SUCCESS [ 0.148 s]
129[INFO] ------------------------------------------------------------------------
130[INFO] BUILD SUCCESS
131[INFO] ------------------------------------------------------------------------
132[INFO] Total time: 03:12 min
133[INFO] Finished at: 2018-06-05T16:03:27+12:00
134[INFO] ------------------------------------------------------------------------
135
136
137
138The maven module in pdfbox-app to actually build: "Apache PDFBox tools", as this contains ExtractText.java, PDFToImage.java and the new class we're going to put in there.
139
140To build a project component ("module") in maven, see
141- https://maven.apache.org/guides/mini/guide-multiple-modules.html
142- https://stackoverflow.com/questions/1114026/maven-modules-building-a-single-specific-module
143- https://stackoverflow.com/questions/23075415/maven-build-second-level-child-projects-using-reactor-option-pl
144
145So since we want to build the package "tools" (and its dependencies), we do:
146 mvn [clean] install -pl tools -am
147
148Again, run the above wherever the pdfbox app's src code v 2.09 is extracted:
149 greenstone@bedrock:~/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9$ mvn clean install -pl tools -am
150
151(This will also compile our new PDFBoxToImagesAndText.java and put its class file PDFBoxToImagesAndText.class into the target dir /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9/tools/target/classes/org/apache/pdfbox/tools)
152
153__________________________________________________________
154Unused, originally considered as starting points. Kathy's searches:
155
156Google: pdfbox api extract text
157https://stackoverflow.com/questions/23813727/how-to-extract-text-from-a-pdf-file-with-apache-pdfbox
158
159Google: pdfbox extractimages
160https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/tools/ExtractImages.html
161https://stackoverflow.com/questions/8705163/extract-images-from-pdf-using-pdfbox
Note: See TracBrowser for help on using the repository browser.