source: gs2-extensions/pdf-box/trunk/GS_PDFBox_README.txt@ 32193

Last change on this file since 32193 was 32193, checked in by ak19, 6 years ago

All the *essential* changes related to the PDFBox modifications Kathy asked for. The PDFBox app used to be used to generated either images for every PDF page or extract txt from the PDF. Kathy wanted to ideally produce paged images with extracted text, where available, so that this would be searchable. So images AND extracted text. Her idea was to modify the pdfbox app code to do it: a new class based on the existing one that generated the images for each page that would (based on Kathy's answers to my questions) need to be modified to additionally extract the text of each page, so that txt search results matched the correct img page presented. Might as well upgrade the pdfbox app version our GS code used. After testing that the latest version (2.09) did not have any of the issues for which we previously settled on v 1.8.2 (lower than the then most up to date version), the necessary code changes were made. All of these are documented in the newly included GS_PDFBox_README.txt. The new java file is called GS_PDFToImagesAndText.java and is located in the new java/src subfolder. This will need to be put into the pdfbox app 2.09 *src* code to be built, and the generated class file should then be copied into the java/lib/java/pdfbox-app.jar, all as explained in the GS_PDFBox_README.txt. Other files modified for the changes requested by Kathy are PDFBoxConvertger.pm, to refer to our new class and its new java package location as packages have changed in 2.09, and util.pm's create_itemfile() function which now may additionally deal with txt files matching each img file generated. (Not committing minor adjustment to ReadTextFile.pm to prevent a warning, as my fix seems hacky. But the fix is described in the Readme). The pdfbox ext zip/tarballs also modified to contain the changed PDFBoxConverter.pm and pdfbox-app jar file for 2.09 with our custom new class file. But have not yet renamed anything to gs-pdfbox-app as there will be flow on effects elsewhere as described in the Readme, can do all this in a separate commit.

File size: 10.3 KB
Line 
1__________________________________________________________
2README for GS modifications to the PDFBox App.
3__________________________________________________________
4
55 June 2018
6
7PROBLEM:
8PDFBoxConverter.pm used the PDFBox App (v 1.8.2 at the time) to allow us the option of either extracting text from the entire PDF or saving each page of the PDF as an image.
9Kathy wanted the option to have both the PDF's pages as images but also text extracted for each page (if extractable), so that contents could be searched and viewed in image format. Kathy said the HTML output was ugly, so we just pursue txt extraction.
10
11IDEA FOR SOLUTION:
12Kathy had found that the PDFBox app had an API. She looked at the API documentation for ExtractImages.java, which had a main function.
13She explained that our GS perl code was calling the main function of classes such as this in order to do PDF to txt or img conversion.
14
15Instead, of doing that, Kathy's idea was to modify the PDFBox app's ExtractImages.java into a GS version that would additionally extract text. The additional code could be copied in from the ExtractText.java class.
16
17Note: In looking into this, I found that ExtractImages literally extracts any images in the PDF, rather than converting pages to images. The actual PDFBox class that our perl code calls to do the PDF pages to images conversion is PDFToImage.java.
18
19
20CONTENTS
21The following are the steps I found I needed to go through and which are described in this readme:
22- upgrade to pdfbox app 2.09,
23- check the new version didn't break the stuff for which we specifically settled on version 1.8.2 before
24- grab the pdfbox 2.09 source code
25- introduce the new class GS_PDFToImagesAndText.java to the pdfbox app 2.09 src, based on its PDFToImage.java with added code to extract text based on ExtractText.java
26- rebuild the source, a maven project
27- modify the pdfbox app 2.09 (non source) ext jar file by including the new .class file
28- Now we're possible generating a txtfile for each img file (assuming there was extractable txt), some modifications needed to be made to the generated .item file to ensure these referred to the txt file.
29- work that remains to be done
30
31
320. Used to use pdfbox app version 1.8.2, see commit message at http://trac.greenstone.org/browser/gs2-extensions/pdf-box/trunk/java/lib/java/pdfbox-app.jar
33
341. Upgrading to pdfbox app 2.0.9 (downloaded from https://pdfbox.apache.org/download.cgi?Preferred=http%3A%2F%2Fwww-us.apache.org%2Fdist%2F):
35
36Doesn't work out of the box, such as to extract text from a PDF file.
37
38But if references to org.apache.pdfbox.* in PDFBoxConverter.pm are changed to org.apache.pdfbox.tools.*, the pdf is still processed.
39
40And the extracted text contains fil ("file") and fl ("workflow") instead of ligatures. Once again, refer to the commit message at http://trac.greenstone.org/browser/gs2-extensions/pdf-box/trunk/java/lib/java/pdfbox-app.jar
41
42
43BUT we get to see this warning message now:
44 ReadTextFile: WARNING: /home/greenstone/gs3-svn-26Mar2018/gs2build/tmp/F996.html appears to be encoded in an unsupported encoding ("utf_8) - using utf8
45
46
47-> In ReadTextFile.pm change:
48 if ($best_encoding eq "utf_8") { $best_encoding = "utf8" }
49to:
50 if ($best_encoding eq "utf_8" || $best_encoding =~ /utf_8/) { $best_encoding = "utf8" }
51
52
532. Need to edit org.apache.pdfbox.tools.PDFToImage (not ExtractImages) as that's the tool that converts a PDF's individual pages to images
54
55
56- unzipped the src version of pdfbox 2.09
57- downloaded and unzipped maven
58 Got it from https://maven.apache.org/download.cgi
59- added maven/bin to PATH and set JAVA_TOOL_OPTIONS as below* in setenv.sh then sourced setenv.sh (instructions at https://maven.apache.org/install.html)
60 export PATH=$JAVA_HOME/bin:/home/greenstone/apache-maven-3.5.3/bin:$PATH
61 export JAVA_TOOL_OPTIONS="-Dhttps.protocols=TLSv1.2"
62
63(JAVA_TOOL_OPTIONS needs to be set as above because when using maven to build pdfbox src without it, saw mvn fail to build with the error "SSLException: Received fatal alert: protocol_version" explained at: https://github.com/jenkinsci/ghprb-plugin/issues/638)
64
65- As per pdfbox 2.09 src's README.md, the build command is:
66 mvn clean install
67
68Run that command after first going into the extracted pdfbox-2.09 src code's folder, wherever this is extracted, e.g.
69 greenstone@bedrock:~/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9
70
71- The source code to edit is at /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9/tools/src/main/java/org/apache/pdfbox/tools
72Create GS_PDFToImagesAndText.java based on PDFToImage.java into which to incorporate from ExtractText.java
73
74- When built, will get /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9/tools/target/classes/org/apache/pdfbox/tools
75Put that into the unzipped pdfbox app 2.09 version (a.o.t. the unzipped pdfbox src 2.09)
76
77
783. Once the pdfbox-app has been entirely built the first time, want smaller rebuilds.
79
80Final section of output of full rebuild:
81
82
83[INFO] ------------------------------------------------------------------------
84[INFO] Reactor Summary:
85[INFO]
86[INFO] PDFBox parent ...................................... SUCCESS [ 1.457 s]
87[INFO] Apache FontBox ..................................... SUCCESS [ 16.135 s]
88[INFO] Apache XmpBox ...................................... SUCCESS [ 7.046 s]
89[INFO] Apache PDFBox ...................................... SUCCESS [01:37 min]
90[INFO] Apache Preflight ................................... SUCCESS [ 20.115 s]
91[INFO] Apache Preflight application ....................... SUCCESS [ 8.270 s]
92[INFO] Apache PDFBox Debugger ............................. SUCCESS [ 2.246 s]
93[INFO] Apache PDFBox tools ................................ SUCCESS [ 8.567 s]
94[INFO] Apache PDFBox application .......................... SUCCESS [ 7.401 s]
95[INFO] Apache PDFBox Debugger application ................. SUCCESS [ 7.184 s]
96[INFO] Apache PDFBox examples ............................. SUCCESS [ 15.077 s]
97[INFO] PDFBox reactor 2.0.9 ............................... SUCCESS [ 0.148 s]
98[INFO] ------------------------------------------------------------------------
99[INFO] BUILD SUCCESS
100[INFO] ------------------------------------------------------------------------
101[INFO] Total time: 03:12 min
102[INFO] Finished at: 2018-06-05T16:03:27+12:00
103[INFO] ------------------------------------------------------------------------
104
105
106
107The maven module in pdfbox-app to actually build: "Apache PDFBox tools", as this contains ExtractText.java, PDFToImage.java and the new class we're going to put in there.
108
109To build a project component ("module") in maven, see
110- https://maven.apache.org/guides/mini/guide-multiple-modules.html
111- https://stackoverflow.com/questions/1114026/maven-modules-building-a-single-specific-module
112- https://stackoverflow.com/questions/23075415/maven-build-second-level-child-projects-using-reactor-option-pl
113
114So since we want to build the package "tools" (and its dependencies), we do:
115 mvn [clean] install -pl tools -am
116
117Again, run the above wherever the pdfbox app's src code v 2.09 is extracted:
118 greenstone@bedrock:~/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9$ mvn clean install -pl tools -am
119
120(This will also compile our new GS_PDFToImagesAndText.java and put its class file GS_PDFToImagesAndText.class into the target dir /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9/tools/target/classes/org/apache/pdfbox/tools)
121
122
1234. For convenience GS_PDFToImagesAndText.java further no longer generates <PDF filename prefix><sequential number>.img-ext (jpg/gif/png) and matchingly named txt file, but just <sequential number>.img-ext and <sequential number>.txt.
124
125Further, needed to modify GS perllib's util::create_itemfile(), so that it no longer leaves each txtfile="" property empty in the item file if we've generated matching text files for the image files. In the past, util::create_itemfile() only worked with images, but now it's modified to additionally deal with matching text files.
126
127
1285. Remaining Issues.
129
130a. Build messages (when building lower level PDF conversion commands):
131ImageConverter: "/home/greenstone/gs3-svn-26Mar2018/gs2build/tmp/F143/A9-access-best-practices1.txt" too small, skipping
132Same for: A9-access-best-practices56.txt
133
134
135Lower level PDF conversion commands:
136calling cmd "/usr/bin/perl" -S gsConvert.pl -verbose 2 -pdf_zoom 2 -errlog "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/tmp/1528179077/err.log" -output pagedimg_jpg "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/tmp/1528179077/A9-access-best-practices.pdf"
137import.pl>
138
139
140"/usr/bin/perl" -S gsConvert.pl -verbose 2 -pdf_zoom 2 -errlog "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/tmp/1528179077/err.log" -output pagedimg_jpg "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices.pdf"
141
142
143"/usr/bin/perl" -S pdfpstoimg.pl -convert_to jpg "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices.pdf" "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices" > "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices.out" 2> "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices.err"
144
145
146@@@@ pdfpstoimg.pl::pdfps_to_img(): cmd = "/usr/bin/perl" -S gs-magick.pl convert "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices.pdf" "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices/A9-access-best-practices.jpg"
147
148
149b. Have not yet renamed the pdf-box, its containing tarball and its jar to gs-pdf-box yet, since a lot of things foreseen (nightly builds, releases, diffcol) and unforeseen can break due to the name change.
150
151c. Not sure what the licensing information for pdfbox/java/src/GS_PDFToImagesAndText.java should be in its preamble comment section, since most of the code is a copy or slight modification of the Apache PDFBox app's code.
152
153__________________________________________________________
154Unused, originally considered as starting points. Kathy's searches:
155
156Google: pdfbox api extract text
157https://stackoverflow.com/questions/23813727/how-to-extract-text-from-a-pdf-file-with-apache-pdfbox
158
159Google: pdfbox extractimages
160https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/tools/ExtractImages.html
161https://stackoverflow.com/questions/8705163/extract-images-from-pdf-using-pdfbox
Note: See TracBrowser for help on using the repository browser.