source: gs2-extensions/pdf-box/trunk/GS_PDFBox_README.txt@ 32194

Last change on this file since 32194 was 32194, checked in by ak19, 6 years ago

Minor fixes to Readme

File size: 10.4 KB
Line 
1__________________________________________________________
2README for GS modifications to the PDFBox App.
3__________________________________________________________
4
55 June 2018
6
7PROBLEM:
8PDFBoxConverter.pm used the PDFBox App (v 1.8.2 at the time) to allow us the option of either extracting text from the entire PDF or saving each page of the PDF as an image.
9Kathy wanted the option to have both the PDF's pages as images but also text extracted for each page (if extractable), so that contents could be searched and viewed in image format. Kathy said the HTML output was ugly, so we just pursue txt extraction.
10
11IDEA FOR SOLUTION:
12Kathy had found that the PDFBox app had an API. She looked at the API documentation for ExtractImages.java, which had a main function.
13She explained that our GS perl code was calling the main function of classes such as this in order to do PDF to txt or img conversion.
14
15Instead, of doing that, Kathy's idea was to modify the PDFBox app's ExtractImages.java into a GS version that would additionally extract text. The additional code could be copied in from the ExtractText.java class.
16
17Note: In looking into this, I found that ExtractImages literally extracts any images in the PDF, rather than converting pages to images. The actual PDFBox class that our perl code calls to do the PDF pages to images conversion is PDFToImage.java.
18
19
20CONTENTS
21The following are the steps I found I needed to go through and which are described in this readme:
22- upgrade to pdfbox app 2.09,
23- check the new version didn't break the stuff for which we specifically settled on version 1.8.2 before
24- grab the pdfbox 2.09 source code
25- introduce the new class GS_PDFToImagesAndText.java to the pdfbox app 2.09 src, based on its PDFToImage.java with added code to extract text based on ExtractText.java
26- rebuild the source, a maven project
27- modify the pdfbox app 2.09 (non source) ext jar file by including the new .class file
28- Now we're possible generating a txtfile for each img file (assuming there was extractable txt), some modifications needed to be made to the generated .item file to ensure these referred to the txt file.
29- work that remains to be done
30
31
320. Used to use pdfbox app version 1.8.2, see commit message at http://trac.greenstone.org/changeset/29029/gs2-extensions/pdf-box/trunk/java/lib/java/pdfbox-app.jar
33
341. Upgrading to pdfbox app 2.0.9 (downloaded from https://pdfbox.apache.org/download.cgi?Preferred=http%3A%2F%2Fwww-us.apache.org%2Fdist%2F):
35
36Doesn't work out of the box, such as to extract text from a PDF file.
37
38But if references to org.apache.pdfbox.* in PDFBoxConverter.pm are changed to org.apache.pdfbox.tools.*, the pdf is still processed.
39
40And the extracted text contains fi ("file") and fl ("workflow") instead of single character ligatures for these. Once again, refer to the commit message at http://trac.greenstone.org/changeset/29029/gs2-extensions/pdf-box/trunk/java/lib/java/pdfbox-app.jar
41
42
43BUT we get to see this warning message now:
44 ReadTextFile: WARNING: /home/greenstone/gs3-svn-26Mar2018/gs2build/tmp/F996.html appears to be encoded in an unsupported encoding ("utf_8) - using utf8
45
46
47-> In ReadTextFile.pm change:
48 if ($best_encoding eq "utf_8") { $best_encoding = "utf8" }
49to:
50 if ($best_encoding eq "utf_8" || $best_encoding =~ /utf_8/) { $best_encoding = "utf8" }
51
52
532. Need to edit org.apache.pdfbox.tools.PDFToImage (not ExtractImages) as that's the tool that converts a PDF's individual pages to images
54
55
56- unzipped the src version of pdfbox 2.09
57- downloaded and unzipped maven
58 Got it from https://maven.apache.org/download.cgi
59- added maven/bin to PATH and set JAVA_TOOL_OPTIONS as below* in setenv.sh then sourced setenv.sh (instructions at https://maven.apache.org/install.html)
60 export PATH=$JAVA_HOME/bin:/home/greenstone/apache-maven-3.5.3/bin:$PATH
61 export JAVA_TOOL_OPTIONS="-Dhttps.protocols=TLSv1.2"
62
63(JAVA_TOOL_OPTIONS needs to be set as above because when using maven to build pdfbox src without it, saw mvn fail to build with the error "SSLException: Received fatal alert: protocol_version" explained at: https://github.com/jenkinsci/ghprb-plugin/issues/638)
64
65- As per pdfbox 2.09 src's README.md, the build command is:
66 mvn clean install
67
68Run that command after first going into the extracted pdfbox-2.09 src code's folder, wherever this is extracted, e.g.
69 greenstone@bedrock:~/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9
70
71- The source code to edit is at /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9/tools/src/main/java/org/apache/pdfbox/tools
72Create GS_PDFToImagesAndText.java based on PDFToImage.java into which to incorporate from ExtractText.java
73
74- When built, will get /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9/tools/target/classes/org/apache/pdfbox/tools
75Put that into the unzipped pdfbox app 2.09 version (a.o.t. the unzipped pdfbox src 2.09)
76
77
783. Once the pdfbox-app has been entirely built the first time, want smaller rebuilds.
79
80Final section of output of full rebuild:
81
82
83[INFO] ------------------------------------------------------------------------
84[INFO] Reactor Summary:
85[INFO]
86[INFO] PDFBox parent ...................................... SUCCESS [ 1.457 s]
87[INFO] Apache FontBox ..................................... SUCCESS [ 16.135 s]
88[INFO] Apache XmpBox ...................................... SUCCESS [ 7.046 s]
89[INFO] Apache PDFBox ...................................... SUCCESS [01:37 min]
90[INFO] Apache Preflight ................................... SUCCESS [ 20.115 s]
91[INFO] Apache Preflight application ....................... SUCCESS [ 8.270 s]
92[INFO] Apache PDFBox Debugger ............................. SUCCESS [ 2.246 s]
93[INFO] Apache PDFBox tools ................................ SUCCESS [ 8.567 s]
94[INFO] Apache PDFBox application .......................... SUCCESS [ 7.401 s]
95[INFO] Apache PDFBox Debugger application ................. SUCCESS [ 7.184 s]
96[INFO] Apache PDFBox examples ............................. SUCCESS [ 15.077 s]
97[INFO] PDFBox reactor 2.0.9 ............................... SUCCESS [ 0.148 s]
98[INFO] ------------------------------------------------------------------------
99[INFO] BUILD SUCCESS
100[INFO] ------------------------------------------------------------------------
101[INFO] Total time: 03:12 min
102[INFO] Finished at: 2018-06-05T16:03:27+12:00
103[INFO] ------------------------------------------------------------------------
104
105
106
107The maven module in pdfbox-app to actually build: "Apache PDFBox tools", as this contains ExtractText.java, PDFToImage.java and the new class we're going to put in there.
108
109To build a project component ("module") in maven, see
110- https://maven.apache.org/guides/mini/guide-multiple-modules.html
111- https://stackoverflow.com/questions/1114026/maven-modules-building-a-single-specific-module
112- https://stackoverflow.com/questions/23075415/maven-build-second-level-child-projects-using-reactor-option-pl
113
114So since we want to build the package "tools" (and its dependencies), we do:
115 mvn [clean] install -pl tools -am
116
117Again, run the above wherever the pdfbox app's src code v 2.09 is extracted:
118 greenstone@bedrock:~/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9$ mvn clean install -pl tools -am
119
120(This will also compile our new GS_PDFToImagesAndText.java and put its class file GS_PDFToImagesAndText.class into the target dir /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9/tools/target/classes/org/apache/pdfbox/tools)
121
122
1234. For convenience GS_PDFToImagesAndText.java further no longer generates <PDF filename prefix><sequential number>.img-ext (jpg/gif/png) and matchingly named txt file, but just <sequential number>.img-ext and <sequential number>.txt.
124
125Further, needed to modify GS perllib's util::create_itemfile(), so that it no longer leaves each txtfile="" property empty in the item file if we've generated matching text files for the image files. In the past, util::create_itemfile() only worked with images, but now it's modified to additionally deal with matching text files.
126
127
1285. Remaining Issues.
129
130a. Build messages (when building lower level PDF conversion commands):
131ImageConverter: "/home/greenstone/gs3-svn-26Mar2018/gs2build/tmp/F143/A9-access-best-practices1.txt" too small, skipping
132Same for: A9-access-best-practices56.txt
133
134
135Lower level PDF conversion commands:
136calling cmd "/usr/bin/perl" -S gsConvert.pl -verbose 2 -pdf_zoom 2 -errlog "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/tmp/1528179077/err.log" -output pagedimg_jpg "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/tmp/1528179077/A9-access-best-practices.pdf"
137import.pl>
138
139
140"/usr/bin/perl" -S gsConvert.pl -verbose 2 -pdf_zoom 2 -errlog "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/tmp/1528179077/err.log" -output pagedimg_jpg "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices.pdf"
141
142
143"/usr/bin/perl" -S pdfpstoimg.pl -convert_to jpg "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices.pdf" "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices" > "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices.out" 2> "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices.err"
144
145
146@@@@ pdfpstoimg.pl::pdfps_to_img(): cmd = "/usr/bin/perl" -S gs-magick.pl convert "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices.pdf" "/home/greenstone/gs3-svn-26Mar2018/web/sites/localsite/collect/pdftst1/import/A9-access-best-practices/A9-access-best-practices.jpg"
147
148
149b. Have not yet renamed the pdf-box, its containing tarball and its jar to gs-pdf-box yet, since a lot of things foreseen (nightly builds, releases, diffcol) and unforeseen can break due to the name change.
150
151c. Not sure what the licensing information for pdfbox/java/src/GS_PDFToImagesAndText.java should be in its preamble comment section, since most of the code is a copy or slight modification of the Apache PDFBox app's code.
152
153__________________________________________________________
154Unused, originally considered as starting points. Kathy's searches:
155
156Google: pdfbox api extract text
157https://stackoverflow.com/questions/23813727/how-to-extract-text-from-a-pdf-file-with-apache-pdfbox
158
159Google: pdfbox extractimages
160https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/tools/ExtractImages.html
161https://stackoverflow.com/questions/8705163/extract-images-from-pdf-using-pdfbox
Note: See TracBrowser for help on using the repository browser.