Context Navigation

← Previous Revision
Next Revision →
Blame
Revision Log

GS_PDFBox_README.txt

Last change on this file was 32279, checked in by ak19, 6 years ago
Adding details on the updates to the pdfbox extension's GS-README
File size: 11.7 KB

Line
1	__________________________________________________________
2	README for GS modifications to the PDFBox App.
3	__________________________________________________________
4
5	5 June 2018
6
7	PROBLEM:
8	PDFBoxConverter.pm used the PDFBox App (v 1.8.2 at the time) to allow us the option of either extracting text from the entire PDF or saving each page of the PDF as an image.
9	Kathy wanted the option to have both the PDF's pages as images but also text extracted for each page (if extractable), so that contents could be searched and viewed in image format. Kathy said the HTML output was ugly, so we just pursue txt extraction.
10
11	IDEA FOR SOLUTION:
12	Kathy had found that the PDFBox app had an API. She looked at the API documentation for ExtractImages.java, which had a main function.
13	She explained that our GS perl code was calling the main function of classes such as this in order to do PDF to txt or img conversion.
14
15	Instead, of doing that, Kathy's idea was to modify the PDFBox app's ExtractImages.java into a GS version that would additionally extract text. The additional code could be copied in from the ExtractText.java class.
16
17	Note: In looking into this, I found that ExtractImages literally extracts any images in the PDF, rather than converting pages to images. The actual PDFBox class that our perl code calls to do the PDF pages to images conversion is PDFToImage.java.
18
19
20	CONTENTS
21	The following are the steps I found I needed to go through and which are described in this readme:
22	- upgrade to pdfbox app 2.09,
23	- check the new version didn't break the stuff for which we specifically settled on version 1.8.2 before
24	- grab the pdfbox 2.09 source code
25	- introduce the new class PDFBoxToImagesAndText.java (originally added to the pdfbox app 2.09 src locally, now maintained separately with a GS Java package name). The new class is based on Apache PDFBox's PDFToImage.java, with added code to extract text based on Apache PDFBox's ExtractText.java
26	Since 17 July 2018, this class also recognises 2 additional flags: -textOnly and -imagesOnly to support the new paged_text and the original pagedimg_<imgext> output formats, besides the recently introduced pagedimgtxt_<imgext> output format that outputs images and text for each page.
27	- OLD: rebuild the source, a maven project. Then modify the pdfbox app 2.09 (non source) ext jar file by including the new .class file
28	- compiling the new java class against the pdfbox-app.jar
29	- Now we're possibly generating a txtfile for each img file (assuming there was extractable txt), some modifications needed to be made to the generated .item file to ensure these referred to the txt file.
30	- work that remains to be done
31
32
33	1. Used to use pdfbox app version 1.8.2, see commit message at http://trac.greenstone.org/changeset/29029/gs2-extensions/pdf-box/trunk/java/lib/java/pdfbox-app.jar
34
35	2. Upgrading to pdfbox app 2.0.9 (downloaded from https://pdfbox.apache.org/download.cgi?Preferred=http%3A%2F%2Fwww-us.apache.org%2Fdist%2F):
36
37	Doesn't work out of the box, such as to extract text from a PDF file.
38
39	But if references to org.apache.pdfbox.* in PDFBoxConverter.pm are changed to org.apache.pdfbox.tools.*, the pdf is still processed.
40
41	And the extracted text contains fi ("file") and fl ("workflow") instead of single character ligatures for these. Once again, refer to the commit message at http://trac.greenstone.org/changeset/29029/gs2-extensions/pdf-box/trunk/java/lib/java/pdfbox-app.jar
42
43
44	BUT we get to see this warning message now:
45	ReadTextFile: WARNING: /home/greenstone/gs3-svn-26Mar2018/gs2build/tmp/F996.html appears to be encoded in an unsupported encoding ("utf_8) - using utf8
46
47
48	-> In ReadTextFile.pm change:
49	if ($best_encoding eq "utf_8") { $best_encoding = "utf8" }
50	to:
51	if ($best_encoding eq "utf_8" \|\| $best_encoding =~ /utf_8/) { $best_encoding = "utf8" }
52
53	3. To recompile the new PDFBoxToImagesAndText.java code:
54
55	- grab the svn version of the Greenstone pdfbox extension
56	- Then from the svn checked out pdfbox (trunk/java) folder, run
57	$ javac -cp `pwd`/lib/java/pdfbox-app.jar -d `pwd`/build src/org/greenstone/pdfbox/PDFBoxToImagesAndText.java
58
59	which will compile our custom PDFBoxToImagesAndText.java file against the pdfbox-app.jar in the classpath (-cp) and output the .class file into the directory denoted by -d
60
61	To run, that build folder needs to be on the classpath, besides pdfbox-app.jar itself. See PDFBoxConverter.pm
62
63	Example of a run command, where -textOnly is thrown in to generate paged_text (no images). Leave out -textOnly if an image should still be generated for each page, besides the page's text:
64
65	java -cp "GS3/gs2build/ext/pdf-box/lib/java/pdfbox-app.jar:GS3/gs2build/ext/pdf-box/build" org.greenstone.pdfbox.PDFBoxToImagesAndText -textOnly -outputPrefix "GS3/gs2build/tmp/F228.txt/ApacheLicencePDFA" "GS3/web/sites/localsite/collect/pdfv2/import/ApacheLicencePDFA.pdf"
66
67
68	4. For convenience PDFBoxToImagesAndText.java further no longer generates <PDF filename prefix><sequential number>.img-ext (jpg/gif/png) and matchingly named txt file, but just <sequential number>.img-ext and <sequential number>.txt.
69
70	Further, needed to modify GS perllib's util::create_itemfile(), so that it no longer leaves each txtfile="" property empty in the item file if we've generated matching text files for the image files. In the past, util::create_itemfile() only worked with images, but now it's modified to additionally deal with matching text files.
71
72
73	5. No need to rename the pdf-box, its containing tarball and its jar to gs-pdf-box. Doing that may have required changes to get nightly builds, releases, diffcol to get them to still work.
74
75	However, Dr Bainbridge explained that the solution is to not rename pdfbox-app.jar to gs-pdfbox-app.jar, but have our new custom java file living separately rather than within the pdfbox-app.jar though compiled against it. The java file will have its own greenstone Java package (org.greenstone.pdfbox) and the class file produced from our custom class will then be run from its package by PDFBoxConverter.pm
76
77
78	6. I was not sure what the licensing information for pdfbox/java/src/PDFBoxToImagesAndText.java should be in its preamble comment section, since most of the code is a copy or slight modification of the Apache PDFBox app's code. I asked Dr Nichols about the licensing, and he pointed out that Apache's PDFBox pages referred to https://www.apache.org/licenses/LICENSE-2.0
79	Dr Nichols drew attention to its section "Redistribution", saying that that should cover things. I've tried to follow the points there, and have therefore included apache's LICENSE.txt and NOTICE.txt files into our pdfbox/java/src and pdfbox/java/src build folders where PDFBoxToImagesAndText.java and PDFBoxToImagesAndText.class, respectively, will live.
80
81	------------------
82
83
84	EXTRA INFORMATION
85
86	Older but useful information, as it covers how to use maven to build the pdfbox-app, and include a new java file to be built into the pdfbox-app and distribute the file within the app.
87
88	1. Originally I thought I would need to edit org.apache.pdfbox.tools.PDFToImage (not ExtractImages), as that's the tool that converts a PDF's individual pages to images. Eventually, since there was a new class PDFBoxToImagesAndText.java, Dr Bainbridge explained that this could just be compiled against an exiting pdfbox-app.jar, rather than including it inside the jar and recompiling the pdfbox-app in entirety.
89
90	When I was still compiling the pdfbox-app in entirety, with and without my new code included therein, to obtain the basecode and build the app, the steps were:
91
92
93	- unzipped the src version of pdfbox 2.09
94	- downloaded and unzipped maven
95	Got it from https://maven.apache.org/download.cgi
96	- added maven/bin to PATH and set JAVA_TOOL_OPTIONS as below* in setenv.sh then sourced setenv.sh (instructions at https://maven.apache.org/install.html)
97	export PATH=$JAVA_HOME/bin:/home/greenstone/apache-maven-3.5.3/bin:$PATH
98	export JAVA_TOOL_OPTIONS="-Dhttps.protocols=TLSv1.2"
99
100	(JAVA_TOOL_OPTIONS needs to be set as above because when using maven to build pdfbox src without it, saw mvn fail to build with the error "SSLException: Received fatal alert: protocol_version" explained at: https://github.com/jenkinsci/ghprb-plugin/issues/638)
101
102	- As per pdfbox 2.09 src's README.md, the build command is:
103	mvn clean install
104
105	Run that command after first going into the extracted pdfbox-2.09 src code's folder, wherever this is extracted, e.g.
106	greenstone@bedrock:~/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9
107
108	- The source code to edit is at /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9/tools/src/main/java/org/apache/pdfbox/tools
109	Create PDFBoxToImagesAndText.java based on PDFToImage.java into which to incorporate from ExtractText.java
110
111	- When built, will get /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9/tools/target/classes/org/apache/pdfbox/tools
112	Put that into the unzipped pdfbox app 2.09 version (a.o.t. the unzipped pdfbox src 2.09)
113
114
115	2. Once the pdfbox-app has been entirely built the first time, want smaller rebuilds.
116
117	Final section of output of full rebuild:
118
119
120	[INFO] ------------------------------------------------------------------------
121	[INFO] Reactor Summary:
122	[INFO]
123	[INFO] PDFBox parent ...................................... SUCCESS [ 1.457 s]
124	[INFO] Apache FontBox ..................................... SUCCESS [ 16.135 s]
125	[INFO] Apache XmpBox ...................................... SUCCESS [ 7.046 s]
126	[INFO] Apache PDFBox ...................................... SUCCESS [01:37 min]
127	[INFO] Apache Preflight ................................... SUCCESS [ 20.115 s]
128	[INFO] Apache Preflight application ....................... SUCCESS [ 8.270 s]
129	[INFO] Apache PDFBox Debugger ............................. SUCCESS [ 2.246 s]
130	[INFO] Apache PDFBox tools ................................ SUCCESS [ 8.567 s]
131	[INFO] Apache PDFBox application .......................... SUCCESS [ 7.401 s]
132	[INFO] Apache PDFBox Debugger application ................. SUCCESS [ 7.184 s]
133	[INFO] Apache PDFBox examples ............................. SUCCESS [ 15.077 s]
134	[INFO] PDFBox reactor 2.0.9 ............................... SUCCESS [ 0.148 s]
135	[INFO] ------------------------------------------------------------------------
136	[INFO] BUILD SUCCESS
137	[INFO] ------------------------------------------------------------------------
138	[INFO] Total time: 03:12 min
139	[INFO] Finished at: 2018-06-05T16:03:27+12:00
140	[INFO] ------------------------------------------------------------------------
141
142
143
144	The maven module in pdfbox-app to actually build: "Apache PDFBox tools", as this contains ExtractText.java, PDFToImage.java and the new class we're going to put in there.
145
146	To build a project component ("module") in maven, see
147	- https://maven.apache.org/guides/mini/guide-multiple-modules.html
148	- https://stackoverflow.com/questions/1114026/maven-modules-building-a-single-specific-module
149	- https://stackoverflow.com/questions/23075415/maven-build-second-level-child-projects-using-reactor-option-pl
150
151	So since we want to build the package "tools" (and its dependencies), we do:
152	mvn [clean] install -pl tools -am
153
154	Again, run the above wherever the pdfbox app's src code v 2.09 is extracted:
155	greenstone@bedrock:~/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9$ mvn clean install -pl tools -am
156
157	(This will also compile our new PDFBoxToImagesAndText.java and put its class file PDFBoxToImagesAndText.class into the target dir /home/greenstone/gs3-svn-26Mar2018/gs2build/ext/pdfbox-2.0.9/tools/target/classes/org/apache/pdfbox/tools)
158
159	__________________________________________________________
160	Unused, originally considered as starting points. Kathy's searches:
161
162	Google: pdfbox api extract text
163	https://stackoverflow.com/questions/23813727/how-to-extract-text-from-a-pdf-file-with-apache-pdfbox
164
165	Google: pdfbox extractimages
166	https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/tools/ExtractImages.html
167	https://stackoverflow.com/questions/8705163/extract-images-from-pdf-using-pdfbox

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: gs2-extensions/pdf-box/trunk/GS_PDFBox_README.txt

Download in other formats: