Context Navigation

PDFBoxConverter.pm@ 32193

Last change on this file since 32193 was 32193, checked in by ak19, 6 years ago

All the *essential* changes related to the PDFBox modifications Kathy asked for. The PDFBox app used to be used to generated either images for every PDF page or extract txt from the PDF. Kathy wanted to ideally produce paged images with extracted text, where available, so that this would be searchable. So images AND extracted text. Her idea was to modify the pdfbox app code to do it: a new class based on the existing one that generated the images for each page that would (based on Kathy's answers to my questions) need to be modified to additionally extract the text of each page, so that txt search results matched the correct img page presented. Might as well upgrade the pdfbox app version our GS code used. After testing that the latest version (2.09) did not have any of the issues for which we previously settled on v 1.8.2 (lower than the then most up to date version), the necessary code changes were made. All of these are documented in the newly included GS_PDFBox_README.txt. The new java file is called GS_PDFToImagesAndText.java and is located in the new java/src subfolder. This will need to be put into the pdfbox app 2.09 *src* code to be built, and the generated class file should then be copied into the java/lib/java/pdfbox-app.jar, all as explained in the GS_PDFBox_README.txt. Other files modified for the changes requested by Kathy are PDFBoxConvertger.pm, to refer to our new class and its new java package location as packages have changed in 2.09, and util.pm's create_itemfile() function which now may additionally deal with txt files matching each img file generated. (Not committing minor adjustment to ReadTextFile.pm to prevent a warning, as my fix seems hacky. But the fix is described in the Readme). The pdfbox ext zip/tarballs also modified to contain the changed PDFBoxConverter.pm and pdfbox-app jar file for 2.09 with our custom new class file. But have not yet renamed anything to gs-pdfbox-app as there will be flow on effects elsewhere as described in the Readme, can do all this in a separate commit.

File size: 11.6 KB

Line
1	###########################################################################
2	#
3	# PDFBoxConverter - helper plugin that does pdf document conversion with PDFBox
4	#
5	# A component of the Greenstone digital library software
6	# from the New Zealand Digital Library Project at the
7	# University of Waikato, New Zealand.
8	#
9	# Copyright (C) 2010 New Zealand Digital Library Project
10	#
11	# This program is free software; you can redistribute it and/or modify
12	# it under the terms of the GNU General Public License as published by
13	# the Free Software Foundation; either version 2 of the License, or
14	# (at your option) any later version.
15	#
16	# This program is distributed in the hope that it will be useful,
17	# but WITHOUT ANY WARRANTY; without even the implied warranty of
18	# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
19	# GNU General Public License for more details.
20	#
21	# You should have received a copy of the GNU General Public License
22	# along with this program; if not, write to the Free Software
23	# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
24	#
25	###########################################################################
26	package PDFBoxConverter;
27
28	use BaseMediaConverter;
29
30	use strict;
31	no strict 'refs'; # allow filehandles to be variables and viceversa
32	no strict 'subs'; # allow barewords (eg STDERR) as function arguments
33
34	#use HTML::Entities; # for encoding characters into their HTML entities when PDFBox converts to text
35
36	use gsprintf 'gsprintf';
37	use FileUtils;
38
39	# these two variables mustn't be initialised here or they will get stuck
40	# at those values.
41	our $pdfbox_conversion_available;
42	our $no_pdfbox_conversion_reason;
43
44	BEGIN {
45	@PDFBoxConverter::ISA = ('BaseMediaConverter');
46
47	# Check that PDFBox is installed and available on the path
48	$pdfbox_conversion_available = 1;
49	$no_pdfbox_conversion_reason = "";
50
51	if (!defined $ENV{'GEXT_PDFBOX'}) {
52	$pdfbox_conversion_available = 0;
53	$no_pdfbox_conversion_reason = "gextpdfboxnotinstalled";
54	}
55	else {
56	my $gextpb_home = $ENV{'GEXT_PDFBOX'};
57	my $pbajar = &FileUtils::filenameConcatenate($gextpb_home,"lib","java","pdfbox-app.jar");
58
59	if (!-e $pbajar) {
60	&gsprintf(STDERR,"**** Failed to find $pbajar\n");
61	$pdfbox_conversion_available = 0;
62	$no_pdfbox_conversion_reason = "gextpdfboxjarnotinstalled";
63	}
64	else {
65	# test to see if java is in path
66	# Need to run java -version instead of just java, since the %ERRORLEVEL% returned
67	# for `java` (which is checked below for failure of the command) is 0 for JDK 1.6*
68	# while %ERRORLEVEL% is 1 for JDK 1.7*
69	# If `java -version` is run however, %ERRORLEVEL% returned is 0 if java is
70	# installed, regardless of whether the JDK version is 1.6* or 1.7*.
71	my $java = &util::get_java_command();
72
73	my $cmd = "$java -version";
74	if ($ENV{'GSDLOS'} =~ /^windows/i) {
75	$cmd .= " >nul 2>&1"; # java 2>&1 >null or java >null 2>&1 both work (%ERRORLEVEL% is 0)
76	}
77	else {
78	# On Ubuntu, java >/dev/null 2>&1 works,
79	# but java 2>&1 >/dev/null doesn't work: output goes to screen anyway
80	$cmd .= " >/dev/null 2>&1"; # " >/dev/null 2>&1 &" - don't need & at end for Linux Centos anymore (Ubuntu was already fine without it)
81	}
82
83	my $status = system($cmd);
84
85	if ($status != 0) {
86
87	my $error_message = "**** Testing for java\n";
88	$error_message .= "Failed to run: $cmd\n";
89	$error_message .= "Error variable: \|$!\| and status: $status\n";
90
91	&gsprintf(STDERR, "PDFBoxConverter: $error_message");
92
93	$pdfbox_conversion_available = 0;
94	$no_pdfbox_conversion_reason = "couldnotrunjava";
95	}
96	}
97	}
98
99	}
100
101	my $arguments = [ ];
102
103	my $options = { 'name' => "PDFBoxConverter",
104	'desc' => "{PDFBoxConverter.desc}",
105	'abstract' => "yes",
106	'inherits' => "yes",
107	'args' => $arguments };
108
109	sub new {
110	my ($class) = shift (@_);
111	my ($pluginlist,$inputargs,$hashArgOptLists,$auxilary) = @_;
112	push(@$pluginlist, $class);
113
114	push(@{$hashArgOptLists->{"ArgList"}},@{$arguments});
115	push(@{$hashArgOptLists->{"OptList"}},$options);
116
117
118	my $self = new BaseMediaConverter($pluginlist, $inputargs,
119	$hashArgOptLists, $auxilary);
120
121	if ($self->{'info_only'}) {
122	# don't worry about any options etc
123	return bless $self, $class;
124	}
125	if ($pdfbox_conversion_available) {
126	my $gextpb_home = $ENV{'GEXT_PDFBOX'};
127	my $pbajar = &FileUtils::filenameConcatenate($gextpb_home,"lib","java","pdfbox-app.jar");
128	my $java = &util::get_java_command();
129	my $launch_cmd = "$java -cp \"$pbajar\" -Dline.separator=\"<br />\" org.apache.pdfbox.tools.ExtractText";
130
131	$self->{'pdfbox_launch_cmd'} = $launch_cmd;
132	#$self->{'pdfbox_img_launch_cmd'} = "java -cp \"$pbajar\" org.apache.pdfbox.tools.PDFToImage"; # pdfbox 2.09 cmd for converting each PDF page to an image (gif, jpg, png)
133	# Now: use this cmd to launch our new custom PDFBox class (GS_PDFToImagesAndText.java) to convert each PDF page into an image (gif, jpg, png)
134	# AND its extracted text. An item file is still generated, but this time referring to txtfiles too, not just the images. Result: searchable paged output.
135	$self->{'pdfbox_img_launch_cmd'} = "java -cp \"$pbajar\" org.apache.pdfbox.tools.GS_PDFToImagesAndText";
136	}
137	else {
138	$self->{'no_pdfbox_conversion_reason'} = $no_pdfbox_conversion_reason;
139
140	my $outhandle = $self->{'outhandle'};
141	&gsprintf($outhandle, "PDFBoxConverter: {PDFBoxConverter.noconversionavailable} ({PDFBoxConverter.$no_pdfbox_conversion_reason})\n");
142	}
143
144	$self->{'pdfbox_conversion_available'} = $pdfbox_conversion_available;
145
146	return bless $self, $class;
147
148	}
149
150	sub init {
151	my $self = shift(@_);
152	my ($verbosity, $outhandle, $failhandle) = @_;
153
154	$self->{'pbtmp_file_paths'} = ();
155	}
156
157	sub deinit {
158	my $self = shift(@_);
159
160	$self->clean_up_temporary_files();
161	}
162
163
164	sub convert {
165	my $self = shift(@_);
166	my ($source_file_full_path, $target_file_type) = @_;
167
168	return 0 unless $pdfbox_conversion_available;
169	# check the filename
170	return 0 if ( !-f $source_file_full_path);
171
172	my $img_output_mode = 0;
173
174	# the following line is necessary to avoid 'uninitialised variable' error
175	# messages concerning the converted_to member variable when PDFPlugin's
176	# use_sections option is checked.
177	# PDFBox plugin now processes use_sections option, when working with v1.5.0
178	# of the PDFBox jar file (which embeds each page in special <div> tags).
179	if ($target_file_type eq "html") {
180	$self->{'converted_to'} = "HTML";
181	} elsif ($target_file_type eq "jpg" \|\| $target_file_type eq "gif" \|\| $target_file_type eq "png") {
182	$self->{'converted_to'} = $target_file_type;
183	$img_output_mode = 1;
184	} else {
185	$self->{'converted_to'} = "text";
186	}
187
188	my $outhandle = $self->{'outhandle'};
189	my $verbosity = $self->{'verbosity'};
190
191	my $source_file_no_path = &File::Basename::basename($source_file_full_path);
192	# Determine the full name and path of the output file
193	my $target_file_path;
194	if ($self->{'enable_cache'}) {
195	$self->init_cache_for_file($source_file_full_path);
196	my $cache_dir = $self->{'cached_dir'};
197	my $file_root = $self->{'cached_file_root'};
198	#$file_root .= "_$convert_id" if ($convert_id ne "");
199
200	# append the output filetype suffix only for non-image output formats, since for
201	# images we can be outputting multiple image files per single PDF input file
202	my $target_file = $img_output_mode ? "$file_root" : "$file_root.$target_file_type";
203
204	$target_file_path = &FileUtils::filenameConcatenate($cache_dir,$target_file);
205	}
206	else {
207	# this is in gsdl/tmp. get a tmp filename in collection instead???
208	$target_file_path = &util::get_tmp_filename($target_file_type);
209
210	# for image files, remove the suffix, since we can have many output image files
211	# per input PDF (one img for each page of the PDF, for example)
212	if($img_output_mode) {
213	$target_file_path =~ s/\.[^.]*$//g;
214	if(!&FileUtils::directoryExists($target_file_path)) {
215	mkdir($target_file_path);
216	}
217
218	# once the item file for the imgs has been created, need to adjust target_file_path
219
220	# below, we'll store the dir just created to pbtmp_file_paths, so all imgs and the
221	# item file generated in it can be deleted in one go on clean_up
222	}
223
224	push(@{$self->{'pbtmp_file_paths'}}, $target_file_path);
225	}
226
227	# Generate and run the convert command
228	my $convert_cmd = "";
229
230	# want the filename without extension, because any images
231	# are to be generated with the same filename as the PDF
232	my ($tailname, $dirname, $suffix) = &File::Basename::fileparse($source_file_full_path, "\\.[^\\.]+\$");
233
234	if($img_output_mode) { # converting to images
235	my $output_prefix = &FileUtils::filenameConcatenate($target_file_path, $tailname);
236
237	$convert_cmd = $self->{'pdfbox_img_launch_cmd'};
238	$convert_cmd .= " -imageType $target_file_type";
239	$convert_cmd .= " -outputPrefix \"$output_prefix\"";
240	$convert_cmd .= " \"$source_file_full_path\"";
241
242	} else { # html or text
243	$convert_cmd = $self->{'pdfbox_launch_cmd'};
244	$convert_cmd .= " -html" if ($target_file_type eq "html");
245	$convert_cmd .= " \"$source_file_full_path\" \"$target_file_path\"";
246	}
247
248	if ($verbosity>2) {
249	&gsprintf($outhandle,"Convert command: $convert_cmd\n");
250	}
251
252	my $print_info = { 'message_prefix' => "PDFBox Conversion",
253	'message' => "Converting $source_file_no_path to: $target_file_type" };
254	# $print_info->{'cache_mode'} = $cache_mode if ($cache_mode ne "");
255
256	my ($regenerated,$result,$had_error)
257	= $self->autorun_general_cmd($convert_cmd,$source_file_full_path, $target_file_path,$print_info);
258
259	if($img_output_mode) {
260	# now the images have been generated, generate the "$target_file_path/tailname.item"
261	# item file for them, which is also the target_file_path that needs to be returned
262	$target_file_path = &util::create_itemfile($target_file_path, $tailname, $target_file_type);
263	#print STDERR "**** item file: $target_file_path\n";
264	}
265	elsif ($self->{'converted_to'} eq "text") {
266	# ensure html entities are doubly escaped for pdfbox to text conversion: & -> &amp;
267	# conversion to html does it automatically, but conversion to text doesn't
268	# and this results in illegal characters in doc.xml
269
270	my $fulltext = &FileUtils::readUTF8File($target_file_path);
271	if(defined $fulltext) {
272	#$fulltext = &HTML::Entities::encode($fulltext); # doesn't seem to help
273	$fulltext =~ s@&@&@sg; # Kathy's fix to ensure doc contents don't break XML
274	&FileUtils::writeUTF8File($target_file_path, \$fulltext);
275	} else {
276	print STDERR "PDFBoxConverter::convert(): Unable to read from converted file\n";
277	$had_error = 1;
278	}
279	}
280
281	if ($had_error) {
282	return (0, $result,$target_file_path);
283	}
284	return (1, $result,$target_file_path);
285	}
286
287	sub convert_without_result {
288	my $self = shift(@_);
289
290	my $source_file_path = shift(@_);
291	my $target_file_type = shift(@_);
292	my $convert_options = shift(@_) \|\| "";
293	my $convert_id = shift(@_) \|\| "";
294
295	return $self->convert($source_file_path,$target_file_type,
296	$convert_options,$convert_id,"without_result");
297	}
298
299	sub clean_up_temporary_files {
300	my $self = shift(@_);
301
302	foreach my $pbtmp_file_path (@{$self->{'pbtmp_file_paths'}}) {
303	if (-d $pbtmp_file_path) {
304	#print STDERR "@@@@@@ cleanup called on $pbtmp_file_path\n";
305	&FileUtils::removeFilesRecursive($pbtmp_file_path);
306	}
307	elsif (-e $pbtmp_file_path) {
308	&FileUtils::removeFiles($pbtmp_file_path);
309	}
310	}
311
312	$self->{'pbtmp_file_paths'} = ();
313	}
314
315
316	1;

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: gs2-extensions/pdf-box/trunk/java/perllib/plugins/PDFBoxConverter.pm@ 32193

Download in other formats: