source: main/trunk/greenstone2/perllib/util.pm@ 32193

Last change on this file since 32193 was 32193, checked in by ak19, 6 years ago

All the *essential* changes related to the PDFBox modifications Kathy asked for. The PDFBox app used to be used to generated either images for every PDF page or extract txt from the PDF. Kathy wanted to ideally produce paged images with extracted text, where available, so that this would be searchable. So images AND extracted text. Her idea was to modify the pdfbox app code to do it: a new class based on the existing one that generated the images for each page that would (based on Kathy's answers to my questions) need to be modified to additionally extract the text of each page, so that txt search results matched the correct img page presented. Might as well upgrade the pdfbox app version our GS code used. After testing that the latest version (2.09) did not have any of the issues for which we previously settled on v 1.8.2 (lower than the then most up to date version), the necessary code changes were made. All of these are documented in the newly included GS_PDFBox_README.txt. The new java file is called GS_PDFToImagesAndText.java and is located in the new java/src subfolder. This will need to be put into the pdfbox app 2.09 *src* code to be built, and the generated class file should then be copied into the java/lib/java/pdfbox-app.jar, all as explained in the GS_PDFBox_README.txt. Other files modified for the changes requested by Kathy are PDFBoxConvertger.pm, to refer to our new class and its new java package location as packages have changed in 2.09, and util.pm's create_itemfile() function which now may additionally deal with txt files matching each img file generated. (Not committing minor adjustment to ReadTextFile.pm to prevent a warning, as my fix seems hacky. But the fix is described in the Readme). The pdfbox ext zip/tarballs also modified to contain the changed PDFBoxConverter.pm and pdfbox-app jar file for 2.09 with our custom new class file. But have not yet renamed anything to gs-pdfbox-app as there will be flow on effects elsewhere as described in the Readme, can do all this in a separate commit.

  • Property svn:keywords set to Author Date Id Revision
File size: 64.0 KB
Line 
1###########################################################################
2#
3# util.pm -- various useful utilities
4# A component of the Greenstone digital library software
5# from the New Zealand Digital Library Project at the
6# University of Waikato, New Zealand.
7#
8# Copyright (C) 1999 New Zealand Digital Library Project
9#
10# This program is free software; you can redistribute it and/or modify
11# it under the terms of the GNU General Public License as published by
12# the Free Software Foundation; either version 2 of the License, or
13# (at your option) any later version.
14#
15# This program is distributed in the hope that it will be useful,
16# but WITHOUT ANY WARRANTY; without even the implied warranty of
17# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
18# GNU General Public License for more details.
19#
20# You should have received a copy of the GNU General Public License
21# along with this program; if not, write to the Free Software
22# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
23#
24###########################################################################
25
26package util;
27
28use strict;
29no strict 'refs'; # make an exception so we can use variables as filehandles
30use FileUtils;
31
32use Encode;
33use Unicode::Normalize 'normalize';
34
35use File::Copy;
36use File::Basename;
37# Config for getting the perlpath in the recommended way, though it uses paths that are
38# hard-coded into the Config file that's generated upon configuring and compiling perl.
39# $^X works better in some cases to return the path to perl used to launch the script,
40# but if launched with plain "perl" (no full-path), that will be just what it returns.
41use Config;
42# New module for file related utility functions - intended as a
43# placeholder for an extension that allows a variety of different
44# filesystems (FTP, HTTP, SAMBA, WEBDav, HDFS etc)
45use FileUtils;
46
47if ($ENV{'GSDLOS'} =~ /^windows$/i) {
48 require Win32; # for working out Windows Long Filenames from Win 8.3 short filenames
49}
50
51# removes files (but not directories)
52sub rm {
53 warnings::warnif("deprecated", "util::rm() is deprecated, using FileUtils::removeFiles() instead");
54 return &FileUtils::removeFiles(@_);
55}
56
57# recursive removal
58sub filtered_rm_r {
59 warnings::warnif("deprecated", "util::filtered_rm_r() is deprecated, using FileUtils::removeFilesFiltered() instead");
60 return &FileUtils::removeFilesFiltered(@_);
61}
62
63# recursive removal
64sub rm_r {
65 warnings::warnif("deprecated", "util::rm_r() is deprecated, using FileUtils::removeFilesRecursive() instead");
66 return &FileUtils::removeFilesRecursive(@_);
67}
68
69# moves a file or a group of files
70sub mv {
71 warnings::warnif("deprecated", "util::mv() is deprecated, using FileUtils::moveFiles() instead");
72 return &FileUtils::moveFiles(@_);
73}
74
75# Move the contents of source directory into target directory
76# (as opposed to merely replacing target dir with the src dir)
77# This can overwrite any files with duplicate names in the target
78# but other files and folders in the target will continue to exist
79sub mv_dir_contents {
80 warnings::warnif("deprecated", "util::mv_dir_contents() is deprecated, using FileUtils::moveDirectoryContents() instead");
81 return &FileUtils::moveDirectoryContents(@_);
82}
83
84# copies a file or a group of files
85sub cp {
86 warnings::warnif("deprecated", "util::cp() is deprecated, using FileUtils::copyFiles() instead");
87 return &FileUtils::copyFiles(@_);
88}
89
90# recursively copies a file or group of files
91# syntax: cp_r (sourcefiles, destination directory)
92# destination must be a directory - to copy one file to
93# another use cp instead
94sub cp_r {
95 warnings::warnif("deprecated", "util::cp_r() is deprecated, using FileUtils::copyFilesrecursive() instead");
96 return &FileUtils::copyFilesRecursive(@_);
97}
98
99# recursively copies a file or group of files
100# syntax: cp_r (sourcefiles, destination directory)
101# destination must be a directory - to copy one file to
102# another use cp instead
103sub cp_r_nosvn {
104 warnings::warnif("deprecated", "util::cp_r_nosvn() is deprecated, using FileUtils::copyFilesRecursiveNoSVN() instead");
105 return &FileUtils::copyFilesRecursiveNoSVN(@_);
106}
107
108# copies a directory and its contents, excluding subdirectories, into a new directory
109sub cp_r_toplevel {
110 warnings::warnif("deprecated", "util::cp_r_toplevel() is deprecated, using FileUtils::recursiveCopyTopLevel() instead");
111 return &FileUtils::recursiveCopyTopLevel(@_);
112}
113
114sub mk_dir {
115 warnings::warnif("deprecated", "util::mk_dir() is deprecated, using FileUtils::makeDirectory() instead");
116 return &FileUtils::makeDirectory(@_);
117}
118
119# in case anyone cares - I did some testing (using perls Benchmark module)
120# on this subroutine against File::Path::mkpath (). mk_all_dir() is apparently
121# slightly faster (surprisingly) - Stefan.
122sub mk_all_dir {
123 warnings::warnif("deprecated", "util::mk_all_dir() is deprecated, using FileUtils::makeAllDirectories() instead");
124 return &FileUtils::makeAllDirectories(@_);
125}
126
127# make hard link to file if supported by OS, otherwise copy the file
128sub hard_link {
129 warnings::warnif("deprecated", "util::hard_link() is deprecated, using FileUtils::hardLink() instead");
130 return &FileUtils::hardLink(@_);
131}
132
133# make soft link to file if supported by OS, otherwise copy file
134sub soft_link {
135 warnings::warnif("deprecated", "util::soft_link() is deprecated, using FileUtils::softLink() instead");
136 return &FileUtils::softLink(@_);
137}
138
139# Primarily for filenames generated by processing
140# content of HTML files (which are mapped to UTF-8 internally)
141#
142# To turn this into an octet string that really exists on the file
143# system:
144# 1. don't need to do anything special for Unix-based systems
145# (as underlying file system is byte-code)
146# 2. need to map to short DOS filenames for Windows
147
148sub utf8_to_real_filename
149{
150 my ($utf8_filename) = @_;
151
152 my $real_filename;
153
154 if (($ENV{'GSDLOS'} =~ m/^windows$/i) && ($^O ne "cygwin")) {
155 require Win32;
156
157 my $unicode_filename = decode("utf8",$utf8_filename);
158 $real_filename = Win32::GetShortPathName($unicode_filename);
159 }
160 else {
161 $real_filename = $utf8_filename;
162 }
163
164 return $real_filename;
165}
166
167sub raw_filename_to_unicode
168{
169 my ($directory, $raw_file, $filename_encoding ) = @_;
170
171 my $unicode_filename = $raw_file;
172 if (($ENV{'GSDLOS'} =~ m/^windows$/i) && ($^O ne "cygwin")) {
173 # Try turning a short version to the long version
174 # If there are "funny" characters in the file name, that can't be represented in the ANSI code, then we will have a short weird version, eg E74~1.txt
175 $unicode_filename = &util::get_dirsep_tail(&util::upgrade_if_dos_filename(&FileUtils::filenameConcatenate($directory, $raw_file), 0));
176
177
178 if ($unicode_filename eq $raw_file) {
179 # This means the original filename *was* able to be encoded in the local ANSI file encoding (eg windows_1252), so now we turn it back to perl's unicode
180
181 $unicode_filename = &Encode::decode(locale_fs => $unicode_filename);
182 }
183 # else This means we did have one of the funny filenames. the getLongPathName (used in upgrade_if_dos_filename) will return unicode, so we don't need to do anything more.
184
185
186 } else {
187 # we had a utf-8 string, turn it into perl internal unicode
188 $unicode_filename = &Encode::decode("utf-8", $unicode_filename);
189
190
191 }
192 #Does the filename have url encoded chars in it?
193 if (&unicode::is_url_encoded($unicode_filename)) {
194 $unicode_filename = &unicode::url_decode($unicode_filename);
195 }
196
197 # Normalise the filename to canonical composition - on mac, filenames use decopmposed form for accented chars
198 if ($ENV{'GSDLOS'} =~ m/^darwin$/i) {
199 $unicode_filename = normalize('C', $unicode_filename); # Composed form 'C'
200 }
201 return $unicode_filename;
202
203}
204sub fd_exists {
205 warnings::warnif("deprecated", "util::fd_exists() is deprecated, using FileUtils::fileTest() instead");
206 return &FileUtils::fileTest(@_);
207}
208
209sub file_exists {
210 warnings::warnif("deprecated", "util::file_exists() is deprecated, using FileUtils::fileExists() instead");
211 return &FileUtils::fileExists(@_);
212}
213
214sub dir_exists {
215 warnings::warnif("deprecated", "util::dir_exists() is deprecated, using FileUtils::directoryExists() instead");
216 return &FileUtils::directoryExists(@_);
217}
218
219# updates a copy of a directory in some other part of the filesystem
220# verbosity settings are: 0=low, 1=normal, 2=high
221# both $fromdir and $todir should be absolute paths
222sub cachedir {
223 warnings::warnif("deprecated", "util::cachedir() is deprecated, using FileUtils::synchronizeDirectories() instead");
224 return &FileUtils::synchronizeDirectories(@_);
225}
226
227# this function returns -1 if either file is not found
228# assumes that $file1 and $file2 are absolute file names or
229# in the current directory
230# $file2 is allowed to be newer than $file1
231sub differentfiles {
232 warnings::warnif("deprecated", "util::differentfiles() is deprecated, using FileUtils::differentFiles() instead");
233 return &FileUtils::differentFiles(@_);
234}
235
236
237# works out the temporary directory, including in the case where Greenstone is not writable
238# In that case, gs3-setup.bat would already have set the GS_TMP_OUTPUT_DIR temp variable
239sub determine_tmp_dir
240{
241 my $try_collect_dir = shift(@_) || 0;
242
243 my $tmp_dirname;
244 if(defined $ENV{'GS_TMP_OUTPUT_DIR'}) {
245 $tmp_dirname = $ENV{'GS_TMP_OUTPUT_DIR'};
246 } elsif($try_collect_dir && defined $ENV{'GSDLCOLLECTDIR'}) {
247 $tmp_dirname = $ENV{'GSDLCOLLECTDIR'};
248 } elsif(defined $ENV{'GSDLHOME'}) {
249 $tmp_dirname = $ENV{'GSDLHOME'};
250 } else {
251 return undef;
252 }
253
254 if(!defined $ENV{'GS_TMP_OUTPUT_DIR'}) {
255 # test the tmp_dirname folder is writable, by trying to write out a file
256 # Unfortunately, cound not get if(-w $dirname) to work on directories on Windows
257 ## http://alvinalexander.com/blog/post/perl/perl-file-test-operators-reference-cheat-sheet (test file/dir writable)
258 ## http://www.makelinux.net/alp/083 (real and effective user IDs)
259
260 my $tmp_test_file = &FileUtils::filenameConcatenate($tmp_dirname, "writability_test.tmp");
261 if (open (FOUT, ">$tmp_test_file")) {
262 close(FOUT);
263 &FileUtils::removeFiles($tmp_test_file);
264 } else { # location not writable, use TMP location
265 if (defined $ENV{'TMP'}) {
266 $tmp_dirname = $ENV{'TMP'};
267 } else {
268 $tmp_dirname = "/tmp";
269 }
270 $tmp_dirname = &FileUtils::filenameConcatenate($tmp_dirname, "greenstone");
271 $ENV{'GS_TMP_OUTPUT_DIR'} = $tmp_dirname; # store for next time
272 }
273 }
274
275 $tmp_dirname = &FileUtils::filenameConcatenate($tmp_dirname, "tmp");
276 &FileUtils::makeAllDirectories ($tmp_dirname) unless -e $tmp_dirname;
277
278 return $tmp_dirname;
279}
280
281sub get_tmp_filename
282{
283 my $file_ext = shift(@_) || undef;
284
285 my $opt_dot_file_ext = "";
286 if (defined $file_ext) {
287 if ($file_ext !~ m/\./) {
288 # no dot, so needs one added in at start
289 $opt_dot_file_ext = ".$file_ext"
290 }
291 else {
292 # allow for "extensions" such as _metadata.txt to be handled
293 # gracefully
294 $opt_dot_file_ext = $file_ext;
295 }
296 }
297
298 my $tmpdir = &util::determine_tmp_dir(0);
299
300 my $count = 1000;
301 my $rand = int(rand $count);
302 my $full_tmp_filename = &FileUtils::filenameConcatenate($tmpdir, "F$rand$opt_dot_file_ext");
303
304 while (-e $full_tmp_filename) {
305 $rand = int(rand $count);
306 $full_tmp_filename = &FileUtils::filenameConcatenate($tmpdir, "F$rand$opt_dot_file_ext");
307 $count++;
308 }
309
310 return $full_tmp_filename;
311}
312
313# These 2 are "static" variables used by the get_timestamped_tmp_folder() subroutine below and
314# belong with that function. They help ensure the timestamped tmp folders generated are unique.
315my $previous_timestamp = undef;
316my $previous_timestamp_f = 0; # frequency
317
318sub get_timestamped_tmp_folder
319{
320 my $tmp_dirname = &util::determine_tmp_dir(1);
321
322 # add the timestamp into the path otherwise we can run into problems
323 # if documents have the same name
324 my $timestamp = time;
325
326 if (!defined $previous_timestamp || ($timestamp > $previous_timestamp)) {
327 $previous_timestamp_f = 0;
328 $previous_timestamp = $timestamp;
329 } else {
330 $previous_timestamp_f++;
331 }
332
333 my $time_tmp_dirname = &FileUtils::filenameConcatenate($tmp_dirname, $timestamp);
334 $tmp_dirname = $time_tmp_dirname;
335 my $i = $previous_timestamp_f;
336
337 if($previous_timestamp_f > 0) {
338 $tmp_dirname = $time_tmp_dirname."_".$i;
339 $i++;
340 }
341 while (-e $tmp_dirname) {
342 $tmp_dirname = $time_tmp_dirname."_".$i;
343 $i++;
344 }
345 &FileUtils::makeDirectory($tmp_dirname);
346
347 return $tmp_dirname;
348}
349
350sub get_timestamped_tmp_filename_in_collection
351{
352
353 my ($input_filename, $output_ext) = @_;
354 # derive tmp filename from input filename
355 my ($tailname, $dirname, $suffix)
356 = &File::Basename::fileparse($input_filename, "\\.[^\\.]+\$");
357
358 # softlink to collection tmp dir
359 my $tmp_dirname = &util::get_timestamped_tmp_folder();
360 $tmp_dirname = $dirname unless defined $tmp_dirname;
361
362 # following two steps copied from ConvertBinaryFile
363 # do we need them?? can't use them as is, as they use plugin methods.
364
365 #$tailname = $self->SUPER::filepath_to_utf8($tailname) unless &unicode::check_is_utf8($tailname);
366
367 # URLEncode this since htmls with images where the html filename is utf8 don't seem
368 # to work on Windows (IE or Firefox), as browsers are looking for filesystem-encoded
369 # files on the filesystem.
370 #$tailname = &util::rename_file($tailname, $self->{'file_rename_method'}, "without_suffix");
371 if (defined $output_ext) {
372 $output_ext = ".$output_ext"; # add the dot
373 } else {
374 $output_ext = $suffix;
375 }
376 $output_ext= lc($output_ext);
377 my $tmp_filename = &FileUtils::filenameConcatenate($tmp_dirname, "$tailname$output_ext");
378
379 return $tmp_filename;
380}
381
382sub get_toplevel_tmp_dir
383{
384 return &FileUtils::filenameConcatenate($ENV{'GSDLHOME'}, "tmp");
385}
386
387
388sub get_collectlevel_tmp_dir
389{
390 my $tmp_dirname = &FileUtils::filenameConcatenate($ENV{'GSDLCOLLECTDIR'}, "tmp");
391 &FileUtils::makeDirectory($tmp_dirname) if (!-e $tmp_dirname);
392
393 return $tmp_dirname;
394}
395
396sub get_parent_folder
397{
398 my ($path) = @_;
399 my ($tailname, $dirname, $suffix)
400 = &File::Basename::fileparse($path, "\\.[^\\.]+\$");
401
402 return &FileUtils::sanitizePath($dirname);
403}
404
405sub filename_to_regex {
406 my $filename = shift (@_);
407
408 # need to make single backslashes double so that regex works
409 $filename =~ s/\\/\\\\/g; # if ($ENV{'GSDLOS'} =~ /^windows$/i);
410
411 # note that the first part of a substitution is a regex, so RE chars need to be escaped,
412 # the second part of a substitution is not a regex, so for e.g. full-stop can be specified literally
413 $filename =~ s/\./\\./g; # in case there are extensions/other full stops, escape them
414 $filename =~ s@\(@\\(@g; # escape brackets
415 $filename =~ s@\)@\\)@g; # escape brackets
416 $filename =~ s@\[@\\[@g; # escape brackets
417 $filename =~ s@\]@\\]@g; # escape brackets
418
419 return $filename;
420}
421
422sub unregex_filename {
423 my $filename = shift (@_);
424
425 # need to put doubled backslashes for regex back to single
426 $filename =~ s/\\\./\./g; # remove RE syntax for .
427 $filename =~ s@\\\(@(@g; # remove RE syntax for ( => "\(" turns into "("
428 $filename =~ s@\\\)@)@g; # remove RE syntax for ) => "\)" turns into ")"
429 $filename =~ s@\\\[@[@g; # remove RE syntax for [ => "\[" turns into "["
430 $filename =~ s@\\\]@]@g; # remove RE syntax for ] => "\]" turns into "]"
431
432 # \\ goes to \
433 # This is the last step in reverse mirroring the order of steps in filename_to_regex()
434 $filename =~ s/\\\\/\\/g; # remove RE syntax for \
435 return $filename;
436}
437
438sub filename_cat {
439 # I've disabled this warning for now, as every Greenstone perl
440 # script seems to make use of this function and so you drown in a
441 # sea of deprecated warnings [jmt12]
442# warnings::warnif("deprecated", "util::filename_cat() is deprecated, using FileUtils::filenameConcatenate() instead");
443 return &FileUtils::filenameConcatenate(@_);
444}
445
446
447sub _pathname_cat {
448 my $join_char = shift(@_);
449 my $first_path = shift(@_);
450 my (@pathnames) = @_;
451
452 # If first_path is not null or empty, then add it back into the list
453 if (defined $first_path && $first_path =~ /\S/) {
454 unshift(@pathnames, $first_path);
455 }
456
457 my $pathname = join($join_char, @pathnames);
458
459 # remove duplicate slashes
460 if ($join_char eq ";") {
461 $pathname =~ s/[\\\/]+/\\/g;
462 if ($^O eq "cygwin") {
463 # Once we've collapsed muliple (potentialy > 2) slashes
464 # For cygwin, actually want things double-backslahed
465 $pathname =~ s/\\/\\\\/g;
466 }
467
468 } else {
469 $pathname =~ s/[\/]+/\//g;
470 # DB: want a pathname abc\de.html to remain like this
471 }
472
473 return $pathname;
474}
475
476
477sub pathname_cat {
478 my (@pathnames) = @_;
479
480 my $join_char;
481 if (($ENV{'GSDLOS'} =~ /^windows$/i) && ($^O ne "cygwin")) {
482 $join_char = ";";
483 } else {
484 $join_char = ":";
485 }
486 return _pathname_cat($join_char,@pathnames);
487}
488
489
490sub javapathname_cat {
491 my (@pathnames) = @_;
492
493 my $join_char;
494
495 # Unlike pathname_cat() above, not interested if running in a Cygwin environment
496 # This is because the java we run is actually a native Windows executable
497
498 if (($ENV{'GSDLOS'} =~ /^windows$/i)) {
499 $join_char = ";";
500 } else {
501 $join_char = ":";
502 }
503 return _pathname_cat($join_char,@pathnames);
504}
505
506
507sub makeFilenameJavaCygwinCompatible
508{
509 my ($java_filename) = @_;
510
511 if ($^O eq "cygwin") {
512 # To be used with a Java program, but under Cygwin
513 # Because the java binary that is native to Windows, need to
514 # convert the Cygwin paths (i.e. Unix style) to be Windows
515 # compatible
516
517 $java_filename = `cygpath -wp "$java_filename"`;
518 chomp($java_filename);
519 $java_filename =~ s%\\%\\\\%g;
520 }
521
522 return $java_filename;
523}
524
525sub tidy_up_oid {
526 my ($OID) = @_;
527 if ($OID =~ /[\.\/\\]/) {
528 print STDERR "Warning, identifier $OID contains periods or slashes(.\\\/), replacing them with _\n";
529 $OID =~ s/[\.\\\/]/_/g; #remove any periods
530 }
531 if ($OID =~ /^\s.*\s$/) {
532 print STDERR "Warning, identifier $OID starts or ends with whitespace. Removing it\n";
533 # remove starting and trailing whitespace
534 $OID =~ s/^\s+//;
535 $OID =~ s/\s+$//;
536 }
537 if ($OID =~ /^[\d]*$/) {
538 print STDERR "Warning, identifier $OID contains only digits. Prepending 'D'.\n";
539 $OID = "D" . $OID;
540 }
541
542 return $OID;
543}
544
545sub envvar_prepend {
546 my ($var,$val) = @_;
547
548 # 64 bit linux can't handle ";" as path separator, so make sure to set this to the right one for the OS
549## my $pathsep = (defined $ENV{'GSDLOS'} && $ENV{'GSDLOS'} !~ m/windows/) ? ":" : ";";
550
551 # Rewritten above to make ":" the default (Windows is the special
552 # case, anything else 'unusual' such as Solaris etc is Unix)
553 my $pathsep = (defined $ENV{'GSDLOS'} && (($ENV{'GSDLOS'} =~ m/windows/) && ($^O ne "cygwin"))) ? ";" : ":";
554
555 # do not prepend any value/path that's already in the environment variable
556
557 my $escaped_val = &filename_to_regex($val); # escape any backslashes and brackets for upcoming regex
558 if (!defined($ENV{$var})) {
559 $ENV{$var} = "$val";
560 }
561 elsif($ENV{$var} !~ m/$escaped_val/) {
562 $ENV{$var} = "$val".$pathsep.$ENV{$var};
563 }
564}
565
566sub envvar_append {
567 my ($var,$val) = @_;
568
569 # 64 bit linux can't handle ";" as path separator, so make sure to set this to the right one for the OS
570 my $pathsep = (defined $ENV{'GSDLOS'} && $ENV{'GSDLOS'} !~ m/windows/) ? ":" : ";";
571
572 # do not append any value/path that's already in the environment variable
573
574 my $escaped_val = &filename_to_regex($val); # escape any backslashes and brackets for upcoming regex
575 if (!defined($ENV{$var})) {
576 $ENV{$var} = "$val";
577 }
578 elsif($ENV{$var} !~ m/$escaped_val/) {
579 $ENV{$var} = $ENV{$var}.$pathsep."$val";
580 }
581}
582
583# debug aid
584sub print_env {
585 my ($handle, @envvars) = @_; # print to $handle, which can be STDERR/STDOUT/file, etc.
586
587 if (scalar(@envvars) == 0) {
588 #print $handle "@@@ All env vars requested\n";
589
590 my $output = "";
591
592 print $handle "@@@ Environment was:\n********\n";
593 foreach my $envvar (sort keys(%ENV)) {
594 if(defined $ENV{$envvar}) {
595 print $handle "\t$envvar = $ENV{$envvar}\n";
596 } else {
597 print $handle "\t$envvar = \n";
598 }
599 }
600 print $handle "********\n";
601 } else {
602 print $handle "@@@ Environment was:\n********\n";
603 foreach my $envvar (@envvars) {
604 if(defined $ENV{$envvar}) {
605 print $handle "\t$envvar = ".$ENV{$envvar}."\n";
606 } else {
607 print $handle "Env var '$envvar' was not set\n";
608 }
609 }
610 print $handle "********\n";
611 }
612}
613
614
615# splits a filename into a prefix and a tail extension using the tail_re, or
616# if that fails, splits on the file_extension . (dot)
617sub get_prefix_and_tail_by_regex {
618
619 my ($filename,$tail_re) = @_;
620
621 my ($file_prefix,$file_ext) = ($filename =~ m/^(.*?)($tail_re)$/);
622 if ((!defined $file_prefix) || (!defined $file_ext)) {
623 ($file_prefix,$file_ext) = ($filename =~ m/^(.*)(\..*?)$/);
624 }
625
626 return ($file_prefix,$file_ext);
627}
628
629# get full path and file only path from a base_dir (which may be empty) and
630# file (which may contain directories)
631sub get_full_filenames {
632 my ($base_dir, $file) = @_;
633
634# my ($cpackage,$cfilename,$cline,$csubr,$chas_args,$cwantarray) = caller(0);
635# my ($lcfilename) = ($cfilename =~ m/([^\\\/]*)$/);
636# print STDERR "** Calling method: $lcfilename:$cline $cpackage->$csubr\n";
637
638
639 my $filename_full_path = $file;
640 # add on directory if present
641 $filename_full_path = &FileUtils::filenameConcatenate($base_dir, $file) if $base_dir =~ /\S/;
642
643 my $filename_no_path = $file;
644
645 # remove directory if present
646 $filename_no_path =~ s/^.*[\/\\]//;
647 return ($filename_full_path, $filename_no_path);
648}
649
650# returns the path of a file without the filename -- ie. the directory the file is in
651sub filename_head {
652 my $filename = shift(@_);
653
654 if (($ENV{'GSDLOS'} =~ /^windows$/i) && ($^O ne "cygwin")) {
655 $filename =~ s/[^\\\\]*$//;
656 }
657 else {
658 $filename =~ s/[^\\\/]*$//;
659 }
660
661 return $filename;
662}
663
664
665
666# returns 1 if filename1 and filename2 point to the same
667# file or directory
668sub filenames_equal {
669 my ($filename1, $filename2) = @_;
670
671 # use filename_cat to clean up trailing slashes and
672 # multiple slashes
673 $filename1 = &FileUtils::filenameConcatenate($filename1);
674 $filename2 = &FileUtils::filenameConcatenate($filename2);
675
676 # filenames not case sensitive on windows
677 if ($ENV{'GSDLOS'} =~ /^windows$/i) {
678 $filename1 =~ tr/[A-Z]/[a-z]/;
679 $filename2 =~ tr/[A-Z]/[a-z]/;
680 }
681 return 1 if $filename1 eq $filename2;
682 return 0;
683}
684
685# If filename is relative to within_dir, returns the relative path of filename to that directory
686# with slashes in the filename returned as they were in the original (absolute) filename.
687sub filename_within_directory
688{
689 my ($filename,$within_dir) = @_;
690
691 if ($within_dir !~ m/[\/\\]$/) {
692 my $dirsep = &util::get_dirsep();
693 $within_dir .= $dirsep;
694 }
695
696 $within_dir = &filename_to_regex($within_dir); # escape DOS style file separator and brackets
697 if ($filename =~ m/^$within_dir(.*)$/) {
698 $filename = $1;
699 }
700
701 return $filename;
702}
703
704# If filename is relative to within_dir, returns the relative path of filename to that directory in URL format.
705# Filename and within_dir can be any type of slashes, but will be compared as URLs (i.e. unix-style slashes).
706# The subpath returned will also be a URL type filename.
707sub filename_within_directory_url_format
708{
709 my ($filename,$within_dir) = @_;
710
711 # convert parameters only to / slashes if Windows
712
713 my $filename_urlformat = &filepath_to_url_format($filename);
714 my $within_dir_urlformat = &filepath_to_url_format($within_dir);
715
716 #if ($within_dir_urlformat !~ m/\/$/) {
717 # make sure directory ends with a slash
718 #$within_dir_urlformat .= "/";
719 #}
720
721 my $within_dir_urlformat_re = &filename_to_regex($within_dir_urlformat); # escape any special RE characters, such as brackets
722
723 #print STDERR "@@@@@ $filename_urlformat =~ $within_dir_urlformat_re\n";
724
725 # dir prefix may or may not end with a slash (this is discarded when extracting the sub-filepath)
726 if ($filename_urlformat =~ m/^$within_dir_urlformat_re(?:\/)*(.*)$/) {
727 $filename_urlformat = $1;
728 }
729
730 return $filename_urlformat;
731}
732
733# Convert parameter to use / slashes if Windows (if on Linux leave any \ as is,
734# since on Linux it doesn't represent a file separator but an escape char).
735sub filepath_to_url_format
736{
737 my ($filepath) = @_;
738 if (($ENV{'GSDLOS'} =~ /^windows$/i) && ($^O ne "cygwin")) {
739 # Only need to worry about Windows, as Unix style directories already in url-format
740 # Convert Windows style \ => /
741 $filepath =~ s@\\@/@g;
742 }
743 return $filepath;
744}
745
746# regex filepaths on windows may include \\ as path separator. Convert \\ to /
747sub filepath_regex_to_url_format
748{
749 my ($filepath) = @_;
750 if (($ENV{'GSDLOS'} =~ /^windows$/i) && ($^O ne "cygwin")) {
751 # Only need to worry about Windows, as Unix style directories already in url-format
752 # Convert Windows style \\ => /
753 $filepath =~ s@\\\\@/@g;
754 }
755 return $filepath;
756
757}
758
759# Like File::Basename::fileparse, but expects filepath in url format (ie only / slash for dirsep)
760# and ignores trailing /
761# returns (file, dirs) dirs will be empty if no subdirs
762sub url_fileparse
763{
764 my ($filepath) = @_;
765 # remove trailing /
766 $filepath =~ s@/$@@;
767 if ($filepath !~ m@/@) {
768 return ($filepath, "");
769 }
770 my ($dirs, $file) = $filepath =~ m@(.+/)([^/]+)@;
771 return ($file, $dirs);
772
773}
774
775
776sub filename_within_collection
777{
778 my ($filename) = @_;
779
780 my $collect_dir = $ENV{'GSDLCOLLECTDIR'};
781
782 if (defined $collect_dir) {
783
784 # if from within GSDLCOLLECTDIR, then remove directory prefix
785 # so source_filename is realative to it. This is done to aid
786 # portability, i.e. the collection can be moved to somewhere
787 # else on the file system and the archives directory will still
788 # work. This is needed, for example in the applet version of
789 # GLI where GSDLHOME/collect on the server will be different to
790 # the collect directory of the remove user. Of course,
791 # GSDLCOLLECTDIR subsequently needs to be put back on to turn
792 # it back into a full pathname.
793
794 $filename = filename_within_directory($filename,$collect_dir);
795 }
796
797 return $filename;
798}
799
800sub prettyprint_file
801{
802 my ($base_dir,$file,$gli) = @_;
803
804 my $filename_full_path = &FileUtils::filenameConcatenate($base_dir,$file);
805
806 if (($ENV{'GSDLOS'} =~ m/^windows$/i) && ($^O ne "cygwin")) {
807 require Win32;
808
809 # For some reason base_dir in the form c:/a/b/c
810 # This leads to confusion later on, so turn it back into
811 # the more usual Windows form
812 $base_dir =~ s/\//\\/g;
813 my $long_base_dir = Win32::GetLongPathName($base_dir);
814 my $long_full_path = Win32::GetLongPathName($filename_full_path);
815
816 $file = filename_within_directory($long_full_path,$long_base_dir);
817 $file = encode("utf8",$file) if ($gli);
818 }
819
820 return $file;
821}
822
823
824sub upgrade_if_dos_filename
825{
826 my ($filename_full_path,$and_encode) = @_;
827
828 if (($ENV{'GSDLOS'} =~ m/^windows$/i) && ($^O ne "cygwin")) {
829 # Ensure any DOS-like filename, such as test~1.txt, has been upgraded
830 # to its long (Windows) version
831 my $long_filename = Win32::GetLongPathName($filename_full_path);
832 if (defined $long_filename) {
833
834 $filename_full_path = $long_filename;
835 }
836 # Make sure initial drive letter is lower-case (to fit in with rest of Greenstone)
837 $filename_full_path =~ s/^(.):/\u$1:/;
838
839 if ((defined $and_encode) && ($and_encode)) {
840 $filename_full_path = encode("utf8",$filename_full_path);
841 }
842 }
843
844 return $filename_full_path;
845}
846
847
848sub downgrade_if_dos_filename
849{
850 my ($filename_full_path) = @_;
851
852 if (($ENV{'GSDLOS'} =~ m/^windows$/i) && ($^O ne "cygwin")) {
853 require Win32;
854
855 # Ensure the given long Windows filename is in a form that can
856 # be opened by Perl => convert it to a short DOS-like filename
857
858 my $short_filename = Win32::GetShortPathName($filename_full_path);
859 if (defined $short_filename) {
860 $filename_full_path = $short_filename;
861 }
862 # Make sure initial drive letter is lower-case (to fit in
863 # with rest of Greenstone)
864 $filename_full_path =~ s/^(.):/\u$1:/;
865 }
866
867 return $filename_full_path;
868}
869
870
871sub filename_is_absolute
872{
873 warnings::warnif("deprecated", "util::filename_is_absolute() is deprecated, using FileUtils::isFilenameAbsolute() instead");
874 return &FileUtils::isFilenameAbsolute(@_);
875}
876
877
878## @method make_absolute()
879#
880# Ensure the given file path is absolute in respect to the given base path.
881#
882# @param $base_dir A string denoting the base path the given dir must be
883# absolute to.
884# @param $dir The directory to be made absolute as a string. Note that the
885# dir may already be absolute, in which case it will remain
886# unchanged.
887# @return The now absolute form of the directory as a string.
888#
889# @author John Thompson, DL Consulting Ltd.
890# @copy 2006 DL Consulting Ltd.
891#
892#used in buildcol.pl, doesn't work for all cases --kjdon
893sub make_absolute {
894
895 my ($base_dir, $dir) = @_;
896### print STDERR "dir = $dir\n";
897 $dir =~ s/[\\\/]+/\//g;
898 $dir = $base_dir . "/$dir" unless ($dir =~ m|^(\w:)?/|);
899 $dir =~ s|^/tmp_mnt||;
900 1 while($dir =~ s|/[^/]*/\.\./|/|g);
901 $dir =~ s|/[.][.]?/|/|g;
902 $dir =~ tr|/|/|s;
903### print STDERR "dir = $dir\n";
904
905 return $dir;
906}
907## make_absolute() ##
908
909sub get_dirsep {
910
911 if (($ENV{'GSDLOS'} =~ /^windows$/i) && ($^O ne "cygwin")) {
912 return "\\";
913 } else {
914 return "\/";
915 }
916}
917
918sub get_os_dirsep {
919
920 if (($ENV{'GSDLOS'} =~ /^windows$/i) && ($^O ne "cygwin")) {
921 return "\\\\";
922 } else {
923 return "\\\/";
924 }
925}
926
927sub get_re_dirsep {
928
929 return "\\\\|\\\/";
930}
931
932
933sub get_dirsep_tail {
934 my ($filename) = @_;
935
936 # returns last part of directory or filename
937 # On unix e.g. a/b.d => b.d
938 # a/b/c => c
939
940 my $dirsep = get_re_dirsep();
941 my @dirs = split (/$dirsep/, $filename);
942 my $tail = pop @dirs;
943
944 # - caused problems under windows
945 #my ($tail) = ($filename =~ m/^(?:.*?$dirsep)?(.*?)$/);
946
947 return $tail;
948}
949
950
951# if this is running on windows we want binaries to end in
952# .exe, otherwise they don't have to end in any extension
953sub get_os_exe {
954 return ".exe" if (($ENV{'GSDLOS'} =~ /^windows$/i) && ($^O ne "cygwin"));
955 return "";
956}
957
958
959# test to see whether this is a big or little endian machine
960sub is_little_endian
961{
962 # To determine the name of the operating system, the variable $^O is a cheap alternative to pulling it out of the Config module;
963 # If it is a Macintosh machine (i.e. the Darwin operating system), regardless if it's running on the IBM power-pc cpu or the x86 Intel-based chip with a power-pc emulator running on top of it, it's big-endian
964 # Otherwise, it's little endian
965
966 #return 0 if $^O =~ /^darwin$/i;
967 #return 0 if $ENV{'GSDLOS'} =~ /^darwin$/i;
968
969 # Going back to stating exactly whether the machine is little endian
970 # or big endian, without any special case for Macs. Since for rata it comes
971 # back with little endian and for shuttle with bigendian.
972 return (ord(substr(pack("s",1), 0, 1)) == 1);
973}
974
975
976# will return the collection name if successful, "" otherwise
977sub use_collection {
978 my ($collection, $collectdir) = @_;
979
980 if (!defined $collectdir || $collectdir eq "") {
981 $collectdir = &FileUtils::filenameConcatenate($ENV{'GSDLHOME'}, "collect");
982 }
983
984 if (!defined $ENV{'GREENSTONEHOME'}) { # for GS3, would have been defined in use_site_collection, to GSDL3HOME
985 $ENV{'GREENSTONEHOME'} = $ENV{'GSDLHOME'};
986 }
987
988 # get and check the collection
989 if (!defined($collection) || $collection eq "") {
990 if (defined $ENV{'GSDLCOLLECTION'}) {
991 $collection = $ENV{'GSDLCOLLECTION'};
992 } else {
993 print STDOUT "No collection specified\n";
994 return "";
995 }
996 }
997
998 if ($collection eq "modelcol") {
999 print STDOUT "You can't use modelcol.\n";
1000 return "";
1001 }
1002
1003 # make sure the environment variables GSDLCOLLECTION and GSDLCOLLECTDIR
1004 # are defined
1005 $ENV{'GSDLCOLLECTION'} = $collection;
1006 $ENV{'GSDLCOLLECTHOME'} = $collectdir;
1007 $ENV{'GSDLCOLLECTDIR'} = &FileUtils::filenameConcatenate($collectdir, $collection);
1008
1009 # make sure this collection exists
1010 if (!-e $ENV{'GSDLCOLLECTDIR'}) {
1011 print STDOUT "Invalid collection ($collection).\n";
1012 return "";
1013 }
1014
1015 # everything is ready to go
1016 return $collection;
1017}
1018
1019sub get_current_collection_name {
1020 return $ENV{'GSDLCOLLECTION'};
1021}
1022
1023
1024# will return the collection name if successful, "" otherwise.
1025# Like use_collection (above) but for greenstone 3 (taking account of site level)
1026
1027sub use_site_collection {
1028 my ($site, $collection, $collectdir) = @_;
1029
1030 if (!defined $collectdir || $collectdir eq "") {
1031 die "GSDL3HOME not set.\n" unless defined $ENV{'GSDL3HOME'};
1032 $collectdir = &FileUtils::filenameConcatenate($ENV{'GSDL3HOME'}, "sites", $site, "collect");
1033 }
1034
1035 if (defined $ENV{'GSDL3HOME'}) {
1036 $ENV{'GREENSTONEHOME'} = $ENV{'GSDL3HOME'};
1037 $ENV{'SITEHOME'} = &FileUtils::filenameConcatenate($ENV{'GREENSTONEHOME'}, "sites", $site);
1038 } elsif (defined $ENV{'GSDL3SRCHOME'}) {
1039 $ENV{'GREENSTONEHOME'} = &FileUtils::filenameConcatenate($ENV{'GSDL3SRCHOME'}, "web");
1040 $ENV{'SITEHOME'} = &FileUtils::filenameConcatenate($ENV{'GREENSTONEHOME'}, "sites", $site);
1041 } else {
1042 print STDERR "*** util::use_site_collection(). Warning: Neither GSDL3HOME nor GSDL3SRCHOME set.\n";
1043 }
1044
1045 # collectdir explicitly set by this point (using $site variable if required).
1046 # Can call "old" gsdl2 use_collection now.
1047
1048 return use_collection($collection,$collectdir);
1049}
1050
1051
1052
1053sub locate_config_file
1054{
1055 my ($file) = @_;
1056
1057 my $locations = locate_config_files($file);
1058
1059 return shift @$locations; # returns undef if 'locations' is empty
1060}
1061
1062
1063sub locate_config_files
1064{
1065 my ($file) = @_;
1066
1067 my @locations = ();
1068
1069 if (-e $file) {
1070 # Clearly specified (most likely full filename)
1071 # No need to hunt in 'etc' directories, return value unchanged
1072 push(@locations,$file);
1073 }
1074 else {
1075 # Check for collection specific one before looking in global GSDL 'etc'
1076 if (defined $ENV{'GSDLCOLLECTDIR'} && $ENV{'GSDLCOLLECTDIR'} ne "") {
1077 my $test_collect_etc_filename
1078 = &FileUtils::filenameConcatenate($ENV{'GSDLCOLLECTDIR'},"etc", $file);
1079
1080 if (-e $test_collect_etc_filename) {
1081 push(@locations,$test_collect_etc_filename);
1082 }
1083 }
1084 my $test_main_etc_filename
1085 = &FileUtils::filenameConcatenate($ENV{'GSDLHOME'},"etc", $file);
1086 if (-e $test_main_etc_filename) {
1087 push(@locations,$test_main_etc_filename);
1088 }
1089 }
1090
1091 return \@locations;
1092}
1093
1094
1095sub hyperlink_text
1096{
1097 my ($text) = @_;
1098
1099 $text =~ s/(http:\/\/[^\s]+)/<a href=\"$1\">$1<\/a>/mg;
1100 $text =~ s/(^|\s+)(www\.(\w|\.)+)/<a href=\"http:\/\/$2\">$2<\/a>/mg;
1101
1102 return $text;
1103}
1104
1105
1106# A method to check if a directory is empty (note that an empty directory still has non-zero size!!!)
1107# Code is from http://episteme.arstechnica.com/eve/forums/a/tpc/f/6330927813/m/436007700831
1108sub is_dir_empty {
1109 warnings::warnif("deprecated", "util::is_dir_empty() is deprecated, using FileUtils::isDirectoryEmpty() instead");
1110 return &FileUtils::isDirectoryEmpty(@_);
1111}
1112
1113# Returns the given filename converted using either URL encoding or base64
1114# encoding, as specified by $rename_method. If the given filename has no suffix
1115# (if it is just the tailname), then $no_suffix should be some defined value.
1116# rename_method can be url, none, base64
1117sub rename_file {
1118 my ($filename, $rename_method, $no_suffix) = @_;
1119
1120 if(!$filename) { # undefined or empty string
1121 return $filename;
1122 }
1123
1124 if (!$rename_method) {
1125 print STDERR "WARNING: no file renaming method specified. Defaulting to using URL encoding...\n";
1126 # Debugging information
1127 # my ($cpackage,$cfilename,$cline,$csubr,$chas_args,$cwantarray) = caller(1);
1128 # print STDERR "Called from method: $cfilename:$cline $cpackage->$csubr\n";
1129 $rename_method = "url";
1130 } elsif($rename_method eq "none") {
1131 return $filename; # would have already been renamed
1132 }
1133
1134 # No longer replace spaces with underscores, since underscores mess with incremental rebuild
1135 ### Replace spaces with underscore. Do this first else it can go wrong below when getting tailname
1136 ###$filename =~ s/ /_/g;
1137
1138 my ($tailname,$dirname,$suffix);
1139 if($no_suffix) { # given a tailname, no suffix
1140 ($tailname,$dirname) = File::Basename::fileparse($filename);
1141 }
1142 else {
1143 ($tailname,$dirname,$suffix) = File::Basename::fileparse($filename, "\\.(?:[^\\.]+?)\$");
1144 }
1145 if (!$suffix) {
1146 $suffix = "";
1147 }
1148 # This breaks GLI matching extracted metadata to files in Enrich panel, as
1149 # original is eg .JPG while gsdlsourcefilename ends up .jpg
1150 # Not sure why it was done in first place...
1151 #else {
1152 #$suffix = lc($suffix);
1153 #}
1154
1155 if ($rename_method eq "url") {
1156 $tailname = &unicode::url_encode($tailname);
1157 }
1158 elsif ($rename_method eq "base64") {
1159 $tailname = &unicode::base64_encode($tailname);
1160 $tailname =~ s/\s*//sg; # for some reason it adds spaces not just at end but also in middle
1161 }
1162
1163 $filename = "$tailname$suffix";
1164 $filename = "$dirname$filename" if ($dirname ne "./" && $dirname ne ".\\");
1165
1166 return $filename;
1167}
1168
1169
1170# BACKWARDS COMPATIBILITY: Just in case there are old .ldb/.bdb files
1171sub rename_ldb_or_bdb_file {
1172 my ($filename_no_ext) = @_;
1173
1174 my $new_filename = "$filename_no_ext.gdb";
1175 return if (-f $new_filename); # if the file has the right extension, don't need to do anything
1176 # try ldb
1177 my $old_filename = "$filename_no_ext.ldb";
1178
1179 if (-f $old_filename) {
1180 print STDERR "Renaming $old_filename to $new_filename\n";
1181 rename ($old_filename, $new_filename)
1182 || print STDERR "Rename failed: $!\n";
1183 return;
1184 }
1185 # try bdb
1186 $old_filename = "$filename_no_ext.bdb";
1187 if (-f $old_filename) {
1188 print STDERR "Renaming $old_filename to $new_filename\n";
1189 rename ($old_filename, $new_filename)
1190 || print STDERR "Rename failed: $!\n";
1191 return;
1192 }
1193}
1194
1195sub os_dir() {
1196
1197 my $gsdlarch = "";
1198 if(defined $ENV{'GSDLARCH'}) {
1199 $gsdlarch = $ENV{'GSDLARCH'};
1200 }
1201 return $ENV{'GSDLOS'}.$gsdlarch;
1202}
1203
1204# returns 1 if this (GS server) is a GS3 installation, returns 0 if it's GS2.
1205sub is_gs3() {
1206 if($ENV{'GSDL3SRCHOME'}) {
1207 return 1;
1208 } else {
1209 return 0;
1210 }
1211}
1212
1213# Returns the greenstone URL prefix extracted from the appropriate GS2/GS3 config file.
1214# By default, /greenstone3 for GS3 or /greenstone for GS2.
1215sub get_greenstone_url_prefix() {
1216 # if already set on a previous occasion, just return that
1217 # (Don't want to keep repeating this: cost of re-opening and scanning files.)
1218 return $ENV{'GREENSTONE_URL_PREFIX'} if($ENV{'GREENSTONE_URL_PREFIX'});
1219
1220 my ($configfile, $urlprefix, $defaultUrlprefix);
1221 my @propertynames = ();
1222
1223 if($ENV{'GSDL3SRCHOME'}) {
1224 $defaultUrlprefix = "/greenstone3";
1225 $configfile = &FileUtils::filenameConcatenate($ENV{'GSDL3SRCHOME'}, "packages", "tomcat", "conf", "Catalina", "localhost", "greenstone3.xml");
1226 push(@propertynames, qw/path\s*\=/);
1227 } else {
1228 $defaultUrlprefix = "/greenstone";
1229 $configfile = &FileUtils::filenameConcatenate($ENV{'GSDLHOME'}, "cgi-bin", &os_dir(), "gsdlsite.cfg");
1230 push(@propertynames, (qw/\nhttpprefix/, qw/\ngwcgi/)); # inspect one property then the other
1231 }
1232
1233 $urlprefix = &extract_propvalue_from_file($configfile, \@propertynames);
1234
1235 if(!$urlprefix) { # no values found for URL prefix, use default values
1236 $urlprefix = $defaultUrlprefix;
1237 } else {
1238 #gwcgi can contain more than the wanted prefix, we split on / to get the first "directory" level
1239 $urlprefix =~ s/^\///; # remove the starting slash
1240 my @dirs = split(/(\\|\/)/, $urlprefix);
1241 $urlprefix = shift(@dirs);
1242
1243 if($urlprefix !~ m/^\//) { # in all cases: ensure the required forward slash is at the front
1244 $urlprefix = "/$urlprefix";
1245 }
1246 }
1247
1248 # set for the future
1249 $ENV{'GREENSTONE_URL_PREFIX'} = $urlprefix;
1250# print STDERR "*** in get_greenstone_url_prefix(): $urlprefix\n\n";
1251 return $urlprefix;
1252}
1253
1254
1255
1256#
1257# The following comes from activate.pl
1258#
1259# Designed to work with a server included with GS.
1260# - For GS3, we ask ant for the library URL.
1261# - For GS2, we derive the URL from the llssite.cfg file.
1262
1263sub get_full_greenstone_url_prefix
1264{
1265 my ($gs_mode, $lib_name) = @_;
1266
1267 # if already set on a previous occasion, just return that
1268 # (Don't want to keep repeating this: cost of re-opening and scanning files.)
1269 return $ENV{'FULL_GREENSTONE_URL_PREFIX'} if($ENV{'FULL_GREENSTONE_URL_PREFIX'});
1270
1271 # set gs_mode if it was not passed in (servercontrol.pm would pass it in, any other callers won't)
1272 $gs_mode = ($ENV{'GSDL3SRCHOME'}) ? "gs3" : "gs2" unless defined $gs_mode;
1273
1274 my $url = undef;
1275
1276 if($gs_mode eq "gs2") {
1277 my $llssite_cfg = &FileUtils::filenameConcatenate($ENV{'GSDLHOME'}, "llssite.cfg");
1278
1279 if(-f $llssite_cfg) {
1280 # check llssite.cfg for line with url property
1281 # for server.exe also need to use portnumber and enterlib properties
1282 # The following file reading section is a candidate to use FileUtils::readUTF8File()
1283 # in place of calling sysread() directly. But only if we can reason we'd be working with UTF8
1284 # Read in the entire contents of the file in one hit
1285 if (!open (FIN, $llssite_cfg)) {
1286 print STDERR "util::get_full_greenstone_url_prefix() failed to open $llssite_cfg ($!)\n";
1287 return undef;
1288 }
1289
1290 my $contents;
1291 sysread(FIN, $contents, -s FIN);
1292 close(FIN);
1293
1294 my @lines = split(/[\n\r]+/, $contents); # split on carriage-returns and/or linefeeds
1295 my $enterlib = "";
1296 my $portnumber = "8282"; # will remain empty (implicit port 80) unless it's specifically been assigned
1297
1298 foreach my $line (@lines) {
1299 if($line =~ m/^url=(.*)$/) {
1300 $url = $1;
1301 } elsif($line =~ m/^enterlib=(.*)$/) {
1302 $enterlib = $1;
1303 } elsif($line =~ m/^portnumber=(.*)$/) {
1304 $portnumber = $1;
1305 }
1306 }
1307
1308 if(!$url) {
1309 return undef;
1310 }
1311 elsif($url eq "URL_pending") { # library is not running
1312 # do not process url=URL_pending in the file, since for server.exe
1313 # this just means the Enter Library button hasn't been pressed yet
1314 $url = undef;
1315 }
1316 else {
1317 # In the case of server.exe, need to do extra work to get the proper URL
1318 # But first, need to know whether we're indeed dealing with server.exe:
1319
1320 # compare the URL's domain to the full URL
1321 # E.g. for http://localhost:8383/greenstone3/cgi-bin, the domain is localhost:8383
1322 my $uri = URI->new( $url );
1323 my $host = $uri->host;
1324 #print STDERR "@@@@@ host: $host\n";
1325 if($url =~ m/https?:\/\/$host(\/)?$/) {
1326 #if($url !~ m/https?:\/\/$host:$portnumber(\/)?/ || $url =~ m/https?:\/\/$host(\/)?$/) {
1327 # (if the URL does not contain the portnumber, OR if the port is implicitly 80 and)
1328 # If the domain with http:// prefix is completely the same as the URL, assume server.exe
1329 # then the actual URL is the result of suffixing the port and enterlib properties in llssite.cfg
1330 $url = $url.":".$portnumber.$enterlib;
1331 } # else, apache web server
1332
1333 }
1334 }
1335 } elsif($gs_mode eq "gs3") {
1336 # Either check build.properties for tomcat.server, tomcat.port and app.name (and default servlet name).
1337 # app.name is stored in app.path by build.xml. Need to move app.name in build.properties from build.xml
1338
1339 # Or, run the new target get-default-servlet-url
1340 # the output can look like:
1341 #
1342 # Buildfile: build.xml
1343 # [echo] os.name: Windows Vista
1344 #
1345 # get-default-servlet-url:
1346 # [echo] http://localhost:8383/greenstone3/library
1347 # BUILD SUCCESSFUL
1348 # Total time: 0 seconds
1349
1350 #my $output = qx/ant get-default-servlet-url/; # backtick operator, to get STDOUT (else 2>&1)
1351 # - see http://stackoverflow.com/questions/799968/whats-the-difference-between-perls-backticks-system-and-exec
1352
1353 # The get-default-servlet-url ant target can be run from anywhere by specifying the
1354 # location of GS3's ant build.xml buildfile. Activate.pl can be run from anywhere for GS3
1355 # GSDL3SRCHOME will be set for GS3 by gs3-setup.sh, a step that would have been necessary
1356 # to run the activate.pl script in the first place
1357
1358 my $full_build_xml = &FileUtils::javaFilenameConcatenate($ENV{'GSDL3SRCHOME'},"build.xml");
1359
1360 my $perl_command = "ant -buildfile \"$full_build_xml\" get-default-servlet-url";
1361
1362 if (open(PIN, "$perl_command |")) {
1363 while (defined (my $perl_output_line = <PIN>)) {
1364
1365 if($perl_output_line =~ m@(https?):\/\/(\S*)@) { # grab all the non-whitespace chars
1366 $url="$1://".$2; # preserve the http protocol #$url="http://".$1;
1367 }
1368 }
1369 close(PIN);
1370
1371 if (defined $lib_name) { # url won't be undef now
1372 # replace the servlet_name portion of the url found, with the given library_name
1373 $url =~ s@/[^/]*$@/$lib_name@;
1374 }
1375 } else {
1376 print STDERR "util::get_full_greenstone_url_prefix() failed to run $perl_command to work out library URL for $gs_mode\n";
1377 }
1378 }
1379
1380 # either the url is still undef or it is now set
1381 #print STDERR "\n@@@@@ final URL:|$url|\n" if $url;
1382 #print STDERR "\n@@@@@ URL still undef\n" if !$url;
1383
1384 $ENV{'FULL_GREENSTONE_URL_PREFIX'} = $url;
1385
1386 return $url;
1387}
1388
1389
1390# Given a config file (xml or java properties file) and a list/array of regular expressions
1391# that represent property names to match on, this function will return the value for the 1st
1392# matching property name. If the return value is undefined, no matching property was found.
1393sub extract_propvalue_from_file() {
1394 my ($configfile, $propertynames) = @_;
1395
1396 my $value;
1397 unless(open(FIN, "<$configfile")) {
1398 print STDERR "extract_propvalue_from_file(): Unable to open $configfile. $!\n";
1399 return $value; # not initialised
1400 }
1401
1402 # Read the entire file at once, as one single line, then close it
1403 my $filecontents;
1404 {
1405 local $/ = undef;
1406 $filecontents = <FIN>;
1407 }
1408 close(FIN);
1409
1410 foreach my $regex (@$propertynames) {
1411 ($value) = $filecontents=~ m/$regex\s*(\S*)/s; # read value of the property given by regex up to the 1st space
1412 if($value) {
1413 $value =~ s/^\"//; # remove any startquotes
1414 $value =~ s/\".*$//; # remove the 1st endquotes (if any) followed by any xml
1415 last; # found value for a matching property, break from loop
1416 }
1417 }
1418
1419 return $value;
1420}
1421
1422# Subroutine that sources setup.bash, given GSDLHOME and GSDLOS and
1423# given that perllib is in @INC in order to invoke this subroutine.
1424# Call as follows -- after setting up INC to include perllib and
1425# after setting up GSDLHOME and GSDLOS:
1426#
1427# require util;
1428# &util::setup_greenstone_env($ENV{'GSDLHOME'}, $ENV{'GSDLOS'});
1429#
1430sub setup_greenstone_env() {
1431 my ($GSDLHOME, $GSDLOS) = @_;
1432
1433 #my %env_map = ();
1434 # Get the localised ENV settings of running a localised source setup.bash
1435 # and put it into the ENV here. Need to clear GSDLHOME before running setup
1436 #my $perl_command = "(cd $GSDLHOME; export GSDLHOME=; . ./setup.bash > /dev/null; env)";
1437 my $perl_command = "(cd $GSDLHOME; /bin/bash -c \"export GSDLHOME=; source setup.bash > /dev/null; env\")";
1438 if (($GSDLOS =~ m/windows/i) && ($^O ne "cygwin")) {
1439 #$perl_command = "cmd /C \"cd $GSDLHOME&& set GSDLHOME=&& setup.bat > nul&& set\"";
1440 $perl_command = "(cd $GSDLHOME&& set GSDLHOME=&& setup.bat > nul&& set)";
1441 }
1442 if (!open(PIN, "$perl_command |")) {
1443 print STDERR ("Unable to execute command: $perl_command. $!\n");
1444 }
1445
1446 my $lastkey;
1447 while (defined (my $perl_output_line = <PIN>)) {
1448 my($key,$value) = ($perl_output_line =~ m/^([^=]*)[=](.*)$/);
1449 if(defined $key) {
1450 #$env_map{$key}=$value;
1451 $ENV{$key}=$value;
1452 $lastkey = $key;
1453 } elsif($lastkey && $perl_output_line !~ m/^\s*$/) {
1454 # there was no equals sign in $perl_output_line, so this
1455 # $perl_output_line may be a spillover from the previous
1456 $ENV{$lastkey} = $ENV{$lastkey}."\n".$perl_output_line;
1457 }
1458 }
1459 close (PIN);
1460
1461 # If any keys in $ENV don't occur in Greenstone's localised env
1462 # (stored in $env_map), delete those entries from $ENV
1463 #foreach $key (keys %ENV) {
1464 # if(!defined $env_map{$key}) {
1465 # print STDOUT "**** DELETING ENV KEY: $key\tVALUE: $ENV{$key}\n";
1466 # delete $ENV{$key}; # del $ENV(key, value) pair
1467 # }
1468 #}
1469 #undef %env_map;
1470}
1471
1472sub get_perl_exec() {
1473 my $perl_exec = $^X; # may return just "perl"
1474
1475 if($ENV{'PERLPATH'}) {
1476 # OR: # $perl_exec = &FileUtils::filenameConcatenate($ENV{'PERLPATH'},"perl");
1477 if (($ENV{'GSDLOS'} =~ m/windows/) && ($^O ne "cygwin")) {
1478 $perl_exec = "$ENV{'PERLPATH'}\\Perl.exe";
1479 } else {
1480 $perl_exec = "$ENV{'PERLPATH'}/perl";
1481 }
1482 } else { # no PERLPATH, use Config{perlpath} else $^X: special variables
1483 # containing the full path to the current perl executable we're using
1484 $perl_exec = $Config{perlpath}; # configured path for perl
1485 if (!-e $perl_exec) { # may not point to location on this machine
1486 $perl_exec = $^X; # may return just "perl"
1487 if($perl_exec =~ m/^perl/i) { # warn if just perl or Perl.exe
1488 print STDERR "**** WARNING: Perl exec found contains no path: $perl_exec";
1489 }
1490 }
1491 }
1492
1493 return $perl_exec;
1494}
1495
1496# returns the path to the java command in the JRE included with GS (if any),
1497# quoted to safeguard any spaces in this path, otherwise a simple java
1498# command is returned which assumes and will try for a system java.
1499sub get_java_command {
1500 my $java = "java";
1501 if(defined $ENV{'GSDLHOME'}) { # should be, as this script would be launched from the cmd line
1502 # after running setup.bat or from GLI which also runs setup.bat
1503 my $java_bin = &FileUtils::filenameConcatenate($ENV{'GSDLHOME'},"packages","jre","bin");
1504 if(-d $java_bin) {
1505 $java = &FileUtils::filenameConcatenate($java_bin,"java");
1506 $java = "\"".$java."\""; # quoted to preserve spaces in path
1507 }
1508 }
1509 return $java;
1510}
1511
1512
1513# Given the qualified collection name (colgroup/collection),
1514# returns the collection and colgroup parts
1515sub get_collection_parts {
1516 # http://perldoc.perl.org/File/Basename.html
1517 # my($filename, $directories, $suffix) = fileparse($path);
1518 # "$directories contains everything up to and including the last directory separator in the $path
1519 # including the volume (if applicable). The remainder of the $path is the $filename."
1520 #my ($collection, $colgroup) = &File::Basename::fileparse($qualified_collection);
1521
1522 my $qualified_collection = shift(@_);
1523
1524 # Since activate.pl can be launched from the command-line, including by a user,
1525 # best not to assume colgroup uses URL-style slashes as would be the case with GLI
1526 # Also allow for the accidental inclusion of multiple slashes
1527 my ($colgroup, $collection) = split(/[\/\\]+/, $qualified_collection); #split('/', $qualified_collection);
1528
1529 if(!defined $collection) {
1530 $collection = $colgroup;
1531 $colgroup = "";
1532 }
1533 return ($collection, $colgroup);
1534}
1535
1536# work out the "collectdir/collection" location
1537sub resolve_collection_dir {
1538 my ($collect_dir, $qualified_collection, $site) = @_; #, $gs_mode
1539
1540 if (defined $ENV{'GSDLCOLLECTDIR'}) { # a predefined collection dir exists
1541 return $ENV{'GSDLCOLLECTDIR'};
1542 }
1543
1544 my ($colgroup, $collection) = &util::get_collection_parts($qualified_collection);
1545
1546 if (!defined $collect_dir || !$collect_dir) { # if undefined or empty string
1547 $collect_dir = &util::get_working_collect_dir($site);
1548 }
1549
1550 return &FileUtils::filenameConcatenate($collect_dir,$colgroup,$collection);
1551}
1552
1553# work out the full path to "collect" of this greenstone 2/3 installation
1554sub get_working_collect_dir {
1555 my ($site) = @_;
1556
1557 if (defined $ENV{'GSDLCOLLECTHOME'}) { # a predefined collect dir exists
1558 return $ENV{'GSDLCOLLECTHOME'};
1559 }
1560
1561 if (defined $site && $site) { # site non-empty, so get default collect dir for GS3
1562
1563 if (defined $ENV{'GSDL3HOME'}) {
1564 return &FileUtils::filenameConcatenate($ENV{'GSDL3HOME'},"sites",$site,"collect"); # web folder
1565 }
1566 elsif (defined $ENV{'GSDL3SRCHOME'}) {
1567 return &FileUtils::filenameConcatenate($ENV{'GSDL3SRCHOME'},"web","sites",$site,"collect");
1568 }
1569 }
1570
1571 elsif (defined $ENV{'SITEHOME'}) {
1572 return &FileUtils::filenameConcatenate($ENV{'SITEHOME'},"collect");
1573 }
1574
1575 else { # get default collect dir for GS2
1576 return &FileUtils::filenameConcatenate($ENV{'GSDLHOME'},"collect");
1577 }
1578}
1579
1580sub is_abs_path_any_os {
1581 my ($path) = @_;
1582
1583 # We can have filenames in our DBs that were produced on other OS, so this method exists
1584 # to help identify absolute paths in such cases.
1585
1586 return 1 if($path =~ m@^/@); # full paths begin with forward slash on linux/mac
1587 return 1 if($path =~ m@^([a-zA-Z]\:|\\)@); # full paths begin with drive letter colon for Win or \ for volume, http://stackoverflow.com/questions/13011013/get-only-volume-name-from-filepath
1588
1589 return 0;
1590}
1591
1592
1593# This subroutine is for improving portability of Greenstone collections from one OS to another,
1594# to be used to convert absolute paths going into db files into paths with placeholders instead.
1595# This sub works with util::get_common_gs_paths and takes a path to a greenstone file and, if it's
1596# an absolute path, then it will replace the longest matching greenstone-path prefix of the given
1597# path with a placeholder to match.
1598# The Greenstone-path prefixes that can be matched are the following common Greenstone paths:
1599# the path to the current (specific) collection, the path to the general GS collect directory,
1600# the path to the site directory if GS3, else the path to the GSDLHOME/GSDL3HOME folder.
1601# The longest matching prefix will be replaced with the equivalent placeholder:
1602# @THISCOLLECTPATH@, else @COLLECTHOME@, else @SITEHOME@, else @GSDLHOME@.
1603sub abspath_to_placeholders {
1604 my $path = shift(@_); # path to convert from absolute to one with placeholders
1605 my $opt_long_or_short_winfilenames = shift(@_) || "short"; # whether we want to force use of long file names even on windows, default uses short
1606
1607 return $path unless is_abs_path_any_os($path); # path is relative
1608
1609 if ($opt_long_or_short_winfilenames eq "long") {
1610 $path = &util::upgrade_if_dos_filename($path); # will only do something on windows
1611 }
1612
1613 # now we know we're dealing with absolute paths and have to replace gs prefixes with placeholders
1614 my @gs_paths = ($ENV{'GSDLCOLLECTDIR'}, $ENV{'GSDLCOLLECTHOME'}, $ENV{'SITEHOME'}, $ENV{'GREENSTONEHOME'}); # list in this order: from longest to shortest path
1615
1616 my %placeholder_map = ($ENV{'GREENSTONEHOME'} => '@GSDLHOME@', # can't use double-quotes around at-sign, else perl tries to evaluate it as referring to an array
1617 $ENV{'GSDLCOLLECTHOME'} => '@COLLECTHOME@',
1618 $ENV{'GSDLCOLLECTDIR'} => '@THISCOLLECTPATH@'
1619 );
1620 $placeholder_map{$ENV{'SITEHOME'}} = '@SITEHOME@' if defined $ENV{'SITEHOME'};
1621
1622 $path = &util::_abspath_to_placeholders($path, \@gs_paths, \%placeholder_map);
1623
1624 if ($ENV{'GSDLOS'} =~ /^windows$/i && $opt_long_or_short_winfilenames eq "short") {
1625 # for windows need to look for matches on short file names too
1626 # matched paths are again to be replaced with the usual placeholders
1627
1628 my $gsdlcollectdir = &util::downgrade_if_dos_filename($ENV{'GSDLCOLLECTDIR'});
1629 my $gsdlcollecthome = &util::downgrade_if_dos_filename($ENV{'GSDLCOLLECTHOME'});
1630 my $sitehome = (defined $ENV{'SITEHOME'}) ? &util::downgrade_if_dos_filename($ENV{'SITEHOME'}) : undef;
1631 my $greenstonehome = &util::downgrade_if_dos_filename($ENV{'GREENSTONEHOME'});
1632
1633 @gs_paths = ($gsdlcollectdir, $gsdlcollecthome, $sitehome, $greenstonehome); # order matters
1634
1635 %placeholder_map = ($greenstonehome => '@GSDLHOME@', # can't use double-quotes around at-sign, else perl tries to evaluate it as referring to an array
1636 $gsdlcollecthome => '@COLLECTHOME@',
1637 $gsdlcollectdir => '@THISCOLLECTPATH@'
1638 );
1639 $placeholder_map{$sitehome} = '@SITEHOME@' if defined $sitehome;
1640
1641 $path = &util::_abspath_to_placeholders($path, \@gs_paths, \%placeholder_map);
1642 }
1643
1644 return $path;
1645}
1646
1647sub _abspath_to_placeholders {
1648 my ($path, $gs_paths_ref, $placeholder_map_ref) = @_;
1649
1650 # The sequence of elements in @gs_paths matters
1651 # Need to loop starting from the *longest* matching path (the path to the specific collection)
1652 # to the shortest matching path (the path to gsdlhome/gsdl3home folder):
1653
1654 foreach my $gs_path (@$gs_paths_ref) {
1655 next if(!defined $gs_path); # site undefined for GS2
1656
1657 my $re_path = &util::filename_to_regex($gs_path); # escape for regex
1658
1659 if($path =~ m/^$re_path/i) { # case sensitive or not for OS?
1660
1661 my $placeholder = $placeholder_map_ref->{$gs_path}; # get the placeholder to replace the matched path with
1662
1663 $path =~ s/^$re_path/$placeholder/; #case sensitive or not?
1664 #$path =~ s/^[\\\/]//; # remove gs_path's trailing separator left behind at the start of the path
1665 # lowercase file extension, This is needed when shortfilenames are used, as case affects alphetical ordering, which affects diffcol
1666 $path =~ s/\.([A-Z]+)$/".".lc($1)/e;
1667 last; # done
1668 }
1669 }
1670
1671 return $path;
1672}
1673
1674# Function that does the reverse of the util::abspath_to_placeholders subroutine
1675# Once again, call this with the values returned from util::get_common_gs_paths
1676sub placeholders_to_abspath {
1677 my $path = shift(@_); # path that can contain placeholders to convert to resolved absolute path
1678 my $opt_long_or_short_winfilenames = shift(@_) || "short"; # whether we want to force use of long file names even on windows, default uses short
1679
1680 return $path if($path !~ m/@/); # path contains no placeholders
1681
1682 # replace placeholders with gs prefixes
1683 my @placeholders = ('@THISCOLLECTPATH@', '@COLLECTHOME@', '@SITEHOME@', '@GSDLHOME@'); # order of paths not crucial in this case,
1684 # but listed here from longest to shortest once placeholders are have been resolved
1685
1686 # can't use double-quotes around at-sign, else perl tries to evaluate it as referring to an array
1687 my %placeholder_to_gspath_map;
1688 if ($ENV{'GSDLOS'} =~ /^windows$/i && $opt_long_or_short_winfilenames eq "short") {
1689 # always replace placeholders with short file names of the absolute paths on windows?
1690 %placeholder_to_gspath_map = ('@GSDLHOME@' => &util::downgrade_if_dos_filename($ENV{'GREENSTONEHOME'}),
1691 '@COLLECTHOME@' => &util::downgrade_if_dos_filename($ENV{'GSDLCOLLECTHOME'}),
1692 '@THISCOLLECTPATH@' => &util::downgrade_if_dos_filename($ENV{'GSDLCOLLECTDIR'})
1693 );
1694 $placeholder_to_gspath_map{'@SITEHOME@'} = &util::downgrade_if_dos_filename($ENV{'SITEHOME'}) if defined $ENV{'SITEHOME'};
1695 } else {
1696 %placeholder_to_gspath_map = ('@GSDLHOME@' => $ENV{'GREENSTONEHOME'},
1697 '@SITEHOME@' => $ENV{'SITEHOME'}, # can be undef
1698 '@COLLECTHOME@' => $ENV{'GSDLCOLLECTHOME'},
1699 '@THISCOLLECTPATH@' => $ENV{'GSDLCOLLECTDIR'}
1700 ); # $placeholder_to_gspath_map{'@SITEHOME@'} = $ENV{'SITEHOME'} if defined $ENV{'SITEHOME'};
1701 }
1702
1703 foreach my $placeholder (@placeholders) {
1704 my $gs_path = $placeholder_to_gspath_map{$placeholder};
1705
1706 next if(!defined $gs_path); # sitehome for GS2 is undefined
1707
1708 if($path =~ m/^$placeholder/) {
1709 $path =~ s/^$placeholder/$gs_path/;
1710 last; # done
1711 }
1712 }
1713
1714 return $path;
1715}
1716
1717# Used by pdfpstoimg.pl and PDFBoxConverter to create a .item file from
1718# a directory containing sequentially numbered images (and optional matching sequentially numbered .txt files).
1719sub create_itemfile
1720{
1721 my ($output_dir, $convert_basename, $convert_to) = @_;
1722 my $page_num = "";
1723
1724 opendir(DIR, $output_dir) || die "can't opendir $output_dir: $!";
1725 my @dir_files = grep {-f "$output_dir/$_"} readdir(DIR);
1726 closedir DIR;
1727
1728 # Sort files in the directory by page_num
1729 sub page_number {
1730 my ($dir) = @_;
1731 my ($pagenum) =($dir =~ m/^.*?[-\.]?(\d+)(\.(jpg|gif|png))?$/i);
1732# my ($pagenum) =($dir =~ m/(\d+)(\.(jpg|gif|png))?$/i); # this works but is not as safe/strict about input filepatterns as the above
1733
1734 $pagenum = 1 unless defined $pagenum;
1735 return $pagenum;
1736 }
1737
1738 # sort the files in the directory in the order of page_num rather than lexically.
1739 @dir_files = sort { page_number($a) <=> page_number($b) } @dir_files;
1740
1741 # work out if the numbering of the now sorted image files starts at 0 or not
1742 # by checking the number of the first _image_ file (skipping item files)
1743 my $starts_at_0 = 0;
1744 my $firstfile = ($dir_files[0] !~ /\.item$/i) ? $dir_files[0] : $dir_files[1];
1745 if(page_number($firstfile) == 0) { # 00 will evaluate to 0 too in this condition
1746 $starts_at_0 = 1;
1747 }
1748
1749 my $item_file = &FileUtils::filenameConcatenate($output_dir, $convert_basename.".item");
1750 my $item_fh;
1751 &FileUtils::openFileHandle($item_file, 'w', \$item_fh);
1752 print $item_fh "<PagedDocument>\n";
1753
1754 # In the past, sub create_itemfile() never output txtfile names into the item file (they were left as empty strings),
1755 # only image file names. Now that PDFBox is being customised for GS with the new GS_PDFToImagesAndText.java class to
1756 # create images of each PDF page and extract text for that page if extractable, we can have matching txt files for
1757 # each img file. So now we can output txt file names if we're working with txt files.
1758 # We just test if a text file exists in the same dir that matches the name of the first image file
1759 # if a matching txt file does not exist, don't output txtfile names into the item file
1760
1761 my ($tailname, $dirname, $suffix) = &File::Basename::fileparse($firstfile, "\\.[^\\.]+\$"); # relative filenames so no dirname
1762 my $txtfilename = &FileUtils::filenameConcatenate($output_dir, $tailname . ".txt");
1763 my $hasTxtFile = &FileUtils::fileExists($txtfilename);
1764
1765 foreach my $file (@dir_files){
1766 if ($file !~ /\.item/i && $file !~ /\.txt/i){
1767 $page_num = page_number($file);
1768 $page_num++ if $starts_at_0; # image numbers start at 0, so add 1
1769 if($hasTxtFile) {
1770 print $item_fh " <Page pagenum=\"$page_num\" imgfile=\"$file\" txtfile=\"$page_num.txt\"/>\n";
1771 } else {
1772 print $item_fh " <Page pagenum=\"$page_num\" imgfile=\"$file\" txtfile=\"\"/>\n";
1773 }
1774 }
1775 }
1776
1777 print $item_fh "</PagedDocument>\n";
1778 &FileUtils::closeFileHandle($item_file, \$item_fh);
1779 return $item_file;
1780}
1781
1782# Sets the gnomelib_env. Based on the logic in wvware.pl which can perhaps be replaced with a call to this function in future
1783sub set_gnomelib_env
1784{
1785 ## SET THE ENVIRONMENT AS DONE IN SETUP.BASH/BAT OF GNOME-LIB
1786 # Though this is only needed for darwin Lion at this point (and android, though that is untested)
1787
1788 my $libext = "so";
1789 if ($ENV{'GSDLOS'} =~ m/^windows$/i) {
1790 return;
1791 } elsif ($ENV{'GSDLOS'} =~ m/^darwin$/i) {
1792 $libext = "dylib";
1793 }
1794
1795 if (!defined $ENV{'GEXTGNOME'}) {
1796 ##print STDERR "@@@ Setting GEXTGNOME env\n";
1797
1798 my $gnome_dir = &FileUtils::filenameConcatenate($ENV{'GSDLHOME'},"ext","gnome-lib-minimal");
1799
1800 if(! -d $gnome_dir) {
1801 $gnome_dir = &FileUtils::filenameConcatenate($ENV{'GSDLHOME'},"ext","gnome-lib");
1802
1803 if(! -d $gnome_dir) {
1804 $gnome_dir = "";
1805 }
1806 }
1807
1808 # now set other the related env vars,
1809 # IF we've found the gnome-lib dir installed in the ext folder
1810
1811 if ($gnome_dir ne "" && -f &FileUtils::filenameConcatenate($gnome_dir, $ENV{'GSDLOS'}, "lib", "libiconv.$libext")) {
1812 $ENV{'GEXTGNOME'} = $gnome_dir;
1813 $ENV{'GEXTGNOME_INSTALLED'}=&FileUtils::filenameConcatenate($ENV{'GEXTGNOME'}, $ENV{'GSDLOS'});
1814
1815 my $gnomelib_bin = &FileUtils::filenameConcatenate($ENV{'GEXTGNOME_INSTALLED'}, "bin");
1816 if(-d $gnomelib_bin) { # no bin subfolder in GS binary's cutdown gnome-lib-minimal folder
1817 &util::envvar_prepend("PATH", $gnomelib_bin);
1818 }
1819
1820 # util's prepend will create LD/DYLD_LIB_PATH if it doesn't exist yet
1821 my $gextlib = &FileUtils::filenameConcatenate($ENV{'GEXTGNOME_INSTALLED'}, "lib");
1822
1823 if($ENV{'GSDLOS'} eq "linux") {
1824 &util::envvar_prepend("LD_LIBRARY_PATH", $gextlib);
1825 }
1826 elsif ($ENV{'GSDLOS'} eq "darwin") {
1827 #&util::envvar_prepend("DYLD_LIBRARY_PATH", $gextlib);
1828 &util::envvar_prepend("DYLD_FALLBACK_LIBRARY_PATH", $gextlib);
1829 }
1830 }
1831
1832 # Above largely mimics the setup.bash of the gnome-lib-minimal.
1833 # Not doing the devel-srcpack that gnome-lib-minimal's setup.bash used to set
1834 # Not exporting GSDLEXTS variable either
1835 }
1836
1837# print STDERR "@@@@@ GEXTGNOME: ".$ENV{'GEXTGNOME'}."\n\tINSTALL".$ENV{'GEXTGNOME_INSTALLED'}."\n";
1838# print STDERR "\tPATH".$ENV{'PATH'}."\n";
1839# print STDERR "\tLD_LIB_PATH".$ENV{'LD_LIBRARY_PATH'}."\n" if $ENV{'LD_LIBRARY_PATH};
1840# print STDERR "\tDYLD_FALLBACK_LIB_PATH".$ENV{'DYLD_FALLBACK_LIBRARY_PATH'}."\n" if $ENV{'DYLD_FALLBACK_LIBRARY_PATH};
1841
1842 # if no GEXTGNOME, maybe users didn't need gnome-lib to run gnomelib/libiconv dependent binaries like hashfile, suffix, wget
1843 # (wvware is launched in a gnomelib env from its own script, but could possibly go through this script in future)
1844}
1845
1846
1847
1848## @function augmentINC()
1849#
1850# Prepend a path (if it exists) onto INC but only if it isn't already in INC
1851# @param $new_path The path to add
1852# @author jmt12
1853#
1854sub augmentINC
1855{
1856 my ($new_path) = @_;
1857 my $did_add_path = 0;
1858 # might need to be replaced with FileUtils::directoryExists() call eventually
1859 if (-d $new_path)
1860 {
1861 my $did_find_path = 0;
1862 foreach my $existing_path (@INC)
1863 {
1864 if ($existing_path eq $new_path)
1865 {
1866 $did_find_path = 1;
1867 last;
1868 }
1869 }
1870 if (!$did_find_path)
1871 {
1872 unshift(@INC, $new_path);
1873 $did_add_path = 1;
1874 }
1875 }
1876 return $did_add_path;
1877}
1878## augmentINC()
1879
1880
18811;
Note: See TracBrowser for help on using the repository browser.