source: gsdl/trunk/perllib/plugins/PagedImgPlug.pm@ 15018

Last change on this file since 15018 was 15018, checked in by davidb, 16 years ago

Marc mapping upgraded to support richer set of operations, including subfields, multiple fields in one line (separated by comma), and the removal of rules, e.g. -245 at the start of a line. A Marc to Qualified Dublin Core crosswalk from the Library of congress has been added as "etc/marc2qdc.txt". A collection can then choose to, for example, top up the mapping with its own version of the file stored in its local "etc" folder, specifying only the rules that are different. This is where a rule like "-245" might be used to override a more general rule from the main file that has all subfields in 245 mapping to one metadata item (Title). If the user specifies a different different filename -- through a plugin option -- then they are free to divise a mapping from scratch and store it in the collections local "etc" folder.

  • Property svn:executable set to *
  • Property svn:keywords set to Author Date Id Revision
File size: 33.9 KB
Line 
1###########################################################################
2#
3# PagedImgPlug.pm -- plugin for sets of images and OCR text that
4# make up a document
5# A component of the Greenstone digital library software
6# from the New Zealand Digital Library Project at the
7# University of Waikato, New Zealand.
8#
9# Copyright (C) 1999 New Zealand Digital Library Project
10#
11# This program is free software; you can redistribute it and/or modify
12# it under the terms of the GNU General Public License as published by
13# the Free Software Foundation; either version 2 of the License, or
14# (at your option) any later version.
15#
16# This program is distributed in the hope that it will be useful,
17# but WITHOUT ANY WARRANTY; without even the implied warranty of
18# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
19# GNU General Public License for more details.
20#
21# You should have received a copy of the GNU General Public License
22# along with this program; if not, write to the Free Software
23# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
24#
25###########################################################################
26
27# PagedImgPlug
28# processes sequences of images, with optional OCR text
29#
30# This plugin takes *.item files, which contain metadata and lists of image
31# files, and produces a document containing sections, one for each page.
32# The files should be named something.item, then you can have more than one
33# book in a directory. You will need to create these files, one for each
34# document/book.
35#
36#There are two formats for the item files: a plain text format, and an xml
37#format. You can use either format, and can have both formats in the same
38#collection if you like. If you use the plain format, you must not start the
39#file off with <PagedDocument>
40
41#### PLAIN FORMAT
42# The format of the xxx.item file is as follows:
43# The first lines contain any metadata for the whole document
44# <metadata-name>metadata-value
45# eg.
46# <Title>Snail farming
47# <Date>19230102
48# Then comes a list of pages, one page per line, each line has the format
49#
50# pagenum:imagefile:textfile:r
51#
52# page num and imagefile are required. pagenum is used for the Title
53# of the section, and in the display is shown as page <pagenum>.
54# imagefile is the image for the page. textfile is an optional text
55# file containing the OCR (or any) text for the page - this gets added
56# as the text for the section. r is optional, and signals that the image
57# should be rotated 180deg. Eg use this if the image has been made upside down.
58# So an example item file looks like:
59# <Title>Snail farming
60# <Date>19960403
61# 1:p1.gif:p1.txt:
62# 2:p2.gif::
63# 3:p3.gif:p3.txt:
64# 3b:p3b.gif:p3b.txt:r
65# The second page has no text, the fourth page is a back page, and
66# should be rotated.
67#
68
69#### XML FORMAT
70# The xml format looks like the following
71#<PagedDocument>
72#<Metadata name="Title">The Title of the entire document</Metadata>
73#<Page pagenum="1" imgfile="xxx.jpg" txtfile="yyy.txt">
74#<Metadata name="Title">The Title of this page</Metadata>
75#</Page>
76#... more pages
77#</PagedDocument>
78#PagedDocument contains a list of Pages, Metadata and PageGroups. Any metadata
79#that is not inside another tag will belong to the document.
80#Each Page has a pagenum (not used at the moment), an imgfile and/or a txtfile.
81#These are both optional - if neither is used, the section will have no content.
82#Pages can also have metadata associated with them.
83#PageGroups can be introduced at any point - they can contain Metadata and Pages and other PageGroups. They are used to introduce hierarchical structure into the document.
84#For example
85#<PagedDocument>
86#<PageGroup>
87#<Page>
88#<Page>
89#</PageGroup>
90#<Page>
91#</PagedDocument>
92#would generate a structure like
93#X
94#--X
95# --X
96# --X
97#--X
98#PageGroup tags can also have imgfile/textfile metadata if you like - this way they get some content themselves.
99
100#Currently the XML structure doesn't work very well with the paged document type, unless you use numerical Titles for each section.
101#There is still a bit of work to do on this format:
102#* enable other text file types, eg html, pdf etc
103#* make the document paging work properly
104#* add pagenum as Title unless a Title is present?
105
106# All the supplemetary image amd text files should be in the same folder as
107# the .item file.
108#
109# To display the images instead of the document text, you can use [srcicon]
110# in the DocumentText format statement.
111# For example,
112#
113# format DocumentText "<center><table width=_pagewidth_><tr><td>[srcicon]</td></tr></table></center>"
114#
115# To have it create thumbnail size images, use the '-thumbnail' option.
116# To have it create medium size images for display, use the '-screenview'
117# option. As usual, running
118# 'perl -S pluginfo.pl PagedImgPlug' will list all the options.
119
120# If you want the resulting documents to be presented with a table of
121# contents, use '-documenttype hierarchy', otherwise they will have
122# next and previous arrows, and a goto page X box.
123
124# If you have used -screenview, you can also use [screenicon] in the format
125# statement to display the smaller image. Here is an example that switches
126# between the two:
127#
128# format DocumentText "<center><table width=_pagewidth_><tr><td>{If}{_cgiargp_ eq full,<a href='_httpdocument_&d=_cgiargd_&p=small'>Switch to small version.</a>,<a href='_httpdocument_&d=_cgiargd_&p=full'>Switch to fullsize version</a>}</td></tr><tr><td>{If}{_cgiargp_ eq full,<a href='_httpdocument_&d=_cgiargd_&p=small' title='Switch to small version'>[srcicon]</a>,<a href='_httpdocument_&d=_cgiargd_&p=full' title='Switch to fullsize version'>[screenicon]</a>}</td></tr></table></center>"
129#
130# Additional metadata can be added into the .item files, alternatively you can
131# use normal metadata.xml files, with the name of the xxx.item file as the
132# FileName (only for document level metadata).
133
134package PagedImgPlug;
135
136use XMLPlug;
137use strict;
138no strict 'refs'; # allow filehandles to be variables and viceversa
139
140sub BEGIN {
141 @PagedImgPlug::ISA = ('XMLPlug');
142}
143
144my $type_list =
145 [ { 'name' => "paged",
146 'desc' => "{PagedImgPlug.documenttype.paged}" },
147 { 'name' => "hierarchy",
148 'desc' => "{PagedImgPlug.documenttype.hierarchy}" } ];
149
150my $arguments =
151 [ { 'name' => "process_exp",
152 'desc' => "{BasPlug.process_exp}",
153 'type' => "string",
154 'deft' => &get_default_process_exp(),
155 'reqd' => "no" },
156 { 'name' => "block_exp",
157 'desc' => "{BasPlug.block_exp}",
158 'type' => "string",
159 'deft' => &get_default_block_exp(),
160 'reqd' => "no" },
161 { 'name' => "title_sub",
162 'desc' => "{HTMLPlug.title_sub}",
163 'type' => "string",
164 'deft' => "" },
165 { 'name' => "noscaleup",
166 'desc' => "{ImagePlug.noscaleup}",
167 'type' => "flag",
168 'reqd' => "no" },
169 { 'name' => "thumbnail",
170 'desc' => "{PagedImgPlug.thumbnail}",
171 'type' => "flag",
172 'reqd' => "no" },
173 { 'name' => "thumbnailsize",
174 'desc' => "{ImagePlug.thumbnailsize}",
175 'type' => "int",
176 'deft' => "100",
177 'range' => "1,",
178 'reqd' => "no" },
179 { 'name' => "thumbnailtype",
180 'desc' => "{ImagePlug.thumbnailtype}",
181 'type' => "string",
182 'deft' => "gif",
183 'reqd' => "no" },
184 { 'name' => "screenview",
185 'desc' => "{PagedImgPlug.screenview}",
186 'type' => "flag",
187 'reqd' => "no" },
188 { 'name' => "screenviewsize",
189 'desc' => "{PagedImgPlug.screenviewsize}",
190 'type' => "int",
191 'deft' => "500",
192 'range' => "1,",
193 'reqd' => "no" },
194 { 'name' => "screenviewtype",
195 'desc' => "{PagedImgPlug.screenviewtype}",
196 'type' => "string",
197 'deft' => "jpg",
198 'reqd' => "no" },
199 { 'name' => "converttotype",
200 'desc' => "{ImagePlug.converttotype}",
201 'type' => "string",
202 'deft' => "",
203 'reqd' => "no" },
204 { 'name' => "minimumsize",
205 'desc' => "{ImagePlug.minimumsize}",
206 'type' => "int",
207 'deft' => "100",
208 'range' => "1,",
209 'reqd' => "no" },
210 { 'name' => "headerpage",
211 'desc' => "{PagedImgPlug.headerpage}",
212 'type' => "flag",
213 'reqd' => "no" },
214 { 'name' => "documenttype",
215 'desc' => "{PagedImgPlug.documenttype}",
216 'type' => "enum",
217 'list' => $type_list,
218 'deft' => "paged",
219 'reqd' => "no" } ];
220
221
222my $options = { 'name' => "PagedImgPlug",
223 'desc' => "{PagedImgPlug.desc}",
224 'abstract' => "no",
225 'inherits' => "yes",
226 'args' => $arguments };
227
228sub new {
229 my ($class) = shift (@_);
230 my ($pluginlist,$inputargs,$hashArgOptLists) = @_;
231 push(@$pluginlist, $class);
232
233 if(defined $arguments){ push(@{$hashArgOptLists->{"ArgList"}},@{$arguments});}
234 if(defined $options) { push(@{$hashArgOptLists->{"OptList"}},$options)};
235
236 my $self = new XMLPlug($pluginlist, $inputargs, $hashArgOptLists);
237
238 return bless $self, $class;
239}
240
241sub get_default_process_exp {
242 my $self = shift (@_);
243
244 return q^\.item$^;
245}
246
247sub get_doctype {
248 my $self = shift(@_);
249
250 return "PagedDocument";
251}
252
253
254# want to block everything except the .item ones
255# but instead we will block images and txt files
256sub get_default_block_exp {
257 my $self = shift (@_);
258
259 return q^(?i)(\.jpe?g|\.gif|\.png|\.tif?f|\.te?xt|\.html?|~)$^
260}
261
262# Create the thumbnail and screenview images, and discover the Image's
263# size, width, and height using the convert utility.
264sub process_image {
265 my $self = shift (@_);
266 my $filename = shift (@_); # filename with full path
267 my $srcfile = shift (@_); # filename without path
268 my $doc_obj = shift (@_);
269 my $section = shift (@_); #the current section
270 my $rotate = shift (@_); # whether to rotate the image or not
271 $rotate = 0 unless defined $rotate;
272
273 # check that the image file exists!!
274 if (!-f $filename) {
275 print "PagedImgPlug: ERROR: File $filename does not exist, skipping\n";
276 return 0;
277 }
278
279 my $top=0;
280 if ($section eq $doc_obj->get_top_section()) {
281 $top=1;
282 }
283 my $verbosity = $self->{'verbosity'};
284 my $outhandle = $self->{'outhandle'};
285
286 # check the filename is okay
287 return 0 if ($srcfile eq "" || $filename eq "");
288
289 my $minimumsize = $self->{'minimumsize'};
290 if (defined $minimumsize && (-s $filename < $minimumsize)) {
291 print $outhandle "PagedImgPlug: \"$filename\" too small, skipping\n"
292 if ($verbosity > 1);
293 }
294
295 # Convert the image to a new type (if required), and rotate if required.
296 my $converttotype = $self->{'converttotype'};
297 my $originalfilename = ""; # only set if we do a conversion
298 my $type = "unknown";
299 my $converted = 0;
300 my $rotated=0;
301
302 if ($converttotype ne "" && $filename !~ /$converttotype$/) {
303 $converted=1;
304 $originalfilename = $filename;
305 my $filehead = &util::get_tmp_filename();
306 $filename = $filehead . ".$converttotype";
307 my $n = 1;
308 while (-e $filename) {
309 $filename = "$filehead$n\.$converttotype";
310 $n++;
311 }
312 $self->{'tmp_filename1'} = $filename;
313
314 my $rotate_option = "";
315 if ($rotate eq "r") {
316 $rotate_option = "-rotate 180 ";
317 }
318
319 my $command = "convert -verbose \"$originalfilename\" $rotate_option \"$filename\"";
320 print $outhandle "CONVERT: $command\n" if ($verbosity > 2);
321 my $result = '';
322 $result = `$command`;
323 print $outhandle "CONVERT RESULT = $result\n" if ($verbosity > 2);
324
325 $type = $converttotype;
326 } elsif ($rotate eq "r") {
327 $rotated=1;
328 $originalfilename = $filename;
329 $filename = &util::get_tmp_filename();
330
331 my $command = "convert \"$originalfilename\" -rotate 180 \"$filename\"";
332 print $outhandle "ROTATE: $command\n" if ($verbosity > 2);
333 my $result = '';
334 $result = `$command`;
335 print $outhandle "ROTATE RESULT = $result\n" if ($verbosity > 2);
336
337 }
338
339
340 # Add the image metadata
341 my $file; # the new file name
342 my $id = $srcfile;
343 $id =~ s/\.([^\.]*)$//; # the new file name without an extension
344 if ($converted) {
345 # we have converted the image
346 # add on the new extension
347 $file .= "$id.$converttotype";
348 } else {
349 $file = $srcfile;
350 }
351
352 my $url =$file; # the new file name prepared for a url
353 my $srcurl = $srcfile;
354 ##$url =~ s/ /%20/g;
355 ##$srcurl =~ s/ /%20/g;
356
357 $doc_obj->add_metadata ($section, "Image", $url);
358
359 # Also want to set filename as 'Source' metadata to be
360 # consistent with other plugins
361 $doc_obj->add_metadata ($section, "Source", $srcurl);
362
363 my ($image_type, $image_width, $image_height, $image_size)
364 = &identify($filename, $outhandle, $verbosity);
365
366 $doc_obj->add_metadata ($section, "ImageType", $image_type);
367 $doc_obj->add_metadata ($section, "ImageWidth", $image_width);
368 $doc_obj->add_metadata ($section, "ImageHeight", $image_height);
369 $doc_obj->add_metadata ($section, "ImageSize", $image_size);
370 $doc_obj->add_metadata ($section, "FileFormat", "PagedImg");
371 # add NoText metadata which can be used to suppress the dummy text
372 $doc_obj->add_metadata ($section, "NoText", "1");
373
374 if ($type eq "unknown" && $image_type) {
375 $type = $image_type;
376 }
377
378 if ($top) {
379 $doc_obj->add_metadata ($section, "srclink",
380 "<a href=\"_httpprefix_/collect/[collection]/index/assoc/[assocfilepath]/[Image]\">");
381 $doc_obj->add_metadata ($section, "srcicon", "<img src=\"_httpprefix_/collect/[collection]/index/assoc/[assocfilepath]/[Image]\">");
382
383 } else {
384 $doc_obj->add_metadata ($section, "srclink",
385 "<a href=\"_httpprefix_/collect/[collection]/index/assoc/[parent(Top):assocfilepath]/[Image]\">");
386 $doc_obj->add_metadata ($section, "srcicon", "<img src=\"_httpprefix_/collect/[collection]/index/assoc/[parent(Top):assocfilepath]/[Image]\">");
387
388 }
389 $doc_obj->add_metadata ($section, "/srclink", "</a>");
390
391
392 # Add the image as an associated file
393 $doc_obj->associate_file($filename,$file,"image/$type",$section);
394 print $outhandle "associating file $filename as name $file\n" if ($verbosity > 2);
395
396 if ($self->{'thumbnail'}) {
397 # Make the thumbnail image
398 my $thumbnailsize = $self->{'thumbnailsize'} || 100;
399 my $thumbnailtype = $self->{'thumbnailtype'} || 'gif';
400
401 my $filehead = &util::get_tmp_filename();
402 my $thumbnailfile = $filehead . ".$thumbnailtype";
403 my $n=1;
404 while (-e $thumbnailfile) {
405 $thumbnailfile = $filehead . $n . ".$thumbnailtype";
406 $n++;
407 }
408
409 $self->{'tmp_filename2'} = $thumbnailfile;
410
411 # Generate the thumbnail with convert
412 my $command = "convert -verbose -geometry $thumbnailsize"
413 . "x$thumbnailsize \"$filename\" \"$thumbnailfile\"";
414 print $outhandle "THUMBNAIL: $command\n" if ($verbosity > 2);
415 my $result = '';
416 $result = `$command 2>&1` ;
417 print $outhandle "THUMB RESULT: $result\n" if ($verbosity > 2);
418
419 # Add the thumbnail as an associated file ...
420 if (-e "$thumbnailfile") {
421 $doc_obj->associate_file("$thumbnailfile", $id."thumb.$thumbnailtype", "image/$thumbnailtype",$section);
422 $doc_obj->add_metadata ($section, "ThumbType", $thumbnailtype);
423 $doc_obj->add_metadata ($section, "Thumb", $id."thumb.$thumbnailtype");
424 if ($top) {
425 $doc_obj->add_metadata ($section, "thumbicon", "<img src=\"_httpprefix_/collect/[collection]/index/assoc/[assocfilepath]/[Thumb]\" width=[ThumbWidth] height=[ThumbHeight]>");
426 } else {
427 $doc_obj->add_metadata ($section, "thumbicon", "<img src=\"_httpprefix_/collect/[collection]/index/assoc/[parent(Top):assocfilepath]/[Thumb]\" width=[ThumbWidth] height=[ThumbHeight]>");
428 }
429 }
430
431 # Extract Thumnail metadata from convert output
432 if ($result =~ m/[0-9]+x[0-9]+=>([0-9]+)x([0-9]+)/) {
433 $doc_obj->add_metadata ($section, "ThumbWidth", $1);
434 $doc_obj->add_metadata ($section, "ThumbHeight", $2);
435 }
436 }
437 # Make a screen-sized version of the picture if requested
438 if ($self->{'screenview'}) {
439
440 # To do: if the actual image is smaller than the screenview size,
441 # we should use the original !
442
443 my $screenviewsize = $self->{'screenviewsize'} || 500;
444 my $screenviewtype = $self->{'screenviewtype'} || 'jpeg';
445 my $filehead = &util::get_tmp_filename();
446 my $screenviewfilename = $filehead . ".$screenviewtype";
447 my $n=1;
448 while (-e $screenviewfilename) {
449 $screenviewfilename = "$filehead$n\.$screenviewtype";
450 $n++;
451 }
452 $self->{'tmp_filename3'} = $screenviewfilename;
453
454 # make the screenview image
455 my $command = "convert -verbose -geometry $screenviewsize"
456 . "x$screenviewsize \"$filename\" \"$screenviewfilename\"";
457 print $outhandle "SCREENVIEW: $command\n" if ($verbosity > 2);
458 my $result = "";
459 $result = `$command 2>&1` ;
460 print $outhandle "SCREENVIEW RESULT: $result\n" if ($verbosity > 3);
461
462 # get screenview dimensions, size and type
463 if ($result =~ m/[0-9]+x[0-9]+=>([0-9]+)x([0-9]+)/) {
464 $doc_obj->add_metadata ($section, "ScreenWidth", $1);
465 $doc_obj->add_metadata ($section, "ScreenHeight", $2);
466 }elsif ($result =~ m/([0-9]+)x([0-9]+)/) {
467 #if the image hasn't changed size, the previous regex doesn't match
468 $doc_obj->add_metadata ($section, "ScreenWidth", $1);
469 $doc_obj->add_metadata ($section, "ScreenHeight", $2);
470 }
471
472 #add the screenview as an associated file ...
473 if (-e "$screenviewfilename") {
474 $doc_obj->associate_file("$screenviewfilename", $id."sv.$screenviewtype",
475 "image/$screenviewtype",$section);
476 print $outhandle "associating screen file $screenviewfilename as name $id sv.$screenviewtype\n" if ($verbosity > 2);
477
478 $doc_obj->add_metadata ($section, "ScreenType", $screenviewtype);
479 $doc_obj->add_metadata ($section, "Screen", $id."sv.$screenviewtype");
480
481 if ($top) {
482 $doc_obj->add_metadata ($section, "screenicon", "<img src=\"_httpprefix_/collect/[collection]/index/assoc/[assocfilepath]/[Screen]\" width=[ScreenWidth] height=[ScreenHeight]>");
483 } else {
484 $doc_obj->add_metadata ($section, "screenicon", "<img src=\"_httpprefix_/collect/[collection]/index/assoc/[parent(Top):assocfilepath]/[Screen]\" width=[ScreenWidth] height=[ScreenHeight]>");
485
486 }
487 } else {
488 print $outhandle "PagedImgPlug: couldn't find \"$screenviewfilename\"\n";
489 }
490 }
491
492 return $type;
493
494
495}
496
497
498
499# Discover the characteristics of an image file with the ImageMagick
500# "identify" command.
501
502sub identify {
503 my ($image, $outhandle, $verbosity) = @_;
504
505 # Use the ImageMagick "identify" command to get the file specs
506 my $command = "identify \"$image\" 2>&1";
507 print $outhandle "$command\n" if ($verbosity > 2);
508 my $result = '';
509 $result = `$command`;
510 print $outhandle "$result\n" if ($verbosity > 3);
511
512 # Read the type, width, and height
513 my $type = 'unknown';
514 my $width = 'unknown';
515 my $height = 'unknown';
516
517 my $image_safe = quotemeta $image;
518 if ($result =~ /^$image_safe (\w+) (\d+)x(\d+)/) {
519 $type = $1;
520 $width = $2;
521 $height = $3;
522 }
523
524 # Read the size
525 my $size = "unknown";
526 if ($result =~ m/^.* ([0-9]+)b/) {
527 $size = $1;
528 } elsif ($result =~ m/^.* ([0-9]+)kb/) {
529 $size = 1024 * $1;
530 }
531
532 print $outhandle "file: $image:\t $type, $width, $height, $size\n"
533 if ($verbosity > 3);
534
535 # Return the specs
536 return ($type, $width, $height, $size);
537}
538
539
540# The PagedImgPlug read() function. This function does all the right things
541# to make general options work for a given plugin. It calls the process()
542# function which does all the work specific to a plugin (like the old
543# read functions used to do). Most plugins should define their own
544# process() function and let this read() function keep control.
545#
546# PagedImgPlug overrides read() because there is no need to read the actual
547# text of the file in, because the contents of the file is not text...
548#
549# Return number of files processed, undef if can't process
550# Note that $base_dir might be "" and that $file might
551# include directories
552
553sub read_into_doc_obj {
554 my $self = shift (@_);
555 my ($pluginfo, $base_dir, $file, $metadata, $processor, $maxdocs, $total_count, $gli) = @_;
556 my $outhandle = $self->{'outhandle'};
557
558 #check process and block exps, smart block, etc
559 my ($block_status,$filename) = $self->read_block(@_);
560 return $block_status if ((!defined $block_status) || ($block_status==0));
561
562 print $outhandle "PagedImgPlug processing \"$filename\"\n"
563 if $self->{'verbosity'} > 1;
564 print STDERR "<Processing n='$file' p='PagedImgPlug'>\n" if ($gli);
565
566 # here we need to decide if we have an old text .item file, or a new xml
567 # .item file - for now the test is if the first non-empty line is
568 # <PagedDocument> then its xml
569 my $xml_version = 0;
570 open (ITEMFILE, $filename) || die "couldn't open $filename\n";
571
572 my $backup_filename = "backup.item";
573 open (BACKUP,">$backup_filename")|| die "couldn't write to $backup_filename\n";
574 my $line = "";
575 my $num = 0;
576 $line = <ITEMFILE>;
577 while ($line !~ /\w/) {
578 $line = <ITEMFILE>;
579 }
580 chomp $line;
581 if ($line =~ /<PagedDocument/) {
582 $xml_version = 1;
583 }
584 close ITEMFILE;
585 open (ITEMFILE, $filename) || die "couldn't open $filename\n";
586 $line = <ITEMFILE>;
587 $line =~ s/^\xEF\xBB\xBF//; # strip BOM
588 $line =~ s/\x0B+//ig;
589 $line =~ s/&/&amp;/g;
590 print BACKUP ($line);
591 #Tidy up the item file some metadata title contains \vt-vertical tab
592 while ($line = <ITEMFILE>) {
593 $line =~ s/\x0B+//ig;
594 $line =~ s/&/&amp;/g;
595 print BACKUP ($line);
596 }
597 close ITEMFILE;
598 close BACKUP;
599 &File::Copy::copy ($backup_filename, $filename);
600 &util::rm($backup_filename);
601
602 my $doc_obj;
603 if ($xml_version) {
604 $file =~ s/^[\/\\]+//; # $file often begins with / so we'll tidy it up
605 $self->{'file'} = $file;
606 $self->{'filename'} = $filename;
607 $self->{'processor'} = $processor;
608 $self->{'metadata'} = $metadata;
609
610 eval {
611 $@ = "";
612 my $xslt = $self->{'xslt'};
613 if (defined $xslt && ($xslt ne "")) {
614 # perform xslt
615 my $transformed_xml = $self->apply_xslt($xslt,$filename);
616
617 # feed transformed file (now in memory as string) into XML parser
618 #$self->{'parser'}->parse($transformed_xml);
619 $self->parse_string($transformed_xml);
620 }
621 else {
622 #$self->{'parser'}->parsefile($filename);
623 $self->parse_file($filename);
624 }
625 };
626
627
628
629 if ($@) {
630
631 # parsefile may either croak somewhere in XML::Parser (e.g. because
632 # the document is not well formed) or die somewhere in XMLPlug or a
633 # derived plugin (e.g. because we're attempting to process a
634 # document whose DOCTYPE is not meant for this plugin). For the
635 # first case we'll print a warning and continue, for the second
636 # we'll just continue quietly
637
638 print STDERR "**** XML Parse Error is: $@\n";
639
640 my ($msg) = $@ =~ /Carp::croak\(\'(.*?)\'\)/;
641 if (defined $msg) {
642 my $outhandle = $self->{'outhandle'};
643 my $plugin_name = ref ($self);
644 print $outhandle "$plugin_name failed to process $file ($msg)\n";
645 }
646
647 # reset ourself for the next document
648 $self->{'section_level'}=0;
649 print STDERR "<ProcessingError n='$file'>\n" if ($gli);
650 return -1; # error during processing
651 }
652 $doc_obj = $self->{'doc_obj'};
653 } else {
654 my ($dir);
655 ($dir, $file) = $filename =~ /^(.*?)([^\/\\]*)$/;
656
657 #process the .item file
658 $doc_obj = $self->process_item($filename, $dir, $file, $processor);
659
660 }
661
662 if ($self->{'cover_image'}) {
663 $self->associate_cover_image($doc_obj, $filename);
664 }
665
666 # include any metadata passed in from previous plugins
667 # note that this metadata is associated with the top level section
668 my $section = $doc_obj->get_top_section();
669 $self->extra_metadata ($doc_obj, $section, $metadata);
670 #my $text="";
671 # do plugin specific processing of doc_obj
672 #unless (defined ($self->process(\$text, $pluginfo, $base_dir, $file, $metadata, $doc_obj))) {
673 #print STDERR "<ProcessingError n='$file'>\n" if ($gli);
674 #return -1;
675 #}
676 # do any automatic metadata extraction
677 $self->auto_extract_metadata ($doc_obj);
678
679 $self->{'num_processed'}++;
680 return (1,$doc_obj);
681}
682
683sub read
684{
685 my $self = shift (@_);
686 my ($pluginfo, $base_dir, $file, $metadata, $processor, $maxdocs, $total_count, $gli) = @_; my ($process_status,$doc_obj) = $self->read_into_doc_obj(@_);
687
688 if ((defined $process_status) && ($process_status == 1)) {
689 # process the document
690 $processor->process($doc_obj);
691
692 #if(defined($self->{'places_filename'})){
693 # &util::rm($self->{'places_filename'});
694 # $self->{'places_filename'} = undef;
695 #}
696 #$self->{'num_processed'} ++;
697 undef $doc_obj;
698 }
699
700 # clean up temporary files - we do this here instead of in
701 # process_image becuase associated files aren't actually copied
702 # until after process has been run.
703 if (defined $self->{'tmp_filename1'} &&
704 -e $self->{'tmp_filename1'}) {
705 &util::rm($self->{'tmp_filename1'})
706 }
707 if (defined $self->{'tmp_filename2'} &&
708 -e $self->{'tmp_filename2'}) {
709 &util::rm($self->{'tmp_filename2'})
710 }
711 if (defined $self->{'tmp_filename3'} &&
712 -e $self->{'tmp_filename3'}) {
713 &util::rm($self->{'tmp_filename3'})
714 }
715 # if process_status == 1, then the file has been processed.
716 return $process_status;
717}
718
719sub xml_start_tag {
720 my $self = shift(@_);
721 my ($expat, $element) = @_;
722 $self->{'element'} = $element;
723
724 my $doc_obj = $self->{'doc_obj'};
725 if ($element eq "PagedDocument") {
726 $self->{'current_section'} = $doc_obj->get_top_section();
727 } elsif ($element eq "PageGroup" || $element eq "Page") {
728 # create a new section as a child
729 $self->{'current_section'} = $doc_obj->insert_section($doc_obj->get_end_child($self->{'current_section'}));
730 $self->{'num_pages'}++;
731 # assign pagenum as what??
732 my $pagenum = $_{'pagenum'}; #TODO!!
733 if (defined $pagenum) {
734 $doc_obj->set_utf8_metadata_element($self->{'current_section'}, 'PageNum', $pagenum);
735 }
736 my ($imgfile) = $_{'imgfile'};
737 if (defined $imgfile) {
738 $self->process_image($self->{'base_dir'}.$imgfile, $imgfile, $doc_obj, $self->{'current_section'});
739 }
740 my ($txtfile) = $_{'txtfile'};
741 if (defined($txtfile)&& $txtfile ne "") {
742 $self->process_text ($self->{'base_dir'}.$txtfile, $txtfile, $doc_obj, $self->{'current_section'});
743 $doc_obj->set_metadata_element($self->{'current_section'},"NoText","0");
744 } else {
745 # otherwise add in some dummy text
746 #create an empty text string so we don't break downstream plugins
747 my $text = &gsprintf::lookup_string("{BasPlug.dummy_text}",1);
748 $doc_obj->add_utf8_text($self->{'current_section'}, $text);
749 $doc_obj->add_metadata($self->{'current_section'},"NoText","1");
750 }
751 } elsif ($element eq "Metadata") {
752 $self->{'metadata_name'} = $_{'name'};
753 }
754}
755
756sub xml_end_tag {
757 my $self = shift(@_);
758 my ($expat, $element) = @_;
759
760 my $doc_obj = $self->{'doc_obj'};
761 if ($element eq "Page" || $element eq "PageGroup") {
762 # if Title hasn't been assigned, set PageNum as Title
763 if (!defined $doc_obj->get_metadata_element ($self->{'current_section'}, "Title") && defined $doc_obj->get_metadata_element ($self->{'current_section'}, "PageNum" )) {
764 $doc_obj->add_utf8_metadata ($self->{'current_section'}, "Title", $doc_obj->get_metadata_element ($self->{'current_section'}, "PageNum" ));
765 }
766 # move the current section back to the parent
767 $self->{'current_section'} = $doc_obj->get_parent_section($self->{'current_section'});
768 } elsif ($element eq "Metadata") {
769
770 $doc_obj->add_utf8_metadata ($self->{'current_section'}, $self->{'metadata_name'}, $self->{'metadata_value'});
771 $self->{'metadata_name'} = "";
772 $self->{'metadata_value'} = "";
773
774 }
775 # otherwise we ignore the end tag
776}
777
778
779sub xml_text {
780 my $self = shift(@_);
781 my ($expat) = @_;
782
783 if ($self->{'element'} eq "Metadata" && $self->{'metadata_name'}) {
784 $self->{'metadata_value'} .= $_;
785 }
786}
787
788sub xml_doctype {
789}
790
791sub open_document {
792 my $self = shift(@_);
793
794 # create a new document
795 $self->{'doc_obj'} = new doc ($self->{'filename'}, "indexed_doc");
796 my $doc_obj = $self->{'doc_obj'};
797 $doc_obj->set_OIDtype ($self->{'processor'}->{'OIDtype'});
798 my ($dir, $file) = $self->{'filename'} =~ /^(.*?)([^\/\\]*)$/;
799 $self->{'base_dir'} = $dir;
800 $self->{'num_pages'} = 0;
801 my $topsection = $doc_obj->get_top_section();
802 if ($self->{'documenttype'} eq 'paged') {
803 # set the gsdlthistype metadata to Paged - this ensures this document will
804 # be treated as a Paged doc, even if Titles are not numeric
805
806 $doc_obj->set_utf8_metadata_element ($topsection, "gsdlthistype", "Paged");
807 } else {
808 $doc_obj->set_utf8_metadata_element ($topsection, "gsdlthistype", "Hierarchy");
809 }
810
811 $doc_obj->add_metadata ($topsection, "Source", $file);
812 if ($self->{'headerpage'}) {
813 $doc_obj->add_text($topsection, &gsprintf::lookup_string("{BasPlug.dummy_text}"));
814 }
815
816}
817
818sub close_document {
819 my $self = shift(@_);
820 my $doc_obj = $self->{'doc_obj'};
821
822 $doc_obj->add_utf8_metadata($doc_obj->get_top_section(), "Plugin", "$self->{'plugin_type'}");
823 $doc_obj->add_metadata($doc_obj->get_top_section(), "FileFormat", "PagedImg");
824
825 # add numpages metadata
826 $doc_obj->set_utf8_metadata_element ($doc_obj->get_top_section(), 'NumPages', $self->{'num_pages'});
827
828 # add an OID
829 $doc_obj->set_OID();
830
831}
832
833sub process_item {
834 my $self = shift (@_);
835 my ($filename, $dir, $file, $processor) = @_;
836
837 my $doc_obj = new doc ($filename, "indexed_doc");
838 $doc_obj->set_OIDtype ($processor->{'OIDtype'}, $processor->{'OIDmetadata'});
839 my $topsection = $doc_obj->get_top_section();
840 $doc_obj->add_utf8_metadata($topsection, "Plugin", "$self->{'plugin_type'}");
841 $doc_obj->add_metadata($topsection, "FileFormat", "PagedImg");
842
843 if ($self->{'documenttype'} eq 'paged') {
844 # set the gsdlthistype metadata to Paged - this ensures this document will
845 # be treated as a Paged doc, even if Titles are not numeric
846 $doc_obj->set_utf8_metadata_element ($topsection, "gsdlthistype", "Paged");
847 } else {
848 $doc_obj->set_utf8_metadata_element ($topsection, "gsdlthistype", "Hierarchy");
849 }
850
851 $doc_obj->add_metadata ($topsection, "Source", $file);
852
853 open (ITEMFILE, $filename) || die "couldn't open $filename\n";
854 my $line = "";
855 my $num = 0;
856 while (defined ($line = <ITEMFILE>)) {
857 next unless $line =~ /\w/;
858 chomp $line;
859 next if $line =~ /^#/; # ignore comment lines
860 if ($line =~ /^<([^>]*)>\s*(.*?)\s*$/) {
861 $doc_obj->set_utf8_metadata_element ($topsection, $1, $2);
862 #$meta->{$1} = $2;
863 } else {
864 $num++;
865 # line should be like page:imagefilename:textfilename:r - the r is optional -> means rotate the image 180 deg
866 $line =~ s/^\s+//; #remove space at the front
867 $line =~ s/\s+$//; #remove space at the end
868 my ($pagenum, $imgname, $txtname, $rotate) = split /:/, $line;
869
870 # create a new section for each image file
871 my $cursection = $doc_obj->insert_section($doc_obj->get_end_child($topsection));
872 # the page number becomes the Title
873 $doc_obj->set_utf8_metadata_element($cursection, 'Title', $pagenum);
874
875 # process the image for this page if there is one
876 if (defined $imgname && $imgname ne "") {
877 my $result1 = $self->process_image($dir.$imgname, $imgname, $doc_obj, $cursection, $rotate);
878
879 if (!defined $result1)
880 {
881 print "PagedImgPlug: couldn't process image \"$dir.$imgname\" for item \"$filename\"\n";
882 }
883 }
884 # process the text file if one is there
885 if (defined $txtname && $txtname ne "") {
886 my $result2 = $self->process_text ($dir.$txtname, $txtname, $doc_obj, $cursection);
887
888 if (!defined $result2) {
889 print "PagedImgPlug: couldn't process text file \"$dir.$txtname\" for item \"$filename\"\n";
890 }
891 else{
892 $doc_obj->set_metadata_element($cursection, "NoText", "0");
893 }
894 } else {
895 # otherwise add in some dummy text
896 $doc_obj->add_text($cursection, &gsprintf::lookup_string("{BasPlug.dummy_text}"));
897 # add NoText metadata which can be used to suppress the dummy text
898 }
899 }
900 }
901
902 close ITEMFILE;
903
904 # if we want a header page, we need to add some text into the top section, otherwise this section will become invisible
905 if ($self->{'headerpage'}) {
906 $doc_obj->add_text($topsection, &gsprintf::lookup_string("{BasPlug.dummy_text}"));
907 }
908 $file =~ s/\.item//i;
909 $doc_obj->set_OID ();
910 # add numpages metadata
911 $doc_obj->set_utf8_metadata_element ($topsection, 'NumPages', "$num");
912 return $doc_obj;
913}
914
915sub process_text {
916 my $self = shift (@_);
917 my ($fullpath, $file, $doc_obj, $cursection) = @_;
918
919 # check that the text file exists!!
920 if (!-f $fullpath) {
921 print "PagedImgPlug: ERROR: File $fullpath does not exist, skipping\n";
922 return 0;
923 }
924
925 # Do encoding stuff
926 my ($language, $encoding) = $self->textcat_get_language_encoding ($fullpath);
927
928 my $text="";
929 &BasPlug::read_file($self, $fullpath, $encoding, $language, \$text);
930 if (!length ($text)) {
931 # It's a bit unusual but not out of the question to have no text, so just give a warning
932 print "PagedImgPlug: WARNING: $fullpath contains no text\n";
933 }
934
935 # we need to escape the escape character, or else mg will convert into
936 # eg literal newlines, instead of leaving the text as '\n'
937 $text =~ s/\\/\\\\/g; # macro language
938 $text =~ s/_/\\_/g; # macro language
939
940
941 if ($text =~ m/<html.*?>\s*<head.*?>.*<\/head>\s*<body.*?>(.*)<\/body>\s*<\/html>\s*$/s) {
942 # looks like HTML input
943 # no need to escape < and > or put in <pre> tags
944
945 $text = $1;
946
947 # insert preformat tags and add text to document object
948 $doc_obj->add_utf8_text($cursection, "$text");
949 }
950 else {
951 $text =~ s/</&lt;/g;
952 $text =~ s/>/&gt;/g;
953
954 # insert preformat tags and add text to document object
955 $doc_obj->add_utf8_text($cursection, "<pre>\n$text\n</pre>");
956 }
957
958
959 return 1;
960}
961
962# do plugin specific processing of doc_obj
963sub process {
964 my $self = shift (@_);
965 my ($textref, $pluginfo, $base_dir, $file, $metadata, $doc_obj) = @_;
966 my $outhandle = $self->{'outhandle'};
967
968 return 1;
969}
970
9711;
Note: See TracBrowser for help on using the repository browser.