source: main/trunk/greenstone2/perllib/plugins/PagedImagePlugin.pm@ 28836

Last change on this file since 28836 was 28355, checked in by ak19, 11 years ago
  1. Now gsConvert.pl calls the new pptextract.vbs VBScript (which creates .item files and ppt slide.txt files in utf-8) instead of the older VB pptextract.exe executable which created .item and slide.txt files in windows default utf-16 LE. 2. PagedImagePlugin.pm::tidy_item_file now reads in the .item files in utf-8 mode, so that its strings are unicode aware. Substitutions are of unicode code points instead of byte sequences, since the strings in the file are now unicode aware.
  • Property svn:executable set to *
  • Property svn:keywords set to Author Date Id Revision
File size: 29.1 KB
Line 
1###########################################################################
2#
3# PagedImagePlugin.pm -- plugin for sets of images and OCR text that
4# make up a document
5# A component of the Greenstone digital library software
6# from the New Zealand Digital Library Project at the
7# University of Waikato, New Zealand.
8#
9# Copyright (C) 1999 New Zealand Digital Library Project
10#
11# This program is free software; you can redistribute it and/or modify
12# it under the terms of the GNU General Public License as published by
13# the Free Software Foundation; either version 2 of the License, or
14# (at your option) any later version.
15#
16# This program is distributed in the hope that it will be useful,
17# but WITHOUT ANY WARRANTY; without even the implied warranty of
18# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
19# GNU General Public License for more details.
20#
21# You should have received a copy of the GNU General Public License
22# along with this program; if not, write to the Free Software
23# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
24#
25###########################################################################
26
27# PagedImagePlugin
28# processes sequences of images, with optional OCR text
29#
30# This plugin takes *.item files, which contain metadata and lists of image
31# files, and produces a document containing sections, one for each page.
32# The files should be named something.item, then you can have more than one
33# book in a directory. You will need to create these files, one for each
34# document/book.
35#
36#There are two formats for the item files: a plain text format, and an xml
37#format. You can use either format, and can have both formats in the same
38#collection if you like. If you use the plain format, you must not start the
39#file off with <PagedDocument>
40
41#### PLAIN FORMAT
42# The format of the xxx.item file is as follows:
43# The first lines contain any metadata for the whole document
44# <metadata-name>metadata-value
45# eg.
46# <Title>Snail farming
47# <Date>19230102
48# Then comes a list of pages, one page per line, each line has the format
49#
50# pagenum:imagefile:textfile:r
51#
52# page num and imagefile are required. pagenum is used for the Title
53# of the section, and in the display is shown as page <pagenum>.
54# imagefile is the image for the page. textfile is an optional text
55# file containing the OCR (or any) text for the page - this gets added
56# as the text for the section. r is optional, and signals that the image
57# should be rotated 180deg. Eg use this if the image has been made upside down.
58# So an example item file looks like:
59# <Title>Snail farming
60# <Date>19960403
61# 1:p1.gif:p1.txt:
62# 2:p2.gif::
63# 3:p3.gif:p3.txt:
64# 3b:p3b.gif:p3b.txt:r
65# The second page has no text, the fourth page is a back page, and
66# should be rotated.
67#
68
69#### XML FORMAT
70# The xml format looks like the following
71#<PagedDocument>
72#<Metadata name="Title">The Title of the entire document</Metadata>
73#<Page pagenum="1" imgfile="xxx.jpg" txtfile="yyy.txt">
74#<Metadata name="Title">The Title of this page</Metadata>
75#</Page>
76#... more pages
77#</PagedDocument>
78#PagedDocument contains a list of Pages, Metadata and PageGroups. Any metadata
79#that is not inside another tag will belong to the document.
80#Each Page has a pagenum (not used at the moment), an imgfile and/or a txtfile.
81#These are both optional - if neither is used, the section will have no content.
82#Pages can also have metadata associated with them.
83#PageGroups can be introduced at any point - they can contain Metadata and Pages and other PageGroups. They are used to introduce hierarchical structure into the document.
84#For example
85#<PagedDocument>
86#<PageGroup>
87#<Page>
88#<Page>
89#</PageGroup>
90#<Page>
91#</PagedDocument>
92#would generate a structure like
93#X
94#--X
95# --X
96# --X
97#--X
98#PageGroup tags can also have imgfile/textfile metadata if you like - this way they get some content themselves.
99
100#Currently the XML structure doesn't work very well with the paged document type, unless you use numerical Titles for each section.
101#There is still a bit of work to do on this format:
102#* enable other text file types, eg html, pdf etc
103#* make the document paging work properly
104#* add pagenum as Title unless a Title is present?
105
106# All the supplemetary image amd text files should be in the same folder as
107# the .item file.
108#
109# To display the images instead of the document text, you can use [srcicon]
110# in the DocumentText format statement.
111# For example,
112#
113# format DocumentText "<center><table width=_pagewidth_><tr><td>[srcicon]</td></tr></table></center>"
114#
115# To have it create thumbnail size images, use the '-create_thumbnail' option.
116# To have it create medium size images for display, use the '-create_screenview'
117# option. As usual, running
118# 'perl -S pluginfo.pl PagedImagePlugin' will list all the options.
119
120# If you want the resulting documents to be presented with a table of
121# contents, use '-documenttype hierarchy', otherwise they will have
122# next and previous arrows, and a goto page X box.
123
124# If you have used -create_screenview, you can also use [screenicon] in the format
125# statement to display the smaller image. Here is an example that switches
126# between the two:
127#
128# format DocumentText "<center><table width=_pagewidth_><tr><td>{If}{_cgiargp_ eq full,<a href='_httpdocument_&d=_cgiargd_&p=small'>Switch to small version.</a>,<a href='_httpdocument_&d=_cgiargd_&p=full'>Switch to fullsize version</a>}</td></tr><tr><td>{If}{_cgiargp_ eq full,<a href='_httpdocument_&d=_cgiargd_&p=small' title='Switch to small version'>[srcicon]</a>,<a href='_httpdocument_&d=_cgiargd_&p=full' title='Switch to fullsize version'>[screenicon]</a>}</td></tr></table></center>"
129#
130# Additional metadata can be added into the .item files, alternatively you can
131# use normal metadata.xml files, with the name of the xxx.item file as the
132# FileName (only for document level metadata).
133
134package PagedImagePlugin;
135
136use Encode;
137use ReadXMLFile;
138use ReadTextFile;
139use ImageConverter;
140use MetadataRead;
141
142use strict;
143no strict 'refs'; # allow filehandles to be variables and viceversa
144
145sub BEGIN {
146 @PagedImagePlugin::ISA = ('MetadataRead', 'ReadXMLFile', 'ReadTextFile', 'ImageConverter');
147}
148
149my $gs2_type_list =
150 [ { 'name' => "auto",
151 'desc' => "{PagedImagePlugin.documenttype.auto2}" },
152 { 'name' => "paged",
153 'desc' => "{PagedImagePlugin.documenttype.paged2}" },
154 { 'name' => "hierarchy",
155 'desc' => "{PagedImagePlugin.documenttype.hierarchy}" }
156 ];
157
158my $gs3_type_list =
159 [ { 'name' => "auto",
160 'desc' => "{PagedImagePlugin.documenttype.auto3}" },
161 { 'name' => "paged",
162 'desc' => "{PagedImagePlugin.documenttype.paged3}" },
163 { 'name' => "hierarchy",
164 'desc' => "{PagedImagePlugin.documenttype.hierarchy}" },
165 { 'name' => "pagedhierarchy",
166 'desc' => "{PagedImagePlugin.documenttype.pagedhierarchy}" }
167 ];
168
169my $arguments =
170 [ { 'name' => "process_exp",
171 'desc' => "{BasePlugin.process_exp}",
172 'type' => "string",
173 'deft' => &get_default_process_exp(),
174 'reqd' => "no" },
175 { 'name' => "title_sub",
176 'desc' => "{HTMLPlugin.title_sub}",
177 'type' => "string",
178 'deft' => "" },
179 { 'name' => "headerpage",
180 'desc' => "{PagedImagePlugin.headerpage}",
181 'type' => "flag",
182 'reqd' => "no" },
183# { 'name' => "documenttype",
184# 'desc' => "{PagedImagePlugin.documenttype}",
185# 'type' => "enum",
186# 'list' => $type_list,
187# 'deft' => "auto",
188# 'reqd' => "no" },
189 {'name' => "processing_tmp_files",
190 'desc' => "{BasePlugin.processing_tmp_files}",
191 'type' => "flag",
192 'hiddengli' => "yes"}
193 ];
194
195my $doc_type_opt = { 'name' => "documenttype",
196 'desc' => "{PagedImagePlugin.documenttype}",
197 'type' => "enum",
198 'deft' => "auto",
199 'reqd' => "no" };
200
201my $options = { 'name' => "PagedImagePlugin",
202 'desc' => "{PagedImagePlugin.desc}",
203 'abstract' => "no",
204 'inherits' => "yes",
205 'args' => $arguments };
206
207sub new {
208 my ($class) = shift (@_);
209 my ($pluginlist,$inputargs,$hashArgOptLists) = @_;
210 push(@$pluginlist, $class);
211
212 push(@{$hashArgOptLists->{"OptList"}},$options);
213
214 my $imc_self = new ImageConverter($pluginlist, $inputargs, $hashArgOptLists);
215
216 # we can use this plugin to check gs3 version
217 if ($imc_self->{'gs_version'} eq "3") {
218 $doc_type_opt->{'list'} = $gs3_type_list;
219 }
220 else {
221 $doc_type_opt->{'list'} = $gs2_type_list;
222 }
223 push(@$arguments,$doc_type_opt);
224 # now we add the args to the list for parsing
225 push(@{$hashArgOptLists->{"ArgList"}},@{$arguments});
226
227 my $rtf_self = new ReadTextFile($pluginlist, $inputargs, $hashArgOptLists, 1);
228 my $rxf_self = new ReadXMLFile($pluginlist, $inputargs, $hashArgOptLists);
229
230 my $self = BasePlugin::merge_inheritance($imc_self,$rtf_self,$rxf_self);
231
232 # Update $self used by XML::Parser so it finds callback functions
233 # such as start_document here and not in ReadXMLFile (which is what
234 # $self was when new XML::Parser was done)
235 #
236 # If the $self returned by this constructor is the same as the one
237 # used in ReadXMLFile (e.g. in the GreenstoneXMLPlugin) then this step isn't necessary
238 #
239 # Consider embedding this type of assignment into merge_inheritance
240 # to help catch all cases?
241
242 $rxf_self->{'parser'}->{'PluginObj'} = $self;
243
244 return bless $self, $class;
245}
246
247
248sub init {
249 my $self = shift (@_);
250 my ($verbosity, $outhandle, $failhandle) = @_;
251
252 $self->SUPER::init(@_);
253 $self->ImageConverter::init();
254}
255
256sub begin {
257 my $self = shift (@_);
258 my ($pluginfo, $base_dir, $processor, $maxdocs) = @_;
259
260 $self->SUPER::begin(@_);
261 $self->ImageConverter::begin(@_);
262}
263
264sub get_default_process_exp {
265 my $self = shift (@_);
266
267 return q^\.item$^;
268}
269
270sub get_doctype {
271 my $self = shift(@_);
272
273 return "PagedDocument";
274}
275
276
277# want to use BasePlugin's version of this, not ReadXMLFile's
278sub can_process_this_file {
279 my $self = shift(@_);
280 return $self->BasePlugin::can_process_this_file(@_);
281}
282
283# instead of a block exp, now we scan the file and record all text and img files mentioned there for blocking.
284sub store_block_files
285{
286 my $self = shift (@_);
287 my ($filename_full_path, $block_hash) = @_;
288
289 my $xml_version = $self->is_xml_item_file($filename_full_path);
290
291 # do we need to do this?
292 # does BOM interfere just with XML parsing? In that case don't need it here
293 # if we do it here, we are modifying the file before we have worked out if
294 # its new or not, so it will always be reimported.
295 #$self->tidy_item_file($filename_full_path);
296
297 my ($dir, $file) = $filename_full_path =~ /^(.*?)([^\/\\]*)$/;
298 if ($xml_version) {
299
300 # do something
301 $self->scan_xml_for_files_to_block($filename_full_path, $dir, $block_hash);
302 } else {
303
304 $self->scan_item_for_files_to_block($filename_full_path, $dir, $block_hash);
305 }
306
307}
308
309# we want to use BasePlugin's read, not ReadXMLFile's
310sub read
311{
312 my $self = shift (@_);
313
314 $self->BasePlugin::read(@_);
315}
316
317
318
319sub read_into_doc_obj {
320 my $self = shift (@_);
321 my ($pluginfo, $base_dir, $file, $block_hash, $metadata, $processor, $maxdocs, $total_count, $gli) = @_;
322 my $outhandle = $self->{'outhandle'};
323 my $verbosity = $self->{'verbosity'};
324
325 my ($filename_full_path, $filename_no_path) = &util::get_full_filenames($base_dir, $file);
326
327 print $outhandle "PagedImagePlugin processing \"$filename_full_path\"\n"
328 if $verbosity > 1;
329 print STDERR "<Processing n='$file' p='PagedImagePlugin'>\n" if ($gli);
330
331 $self->{'MaxImageWidth'} = 0;
332 $self->{'MaxImageHeight'} = 0;
333
334 # here we need to decide if we have an old text .item file, or a new xml
335 # .item file
336 my $xml_version = $self->is_xml_item_file($filename_full_path);
337
338 $self->tidy_item_file($filename_full_path);
339
340 my $doc_obj;
341 if ($xml_version) {
342 # careful checking needed here!! are we using local xml handlers or super ones
343 $self->ReadXMLFile::read($pluginfo, $base_dir, $file, $block_hash, $metadata, $processor, $maxdocs, $total_count, $gli);
344 $doc_obj = $self->{'doc_obj'};
345 } else {
346 my ($dir, $item_file);
347 ($dir, $item_file) = $filename_full_path =~ /^(.*?)([^\/\\]*)$/;
348
349 #process the .item file
350 $doc_obj = $self->process_item($filename_full_path, $dir, $item_file, $processor, $metadata);
351
352 }
353
354 my $section = $doc_obj->get_top_section();
355
356 $doc_obj->add_utf8_metadata($section, "Plugin", "$self->{'plugin_type'}");
357 $doc_obj->add_metadata($section, "FileFormat", "PagedImage");
358
359 # include any metadata passed in from previous plugins
360 # note that this metadata is associated with the top level section
361 $self->add_associated_files($doc_obj, $filename_full_path);
362 $self->extra_metadata ($doc_obj, $section, $metadata);
363 $self->auto_extract_metadata ($doc_obj);
364 $self->plugin_specific_process($base_dir, $file, $doc_obj, $gli);
365 # if we haven't found any Title so far, assign one
366 $self->title_fallback($doc_obj,$section,$filename_no_path);
367
368 $self->add_OID($doc_obj);
369 return (1,$doc_obj);
370}
371# override this for an inheriting plugin to add extra metadata etc
372sub plugin_specific_process {
373 my $self = shift(@_);
374 my ($base_dir, $file, $doc_obj, $gli) = @_;
375
376}
377
378# for now, the test is if the first non-empty line is <PagedDocument>, then its xml
379sub is_xml_item_file {
380 my $self = shift(@_);
381 my ($filename) = @_;
382
383 my $xml_version = 0;
384 open (ITEMFILE, $filename) || die "couldn't open $filename\n";
385
386 my $line = "";
387 my $num = 0;
388
389 $line = <ITEMFILE>;
390 while (defined ($line) && ($line !~ /\w/)) {
391 $line = <ITEMFILE>;
392 }
393
394 if (defined $line) {
395 chomp $line;
396 if ($line =~ /<PagedDocument/) {
397 $xml_version = 1;
398 }
399 }
400
401 close ITEMFILE;
402 return $xml_version;
403}
404
405sub tidy_item_file {
406 my $self = shift(@_);
407 my ($filename) = @_;
408
409 open (ITEMFILE, "<:encoding(UTF-8)", $filename) || die "couldn't open $filename\n";
410 my $backup_filename = "backup.item";
411 open (BACKUP,">$backup_filename")|| die "couldn't write to $backup_filename\n";
412 binmode(BACKUP, ":utf8");
413 my $line = "";
414 $line = <ITEMFILE>;
415 #$line =~ s/^\xEF\xBB\xBF//; # strip BOM in text file read in as a sequence of bytes (not unicode aware strings)
416 $line =~ s/^\x{FEFF}//; # strip BOM in file opened *as UTF-8*. Strings in the file just read in are now unicode-aware,
417 # this means the BOM is now a unicode codepoint instead of a byte sequence
418 # See http://en.wikipedia.org/wiki/Byte_order_mark and http://perldoc.perl.org/5.14.0/perlunicode.html
419 $line =~ s/\x{0B}+//ig; # removing \vt-vertical tabs using the unicode codepoint for \vt
420 $line =~ s/&/&amp;/g;
421 print BACKUP ($line);
422 #Tidy up the item file some metadata title contains \vt-vertical tab
423 while ($line = <ITEMFILE>) {
424 $line =~ s/\x{0B}+//ig; # removing \vt-vertical tabs using the unicode codepoint for \vt
425 $line =~ s/&/&amp;/g;
426 print BACKUP ($line);
427 }
428 close ITEMFILE;
429 close BACKUP;
430 &File::Copy::copy ($backup_filename, $filename);
431 &FileUtils::removeFiles($backup_filename);
432
433}
434
435sub rotate_image {
436 my $self = shift (@_);
437 my ($filename_full_path) = @_;
438
439 my ($this_filetype) = $filename_full_path =~ /\.([^\.]*)$/;
440 my $result = $self->convert($filename_full_path, $this_filetype, "-rotate 180", "ROTATE");
441 my ($new_filename) = ($result =~ /=>(.*\.$this_filetype)/);
442 if (-e "$new_filename") {
443 return $new_filename;
444 }
445 # somethings gone wrong
446 return $filename_full_path;
447
448}
449
450sub process_image {
451 my $self = shift(@_);
452 my ($filename_full_path, $filename_no_path, $doc_obj, $section, $rotate) = @_;
453 # check the filenames
454 return 0 if ($filename_no_path eq "" || !-f $filename_full_path);
455
456 # remember that this image file was one of our source files, but only
457 # if we are not processing a tmp file
458 if (!$self->{'processing_tmp_files'} ) {
459 $doc_obj->associate_source_file($filename_full_path);
460 }
461 # do rotation
462 if ((defined $rotate) && ($rotate eq "r")) {
463 # we get a new temporary file which is rotated
464 $filename_full_path = $self->rotate_image($filename_full_path);
465 }
466
467 # do generate images
468 my $result = 0;
469 if ($self->{'image_conversion_available'} == 1) {
470 # do we need to convert $filename_no_path to utf8/url encoded?
471 # We are already reading in from a file, what encoding is it in???
472 my $url_encoded_full_filename
473 = &unicode::raw_filename_to_url_encoded($filename_full_path);
474 $result = $self->generate_images($filename_full_path, $url_encoded_full_filename, $doc_obj, $section);
475 }
476 #overwrite one set in ImageConverter
477 $doc_obj->set_metadata_element ($section, "FileFormat", "PagedImage");
478 return $result;
479}
480
481
482sub xml_start_tag {
483 my $self = shift(@_);
484 my ($expat, $element) = @_;
485 $self->{'element'} = $element;
486
487 my $doc_obj = $self->{'doc_obj'};
488 if ($element eq "PagedDocument") {
489 $self->{'current_section'} = $doc_obj->get_top_section();
490 } elsif ($element eq "PageGroup" || $element eq "Page") {
491 if ($element eq "PageGroup") {
492 $self->{'has_internal_structure'} = 1;
493 }
494 # create a new section as a child
495 $self->{'current_section'} = $doc_obj->insert_section($doc_obj->get_end_child($self->{'current_section'}));
496 $self->{'num_pages'}++;
497 # assign pagenum as what??
498 my $pagenum = $_{'pagenum'}; #TODO!!
499 if (defined $pagenum) {
500 $doc_obj->set_utf8_metadata_element($self->{'current_section'}, 'PageNum', $pagenum);
501 }
502 my ($imgfile) = $_{'imgfile'};
503 if (defined $imgfile) {
504 # *****
505 # What about support for rotate image (e.g. old ':r' notation)?
506 $self->process_image($self->{'xml_file_dir'}.$imgfile, $imgfile, $doc_obj, $self->{'current_section'});
507 }
508 my ($txtfile) = $_{'txtfile'};
509 if (defined($txtfile)&& $txtfile ne "") {
510 $self->process_text ($self->{'xml_file_dir'}.$txtfile, $txtfile, $doc_obj, $self->{'current_section'});
511 } else {
512 $self->add_dummy_text($doc_obj, $self->{'current_section'});
513 }
514 } elsif ($element eq "Metadata") {
515 $self->{'metadata_name'} = $_{'name'};
516 }
517}
518
519sub xml_end_tag {
520 my $self = shift(@_);
521 my ($expat, $element) = @_;
522
523 my $doc_obj = $self->{'doc_obj'};
524 if ($element eq "Page" || $element eq "PageGroup") {
525 # if Title hasn't been assigned, set PageNum as Title
526 if (!defined $doc_obj->get_metadata_element ($self->{'current_section'}, "Title") && defined $doc_obj->get_metadata_element ($self->{'current_section'}, "PageNum" )) {
527 $doc_obj->add_utf8_metadata ($self->{'current_section'}, "Title", $doc_obj->get_metadata_element ($self->{'current_section'}, "PageNum" ));
528 }
529 # move the current section back to the parent
530 $self->{'current_section'} = $doc_obj->get_parent_section($self->{'current_section'});
531 } elsif ($element eq "Metadata") {
532
533 # text read in by XML::Parser is in Perl's binary byte value
534 # form ... need to explicitly make it UTF-8
535 my $meta_name = decode("utf-8",$self->{'metadata_name'});
536 my $metadata_value = decode("utf-8",$self->{'metadata_value'});
537
538 if ($meta_name =~ /\./) {
539 $meta_name = "ex.$meta_name";
540 }
541
542 $doc_obj->add_utf8_metadata ($self->{'current_section'}, $meta_name, $metadata_value);
543 $self->{'metadata_name'} = "";
544 $self->{'metadata_value'} = "";
545
546 }
547 # otherwise we ignore the end tag
548}
549
550
551sub xml_text {
552 my $self = shift(@_);
553 my ($expat) = @_;
554
555 if ($self->{'element'} eq "Metadata" && $self->{'metadata_name'}) {
556 $self->{'metadata_value'} .= $_;
557 }
558}
559
560sub xml_doctype {
561}
562
563sub open_document {
564 my $self = shift(@_);
565
566 # create a new document
567 $self->{'doc_obj'} = new doc ($self->{'filename'}, "indexed_doc", $self->{'file_rename_method'});
568 # TODO is file filenmae_no_path??
569 $self->set_initial_doc_fields($self->{'doc_obj'}, $self->{'filename'}, $self->{'processor'}, $self->{'metadata'});
570
571 my ($dir, $file) = $self->{'filename'} =~ /^(.*?)([^\/\\]*)$/;
572 $self->{'xml_file_dir'} = $dir;
573 $self->{'num_pages'} = 0;
574 $self->{'has_internal_structure'} = 0;
575
576}
577
578sub close_document {
579 my $self = shift(@_);
580 my $doc_obj = $self->{'doc_obj'};
581
582 my $topsection = $doc_obj->get_top_section();
583
584 # add numpages metadata
585 $doc_obj->set_utf8_metadata_element ($topsection, 'NumPages', $self->{'num_pages'});
586
587 # set the document type
588 my $final_doc_type = "";
589 if ($self->{'documenttype'} eq "auto") {
590 if ($self->{'has_internal_structure'}) {
591 if ($self->{'gs_version'} eq "3") {
592 $final_doc_type = "pagedhierarchy";
593 }
594 else {
595 $final_doc_type = "hierarchy";
596 }
597 } else {
598 $final_doc_type = "paged";
599 }
600 } else {
601 # set to what doc type option was set to
602 $final_doc_type = $self->{'documenttype'};
603 }
604 $doc_obj->set_utf8_metadata_element ($topsection, "gsdlthistype", $final_doc_type);
605 ### capiatalisation????
606# if ($self->{'documenttype'} eq 'paged') {
607 # set the gsdlthistype metadata to Paged - this ensures this document will
608 # be treated as a Paged doc, even if Titles are not numeric
609# $doc_obj->set_utf8_metadata_element ($topsection, "gsdlthistype", "Paged");
610# } else {
611# $doc_obj->set_utf8_metadata_element ($topsection, "gsdlthistype", "Hierarchy");
612# }
613
614 $doc_obj->set_utf8_metadata_element($topsection,"MaxImageWidth",$self->{'MaxImageWidth'});
615 $doc_obj->set_utf8_metadata_element($topsection,"MaxImageHeight",$self->{'MaxImageHeight'});
616 $self->{'MaxImageWidth'} = undef;
617 $self->{'MaxImageHeight'} = undef;
618
619}
620
621
622sub set_initial_doc_fields {
623 my $self = shift(@_);
624 my ($doc_obj, $filename_full_path, $processor, $metadata) = @_;
625
626 my $topsection = $doc_obj->get_top_section();
627
628 my $plugin_filename_encoding = $self->{'filename_encoding'};
629 my $filename_encoding = $self->deduce_filename_encoding($filename_full_path,$metadata,$plugin_filename_encoding);
630 $self->set_Source_metadata($doc_obj, $filename_full_path, $filename_encoding);
631
632 # if we want a header page, we need to add some text into the top section, otherwise this section will become invisible
633 if ($self->{'headerpage'}) {
634 $self->add_dummy_text($doc_obj, $topsection);
635 }
636}
637
638sub scan_xml_for_files_to_block
639{
640 my $self = shift (@_);
641 my ($filename_full_path, $dir, $block_hash) = @_;
642
643 open (ITEMFILE, $filename_full_path) || die "couldn't open $filename_full_path to work out which files to block\n";
644 my $line = "";
645 while (defined ($line = <ITEMFILE>)) {
646 next unless $line =~ /\w/;
647
648 if ($line =~ /imgfile=\"([^\"]+)\"/) {
649 &util::block_filename($block_hash,&FileUtils::filenameConcatenate($dir,$1));
650 }
651 if ($line =~ /txtfile=\"([^\"]+)\"/) {
652 &util::block_filename($block_hash,&FileUtils::filenameConcatenate($dir,$1));
653 }
654 }
655 close ITEMFILE;
656
657}
658
659sub scan_item_for_files_to_block
660{
661 my $self = shift (@_);
662 my ($filename_full_path, $dir, $block_hash) = @_;
663
664
665 open (ITEMFILE, $filename_full_path) || die "couldn't open $filename_full_path to work out which files to block\n";
666 my $line = "";
667 while (defined ($line = <ITEMFILE>)) {
668 next unless $line =~ /\w/;
669 chomp $line;
670 next if $line =~ /^#/; # ignore comment lines
671 next if ($line =~ /^<([^>]*)>\s*(.*?)\s*$/); # ignore metadata lines
672 # line should be like page:imagefilename:textfilename:r
673 $line =~ s/^\s+//; #remove space at the front
674 $line =~ s/\s+$//; #remove space at the end
675 my ($pagenum, $imgname, $txtname, $rotate) = split /:/, $line;
676
677 # find the image file if there is one
678 if (defined $imgname && $imgname ne "") {
679 &util::block_filename($block_hash, &FileUtils::filenameConcatenate( $dir,$imgname));
680 }
681 # find the text file if there is one
682 if (defined $txtname && $txtname ne "") {
683 &util::block_filename($block_hash, &FileUtils::filenameConcatenate($dir,$txtname));
684 }
685 }
686 close ITEMFILE;
687
688}
689
690sub process_item {
691 my $self = shift (@_);
692 my ($filename_full_path, $dir, $filename_no_path, $processor, $metadata) = @_;
693
694 my $doc_obj = new doc ($filename_full_path, "indexed_doc", $self->{'file_rename_method'});
695 $self->set_initial_doc_fields($doc_obj, $filename_full_path, $processor, $metadata);
696 my $topsection = $doc_obj->get_top_section();
697 # simple item files are always paged unless user specified
698 if ($self->{'documenttype'} eq "auto") {
699 $doc_obj->set_utf8_metadata_element ($topsection, "gsdlthistype", "paged");
700 } else {
701 $doc_obj->set_utf8_metadata_element ($topsection, "gsdlthistype", $self->{'documenttype'});
702 }
703 open (ITEMFILE, $filename_full_path) || die "couldn't open $filename_full_path\n";
704 my $line = "";
705 my $num = 0;
706 while (defined ($line = <ITEMFILE>)) {
707
708 # Since process_item is called not on an XML item file, but a text item file
709 # don't decode into UTF8 the text that was read in, since it's already UTF-8
710 #$line = decode("utf-8",$line);
711
712 next unless $line =~ /\w/;
713 chomp $line;
714 next if $line =~ /^#/; # ignore comment lines
715 if ($line =~ /^<([^>]*)>\s*(.*?)\s*$/) {
716 my $meta_name = $1;
717 my $meta_value = $2;
718 if ($meta_name =~ /\./) {
719 $meta_name = "ex.$meta_name";
720 }
721 $doc_obj->set_utf8_metadata_element ($topsection, $meta_name, $meta_value);
722 #$meta->{$1} = $2;
723 } else {
724 $num++;
725 # line should be like page:imagefilename:textfilename:r - the r is optional -> means rotate the image 180 deg
726 $line =~ s/^\s+//; #remove space at the front
727 $line =~ s/\s+$//; #remove space at the end
728 my ($pagenum, $imgname, $txtname, $rotate) = split /:/, $line;
729
730 # create a new section for each image file
731 my $cursection = $doc_obj->insert_section($doc_obj->get_end_child($topsection));
732 # the page number becomes the Title
733 $doc_obj->set_utf8_metadata_element($cursection, 'Title', $pagenum);
734
735 # process the image for this page if there is one
736 if (defined $imgname && $imgname ne "") {
737 my $result1 = $self->process_image($dir.$imgname, $imgname, $doc_obj, $cursection, $rotate);
738 if (!defined $result1)
739 {
740 print "PagedImagePlugin: couldn't process image \"$dir$imgname\" for item \"$filename_full_path\"\n";
741 }
742 }
743 # process the text file if one is there
744 if (defined $txtname && $txtname ne "") {
745 my $result2 = $self->process_text ($dir.$txtname, $txtname, $doc_obj, $cursection);
746
747 if (!defined $result2) {
748 print "PagedImagePlugin: couldn't process text file \"$dir.$txtname\" for item \"$filename_full_path\"\n";
749 $self->add_dummy_text($doc_obj, $cursection);
750 }
751 } else {
752 # otherwise add in some dummy text
753 $self->add_dummy_text($doc_obj, $cursection);
754 }
755 }
756 }
757
758 close ITEMFILE;
759
760 # add numpages metadata
761 $doc_obj->set_utf8_metadata_element ($topsection, 'NumPages', "$num");
762
763 $doc_obj->set_utf8_metadata_element($topsection,"MaxImageWidth",$self->{'MaxImageWidth'});
764 $doc_obj->set_utf8_metadata_element($topsection,"MaxImageHeight",$self->{'MaxImageHeight'});
765 $self->{'MaxImageWidth'} = undef;
766 $self->{'MaxImageHeight'} = undef;
767
768
769 return $doc_obj;
770}
771
772sub process_text {
773 my $self = shift (@_);
774 my ($filename_full_path, $file, $doc_obj, $cursection) = @_;
775
776 # check that the text file exists!!
777 if (!-f $filename_full_path) {
778 print "PagedImagePlugin: ERROR: File $filename_full_path does not exist, skipping\n";
779 return 0;
780 }
781
782 # remember that this text file was one of our source files, but only
783 # if we are not processing a tmp file
784 if (!$self->{'processing_tmp_files'} ) {
785 $doc_obj->associate_source_file($filename_full_path);
786 }
787 # Do encoding stuff
788 my ($language, $encoding) = $self->textcat_get_language_encoding ($filename_full_path);
789
790 my $text="";
791 &ReadTextFile::read_file($self, $filename_full_path, $encoding, $language, \$text); # already decoded as utf8
792 if (!length ($text)) {
793 # It's a bit unusual but not out of the question to have no text, so just give a warning
794 print "PagedImagePlugin: WARNING: $filename_full_path contains no text\n";
795 }
796
797 # we need to escape the escape character, or else mg will convert into
798 # eg literal newlines, instead of leaving the text as '\n'
799 $text =~ s/\\/\\\\/g; # macro language
800 $text =~ s/_/\\_/g; # macro language
801
802
803 if ($text =~ m/<html.*?>\s*<head.*?>.*<\/head>\s*<body.*?>(.*)<\/body>\s*<\/html>\s*$/is) {
804 # looks like HTML input
805 # no need to escape < and > or put in <pre> tags
806
807 $text = $1;
808
809 # add text to document object
810 $doc_obj->add_utf8_text($cursection, "$text");
811 }
812 else {
813 $text =~ s/</&lt;/g;
814 $text =~ s/>/&gt;/g;
815
816 # insert preformat tags and add text to document object
817 $doc_obj->add_utf8_text($cursection, "<pre>\n$text\n</pre>");
818 }
819
820
821 return 1;
822}
823
824
825sub clean_up_after_doc_obj_processing {
826 my $self = shift(@_);
827
828 $self->ImageConverter::clean_up_temporary_files();
829}
830
8311;
Note: See TracBrowser for help on using the repository browser.