Context Navigation

source: gsdl/trunk/perllib/plugins/PagedImgPlug.pm@ 15018

Last change on this file since 15018 was 15018, checked in by davidb, 16 years ago
Marc mapping upgraded to support richer set of operations, including subfields, multiple fields in one line (separated by comma), and the removal of rules, e.g. -245 at the start of a line. A Marc to Qualified Dublin Core crosswalk from the Library of congress has been added as "etc/marc2qdc.txt". A collection can then choose to, for example, top up the mapping with its own version of the file stored in its local "etc" folder, specifying only the rules that are different. This is where a rule like "-245" might be used to override a more general rule from the main file that has all subfields in 245 mapping to one metadata item (Title). If the user specifies a different different filename -- through a plugin option -- then they are free to divise a mapping from scratch and store it in the collections local "etc" folder.
Property svn:executable set to ``* Property svn:keywords set to `Author Date Id Revision`
File size: 33.9 KB

Line
1	###########################################################################
2	#
3	# PagedImgPlug.pm -- plugin for sets of images and OCR text that
4	# make up a document
5	# A component of the Greenstone digital library software
6	# from the New Zealand Digital Library Project at the
7	# University of Waikato, New Zealand.
8	#
9	# Copyright (C) 1999 New Zealand Digital Library Project
10	#
11	# This program is free software; you can redistribute it and/or modify
12	# it under the terms of the GNU General Public License as published by
13	# the Free Software Foundation; either version 2 of the License, or
14	# (at your option) any later version.
15	#
16	# This program is distributed in the hope that it will be useful,
17	# but WITHOUT ANY WARRANTY; without even the implied warranty of
18	# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
19	# GNU General Public License for more details.
20	#
21	# You should have received a copy of the GNU General Public License
22	# along with this program; if not, write to the Free Software
23	# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
24	#
25	###########################################################################
26
27	# PagedImgPlug
28	# processes sequences of images, with optional OCR text
29	#
30	# This plugin takes *.item files, which contain metadata and lists of image
31	# files, and produces a document containing sections, one for each page.
32	# The files should be named something.item, then you can have more than one
33	# book in a directory. You will need to create these files, one for each
34	# document/book.
35	#
36	#There are two formats for the item files: a plain text format, and an xml
37	#format. You can use either format, and can have both formats in the same
38	#collection if you like. If you use the plain format, you must not start the
39	#file off with <PagedDocument>
40
41	#### PLAIN FORMAT
42	# The format of the xxx.item file is as follows:
43	# The first lines contain any metadata for the whole document
44	# <metadata-name>metadata-value
45	# eg.
46	# <Title>Snail farming
47	# <Date>19230102
48	# Then comes a list of pages, one page per line, each line has the format
49	#
50	# pagenum:imagefile:textfile:r
51	#
52	# page num and imagefile are required. pagenum is used for the Title
53	# of the section, and in the display is shown as page <pagenum>.
54	# imagefile is the image for the page. textfile is an optional text
55	# file containing the OCR (or any) text for the page - this gets added
56	# as the text for the section. r is optional, and signals that the image
57	# should be rotated 180deg. Eg use this if the image has been made upside down.
58	# So an example item file looks like:
59	# <Title>Snail farming
60	# <Date>19960403
61	# 1:p1.gif:p1.txt:
62	# 2:p2.gif::
63	# 3:p3.gif:p3.txt:
64	# 3b:p3b.gif:p3b.txt:r
65	# The second page has no text, the fourth page is a back page, and
66	# should be rotated.
67	#
68
69	#### XML FORMAT
70	# The xml format looks like the following
71	#<PagedDocument>
72	#<Metadata name="Title">The Title of the entire document</Metadata>
73	#<Page pagenum="1" imgfile="xxx.jpg" txtfile="yyy.txt">
74	#<Metadata name="Title">The Title of this page</Metadata>
75	#</Page>
76	#... more pages
77	#</PagedDocument>
78	#PagedDocument contains a list of Pages, Metadata and PageGroups. Any metadata
79	#that is not inside another tag will belong to the document.
80	#Each Page has a pagenum (not used at the moment), an imgfile and/or a txtfile.
81	#These are both optional - if neither is used, the section will have no content.
82	#Pages can also have metadata associated with them.
83	#PageGroups can be introduced at any point - they can contain Metadata and Pages and other PageGroups. They are used to introduce hierarchical structure into the document.
84	#For example
85	#<PagedDocument>
86	#<PageGroup>
87	#<Page>
88	#<Page>
89	#</PageGroup>
90	#<Page>
91	#</PagedDocument>
92	#would generate a structure like
93	#X
94	#--X
95	# --X
96	# --X
97	#--X
98	#PageGroup tags can also have imgfile/textfile metadata if you like - this way they get some content themselves.
99
100	#Currently the XML structure doesn't work very well with the paged document type, unless you use numerical Titles for each section.
101	#There is still a bit of work to do on this format:
102	#* enable other text file types, eg html, pdf etc
103	#* make the document paging work properly
104	#* add pagenum as Title unless a Title is present?
105
106	# All the supplemetary image amd text files should be in the same folder as
107	# the .item file.
108	#
109	# To display the images instead of the document text, you can use [srcicon]
110	# in the DocumentText format statement.
111	# For example,
112	#
113	# format DocumentText "<center><table width=_pagewidth_><tr><td>[srcicon]</td></tr></table></center>"
114	#
115	# To have it create thumbnail size images, use the '-thumbnail' option.
116	# To have it create medium size images for display, use the '-screenview'
117	# option. As usual, running
118	# 'perl -S pluginfo.pl PagedImgPlug' will list all the options.
119
120	# If you want the resulting documents to be presented with a table of
121	# contents, use '-documenttype hierarchy', otherwise they will have
122	# next and previous arrows, and a goto page X box.
123
124	# If you have used -screenview, you can also use [screenicon] in the format
125	# statement to display the smaller image. Here is an example that switches
126	# between the two:
127	#
128	# format DocumentText "<center><table width=_pagewidth_><tr><td>{If}{_cgiargp_ eq full,<a href='_httpdocument_&d=_cgiargd_&p=small'>Switch to small version.</a>,<a href='_httpdocument_&d=_cgiargd_&p=full'>Switch to fullsize version</a>}</td></tr><tr><td>{If}{_cgiargp_ eq full,<a href='_httpdocument_&d=_cgiargd_&p=small' title='Switch to small version'>[srcicon]</a>,<a href='_httpdocument_&d=_cgiargd_&p=full' title='Switch to fullsize version'>[screenicon]</a>}</td></tr></table></center>"
129	#
130	# Additional metadata can be added into the .item files, alternatively you can
131	# use normal metadata.xml files, with the name of the xxx.item file as the
132	# FileName (only for document level metadata).
133
134	package PagedImgPlug;
135
136	use XMLPlug;
137	use strict;
138	no strict 'refs'; # allow filehandles to be variables and viceversa
139
140	sub BEGIN {
141	@PagedImgPlug::ISA = ('XMLPlug');
142	}
143
144	my $type_list =
145	[ { 'name' => "paged",
146	'desc' => "{PagedImgPlug.documenttype.paged}" },
147	{ 'name' => "hierarchy",
148	'desc' => "{PagedImgPlug.documenttype.hierarchy}" } ];
149
150	my $arguments =
151	[ { 'name' => "process_exp",
152	'desc' => "{BasPlug.process_exp}",
153	'type' => "string",
154	'deft' => &get_default_process_exp(),
155	'reqd' => "no" },
156	{ 'name' => "block_exp",
157	'desc' => "{BasPlug.block_exp}",
158	'type' => "string",
159	'deft' => &get_default_block_exp(),
160	'reqd' => "no" },
161	{ 'name' => "title_sub",
162	'desc' => "{HTMLPlug.title_sub}",
163	'type' => "string",
164	'deft' => "" },
165	{ 'name' => "noscaleup",
166	'desc' => "{ImagePlug.noscaleup}",
167	'type' => "flag",
168	'reqd' => "no" },
169	{ 'name' => "thumbnail",
170	'desc' => "{PagedImgPlug.thumbnail}",
171	'type' => "flag",
172	'reqd' => "no" },
173	{ 'name' => "thumbnailsize",
174	'desc' => "{ImagePlug.thumbnailsize}",
175	'type' => "int",
176	'deft' => "100",
177	'range' => "1,",
178	'reqd' => "no" },
179	{ 'name' => "thumbnailtype",
180	'desc' => "{ImagePlug.thumbnailtype}",
181	'type' => "string",
182	'deft' => "gif",
183	'reqd' => "no" },
184	{ 'name' => "screenview",
185	'desc' => "{PagedImgPlug.screenview}",
186	'type' => "flag",
187	'reqd' => "no" },
188	{ 'name' => "screenviewsize",
189	'desc' => "{PagedImgPlug.screenviewsize}",
190	'type' => "int",
191	'deft' => "500",
192	'range' => "1,",
193	'reqd' => "no" },
194	{ 'name' => "screenviewtype",
195	'desc' => "{PagedImgPlug.screenviewtype}",
196	'type' => "string",
197	'deft' => "jpg",
198	'reqd' => "no" },
199	{ 'name' => "converttotype",
200	'desc' => "{ImagePlug.converttotype}",
201	'type' => "string",
202	'deft' => "",
203	'reqd' => "no" },
204	{ 'name' => "minimumsize",
205	'desc' => "{ImagePlug.minimumsize}",
206	'type' => "int",
207	'deft' => "100",
208	'range' => "1,",
209	'reqd' => "no" },
210	{ 'name' => "headerpage",
211	'desc' => "{PagedImgPlug.headerpage}",
212	'type' => "flag",
213	'reqd' => "no" },
214	{ 'name' => "documenttype",
215	'desc' => "{PagedImgPlug.documenttype}",
216	'type' => "enum",
217	'list' => $type_list,
218	'deft' => "paged",
219	'reqd' => "no" } ];
220
221
222	my $options = { 'name' => "PagedImgPlug",
223	'desc' => "{PagedImgPlug.desc}",
224	'abstract' => "no",
225	'inherits' => "yes",
226	'args' => $arguments };
227
228	sub new {
229	my ($class) = shift (@_);
230	my ($pluginlist,$inputargs,$hashArgOptLists) = @_;
231	push(@$pluginlist, $class);
232
233	if(defined $arguments){ push(@{$hashArgOptLists->{"ArgList"}},@{$arguments});}
234	if(defined $options) { push(@{$hashArgOptLists->{"OptList"}},$options)};
235
236	my $self = new XMLPlug($pluginlist, $inputargs, $hashArgOptLists);
237
238	return bless $self, $class;
239	}
240
241	sub get_default_process_exp {
242	my $self = shift (@_);
243
244	return q^\.item$^;
245	}
246
247	sub get_doctype {
248	my $self = shift(@_);
249
250	return "PagedDocument";
251	}
252
253
254	# want to block everything except the .item ones
255	# but instead we will block images and txt files
256	sub get_default_block_exp {
257	my $self = shift (@_);
258
259	return q^(?i)(\.jpe?g\|\.gif\|\.png\|\.tif?f\|\.te?xt\|\.html?\|~)$^
260	}
261
262	# Create the thumbnail and screenview images, and discover the Image's
263	# size, width, and height using the convert utility.
264	sub process_image {
265	my $self = shift (@_);
266	my $filename = shift (@_); # filename with full path
267	my $srcfile = shift (@_); # filename without path
268	my $doc_obj = shift (@_);
269	my $section = shift (@_); #the current section
270	my $rotate = shift (@_); # whether to rotate the image or not
271	$rotate = 0 unless defined $rotate;
272
273	# check that the image file exists!!
274	if (!-f $filename) {
275	print "PagedImgPlug: ERROR: File $filename does not exist, skipping\n";
276	return 0;
277	}
278
279	my $top=0;
280	if ($section eq $doc_obj->get_top_section()) {
281	$top=1;
282	}
283	my $verbosity = $self->{'verbosity'};
284	my $outhandle = $self->{'outhandle'};
285
286	# check the filename is okay
287	return 0 if ($srcfile eq "" \|\| $filename eq "");
288
289	my $minimumsize = $self->{'minimumsize'};
290	if (defined $minimumsize && (-s $filename < $minimumsize)) {
291	print $outhandle "PagedImgPlug: \"$filename\" too small, skipping\n"
292	if ($verbosity > 1);
293	}
294
295	# Convert the image to a new type (if required), and rotate if required.
296	my $converttotype = $self->{'converttotype'};
297	my $originalfilename = ""; # only set if we do a conversion
298	my $type = "unknown";
299	my $converted = 0;
300	my $rotated=0;
301
302	if ($converttotype ne "" && $filename !~ /$converttotype$/) {
303	$converted=1;
304	$originalfilename = $filename;
305	my $filehead = &util::get_tmp_filename();
306	$filename = $filehead . ".$converttotype";
307	my $n = 1;
308	while (-e $filename) {
309	$filename = "$filehead$n\.$converttotype";
310	$n++;
311	}
312	$self->{'tmp_filename1'} = $filename;
313
314	my $rotate_option = "";
315	if ($rotate eq "r") {
316	$rotate_option = "-rotate 180 ";
317	}
318
319	my $command = "convert -verbose \"$originalfilename\" $rotate_option \"$filename\"";
320	print $outhandle "CONVERT: $command\n" if ($verbosity > 2);
321	my $result = '';
322	$result = `$command`;
323	print $outhandle "CONVERT RESULT = $result\n" if ($verbosity > 2);
324
325	$type = $converttotype;
326	} elsif ($rotate eq "r") {
327	$rotated=1;
328	$originalfilename = $filename;
329	$filename = &util::get_tmp_filename();
330
331	my $command = "convert \"$originalfilename\" -rotate 180 \"$filename\"";
332	print $outhandle "ROTATE: $command\n" if ($verbosity > 2);
333	my $result = '';
334	$result = `$command`;
335	print $outhandle "ROTATE RESULT = $result\n" if ($verbosity > 2);
336
337	}
338
339
340	# Add the image metadata
341	my $file; # the new file name
342	my $id = $srcfile;
343	$id =~ s/\.([^\.]*)$//; # the new file name without an extension
344	if ($converted) {
345	# we have converted the image
346	# add on the new extension
347	$file .= "$id.$converttotype";
348	} else {
349	$file = $srcfile;
350	}
351
352	my $url =$file; # the new file name prepared for a url
353	my $srcurl = $srcfile;
354	##$url =~ s/ /%20/g;
355	##$srcurl =~ s/ /%20/g;
356
357	$doc_obj->add_metadata ($section, "Image", $url);
358
359	# Also want to set filename as 'Source' metadata to be
360	# consistent with other plugins
361	$doc_obj->add_metadata ($section, "Source", $srcurl);
362
363	my ($image_type, $image_width, $image_height, $image_size)
364	= &identify($filename, $outhandle, $verbosity);
365
366	$doc_obj->add_metadata ($section, "ImageType", $image_type);
367	$doc_obj->add_metadata ($section, "ImageWidth", $image_width);
368	$doc_obj->add_metadata ($section, "ImageHeight", $image_height);
369	$doc_obj->add_metadata ($section, "ImageSize", $image_size);
370	$doc_obj->add_metadata ($section, "FileFormat", "PagedImg");
371	# add NoText metadata which can be used to suppress the dummy text
372	$doc_obj->add_metadata ($section, "NoText", "1");
373
374	if ($type eq "unknown" && $image_type) {
375	$type = $image_type;
376	}
377
378	if ($top) {
379	$doc_obj->add_metadata ($section, "srclink",
380	"<a href=\"_httpprefix_/collect/[collection]/index/assoc/[assocfilepath]/[Image]\">");
381	$doc_obj->add_metadata ($section, "srcicon", "<img src=\"_httpprefix_/collect/[collection]/index/assoc/[assocfilepath]/[Image]\">");
382
383	} else {
384	$doc_obj->add_metadata ($section, "srclink",
385	"<a href=\"_httpprefix_/collect/[collection]/index/assoc/[parent(Top):assocfilepath]/[Image]\">");
386	$doc_obj->add_metadata ($section, "srcicon", "<img src=\"_httpprefix_/collect/[collection]/index/assoc/[parent(Top):assocfilepath]/[Image]\">");
387
388	}
389	$doc_obj->add_metadata ($section, "/srclink", "</a>");
390
391
392	# Add the image as an associated file
393	$doc_obj->associate_file($filename,$file,"image/$type",$section);
394	print $outhandle "associating file $filename as name $file\n" if ($verbosity > 2);
395
396	if ($self->{'thumbnail'}) {
397	# Make the thumbnail image
398	my $thumbnailsize = $self->{'thumbnailsize'} \|\| 100;
399	my $thumbnailtype = $self->{'thumbnailtype'} \|\| 'gif';
400
401	my $filehead = &util::get_tmp_filename();
402	my $thumbnailfile = $filehead . ".$thumbnailtype";
403	my $n=1;
404	while (-e $thumbnailfile) {
405	$thumbnailfile = $filehead . $n . ".$thumbnailtype";
406	$n++;
407	}
408
409	$self->{'tmp_filename2'} = $thumbnailfile;
410
411	# Generate the thumbnail with convert
412	my $command = "convert -verbose -geometry $thumbnailsize"
413	. "x$thumbnailsize \"$filename\" \"$thumbnailfile\"";
414	print $outhandle "THUMBNAIL: $command\n" if ($verbosity > 2);
415	my $result = '';
416	$result = `$command 2>&1` ;
417	print $outhandle "THUMB RESULT: $result\n" if ($verbosity > 2);
418
419	# Add the thumbnail as an associated file ...
420	if (-e "$thumbnailfile") {
421	$doc_obj->associate_file("$thumbnailfile", $id."thumb.$thumbnailtype", "image/$thumbnailtype",$section);
422	$doc_obj->add_metadata ($section, "ThumbType", $thumbnailtype);
423	$doc_obj->add_metadata ($section, "Thumb", $id."thumb.$thumbnailtype");
424	if ($top) {
425	$doc_obj->add_metadata ($section, "thumbicon", "<img src=\"_httpprefix_/collect/[collection]/index/assoc/[assocfilepath]/[Thumb]\" width=[ThumbWidth] height=[ThumbHeight]>");
426	} else {
427	$doc_obj->add_metadata ($section, "thumbicon", "<img src=\"_httpprefix_/collect/[collection]/index/assoc/[parent(Top):assocfilepath]/[Thumb]\" width=[ThumbWidth] height=[ThumbHeight]>");
428	}
429	}
430
431	# Extract Thumnail metadata from convert output
432	if ($result =~ m/[0-9]+x[0-9]+=>([0-9]+)x([0-9]+)/) {
433	$doc_obj->add_metadata ($section, "ThumbWidth", $1);
434	$doc_obj->add_metadata ($section, "ThumbHeight", $2);
435	}
436	}
437	# Make a screen-sized version of the picture if requested
438	if ($self->{'screenview'}) {
439
440	# To do: if the actual image is smaller than the screenview size,
441	# we should use the original !
442
443	my $screenviewsize = $self->{'screenviewsize'} \|\| 500;
444	my $screenviewtype = $self->{'screenviewtype'} \|\| 'jpeg';
445	my $filehead = &util::get_tmp_filename();
446	my $screenviewfilename = $filehead . ".$screenviewtype";
447	my $n=1;
448	while (-e $screenviewfilename) {
449	$screenviewfilename = "$filehead$n\.$screenviewtype";
450	$n++;
451	}
452	$self->{'tmp_filename3'} = $screenviewfilename;
453
454	# make the screenview image
455	my $command = "convert -verbose -geometry $screenviewsize"
456	. "x$screenviewsize \"$filename\" \"$screenviewfilename\"";
457	print $outhandle "SCREENVIEW: $command\n" if ($verbosity > 2);
458	my $result = "";
459	$result = `$command 2>&1` ;
460	print $outhandle "SCREENVIEW RESULT: $result\n" if ($verbosity > 3);
461
462	# get screenview dimensions, size and type
463	if ($result =~ m/[0-9]+x[0-9]+=>([0-9]+)x([0-9]+)/) {
464	$doc_obj->add_metadata ($section, "ScreenWidth", $1);
465	$doc_obj->add_metadata ($section, "ScreenHeight", $2);
466	}elsif ($result =~ m/([0-9]+)x([0-9]+)/) {
467	#if the image hasn't changed size, the previous regex doesn't match
468	$doc_obj->add_metadata ($section, "ScreenWidth", $1);
469	$doc_obj->add_metadata ($section, "ScreenHeight", $2);
470	}
471
472	#add the screenview as an associated file ...
473	if (-e "$screenviewfilename") {
474	$doc_obj->associate_file("$screenviewfilename", $id."sv.$screenviewtype",
475	"image/$screenviewtype",$section);
476	print $outhandle "associating screen file $screenviewfilename as name $id sv.$screenviewtype\n" if ($verbosity > 2);
477
478	$doc_obj->add_metadata ($section, "ScreenType", $screenviewtype);
479	$doc_obj->add_metadata ($section, "Screen", $id."sv.$screenviewtype");
480
481	if ($top) {
482	$doc_obj->add_metadata ($section, "screenicon", "<img src=\"_httpprefix_/collect/[collection]/index/assoc/[assocfilepath]/[Screen]\" width=[ScreenWidth] height=[ScreenHeight]>");
483	} else {
484	$doc_obj->add_metadata ($section, "screenicon", "<img src=\"_httpprefix_/collect/[collection]/index/assoc/[parent(Top):assocfilepath]/[Screen]\" width=[ScreenWidth] height=[ScreenHeight]>");
485
486	}
487	} else {
488	print $outhandle "PagedImgPlug: couldn't find \"$screenviewfilename\"\n";
489	}
490	}
491
492	return $type;
493
494
495	}
496
497
498
499	# Discover the characteristics of an image file with the ImageMagick
500	# "identify" command.
501
502	sub identify {
503	my ($image, $outhandle, $verbosity) = @_;
504
505	# Use the ImageMagick "identify" command to get the file specs
506	my $command = "identify \"$image\" 2>&1";
507	print $outhandle "$command\n" if ($verbosity > 2);
508	my $result = '';
509	$result = `$command`;
510	print $outhandle "$result\n" if ($verbosity > 3);
511
512	# Read the type, width, and height
513	my $type = 'unknown';
514	my $width = 'unknown';
515	my $height = 'unknown';
516
517	my $image_safe = quotemeta $image;
518	if ($result =~ /^$image_safe (\w+) (\d+)x(\d+)/) {
519	$type = $1;
520	$width = $2;
521	$height = $3;
522	}
523
524	# Read the size
525	my $size = "unknown";
526	if ($result =~ m/^.* ([0-9]+)b/) {
527	$size = $1;
528	} elsif ($result =~ m/^.* ([0-9]+)kb/) {
529	$size = 1024 * $1;
530	}
531
532	print $outhandle "file: $image:\t $type, $width, $height, $size\n"
533	if ($verbosity > 3);
534
535	# Return the specs
536	return ($type, $width, $height, $size);
537	}
538
539
540	# The PagedImgPlug read() function. This function does all the right things
541	# to make general options work for a given plugin. It calls the process()
542	# function which does all the work specific to a plugin (like the old
543	# read functions used to do). Most plugins should define their own
544	# process() function and let this read() function keep control.
545	#
546	# PagedImgPlug overrides read() because there is no need to read the actual
547	# text of the file in, because the contents of the file is not text...
548	#
549	# Return number of files processed, undef if can't process
550	# Note that $base_dir might be "" and that $file might
551	# include directories
552
553	sub read_into_doc_obj {
554	my $self = shift (@_);
555	my ($pluginfo, $base_dir, $file, $metadata, $processor, $maxdocs, $total_count, $gli) = @_;
556	my $outhandle = $self->{'outhandle'};
557
558	#check process and block exps, smart block, etc
559	my ($block_status,$filename) = $self->read_block(@_);
560	return $block_status if ((!defined $block_status) \|\| ($block_status==0));
561
562	print $outhandle "PagedImgPlug processing \"$filename\"\n"
563	if $self->{'verbosity'} > 1;
564	print STDERR "<Processing n='$file' p='PagedImgPlug'>\n" if ($gli);
565
566	# here we need to decide if we have an old text .item file, or a new xml
567	# .item file - for now the test is if the first non-empty line is
568	# <PagedDocument> then its xml
569	my $xml_version = 0;
570	open (ITEMFILE, $filename) \|\| die "couldn't open $filename\n";
571
572	my $backup_filename = "backup.item";
573	open (BACKUP,">$backup_filename")\|\| die "couldn't write to $backup_filename\n";
574	my $line = "";
575	my $num = 0;
576	$line = <ITEMFILE>;
577	while ($line !~ /\w/) {
578	$line = <ITEMFILE>;
579	}
580	chomp $line;
581	if ($line =~ /<PagedDocument/) {
582	$xml_version = 1;
583	}
584	close ITEMFILE;
585	open (ITEMFILE, $filename) \|\| die "couldn't open $filename\n";
586	$line = <ITEMFILE>;
587	$line =~ s/^\xEF\xBB\xBF//; # strip BOM
588	$line =~ s/\x0B+//ig;
589	$line =~ s/&/&/g;
590	print BACKUP ($line);
591	#Tidy up the item file some metadata title contains \vt-vertical tab
592	while ($line = <ITEMFILE>) {
593	$line =~ s/\x0B+//ig;
594	$line =~ s/&/&/g;
595	print BACKUP ($line);
596	}
597	close ITEMFILE;
598	close BACKUP;
599	&File::Copy::copy ($backup_filename, $filename);
600	&util::rm($backup_filename);
601
602	my $doc_obj;
603	if ($xml_version) {
604	$file =~ s/^[\/\\]+//; # $file often begins with / so we'll tidy it up
605	$self->{'file'} = $file;
606	$self->{'filename'} = $filename;
607	$self->{'processor'} = $processor;
608	$self->{'metadata'} = $metadata;
609
610	eval {
611	$@ = "";
612	my $xslt = $self->{'xslt'};
613	if (defined $xslt && ($xslt ne "")) {
614	# perform xslt
615	my $transformed_xml = $self->apply_xslt($xslt,$filename);
616
617	# feed transformed file (now in memory as string) into XML parser
618	#$self->{'parser'}->parse($transformed_xml);
619	$self->parse_string($transformed_xml);
620	}
621	else {
622	#$self->{'parser'}->parsefile($filename);
623	$self->parse_file($filename);
624	}
625	};
626
627
628
629	if ($@) {
630
631	# parsefile may either croak somewhere in XML::Parser (e.g. because
632	# the document is not well formed) or die somewhere in XMLPlug or a
633	# derived plugin (e.g. because we're attempting to process a
634	# document whose DOCTYPE is not meant for this plugin). For the
635	# first case we'll print a warning and continue, for the second
636	# we'll just continue quietly
637
638	print STDERR "**** XML Parse Error is: $@\n";
639
640	my ($msg) = $@ =~ /Carp::croak\(\'(.*?)\'\)/;
641	if (defined $msg) {
642	my $outhandle = $self->{'outhandle'};
643	my $plugin_name = ref ($self);
644	print $outhandle "$plugin_name failed to process $file ($msg)\n";
645	}
646
647	# reset ourself for the next document
648	$self->{'section_level'}=0;
649	print STDERR "<ProcessingError n='$file'>\n" if ($gli);
650	return -1; # error during processing
651	}
652	$doc_obj = $self->{'doc_obj'};
653	} else {
654	my ($dir);
655	($dir, $file) = $filename =~ /^(.?)([^\/\\])$/;
656
657	#process the .item file
658	$doc_obj = $self->process_item($filename, $dir, $file, $processor);
659
660	}
661
662	if ($self->{'cover_image'}) {
663	$self->associate_cover_image($doc_obj, $filename);
664	}
665
666	# include any metadata passed in from previous plugins
667	# note that this metadata is associated with the top level section
668	my $section = $doc_obj->get_top_section();
669	$self->extra_metadata ($doc_obj, $section, $metadata);
670	#my $text="";
671	# do plugin specific processing of doc_obj
672	#unless (defined ($self->process(\$text, $pluginfo, $base_dir, $file, $metadata, $doc_obj))) {
673	#print STDERR "<ProcessingError n='$file'>\n" if ($gli);
674	#return -1;
675	#}
676	# do any automatic metadata extraction
677	$self->auto_extract_metadata ($doc_obj);
678
679	$self->{'num_processed'}++;
680	return (1,$doc_obj);
681	}
682
683	sub read
684	{
685	my $self = shift (@_);
686	my ($pluginfo, $base_dir, $file, $metadata, $processor, $maxdocs, $total_count, $gli) = @_; my ($process_status,$doc_obj) = $self->read_into_doc_obj(@_);
687
688	if ((defined $process_status) && ($process_status == 1)) {
689	# process the document
690	$processor->process($doc_obj);
691
692	#if(defined($self->{'places_filename'})){
693	# &util::rm($self->{'places_filename'});
694	# $self->{'places_filename'} = undef;
695	#}
696	#$self->{'num_processed'} ++;
697	undef $doc_obj;
698	}
699
700	# clean up temporary files - we do this here instead of in
701	# process_image becuase associated files aren't actually copied
702	# until after process has been run.
703	if (defined $self->{'tmp_filename1'} &&
704	-e $self->{'tmp_filename1'}) {
705	&util::rm($self->{'tmp_filename1'})
706	}
707	if (defined $self->{'tmp_filename2'} &&
708	-e $self->{'tmp_filename2'}) {
709	&util::rm($self->{'tmp_filename2'})
710	}
711	if (defined $self->{'tmp_filename3'} &&
712	-e $self->{'tmp_filename3'}) {
713	&util::rm($self->{'tmp_filename3'})
714	}
715	# if process_status == 1, then the file has been processed.
716	return $process_status;
717	}
718
719	sub xml_start_tag {
720	my $self = shift(@_);
721	my ($expat, $element) = @_;
722	$self->{'element'} = $element;
723
724	my $doc_obj = $self->{'doc_obj'};
725	if ($element eq "PagedDocument") {
726	$self->{'current_section'} = $doc_obj->get_top_section();
727	} elsif ($element eq "PageGroup" \|\| $element eq "Page") {
728	# create a new section as a child
729	$self->{'current_section'} = $doc_obj->insert_section($doc_obj->get_end_child($self->{'current_section'}));
730	$self->{'num_pages'}++;
731	# assign pagenum as what??
732	my $pagenum = $_{'pagenum'}; #TODO!!
733	if (defined $pagenum) {
734	$doc_obj->set_utf8_metadata_element($self->{'current_section'}, 'PageNum', $pagenum);
735	}
736	my ($imgfile) = $_{'imgfile'};
737	if (defined $imgfile) {
738	$self->process_image($self->{'base_dir'}.$imgfile, $imgfile, $doc_obj, $self->{'current_section'});
739	}
740	my ($txtfile) = $_{'txtfile'};
741	if (defined($txtfile)&& $txtfile ne "") {
742	$self->process_text ($self->{'base_dir'}.$txtfile, $txtfile, $doc_obj, $self->{'current_section'});
743	$doc_obj->set_metadata_element($self->{'current_section'},"NoText","0");
744	} else {
745	# otherwise add in some dummy text
746	#create an empty text string so we don't break downstream plugins
747	my $text = &gsprintf::lookup_string("{BasPlug.dummy_text}",1);
748	$doc_obj->add_utf8_text($self->{'current_section'}, $text);
749	$doc_obj->add_metadata($self->{'current_section'},"NoText","1");
750	}
751	} elsif ($element eq "Metadata") {
752	$self->{'metadata_name'} = $_{'name'};
753	}
754	}
755
756	sub xml_end_tag {
757	my $self = shift(@_);
758	my ($expat, $element) = @_;
759
760	my $doc_obj = $self->{'doc_obj'};
761	if ($element eq "Page" \|\| $element eq "PageGroup") {
762	# if Title hasn't been assigned, set PageNum as Title
763	if (!defined $doc_obj->get_metadata_element ($self->{'current_section'}, "Title") && defined $doc_obj->get_metadata_element ($self->{'current_section'}, "PageNum" )) {
764	$doc_obj->add_utf8_metadata ($self->{'current_section'}, "Title", $doc_obj->get_metadata_element ($self->{'current_section'}, "PageNum" ));
765	}
766	# move the current section back to the parent
767	$self->{'current_section'} = $doc_obj->get_parent_section($self->{'current_section'});
768	} elsif ($element eq "Metadata") {
769
770	$doc_obj->add_utf8_metadata ($self->{'current_section'}, $self->{'metadata_name'}, $self->{'metadata_value'});
771	$self->{'metadata_name'} = "";
772	$self->{'metadata_value'} = "";
773
774	}
775	# otherwise we ignore the end tag
776	}
777
778
779	sub xml_text {
780	my $self = shift(@_);
781	my ($expat) = @_;
782
783	if ($self->{'element'} eq "Metadata" && $self->{'metadata_name'}) {
784	$self->{'metadata_value'} .= $_;
785	}
786	}
787
788	sub xml_doctype {
789	}
790
791	sub open_document {
792	my $self = shift(@_);
793
794	# create a new document
795	$self->{'doc_obj'} = new doc ($self->{'filename'}, "indexed_doc");
796	my $doc_obj = $self->{'doc_obj'};
797	$doc_obj->set_OIDtype ($self->{'processor'}->{'OIDtype'});
798	my ($dir, $file) = $self->{'filename'} =~ /^(.?)([^\/\\])$/;
799	$self->{'base_dir'} = $dir;
800	$self->{'num_pages'} = 0;
801	my $topsection = $doc_obj->get_top_section();
802	if ($self->{'documenttype'} eq 'paged') {
803	# set the gsdlthistype metadata to Paged - this ensures this document will
804	# be treated as a Paged doc, even if Titles are not numeric
805
806	$doc_obj->set_utf8_metadata_element ($topsection, "gsdlthistype", "Paged");
807	} else {
808	$doc_obj->set_utf8_metadata_element ($topsection, "gsdlthistype", "Hierarchy");
809	}
810
811	$doc_obj->add_metadata ($topsection, "Source", $file);
812	if ($self->{'headerpage'}) {
813	$doc_obj->add_text($topsection, &gsprintf::lookup_string("{BasPlug.dummy_text}"));
814	}
815
816	}
817
818	sub close_document {
819	my $self = shift(@_);
820	my $doc_obj = $self->{'doc_obj'};
821
822	$doc_obj->add_utf8_metadata($doc_obj->get_top_section(), "Plugin", "$self->{'plugin_type'}");
823	$doc_obj->add_metadata($doc_obj->get_top_section(), "FileFormat", "PagedImg");
824
825	# add numpages metadata
826	$doc_obj->set_utf8_metadata_element ($doc_obj->get_top_section(), 'NumPages', $self->{'num_pages'});
827
828	# add an OID
829	$doc_obj->set_OID();
830
831	}
832
833	sub process_item {
834	my $self = shift (@_);
835	my ($filename, $dir, $file, $processor) = @_;
836
837	my $doc_obj = new doc ($filename, "indexed_doc");
838	$doc_obj->set_OIDtype ($processor->{'OIDtype'}, $processor->{'OIDmetadata'});
839	my $topsection = $doc_obj->get_top_section();
840	$doc_obj->add_utf8_metadata($topsection, "Plugin", "$self->{'plugin_type'}");
841	$doc_obj->add_metadata($topsection, "FileFormat", "PagedImg");
842
843	if ($self->{'documenttype'} eq 'paged') {
844	# set the gsdlthistype metadata to Paged - this ensures this document will
845	# be treated as a Paged doc, even if Titles are not numeric
846	$doc_obj->set_utf8_metadata_element ($topsection, "gsdlthistype", "Paged");
847	} else {
848	$doc_obj->set_utf8_metadata_element ($topsection, "gsdlthistype", "Hierarchy");
849	}
850
851	$doc_obj->add_metadata ($topsection, "Source", $file);
852
853	open (ITEMFILE, $filename) \|\| die "couldn't open $filename\n";
854	my $line = "";
855	my $num = 0;
856	while (defined ($line = <ITEMFILE>)) {
857	next unless $line =~ /\w/;
858	chomp $line;
859	next if $line =~ /^#/; # ignore comment lines
860	if ($line =~ /^<([^>])>\s(.?)\s$/) {
861	$doc_obj->set_utf8_metadata_element ($topsection, $1, $2);
862	#$meta->{$1} = $2;
863	} else {
864	$num++;
865	# line should be like page:imagefilename:textfilename:r - the r is optional -> means rotate the image 180 deg
866	$line =~ s/^\s+//; #remove space at the front
867	$line =~ s/\s+$//; #remove space at the end
868	my ($pagenum, $imgname, $txtname, $rotate) = split /:/, $line;
869
870	# create a new section for each image file
871	my $cursection = $doc_obj->insert_section($doc_obj->get_end_child($topsection));
872	# the page number becomes the Title
873	$doc_obj->set_utf8_metadata_element($cursection, 'Title', $pagenum);
874
875	# process the image for this page if there is one
876	if (defined $imgname && $imgname ne "") {
877	my $result1 = $self->process_image($dir.$imgname, $imgname, $doc_obj, $cursection, $rotate);
878
879	if (!defined $result1)
880	{
881	print "PagedImgPlug: couldn't process image \"$dir.$imgname\" for item \"$filename\"\n";
882	}
883	}
884	# process the text file if one is there
885	if (defined $txtname && $txtname ne "") {
886	my $result2 = $self->process_text ($dir.$txtname, $txtname, $doc_obj, $cursection);
887
888	if (!defined $result2) {
889	print "PagedImgPlug: couldn't process text file \"$dir.$txtname\" for item \"$filename\"\n";
890	}
891	else{
892	$doc_obj->set_metadata_element($cursection, "NoText", "0");
893	}
894	} else {
895	# otherwise add in some dummy text
896	$doc_obj->add_text($cursection, &gsprintf::lookup_string("{BasPlug.dummy_text}"));
897	# add NoText metadata which can be used to suppress the dummy text
898	}
899	}
900	}
901
902	close ITEMFILE;
903
904	# if we want a header page, we need to add some text into the top section, otherwise this section will become invisible
905	if ($self->{'headerpage'}) {
906	$doc_obj->add_text($topsection, &gsprintf::lookup_string("{BasPlug.dummy_text}"));
907	}
908	$file =~ s/\.item//i;
909	$doc_obj->set_OID ();
910	# add numpages metadata
911	$doc_obj->set_utf8_metadata_element ($topsection, 'NumPages', "$num");
912	return $doc_obj;
913	}
914
915	sub process_text {
916	my $self = shift (@_);
917	my ($fullpath, $file, $doc_obj, $cursection) = @_;
918
919	# check that the text file exists!!
920	if (!-f $fullpath) {
921	print "PagedImgPlug: ERROR: File $fullpath does not exist, skipping\n";
922	return 0;
923	}
924
925	# Do encoding stuff
926	my ($language, $encoding) = $self->textcat_get_language_encoding ($fullpath);
927
928	my $text="";
929	&BasPlug::read_file($self, $fullpath, $encoding, $language, \$text);
930	if (!length ($text)) {
931	# It's a bit unusual but not out of the question to have no text, so just give a warning
932	print "PagedImgPlug: WARNING: $fullpath contains no text\n";
933	}
934
935	# we need to escape the escape character, or else mg will convert into
936	# eg literal newlines, instead of leaving the text as '\n'
937	$text =~ s/\\/\\\\/g; # macro language
938	$text =~ s/_/\\_/g; # macro language
939
940
941	if ($text =~ m/<html.?>\s<head.?>.<\/head>\s<body.?>(.)<\/body>\s<\/html>\s*$/s) {
942	# looks like HTML input
943	# no need to escape < and > or put in <pre> tags
944
945	$text = $1;
946
947	# insert preformat tags and add text to document object
948	$doc_obj->add_utf8_text($cursection, "$text");
949	}
950	else {
951	$text =~ s/</</g;
952	$text =~ s/>/>/g;
953
954	# insert preformat tags and add text to document object
955	$doc_obj->add_utf8_text($cursection, "<pre>\n$text\n</pre>");
956	}
957
958
959	return 1;
960	}
961
962	# do plugin specific processing of doc_obj
963	sub process {
964	my $self = shift (@_);
965	my ($textref, $pluginfo, $base_dir, $file, $metadata, $doc_obj) = @_;
966	my $outhandle = $self->{'outhandle'};
967
968	return 1;
969	}
970
971	1;

Note: See TracBrowser for help on using the repository browser.

Download in other formats: