Context Navigation

source: main/trunk/greenstone2/perllib/plugins/PagedImagePlugin.pm@ 28836

Last change on this file since 28836 was 28355, checked in by ak19, 11 years ago
Now gsConvert.pl calls the new pptextract.vbs VBScript (which creates .item files and ppt slide.txt files in utf-8) instead of the older VB pptextract.exe executable which created .item and slide.txt files in windows default utf-16 LE. 2. PagedImagePlugin.pm::tidy_item_file now reads in the .item files in utf-8 mode, so that its strings are unicode aware. Substitutions are of unicode code points instead of byte sequences, since the strings in the file are now unicode aware.
Property svn:executable set to ``* Property svn:keywords set to `Author Date Id Revision`
File size: 29.1 KB

Line
1	###########################################################################
2	#
3	# PagedImagePlugin.pm -- plugin for sets of images and OCR text that
4	# make up a document
5	# A component of the Greenstone digital library software
6	# from the New Zealand Digital Library Project at the
7	# University of Waikato, New Zealand.
8	#
9	# Copyright (C) 1999 New Zealand Digital Library Project
10	#
11	# This program is free software; you can redistribute it and/or modify
12	# it under the terms of the GNU General Public License as published by
13	# the Free Software Foundation; either version 2 of the License, or
14	# (at your option) any later version.
15	#
16	# This program is distributed in the hope that it will be useful,
17	# but WITHOUT ANY WARRANTY; without even the implied warranty of
18	# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
19	# GNU General Public License for more details.
20	#
21	# You should have received a copy of the GNU General Public License
22	# along with this program; if not, write to the Free Software
23	# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
24	#
25	###########################################################################
26
27	# PagedImagePlugin
28	# processes sequences of images, with optional OCR text
29	#
30	# This plugin takes *.item files, which contain metadata and lists of image
31	# files, and produces a document containing sections, one for each page.
32	# The files should be named something.item, then you can have more than one
33	# book in a directory. You will need to create these files, one for each
34	# document/book.
35	#
36	#There are two formats for the item files: a plain text format, and an xml
37	#format. You can use either format, and can have both formats in the same
38	#collection if you like. If you use the plain format, you must not start the
39	#file off with <PagedDocument>
40
41	#### PLAIN FORMAT
42	# The format of the xxx.item file is as follows:
43	# The first lines contain any metadata for the whole document
44	# <metadata-name>metadata-value
45	# eg.
46	# <Title>Snail farming
47	# <Date>19230102
48	# Then comes a list of pages, one page per line, each line has the format
49	#
50	# pagenum:imagefile:textfile:r
51	#
52	# page num and imagefile are required. pagenum is used for the Title
53	# of the section, and in the display is shown as page <pagenum>.
54	# imagefile is the image for the page. textfile is an optional text
55	# file containing the OCR (or any) text for the page - this gets added
56	# as the text for the section. r is optional, and signals that the image
57	# should be rotated 180deg. Eg use this if the image has been made upside down.
58	# So an example item file looks like:
59	# <Title>Snail farming
60	# <Date>19960403
61	# 1:p1.gif:p1.txt:
62	# 2:p2.gif::
63	# 3:p3.gif:p3.txt:
64	# 3b:p3b.gif:p3b.txt:r
65	# The second page has no text, the fourth page is a back page, and
66	# should be rotated.
67	#
68
69	#### XML FORMAT
70	# The xml format looks like the following
71	#<PagedDocument>
72	#<Metadata name="Title">The Title of the entire document</Metadata>
73	#<Page pagenum="1" imgfile="xxx.jpg" txtfile="yyy.txt">
74	#<Metadata name="Title">The Title of this page</Metadata>
75	#</Page>
76	#... more pages
77	#</PagedDocument>
78	#PagedDocument contains a list of Pages, Metadata and PageGroups. Any metadata
79	#that is not inside another tag will belong to the document.
80	#Each Page has a pagenum (not used at the moment), an imgfile and/or a txtfile.
81	#These are both optional - if neither is used, the section will have no content.
82	#Pages can also have metadata associated with them.
83	#PageGroups can be introduced at any point - they can contain Metadata and Pages and other PageGroups. They are used to introduce hierarchical structure into the document.
84	#For example
85	#<PagedDocument>
86	#<PageGroup>
87	#<Page>
88	#<Page>
89	#</PageGroup>
90	#<Page>
91	#</PagedDocument>
92	#would generate a structure like
93	#X
94	#--X
95	# --X
96	# --X
97	#--X
98	#PageGroup tags can also have imgfile/textfile metadata if you like - this way they get some content themselves.
99
100	#Currently the XML structure doesn't work very well with the paged document type, unless you use numerical Titles for each section.
101	#There is still a bit of work to do on this format:
102	#* enable other text file types, eg html, pdf etc
103	#* make the document paging work properly
104	#* add pagenum as Title unless a Title is present?
105
106	# All the supplemetary image amd text files should be in the same folder as
107	# the .item file.
108	#
109	# To display the images instead of the document text, you can use [srcicon]
110	# in the DocumentText format statement.
111	# For example,
112	#
113	# format DocumentText "<center><table width=_pagewidth_><tr><td>[srcicon]</td></tr></table></center>"
114	#
115	# To have it create thumbnail size images, use the '-create_thumbnail' option.
116	# To have it create medium size images for display, use the '-create_screenview'
117	# option. As usual, running
118	# 'perl -S pluginfo.pl PagedImagePlugin' will list all the options.
119
120	# If you want the resulting documents to be presented with a table of
121	# contents, use '-documenttype hierarchy', otherwise they will have
122	# next and previous arrows, and a goto page X box.
123
124	# If you have used -create_screenview, you can also use [screenicon] in the format
125	# statement to display the smaller image. Here is an example that switches
126	# between the two:
127	#
128	# format DocumentText "<center><table width=_pagewidth_><tr><td>{If}{_cgiargp_ eq full,<a href='_httpdocument_&d=_cgiargd_&p=small'>Switch to small version.</a>,<a href='_httpdocument_&d=_cgiargd_&p=full'>Switch to fullsize version</a>}</td></tr><tr><td>{If}{_cgiargp_ eq full,<a href='_httpdocument_&d=_cgiargd_&p=small' title='Switch to small version'>[srcicon]</a>,<a href='_httpdocument_&d=_cgiargd_&p=full' title='Switch to fullsize version'>[screenicon]</a>}</td></tr></table></center>"
129	#
130	# Additional metadata can be added into the .item files, alternatively you can
131	# use normal metadata.xml files, with the name of the xxx.item file as the
132	# FileName (only for document level metadata).
133
134	package PagedImagePlugin;
135
136	use Encode;
137	use ReadXMLFile;
138	use ReadTextFile;
139	use ImageConverter;
140	use MetadataRead;
141
142	use strict;
143	no strict 'refs'; # allow filehandles to be variables and viceversa
144
145	sub BEGIN {
146	@PagedImagePlugin::ISA = ('MetadataRead', 'ReadXMLFile', 'ReadTextFile', 'ImageConverter');
147	}
148
149	my $gs2_type_list =
150	[ { 'name' => "auto",
151	'desc' => "{PagedImagePlugin.documenttype.auto2}" },
152	{ 'name' => "paged",
153	'desc' => "{PagedImagePlugin.documenttype.paged2}" },
154	{ 'name' => "hierarchy",
155	'desc' => "{PagedImagePlugin.documenttype.hierarchy}" }
156	];
157
158	my $gs3_type_list =
159	[ { 'name' => "auto",
160	'desc' => "{PagedImagePlugin.documenttype.auto3}" },
161	{ 'name' => "paged",
162	'desc' => "{PagedImagePlugin.documenttype.paged3}" },
163	{ 'name' => "hierarchy",
164	'desc' => "{PagedImagePlugin.documenttype.hierarchy}" },
165	{ 'name' => "pagedhierarchy",
166	'desc' => "{PagedImagePlugin.documenttype.pagedhierarchy}" }
167	];
168
169	my $arguments =
170	[ { 'name' => "process_exp",
171	'desc' => "{BasePlugin.process_exp}",
172	'type' => "string",
173	'deft' => &get_default_process_exp(),
174	'reqd' => "no" },
175	{ 'name' => "title_sub",
176	'desc' => "{HTMLPlugin.title_sub}",
177	'type' => "string",
178	'deft' => "" },
179	{ 'name' => "headerpage",
180	'desc' => "{PagedImagePlugin.headerpage}",
181	'type' => "flag",
182	'reqd' => "no" },
183	# { 'name' => "documenttype",
184	# 'desc' => "{PagedImagePlugin.documenttype}",
185	# 'type' => "enum",
186	# 'list' => $type_list,
187	# 'deft' => "auto",
188	# 'reqd' => "no" },
189	{'name' => "processing_tmp_files",
190	'desc' => "{BasePlugin.processing_tmp_files}",
191	'type' => "flag",
192	'hiddengli' => "yes"}
193	];
194
195	my $doc_type_opt = { 'name' => "documenttype",
196	'desc' => "{PagedImagePlugin.documenttype}",
197	'type' => "enum",
198	'deft' => "auto",
199	'reqd' => "no" };
200
201	my $options = { 'name' => "PagedImagePlugin",
202	'desc' => "{PagedImagePlugin.desc}",
203	'abstract' => "no",
204	'inherits' => "yes",
205	'args' => $arguments };
206
207	sub new {
208	my ($class) = shift (@_);
209	my ($pluginlist,$inputargs,$hashArgOptLists) = @_;
210	push(@$pluginlist, $class);
211
212	push(@{$hashArgOptLists->{"OptList"}},$options);
213
214	my $imc_self = new ImageConverter($pluginlist, $inputargs, $hashArgOptLists);
215
216	# we can use this plugin to check gs3 version
217	if ($imc_self->{'gs_version'} eq "3") {
218	$doc_type_opt->{'list'} = $gs3_type_list;
219	}
220	else {
221	$doc_type_opt->{'list'} = $gs2_type_list;
222	}
223	push(@$arguments,$doc_type_opt);
224	# now we add the args to the list for parsing
225	push(@{$hashArgOptLists->{"ArgList"}},@{$arguments});
226
227	my $rtf_self = new ReadTextFile($pluginlist, $inputargs, $hashArgOptLists, 1);
228	my $rxf_self = new ReadXMLFile($pluginlist, $inputargs, $hashArgOptLists);
229
230	my $self = BasePlugin::merge_inheritance($imc_self,$rtf_self,$rxf_self);
231
232	# Update $self used by XML::Parser so it finds callback functions
233	# such as start_document here and not in ReadXMLFile (which is what
234	# $self was when new XML::Parser was done)
235	#
236	# If the $self returned by this constructor is the same as the one
237	# used in ReadXMLFile (e.g. in the GreenstoneXMLPlugin) then this step isn't necessary
238	#
239	# Consider embedding this type of assignment into merge_inheritance
240	# to help catch all cases?
241
242	$rxf_self->{'parser'}->{'PluginObj'} = $self;
243
244	return bless $self, $class;
245	}
246
247
248	sub init {
249	my $self = shift (@_);
250	my ($verbosity, $outhandle, $failhandle) = @_;
251
252	$self->SUPER::init(@_);
253	$self->ImageConverter::init();
254	}
255
256	sub begin {
257	my $self = shift (@_);
258	my ($pluginfo, $base_dir, $processor, $maxdocs) = @_;
259
260	$self->SUPER::begin(@_);
261	$self->ImageConverter::begin(@_);
262	}
263
264	sub get_default_process_exp {
265	my $self = shift (@_);
266
267	return q^\.item$^;
268	}
269
270	sub get_doctype {
271	my $self = shift(@_);
272
273	return "PagedDocument";
274	}
275
276
277	# want to use BasePlugin's version of this, not ReadXMLFile's
278	sub can_process_this_file {
279	my $self = shift(@_);
280	return $self->BasePlugin::can_process_this_file(@_);
281	}
282
283	# instead of a block exp, now we scan the file and record all text and img files mentioned there for blocking.
284	sub store_block_files
285	{
286	my $self = shift (@_);
287	my ($filename_full_path, $block_hash) = @_;
288
289	my $xml_version = $self->is_xml_item_file($filename_full_path);
290
291	# do we need to do this?
292	# does BOM interfere just with XML parsing? In that case don't need it here
293	# if we do it here, we are modifying the file before we have worked out if
294	# its new or not, so it will always be reimported.
295	#$self->tidy_item_file($filename_full_path);
296
297	my ($dir, $file) = $filename_full_path =~ /^(.?)([^\/\\])$/;
298	if ($xml_version) {
299
300	# do something
301	$self->scan_xml_for_files_to_block($filename_full_path, $dir, $block_hash);
302	} else {
303
304	$self->scan_item_for_files_to_block($filename_full_path, $dir, $block_hash);
305	}
306
307	}
308
309	# we want to use BasePlugin's read, not ReadXMLFile's
310	sub read
311	{
312	my $self = shift (@_);
313
314	$self->BasePlugin::read(@_);
315	}
316
317
318
319	sub read_into_doc_obj {
320	my $self = shift (@_);
321	my ($pluginfo, $base_dir, $file, $block_hash, $metadata, $processor, $maxdocs, $total_count, $gli) = @_;
322	my $outhandle = $self->{'outhandle'};
323	my $verbosity = $self->{'verbosity'};
324
325	my ($filename_full_path, $filename_no_path) = &util::get_full_filenames($base_dir, $file);
326
327	print $outhandle "PagedImagePlugin processing \"$filename_full_path\"\n"
328	if $verbosity > 1;
329	print STDERR "<Processing n='$file' p='PagedImagePlugin'>\n" if ($gli);
330
331	$self->{'MaxImageWidth'} = 0;
332	$self->{'MaxImageHeight'} = 0;
333
334	# here we need to decide if we have an old text .item file, or a new xml
335	# .item file
336	my $xml_version = $self->is_xml_item_file($filename_full_path);
337
338	$self->tidy_item_file($filename_full_path);
339
340	my $doc_obj;
341	if ($xml_version) {
342	# careful checking needed here!! are we using local xml handlers or super ones
343	$self->ReadXMLFile::read($pluginfo, $base_dir, $file, $block_hash, $metadata, $processor, $maxdocs, $total_count, $gli);
344	$doc_obj = $self->{'doc_obj'};
345	} else {
346	my ($dir, $item_file);
347	($dir, $item_file) = $filename_full_path =~ /^(.?)([^\/\\])$/;
348
349	#process the .item file
350	$doc_obj = $self->process_item($filename_full_path, $dir, $item_file, $processor, $metadata);
351
352	}
353
354	my $section = $doc_obj->get_top_section();
355
356	$doc_obj->add_utf8_metadata($section, "Plugin", "$self->{'plugin_type'}");
357	$doc_obj->add_metadata($section, "FileFormat", "PagedImage");
358
359	# include any metadata passed in from previous plugins
360	# note that this metadata is associated with the top level section
361	$self->add_associated_files($doc_obj, $filename_full_path);
362	$self->extra_metadata ($doc_obj, $section, $metadata);
363	$self->auto_extract_metadata ($doc_obj);
364	$self->plugin_specific_process($base_dir, $file, $doc_obj, $gli);
365	# if we haven't found any Title so far, assign one
366	$self->title_fallback($doc_obj,$section,$filename_no_path);
367
368	$self->add_OID($doc_obj);
369	return (1,$doc_obj);
370	}
371	# override this for an inheriting plugin to add extra metadata etc
372	sub plugin_specific_process {
373	my $self = shift(@_);
374	my ($base_dir, $file, $doc_obj, $gli) = @_;
375
376	}
377
378	# for now, the test is if the first non-empty line is <PagedDocument>, then its xml
379	sub is_xml_item_file {
380	my $self = shift(@_);
381	my ($filename) = @_;
382
383	my $xml_version = 0;
384	open (ITEMFILE, $filename) \|\| die "couldn't open $filename\n";
385
386	my $line = "";
387	my $num = 0;
388
389	$line = <ITEMFILE>;
390	while (defined ($line) && ($line !~ /\w/)) {
391	$line = <ITEMFILE>;
392	}
393
394	if (defined $line) {
395	chomp $line;
396	if ($line =~ /<PagedDocument/) {
397	$xml_version = 1;
398	}
399	}
400
401	close ITEMFILE;
402	return $xml_version;
403	}
404
405	sub tidy_item_file {
406	my $self = shift(@_);
407	my ($filename) = @_;
408
409	open (ITEMFILE, "<:encoding(UTF-8)", $filename) \|\| die "couldn't open $filename\n";
410	my $backup_filename = "backup.item";
411	open (BACKUP,">$backup_filename")\|\| die "couldn't write to $backup_filename\n";
412	binmode(BACKUP, ":utf8");
413	my $line = "";
414	$line = <ITEMFILE>;
415	#$line =~ s/^\xEF\xBB\xBF//; # strip BOM in text file read in as a sequence of bytes (not unicode aware strings)
416	$line =~ s/^\x{FEFF}//; # strip BOM in file opened as UTF-8. Strings in the file just read in are now unicode-aware,
417	# this means the BOM is now a unicode codepoint instead of a byte sequence
418	# See http://en.wikipedia.org/wiki/Byte_order_mark and http://perldoc.perl.org/5.14.0/perlunicode.html
419	$line =~ s/\x{0B}+//ig; # removing \vt-vertical tabs using the unicode codepoint for \vt
420	$line =~ s/&/&/g;
421	print BACKUP ($line);
422	#Tidy up the item file some metadata title contains \vt-vertical tab
423	while ($line = <ITEMFILE>) {
424	$line =~ s/\x{0B}+//ig; # removing \vt-vertical tabs using the unicode codepoint for \vt
425	$line =~ s/&/&/g;
426	print BACKUP ($line);
427	}
428	close ITEMFILE;
429	close BACKUP;
430	&File::Copy::copy ($backup_filename, $filename);
431	&FileUtils::removeFiles($backup_filename);
432
433	}
434
435	sub rotate_image {
436	my $self = shift (@_);
437	my ($filename_full_path) = @_;
438
439	my ($this_filetype) = $filename_full_path =~ /\.([^\.]*)$/;
440	my $result = $self->convert($filename_full_path, $this_filetype, "-rotate 180", "ROTATE");
441	my ($new_filename) = ($result =~ /=>(.*\.$this_filetype)/);
442	if (-e "$new_filename") {
443	return $new_filename;
444	}
445	# somethings gone wrong
446	return $filename_full_path;
447
448	}
449
450	sub process_image {
451	my $self = shift(@_);
452	my ($filename_full_path, $filename_no_path, $doc_obj, $section, $rotate) = @_;
453	# check the filenames
454	return 0 if ($filename_no_path eq "" \|\| !-f $filename_full_path);
455
456	# remember that this image file was one of our source files, but only
457	# if we are not processing a tmp file
458	if (!$self->{'processing_tmp_files'} ) {
459	$doc_obj->associate_source_file($filename_full_path);
460	}
461	# do rotation
462	if ((defined $rotate) && ($rotate eq "r")) {
463	# we get a new temporary file which is rotated
464	$filename_full_path = $self->rotate_image($filename_full_path);
465	}
466
467	# do generate images
468	my $result = 0;
469	if ($self->{'image_conversion_available'} == 1) {
470	# do we need to convert $filename_no_path to utf8/url encoded?
471	# We are already reading in from a file, what encoding is it in???
472	my $url_encoded_full_filename
473	= &unicode::raw_filename_to_url_encoded($filename_full_path);
474	$result = $self->generate_images($filename_full_path, $url_encoded_full_filename, $doc_obj, $section);
475	}
476	#overwrite one set in ImageConverter
477	$doc_obj->set_metadata_element ($section, "FileFormat", "PagedImage");
478	return $result;
479	}
480
481
482	sub xml_start_tag {
483	my $self = shift(@_);
484	my ($expat, $element) = @_;
485	$self->{'element'} = $element;
486
487	my $doc_obj = $self->{'doc_obj'};
488	if ($element eq "PagedDocument") {
489	$self->{'current_section'} = $doc_obj->get_top_section();
490	} elsif ($element eq "PageGroup" \|\| $element eq "Page") {
491	if ($element eq "PageGroup") {
492	$self->{'has_internal_structure'} = 1;
493	}
494	# create a new section as a child
495	$self->{'current_section'} = $doc_obj->insert_section($doc_obj->get_end_child($self->{'current_section'}));
496	$self->{'num_pages'}++;
497	# assign pagenum as what??
498	my $pagenum = $_{'pagenum'}; #TODO!!
499	if (defined $pagenum) {
500	$doc_obj->set_utf8_metadata_element($self->{'current_section'}, 'PageNum', $pagenum);
501	}
502	my ($imgfile) = $_{'imgfile'};
503	if (defined $imgfile) {
504	# *****
505	# What about support for rotate image (e.g. old ':r' notation)?
506	$self->process_image($self->{'xml_file_dir'}.$imgfile, $imgfile, $doc_obj, $self->{'current_section'});
507	}
508	my ($txtfile) = $_{'txtfile'};
509	if (defined($txtfile)&& $txtfile ne "") {
510	$self->process_text ($self->{'xml_file_dir'}.$txtfile, $txtfile, $doc_obj, $self->{'current_section'});
511	} else {
512	$self->add_dummy_text($doc_obj, $self->{'current_section'});
513	}
514	} elsif ($element eq "Metadata") {
515	$self->{'metadata_name'} = $_{'name'};
516	}
517	}
518
519	sub xml_end_tag {
520	my $self = shift(@_);
521	my ($expat, $element) = @_;
522
523	my $doc_obj = $self->{'doc_obj'};
524	if ($element eq "Page" \|\| $element eq "PageGroup") {
525	# if Title hasn't been assigned, set PageNum as Title
526	if (!defined $doc_obj->get_metadata_element ($self->{'current_section'}, "Title") && defined $doc_obj->get_metadata_element ($self->{'current_section'}, "PageNum" )) {
527	$doc_obj->add_utf8_metadata ($self->{'current_section'}, "Title", $doc_obj->get_metadata_element ($self->{'current_section'}, "PageNum" ));
528	}
529	# move the current section back to the parent
530	$self->{'current_section'} = $doc_obj->get_parent_section($self->{'current_section'});
531	} elsif ($element eq "Metadata") {
532
533	# text read in by XML::Parser is in Perl's binary byte value
534	# form ... need to explicitly make it UTF-8
535	my $meta_name = decode("utf-8",$self->{'metadata_name'});
536	my $metadata_value = decode("utf-8",$self->{'metadata_value'});
537
538	if ($meta_name =~ /\./) {
539	$meta_name = "ex.$meta_name";
540	}
541
542	$doc_obj->add_utf8_metadata ($self->{'current_section'}, $meta_name, $metadata_value);
543	$self->{'metadata_name'} = "";
544	$self->{'metadata_value'} = "";
545
546	}
547	# otherwise we ignore the end tag
548	}
549
550
551	sub xml_text {
552	my $self = shift(@_);
553	my ($expat) = @_;
554
555	if ($self->{'element'} eq "Metadata" && $self->{'metadata_name'}) {
556	$self->{'metadata_value'} .= $_;
557	}
558	}
559
560	sub xml_doctype {
561	}
562
563	sub open_document {
564	my $self = shift(@_);
565
566	# create a new document
567	$self->{'doc_obj'} = new doc ($self->{'filename'}, "indexed_doc", $self->{'file_rename_method'});
568	# TODO is file filenmae_no_path??
569	$self->set_initial_doc_fields($self->{'doc_obj'}, $self->{'filename'}, $self->{'processor'}, $self->{'metadata'});
570
571	my ($dir, $file) = $self->{'filename'} =~ /^(.?)([^\/\\])$/;
572	$self->{'xml_file_dir'} = $dir;
573	$self->{'num_pages'} = 0;
574	$self->{'has_internal_structure'} = 0;
575
576	}
577
578	sub close_document {
579	my $self = shift(@_);
580	my $doc_obj = $self->{'doc_obj'};
581
582	my $topsection = $doc_obj->get_top_section();
583
584	# add numpages metadata
585	$doc_obj->set_utf8_metadata_element ($topsection, 'NumPages', $self->{'num_pages'});
586
587	# set the document type
588	my $final_doc_type = "";
589	if ($self->{'documenttype'} eq "auto") {
590	if ($self->{'has_internal_structure'}) {
591	if ($self->{'gs_version'} eq "3") {
592	$final_doc_type = "pagedhierarchy";
593	}
594	else {
595	$final_doc_type = "hierarchy";
596	}
597	} else {
598	$final_doc_type = "paged";
599	}
600	} else {
601	# set to what doc type option was set to
602	$final_doc_type = $self->{'documenttype'};
603	}
604	$doc_obj->set_utf8_metadata_element ($topsection, "gsdlthistype", $final_doc_type);
605	### capiatalisation????
606	# if ($self->{'documenttype'} eq 'paged') {
607	# set the gsdlthistype metadata to Paged - this ensures this document will
608	# be treated as a Paged doc, even if Titles are not numeric
609	# $doc_obj->set_utf8_metadata_element ($topsection, "gsdlthistype", "Paged");
610	# } else {
611	# $doc_obj->set_utf8_metadata_element ($topsection, "gsdlthistype", "Hierarchy");
612	# }
613
614	$doc_obj->set_utf8_metadata_element($topsection,"MaxImageWidth",$self->{'MaxImageWidth'});
615	$doc_obj->set_utf8_metadata_element($topsection,"MaxImageHeight",$self->{'MaxImageHeight'});
616	$self->{'MaxImageWidth'} = undef;
617	$self->{'MaxImageHeight'} = undef;
618
619	}
620
621
622	sub set_initial_doc_fields {
623	my $self = shift(@_);
624	my ($doc_obj, $filename_full_path, $processor, $metadata) = @_;
625
626	my $topsection = $doc_obj->get_top_section();
627
628	my $plugin_filename_encoding = $self->{'filename_encoding'};
629	my $filename_encoding = $self->deduce_filename_encoding($filename_full_path,$metadata,$plugin_filename_encoding);
630	$self->set_Source_metadata($doc_obj, $filename_full_path, $filename_encoding);
631
632	# if we want a header page, we need to add some text into the top section, otherwise this section will become invisible
633	if ($self->{'headerpage'}) {
634	$self->add_dummy_text($doc_obj, $topsection);
635	}
636	}
637
638	sub scan_xml_for_files_to_block
639	{
640	my $self = shift (@_);
641	my ($filename_full_path, $dir, $block_hash) = @_;
642
643	open (ITEMFILE, $filename_full_path) \|\| die "couldn't open $filename_full_path to work out which files to block\n";
644	my $line = "";
645	while (defined ($line = <ITEMFILE>)) {
646	next unless $line =~ /\w/;
647
648	if ($line =~ /imgfile=\"([^\"]+)\"/) {
649	&util::block_filename($block_hash,&FileUtils::filenameConcatenate($dir,$1));
650	}
651	if ($line =~ /txtfile=\"([^\"]+)\"/) {
652	&util::block_filename($block_hash,&FileUtils::filenameConcatenate($dir,$1));
653	}
654	}
655	close ITEMFILE;
656
657	}
658
659	sub scan_item_for_files_to_block
660	{
661	my $self = shift (@_);
662	my ($filename_full_path, $dir, $block_hash) = @_;
663
664
665	open (ITEMFILE, $filename_full_path) \|\| die "couldn't open $filename_full_path to work out which files to block\n";
666	my $line = "";
667	while (defined ($line = <ITEMFILE>)) {
668	next unless $line =~ /\w/;
669	chomp $line;
670	next if $line =~ /^#/; # ignore comment lines
671	next if ($line =~ /^<([^>])>\s(.?)\s$/); # ignore metadata lines
672	# line should be like page:imagefilename:textfilename:r
673	$line =~ s/^\s+//; #remove space at the front
674	$line =~ s/\s+$//; #remove space at the end
675	my ($pagenum, $imgname, $txtname, $rotate) = split /:/, $line;
676
677	# find the image file if there is one
678	if (defined $imgname && $imgname ne "") {
679	&util::block_filename($block_hash, &FileUtils::filenameConcatenate( $dir,$imgname));
680	}
681	# find the text file if there is one
682	if (defined $txtname && $txtname ne "") {
683	&util::block_filename($block_hash, &FileUtils::filenameConcatenate($dir,$txtname));
684	}
685	}
686	close ITEMFILE;
687
688	}
689
690	sub process_item {
691	my $self = shift (@_);
692	my ($filename_full_path, $dir, $filename_no_path, $processor, $metadata) = @_;
693
694	my $doc_obj = new doc ($filename_full_path, "indexed_doc", $self->{'file_rename_method'});
695	$self->set_initial_doc_fields($doc_obj, $filename_full_path, $processor, $metadata);
696	my $topsection = $doc_obj->get_top_section();
697	# simple item files are always paged unless user specified
698	if ($self->{'documenttype'} eq "auto") {
699	$doc_obj->set_utf8_metadata_element ($topsection, "gsdlthistype", "paged");
700	} else {
701	$doc_obj->set_utf8_metadata_element ($topsection, "gsdlthistype", $self->{'documenttype'});
702	}
703	open (ITEMFILE, $filename_full_path) \|\| die "couldn't open $filename_full_path\n";
704	my $line = "";
705	my $num = 0;
706	while (defined ($line = <ITEMFILE>)) {
707
708	# Since process_item is called not on an XML item file, but a text item file
709	# don't decode into UTF8 the text that was read in, since it's already UTF-8
710	#$line = decode("utf-8",$line);
711
712	next unless $line =~ /\w/;
713	chomp $line;
714	next if $line =~ /^#/; # ignore comment lines
715	if ($line =~ /^<([^>])>\s(.?)\s$/) {
716	my $meta_name = $1;
717	my $meta_value = $2;
718	if ($meta_name =~ /\./) {
719	$meta_name = "ex.$meta_name";
720	}
721	$doc_obj->set_utf8_metadata_element ($topsection, $meta_name, $meta_value);
722	#$meta->{$1} = $2;
723	} else {
724	$num++;
725	# line should be like page:imagefilename:textfilename:r - the r is optional -> means rotate the image 180 deg
726	$line =~ s/^\s+//; #remove space at the front
727	$line =~ s/\s+$//; #remove space at the end
728	my ($pagenum, $imgname, $txtname, $rotate) = split /:/, $line;
729
730	# create a new section for each image file
731	my $cursection = $doc_obj->insert_section($doc_obj->get_end_child($topsection));
732	# the page number becomes the Title
733	$doc_obj->set_utf8_metadata_element($cursection, 'Title', $pagenum);
734
735	# process the image for this page if there is one
736	if (defined $imgname && $imgname ne "") {
737	my $result1 = $self->process_image($dir.$imgname, $imgname, $doc_obj, $cursection, $rotate);
738	if (!defined $result1)
739	{
740	print "PagedImagePlugin: couldn't process image \"$dir$imgname\" for item \"$filename_full_path\"\n";
741	}
742	}
743	# process the text file if one is there
744	if (defined $txtname && $txtname ne "") {
745	my $result2 = $self->process_text ($dir.$txtname, $txtname, $doc_obj, $cursection);
746
747	if (!defined $result2) {
748	print "PagedImagePlugin: couldn't process text file \"$dir.$txtname\" for item \"$filename_full_path\"\n";
749	$self->add_dummy_text($doc_obj, $cursection);
750	}
751	} else {
752	# otherwise add in some dummy text
753	$self->add_dummy_text($doc_obj, $cursection);
754	}
755	}
756	}
757
758	close ITEMFILE;
759
760	# add numpages metadata
761	$doc_obj->set_utf8_metadata_element ($topsection, 'NumPages', "$num");
762
763	$doc_obj->set_utf8_metadata_element($topsection,"MaxImageWidth",$self->{'MaxImageWidth'});
764	$doc_obj->set_utf8_metadata_element($topsection,"MaxImageHeight",$self->{'MaxImageHeight'});
765	$self->{'MaxImageWidth'} = undef;
766	$self->{'MaxImageHeight'} = undef;
767
768
769	return $doc_obj;
770	}
771
772	sub process_text {
773	my $self = shift (@_);
774	my ($filename_full_path, $file, $doc_obj, $cursection) = @_;
775
776	# check that the text file exists!!
777	if (!-f $filename_full_path) {
778	print "PagedImagePlugin: ERROR: File $filename_full_path does not exist, skipping\n";
779	return 0;
780	}
781
782	# remember that this text file was one of our source files, but only
783	# if we are not processing a tmp file
784	if (!$self->{'processing_tmp_files'} ) {
785	$doc_obj->associate_source_file($filename_full_path);
786	}
787	# Do encoding stuff
788	my ($language, $encoding) = $self->textcat_get_language_encoding ($filename_full_path);
789
790	my $text="";
791	&ReadTextFile::read_file($self, $filename_full_path, $encoding, $language, \$text); # already decoded as utf8
792	if (!length ($text)) {
793	# It's a bit unusual but not out of the question to have no text, so just give a warning
794	print "PagedImagePlugin: WARNING: $filename_full_path contains no text\n";
795	}
796
797	# we need to escape the escape character, or else mg will convert into
798	# eg literal newlines, instead of leaving the text as '\n'
799	$text =~ s/\\/\\\\/g; # macro language
800	$text =~ s/_/\\_/g; # macro language
801
802
803	if ($text =~ m/<html.?>\s<head.?>.<\/head>\s<body.?>(.)<\/body>\s<\/html>\s*$/is) {
804	# looks like HTML input
805	# no need to escape < and > or put in <pre> tags
806
807	$text = $1;
808
809	# add text to document object
810	$doc_obj->add_utf8_text($cursection, "$text");
811	}
812	else {
813	$text =~ s/</</g;
814	$text =~ s/>/>/g;
815
816	# insert preformat tags and add text to document object
817	$doc_obj->add_utf8_text($cursection, "<pre>\n$text\n</pre>");
818	}
819
820
821	return 1;
822	}
823
824
825	sub clean_up_after_doc_obj_processing {
826	my $self = shift(@_);
827
828	$self->ImageConverter::clean_up_temporary_files();
829	}
830
831	1;

Note: See TracBrowser for help on using the repository browser.

Download in other formats: