source: main/trunk/greenstone2/perllib/plugins/HTMLImagePlugin.pm@ 24600

Last change on this file since 24600 was 24600, checked in by ak19, 13 years ago

Added gs-magick.pl script which will set the environment for ImageMagick (including LD_LIBRARY_PATH) before launching the requested ImageMagick command and arguments. By setting the Imagemagick environment from this script we ensure that the modified env variables don't create conflicts with libraries needed for normal linux execution. All the Greenstone files in the *binary* that made direct calls to imagemagick now go through this script. The affected files are perl files in bin/script and perllib and Gatherer.java of GLI. (wvware has files that test for imagemagick during compilation stage, which is independent of our changs which are only for users running imagemagick from a GS binary.) The final problems were related to how different perl files made use of the return values and the output of running their imagemagick command: they would query the 127 and/or and/or run the command with backtick operators to get the output printed to STDOUT. By inserting an intermediate gs-magick.pl file, needed to ensure that the exit code stored in 127 would at least be passed on correctly, as is necessary when testing the exit code against non-zero values or greater/less than zero (instead of comparing them with equals/not equal to 0). To get the correct exit code as emitted by imagemagick, calling code needs to shift bits in 127 and converting it to a signed value.

  • Property svn:keywords set to Author Date Id Revision
File size: 37.3 KB
Line 
1###########################################################################
2#
3# HTMLImagePlugin.pm -- Context-based image indexing plugin for HTML documents
4#
5# A component of the Greenstone digital library software
6# from the New Zealand Digital Library Project at the
7# University of Waikato, New Zealand.
8#
9# Copyright (C) 2001 New Zealand Digital Library Project
10#
11# This program is free software; you can redistribute it and/or modify
12# it under the terms of the GNU General Public License as published by
13# the Free Software Foundation; either version 2 of the License, or
14# (at your option) any later version.
15#
16# This program is distributed in the hope that it will be useful,
17# but WITHOUT ANY WARRANTY; without even the implied warranty of
18# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
19# GNU General Public License for more details.
20#
21# You should have received a copy of the GNU General Public License
22# along with this program; if not, write to the Free Software
23# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
24#
25###########################################################################
26
27# DESCRIPTION:
28#
29# Extracts images and associated text and metadata from
30# web pages as individual documents for indexing. Thumbnails
31# are created from each image for browsing purposes.
32#
33# Options are available for configuring the aggressiveness of the
34# associated text extraction mechanisms. A higher level of
35# aggressiveness will extract more text and consequently
36# may mean lower accuracy (precision); however, it may also
37# retrieve more of the relevant images from the collection (recall).
38# Lower levels of aggressiveness maybe result in slightly faster
39# collection builds at the import stage.
40#
41# HTMLImagePlugin is a subclass of HTMLPlug (i.e. it will index pages also
42# if required). It can be used in place of HTMLPlugin to index both
43# pages and their images.
44#
45# REQUIREMENTS:
46#
47# The ImageMagick image manipulation is used to create
48# thumbnails and extract some image metadata. (Available
49# from http://www.imagemagick.org/)
50#
51# Unix:
52# Many Linux distributions contain ImageMagick.
53#
54# Windows:
55# ImageMagick can be downloaded from the website above.
56# Make sure the system path includes the ImageMagick binaries
57# before using HTMLImagePlugin.
58#
59# NOTE: NT/2000/XP contain a filesystem utility 'convert.exe'
60# with the same name as the image conversion utility. The
61# ImageMagick FAQ recommends renaming the filesystem
62# utility (e.g. to 'fsconvert.exe') to avoid this clash.
63#
64# USAGE:
65#
66# An image document consists of metadata elements:
67#
68# OriginalFilename, FilePath, Filename, FileExt, FileSize,
69# Width, Height, URL, PageURL, ThumbURL, CacheURL, CachePageURL
70# ImageText, PageTitle
71#
72# Most of these are only useful in format strings (e.g. ThumbURL,
73# Filename, URL, PageURL, CachePageURL).
74#
75# ImageText, as the name suggests contains the indexable text.
76# (unless using the -document_text plugin option)
77#
78# Since image documents are made up of metadata elements
79# alone, format strings are needed to display them properly.
80# NOTE: The receptionist will only display results (e.g. thumbnails)
81# in 4 columns if the format string begins with "<td><table>".
82#
83# The example below takes the user to the image within the
84# source HTML document rather than using a format string
85# on DocumentText to display the image document itself.
86#
87# Example collect.cfg:
88#
89# ...
90#
91# indexes document:ImageText document:text
92# defaultindex document:ImageText
93#
94# collectionmeta .document:ImageText "images"
95# collectionmeta .document:text "documents"
96#
97# ...
98#
99# plugin HTMLImagePlugin -index_pages -aggressiveness 6
100#
101# ...
102#
103# format SearchVList '<td>{If}{[Title],[link][icon]&nbsp;[Title][[/link],
104# <table><tr><td align="center"><a href="[CachePageURL]">
105# <img src="[ThumbURL]"></a></td></tr><tr><td align="center">
106# <a href="[CachePageURL]"><font size="-1">[OriginalFilename]</font></a>
107# <br>[Width]x[Height]</td></tr></table>}</td>'
108#
109# ...
110#
111
112package HTMLImagePlugin;
113
114use HTMLPlugin;
115use ghtml;
116use unicode;
117use util;
118use strict; # 'subs';
119no strict 'refs'; # allow filehandles to be variables and viceversa
120
121sub BEGIN {
122 @HTMLImagePlugin::ISA = qw( HTMLPlugin );
123}
124
125my $aggressiveness_list =
126 [ { 'name' => "1",
127 'desc' => "{HTMLImagePlugin.aggressiveness.1}" },
128 { 'name' => "2",
129 'desc' => "{HTMLImagePlugin.aggressiveness.2}" },
130 { 'name' => "3",
131 'desc' => "{HTMLImagePlugin.aggressiveness.3}" },
132 { 'name' => "4",
133 'desc' => "{HTMLImagePlugin.aggressiveness.4}" },
134 { 'name' => "5",
135 'desc' => "{HTMLImagePlugin.aggressiveness.5}" },
136 { 'name' => "6",
137 'desc' => "{HTMLImagePlugin.aggressiveness.6}" },
138 { 'name' => "7",
139 'desc' => "{HTMLImagePlugin.aggressiveness.7}" },
140 { 'name' => "8",
141 'desc' => "{HTMLImagePlugin.aggressiveness.8}" },
142 { 'name' => "9",
143 'desc' => "{HTMLImagePlugin.aggressiveness.9}" } ];
144
145my $arguments =
146 [ { 'name' => "aggressiveness",
147 'desc' => "{HTMLImagePlugin.aggressiveness}",
148 'type' => "int",
149 'list' => $aggressiveness_list,
150 'deft' => "3",
151 'reqd' => "no" },
152 { 'name' => "index_pages",
153 'desc' => "{HTMLImagePlugin.index_pages}",
154 'type' => "flag",
155 'reqd' => "no" },
156 { 'name' => "no_cache_images",
157 'desc' => "{HTMLImagePlugin.no_cache_images}",
158 'type' => "flag",
159 'reqd' => "no" },
160 { 'name' => "min_size",
161 'desc' => "{HTMLImagePlugin.min_size}",
162 'type' => "int",
163 'deft' => "2000",
164 'reqd' => "no" },
165 { 'name' => "min_width",
166 'desc' => "{HTMLImagePlugin.min_width}",
167 'type' => "int",
168 'deft' => "50",
169 'reqd' => "no" },
170 { 'name' => "min_height",
171 'desc' => "{HTMLImagePlugin.min_height}",
172 'type' => "int",
173 'deft' => "50",
174 'reqd' => "no" },
175 { 'name' => "thumb_size",
176 'desc' => "{HTMLImagePlugin.thumb_size}",
177 'type' => "int",
178 'deft' => "100",
179 'reqd' => "no" },
180 { 'name' => "convert_params",
181 'desc' => "{HTMLImagePlugin.convert_params}",
182 'type' => "string",
183 'deft' => "",
184 'reqd' => "no" },
185 { 'name' => "min_near_text",
186 'desc' => "{HTMLImagePlugin.min_near_text}",
187 'type' => "int",
188 'deft' => "10",
189 'reqd' => "no" },
190 { 'name' => "max_near_text",
191 'desc' => "{HTMLImagePlugin.max_near_text}",
192 'type' => "int",
193 'deft' => "400",
194 'reqd' => "no" },
195 { 'name' => "smallpage_threshold",
196 'desc' => "{HTMLImagePlugin.smallpage_threshold}",
197 'type' => "int",
198 'deft' => "2048",
199 'reqd' => "no" },
200 { 'name' => "textrefs_threshold",
201 'desc' => "{HTMLImagePlugin.textrefs_threshold}",
202 'type' => "int",
203 'deft' => "2",
204 'reqd' => "no" },
205 { 'name' => "caption_length",
206 'desc' => "{HTMLImagePlugin.caption_length}",
207 'type' => "int",
208 'deft' => "80",
209 'reqd' => "no" },
210 { 'name' => "neartext_length",
211 'desc' => "{HTMLImagePlugin.neartext_length}",
212 'type' => "int",
213 'deft' => "300",
214 'reqd' => "no" },
215 { 'name' => "document_text",
216 'desc' => "{HTMLImagePlugin.document_text}",
217 'type' => "flag",
218 'reqd' => "no" } ];
219
220my $options = { 'name' => "HTMLImagePlugin",
221 'desc' => "{HTMLImagePlugin.desc}",
222 'abstract' => "no",
223 'inherits' => "yes",
224 'args' => $arguments };
225
226sub new {
227 my ($class) = shift (@_);
228 my ($pluginlist,$inputargs,$hashArgOptLists) = @_;
229 push(@$pluginlist, $class);
230
231 push(@{$hashArgOptLists->{"ArgList"}},@{$arguments});
232 push(@{$hashArgOptLists->{"OptList"}},$options);
233
234 my $self = new HTMLPlugin($pluginlist, $inputargs, $hashArgOptLists);
235
236 # init class variables
237 $self->{'textref'} = undef; # init by read_file fn
238 $self->{'htdoc_obj'} = undef; # init by process fn
239 $self->{'htpath'} = undef; # init by process fn
240 $self->{'hturl'} = undef; # init by process fn
241 $self->{'plaintext'} = undef; # HTML stripped version - only init if needed by raw_neartext sub
242 $self->{'smallpage'} = 0; # set by process fn
243 $self->{'images_indexed'} = undef; # num of images indexed - if 1 or 2 then we know page is small
244 $self->{'initialised'} = undef; # flag (see set_extraction_options())
245
246 return bless $self, $class;
247}
248
249# if indexing pages, let HTMLPlugin do it's stuff
250# image extraction done through read()
251sub process {
252 my $self = shift(@_);
253 my ($textref, $pluginfo, $base_dir, $file, $metadata, $doc_obj, $gli) = @_;
254 $self->{'imglist'} = ();
255 if ( $self->{'index_pages'} ) {
256 my $ok = $self->SUPER::process($textref, $pluginfo, $base_dir, $file, $metadata, $doc_obj, $gli);
257 if ( ! $ok ) { return $ok }
258 $self->{'htdoc_obj'} = $doc_obj;
259 }
260 # else use URL for referencing
261 #if ( $file =~ /(.*)[\/\\]/ ) { $self->{'htpath'} = $1; } else { $self->{'htpath'} = $file; }
262
263 $self->{'htpath'} = $base_dir if (-d $base_dir);
264 if ( $file =~ /(.*)[\/\\]/ ) { $self->{'htpath'} .= "/$1"; }
265 $self->{'htpath'} =~ s/\\/\//g; # replace \ with /
266
267 $self->{'hturl'} = "http://$file";
268 $self->{'hturl'} =~ s/\\/\//g; # for windows
269 ($self->{'filename'}) = $file =~ /.*[\/\\](.*)/;
270 ($self->{'base_path'}) = $file =~ /(.*)[\/\\]/i;
271 if ( ( -s "$base_dir/$file") <= $self->{'smallpage_threshold'} ) {
272 $self->{'smallpage'} = 1;
273 } else { $self->{'smallpage'} = 0; }
274
275 if ( defined($self->{'initialised'}) ) { return 1; }
276 else {
277 $self->{'initialised'} = $self->set_extraction_options($base_dir =~ /^(.*?)\/import/i);
278 return $self->{'initialised'};
279 }
280}
281
282# get complex configuration options from configuration files
283# -- $GSDLCOLLECTION/etc/HTMLImagePlugin.cfg (tag sets for aggr 2+)
284# -- $GSDLHOME/etc/packages/phind/stopword/en/brown.sw (stopwords for aggr 5+)
285
286# If there's no HTMLImagePlugin.cfg file we'll use the following default values
287my $defaultcfg = '
288<delimitertagset>
289 <setname>Caption</setname>
290 <taggroup>font</taggroup>
291 <taggroup>tt</taggroup>
292 <taggroup>small</taggroup>
293 <taggroup>b</taggroup>
294 <taggroup>i</taggroup>
295 <taggroup>u</taggroup>
296 <taggroup>em</taggroup>
297 <taggroup>td</taggroup>
298 <taggroup>li</taggroup>
299 <taggroup>a</taggroup>
300 <taggroup>p</taggroup>
301 <taggroup>tr</taggroup>
302 <taggroup>center</taggroup>
303 <taggroup>div</taggroup>
304 <taggroup>caption</taggroup>
305 <taggroup>br</taggroup>
306 <taggroup>ul</taggroup>
307 <taggroup>ol</taggroup>
308 <taggroup>table</taggroup>
309 <taggroup>hr</taggroup>
310</delimitertagset>
311
312<delimitertagset>
313 <setname>Neartext</setname>
314 <taggroup>tr|hr|table|h\d|img|body</taggroup>
315 <taggroup>td|tr|hr|table|h\d|img|body</taggroup>
316 <taggroup>p|br|td|tr|hr|table|h\d|img|body</taggroup>
317 <taggroup>font|p|i|b|em|img</taggroup>
318</delimitertagset>
319';
320
321sub set_extraction_options() {
322 my ($self, $collpath) = @_;
323 my ($filepath);
324
325 print {$self->{'outhandle'}} "HTMLImagePlugin: Initialising\n"
326 if $self->{'verbosity'} > 1;
327 # etc/HTMLImagePlugin.cfg (XML)
328 # tag sets for captions and neartext
329 if ( $self->{'aggressiveness'} > 1 && $self->{'aggressiveness'} != 9 ) {
330 $self->{'delims'} = [];
331 $self->{'cdelims'} = [];
332 my ($cfg, @tagsets, $tagset, $type, @delims);
333
334 $filepath = "$collpath/etc/HTMLImagePlugin.cfg";
335 if ( open CFG, "<$filepath" ) {
336 while (<CFG>) { $cfg .= $_ }
337 close CFG;
338 } else {
339 $cfg = $defaultcfg;
340 }
341
342 (@tagsets) =
343 $cfg =~ /<delimitertagset>(.*?)<\/delimitertagset>/igs;
344 foreach $tagset ( @tagsets ) {
345 ($type) = $tagset =~ /<setname>(.*?)<\/setname>/i;
346 if ( lc($type) eq "caption" ) {
347 (@{$self->{'cdelims'}}) = $tagset =~ /<taggroup>(.*?)<\/taggroup>/igs;
348 }
349 elsif ( lc($type) eq "neartext" ) {
350 (@{$self->{'delims'}}) = $tagset =~ /<taggroup>(.*?)<\/taggroup>/igs;
351 }
352 }
353
354 # output a warning if there seem to be no delimiters
355 if ( scalar(@{$self->{'cdelims'}} == 0)) {
356 print {$self->{'outhandle'}} "HTMLImagePlugin: Warning: no caption delimiters found in $filepath\n";
357 }
358 if ( scalar(@{$self->{'delims'}} == 0)) {
359 print {$self->{'outhandle'}} "HTMLImagePlugin: Warning: no neartext delimiters found in $filepath\n";
360 }
361 }
362
363 # get stop words for textual reference extraction
364 # TODO: warnings scroll off. Would be best to output them again at end of import
365 if ( $self->{'aggressiveness'} >=5 && $self->{'aggressiveness'} != 9 ) {
366 $self->{'stopwords'} = ();
367 $filepath = &util::filename_cat($ENV{'GSDLHOME'}, "etc", "packages", "phind", "stopword", "en", "brown.sw");
368 if ( open STOPWORDS, "<$filepath" ) {
369 while ( <STOPWORDS> ) {
370 chomp;
371 $self->{'stopwords'}{$_} = 1;
372 }
373 close STOPWORDS;
374 } else {
375 print {$self->{'outhandle'}} "HTMLImagePlugin: Warning: couldn't open stopwords file at $filepath ($!)\n";
376 }
377
378 }
379
380 if ( $self->{'neartext_length'} > $self->{'max_near_text'} ) {
381 $self->{'max_near_text'} = $self->{'neartext_length'} * 1.33;
382 print {$self->{'outhandle'}} "HTMLImagePlugin: Warning: adjusted max_text to $self->{'max_near_text'}\n";
383 }
384 if ( $self->{'caption_length'} > $self->{'max_near_text'} ) {
385 $self->{'max_near_text'} = $self->{'caption_length'} * 1.33;
386 print {$self->{'outhandle'}} "HTMLImagePlugin: Warning: adjusted max_text to $self->{'max_near_text'}\n";
387 }
388
389 return 1;
390}
391
392# return number of files processed, undef if can't recognise, -1 if
393# cant process
394# Note that $base_dir might be "" and that $file might
395# include directories
396sub read {
397 my ($self, $pluginfo, $base_dir, $file, $block_hash, $metadata, $processor, $maxdocs, $total_count, $gli) = (@_);
398 my ($doc_obj, $section, $filepath, $imgtag, $pos, $context, $numdocs, $tndir, $imgs);
399 # forward normal read (runs HTMLPlugin if index_pages T)
400 my $ok = $self->SUPER::read($pluginfo, $base_dir, $file, $block_hash, $metadata, $processor, $maxdocs, $total_count, $gli);
401 if ( ! $ok ) { return $ok } # what is this returning??
402
403 my $outhandle = $self->{'outhandle'};
404 my $textref = $self->{'textref'};
405 my $htdoc_obj = $self->{'htdoc_obj'};
406 $numdocs = 0;
407 $base_dir =~ /(.*)\/.*/;
408 $tndir = "$1/archives/thumbnails"; # TODO: this path shouldn't be hardcoded?
409 &util::mk_all_dir($tndir) unless -e "$tndir";
410
411 $imgs = \%{$self->{'imglist'}};
412 my $nimgs = $self->get_img_list($textref);
413 $self->{'images_indexed'} = $nimgs;
414 if ( $nimgs > 0 ) {
415 my @fplist = (sort { $imgs->{$a}{'pos'} <=> $imgs->{$b}{'pos'} } keys %{$imgs});
416 my $i = 0;
417 foreach $filepath ( @fplist ) {
418 $pos = $imgs->{$filepath}{'pos'};
419 $context = substr ($$textref, $pos - 50, $pos + 50); # grab context (quicker)
420 ($imgtag) = ($context =~ /(<(?:img|a|body)\s[^>]*$filepath[^>]*>)/is );
421 if (! defined($imgtag)) { $imgtag = $filepath }
422 print $outhandle "HTMLImagePlugin: extracting $filepath\n"
423 if ( $self->{'verbosity'} > 1 );
424 $doc_obj = new doc ("", "indexed_doc", $self->{'file_rename_method'});
425 $section = $doc_obj->get_top_section();
426 my $prevpos = ( $i == 0 ? 0 : $imgs->{$fplist[$i - 1]}{'pos'});
427 my $nextpos = ( $i >= ($nimgs -1) ? -1 : $imgs->{$fplist[$i + 1]}{'pos'} );
428
429 $self->extract_image_info($imgtag, $filepath, $textref, $doc_obj, $section, $tndir, $prevpos, $nextpos);
430 $processor->process($doc_obj);
431 $numdocs++;
432 $i++;
433 }
434 return $numdocs;
435 } else {
436 print $outhandle "HTMLImagePlugin: No images from $file indexed\n"
437 if ( $self->{'verbosity'} > 2 );
438 return 1;
439 }
440
441}
442
443# for every valid image tag
444# 1. extract related text and image metadata
445# 2. add this as document meta-data
446# 3. add assoc image(s) as files
447#
448sub extract_image_info {
449 my $self = shift (@_);
450 my ($tag, $id, $textref, $doc_obj, $section, $tndir, $prevpos, $nextpos) = (@_);
451 my ($filename, $orig_fp, $fn, $ext, $reltext, $relreltext, $crcid, $imgs,
452 $thumbfp, $pagetitle, $alttext, $filepath, $aggr);
453
454 my $imagick_cmd = "\"".&util::get_perl_exec()."\" -S gs-magick.pl";
455
456 $aggr = $self->{'aggressiveness'};
457 $imgs = \%{$self->{'imglist'}};
458 $filepath = $imgs->{$id}{'relpath'};
459 ($filename) = $filepath =~ /([^\/\\]+)$/s;
460 ($orig_fp) = "$self->{'base_path'}/$filepath";
461 $orig_fp =~ tr/+/ /;
462 $orig_fp =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg; # translate %2E to space, etc
463 $orig_fp =~ s/\\/\//g;
464 $filepath = "$self->{'htpath'}/$filepath";
465 my ($onlyfn) = $filename =~ /([^\\\/]*)$/;
466 ($fn, $ext) = $onlyfn =~ /(.*)\.(.*)/;
467 $fn = lc $fn; $ext = lc $ext;
468 ($reltext) = "<tr><td>GifComment</td><td>" . `$imagick_cmd identify $filepath -ping -format "%c"` . "</td></tr>\n"
469 if ($ext eq "gif");
470 $reltext .= "<tr><td>FilePath</td><td>$orig_fp</td></tr>\n";
471
472 if ($ENV{'GSDLOS'} =~ /^windows$/i) {
473 $crcid = "$fn.$ext." . $self->{'next_crcid'}++;
474 } else {
475 ($crcid) = `cksum $filepath` =~ /^(\d+)/;
476 }
477
478 $thumbfp = "$tndir/tn_$crcid.jpg";
479 `$imagick_cmd convert -flatten -filter Hanning $self->{'convert_params'} -geometry "$self->{'thumb_size'}x$self->{'thumb_size'}>" $filepath $thumbfp` unless -e $thumbfp;
480 if ( ! (-e $thumbfp) ) {
481 print STDERR "HTMLImagePlugin: 'convert' failed. Check ImageMagicK binaries are installed and working correctly\n"; return 0;
482 }
483
484 # shove in full text (tag stripped or unstripped) if settings require it
485 if ( $aggr == 10) {
486 $reltext = "<tr><td>AllPage</td><td>" . $$textref . "</td><tr>\n"; # level 10 (all text, verbatim)
487 } else {
488 $pagetitle = $self->get_meta_value("title", $textref);
489 ($alttext) = $tag =~ /\salt\s*=\s*(?:\"|\')(.+?)(?:\"|\')/is;
490 if ( defined($alttext) && length($alttext) > 1) {
491 $reltext .= "<tr><td>ALTtext</td><td>$alttext</td></tr>\n"; }
492 $reltext .= "<tr><td>SplitCapitalisation</td><td>" .
493 $self->split_filepath($orig_fp) . "</td></tr>\n";
494
495 # get caption/tag based near text (if appropriate)
496 if ( $aggr > 1 ) {
497 if ( $aggr >= 2 ) {
498 $reltext .=
499 $self->extract_caption_text($tag, $textref, $prevpos, $imgs->{$id}{'pos'}, $nextpos);
500 $relreltext = $reltext;
501 }
502 # repeat the filepath, alt-text, caption, etc
503 if ( $aggr == 8 ) {
504 $reltext .= $relreltext;
505 }
506 if ( $aggr >= 3 ) {
507 $reltext .=
508 $self->extract_near_text($tag, $textref, $prevpos, $imgs->{$id}{'pos'}, $nextpos);
509 }
510
511 # get page metadata (if appropriate)
512 if ( $aggr >= 6 || ( $aggr >= 2 &&
513 ( $self->{'images_indexed'} < 2 ||
514 ($self->{'smallpage'} == 1 && $self->{'images_indexed'} < 6 )))) {
515 $reltext .= $self->get_page_metadata($textref);
516 }
517 # textual references
518 if ( $aggr == 5 || $aggr >= 7) {
519 if ( length($relreltext) > ($self->{'caption_length'} * 2) ) {
520 $reltext .= $self->get_textrefs($relreltext, $textref, $prevpos, $imgs->{$id}{'pos'}, $nextpos); }
521 else {
522 $reltext .= $self->get_textrefs($reltext, $textref, $prevpos, $imgs->{$id}{'pos'}, $nextpos);
523 }
524 }
525 } # aggr > 1
526 } # aggr != 10
527
528 $doc_obj->set_OID($crcid);
529 $doc_obj->associate_file($thumbfp, "$fn.thumb.jpg", undef, $section);
530 $doc_obj->add_metadata($section, "OriginalFilename", $filename);
531 $doc_obj->add_metadata($section, "FilePath", $orig_fp);
532 $doc_obj->add_metadata($section, "Filename", $fn);
533 $doc_obj->add_metadata($section, "FileExt", $ext);
534 $doc_obj->add_metadata($section, "FileSize", $imgs->{$id}{'filesize'});
535 $doc_obj->add_metadata($section, "Width", $imgs->{$id}{'width'});
536 $doc_obj->add_metadata($section, "Height", $imgs->{$id}{'height'});
537 $doc_obj->add_metadata($section, "URL", "http://$orig_fp");
538 $doc_obj->add_metadata($section, "PageURL", $self->{'hturl'});
539 $doc_obj->add_metadata($section, "PageTitle", $pagetitle);
540 $doc_obj->add_metadata($section, "ThumbURL",
541 "_httpprefix_/collect/[collection]/index/assoc/[assocfilepath]/$fn.thumb.jpg");
542 $doc_obj->add_metadata($section, "FileFormat", "W3Img");
543
544 if ( $self->{'document_text'} ) {
545 $doc_obj->add_utf8_text($section, "<table border=1>\n$reltext</table>");
546 } else {
547 $doc_obj->add_metadata($section, "ImageText", "<table border=1>\n$reltext</table>\n");
548 }
549
550 if ( $self->{'index_pages'} ) {
551 my ($cache_url) = "_httpdoc_&d=" . $self->{'htdoc_obj'}->get_OID();
552 if ( $imgs->{$id}{'anchored'} ) {
553 my $a_name = $id;
554 $a_name =~ s/[\/\\\:\&]/_/g;
555 $cache_url .= "#gsdl_$a_name" ;
556 }
557 $doc_obj->add_utf8_metadata($section, "CachePageURL", $cache_url);
558 }
559 if ( ! $self->{'no_cache_images'} ) {
560 $onlyfn = lc $onlyfn;
561 $doc_obj->associate_file($filepath, $onlyfn, undef, $section);
562 $doc_obj->add_utf8_metadata($section, "CacheURL",
563 "_httpprefix_/collect/[collection]/index/assoc/[assocfilepath]/$onlyfn");
564 }
565 return 1;
566}
567
568sub get_page_metadata {
569 my ($self, $textref) = (@_);
570 my (@rval);
571 $rval[0] = $self->get_meta_value("title", $textref);
572 $rval[1] = $self->get_meta_value("keywords", $textref);
573 $rval[2] = $self->get_meta_value("description", $textref);
574 $rval[3] = $self->{'filename'};
575
576 return wantarray ? @rval : "<tr><td>PageMeta</td><td>@rval</td></tr>\n" ;
577}
578
579# turns LargeCatFish into Large,Cat,Fish so MG sees the separate words
580sub split_filepath {
581 my ($self, $filepath) = (@_);
582 my (@words) = $filepath =~ /([A-Z][a-z]+)/g;
583 return join(',', @words);
584}
585
586# finds and extracts sentences
587# that seem to be on the same topic
588# as other related text (correlations)
589# and textual references (e.g. in figure 3 ...)
590sub get_textrefs {
591 my ($self, $reltext, $textref, $prevpos, $pos, $nextpos) = (@_);
592 my ($maxtext, $mintext, $startpos, $context_size, $context);
593
594 my (@relwords, @refwords, %sentences, @pagemeta);
595
596 # extract larger context
597 $maxtext = $self->{'max_near_text'};
598 $startpos = $pos - ($maxtext * 4);
599 $context_size = $maxtext*10;
600 if ($startpos < $prevpos ) { $startpos = $prevpos }
601 if ($nextpos != -1 && $context_size > ( $nextpos - $startpos )) { $context_size = ($nextpos - $startpos) }
602 $context = substr ( $$textref, $startpos, $context_size );
603 $context =~ s/<.*?>//gs;
604 $context =~ s/^.*>(.*)/$1/gs;
605 $context =~ s/(.*)<.*$/$1/gs;
606
607 # get page meta-data (if not already included)
608 if ( $self->{'aggressiveness'} == 5 && ! $self->{'smallpage'} ) {
609 @pagemeta = $self->get_page_metadata($textref);
610 foreach my $value ( @pagemeta ) {
611 $context .= "$value."; # make each into psuedo-sentence
612 }
613 }
614
615 # TODO: this list is not exhaustive
616 @refwords = ( '(?:is|are)? ?(?:show(?:s|n)|demonstrate(?:d|s)|explains|features) (?:in|by|below|above|here)',
617 '(?:see)? (?:figure|table)? (?:below|above)');
618
619 # extract general references
620 foreach my $rw ( @refwords ) {
621 while ( $context =~ /[\.\?\!\,](.*?$rw\W.*?[\.\?\!\,])/ig ) {
622 my $sentence = $1;
623 $sentence =~ s/\s+/ /g;
624 $sentences{$sentence}+=2;
625 }
626 }
627 # extract specific (figure, table) references by number
628 my ($fignum) = $context =~ /[\.\?\!].*?(?:figure|table)s?[\-\_\ \.](\d+\w*)\W.*?[\.\?\!]/ig;
629 if ( $fignum ) {
630 foreach my $rw ( @refwords ) {
631 while ( $context =~ /[\.\?\!](.*?(figure|table)[\-\_\ \.]$fignum\W.*?[\.\?\!])/ig ) {
632 my $sentence = $1;
633 $sentence =~ s/\s+/ /g;
634 $sentences{$sentence}+=4;
635 }
636 }
637 }
638
639 # sentences with occurances of important words
640 @relwords = $reltext =~ /([a-zA-Z]{4,})/g; # take out small words
641 foreach my $word ( @relwords ) {
642 if ( $self->{'stopwords'}{$word} ) { next } # skip stop words
643 while ( $context =~ /([^\.\?\!]*?$word\W.*?[\.\?\!])/ig ) {
644 my $sentence = $1;
645 $sentence =~ s/\s+/ /g;
646 $sentences{$sentence}++;
647 }
648 }
649 foreach my $sentence ( keys %sentences ) {
650 if ($sentences{$sentence} < $self->{'textrefs_threshold'}) {
651 delete $sentences{$sentence};
652 }
653 }
654 my ($rval) = join "<br>\n", (keys %sentences);
655 if ( $rval && length($rval) > 5 ) {
656 return ( "<tr><td>TextualReferences</td><td>" . $rval . "</td></tr>\n") }
657 else { return "" }
658}
659
660# handles caption extraction
661# calling the extractor with different
662# tags and choosing the best candidate caption
663sub extract_caption_text {
664 my ($self, $tag, $textref, $prevpos, $pos, $nextpos) = (@_);
665 my (@neartext, $len, $hdelim, $mintext, $goodlen,
666 $startpos, $context, $context_size);
667
668 $mintext = $self->{'min_near_text'};
669 $goodlen = $self->{'caption_length'};
670
671 # extract a context to extract near text from (faster)
672 $context_size = $self->{'max_near_text'}*3;
673 $startpos = $pos - ($context_size / 2);
674 if ($startpos < $prevpos ) { $startpos = $prevpos }
675 if ($nextpos != -1 && $context_size > ( $nextpos - $startpos ))
676 { $context_size = ($nextpos - $startpos) }
677
678 $context = substr ( $$textref, $startpos, $context_size );
679 $context =~ s/<!--.*?-->//gs;
680 $context =~ s/^.*-->(.*)/$1/gs;
681 $context =~ s/(.*)<!--.*$/$1/gs;
682
683 # try stepping through markup delimiter sets
684 # and selecting the best one
685 foreach $hdelim ( @{ $self->{'cdelims'} } ) {
686 @neartext = $self->extract_caption($tag, $hdelim, \$context);
687 $len = length(join("", @neartext));
688 last if ($len >= $mintext && $len <= $goodlen);
689 }
690 # reject if well over reasonable length
691 if ( $len > $goodlen ) {
692 @neartext = [];
693 }
694 $neartext[0] = " " if (! defined $neartext[0]);
695 $neartext[1] = " " if (! defined $neartext[1]);
696 return "<tr><td>Caption</td><td>" . (join ",", @neartext) . "</td></tr>\n"; # TODO: the | is for testing purposes
697} # end extract_caption_text
698
699# the previous section header often gives a bit
700# of context to the section that the image is
701# in (invariably the header is before/above the image)
702# so extract the text of the closest header above the image
703#
704# this fn just gets all the headers above the image, within the context window
705sub get_prev_header {
706 my ($self, $pos, $textref) = (@_);
707 my ($rhtext);
708 while ( $$textref =~ /<h\d>(.*?)<\/h\d>/sig ) {
709 # only headers before image
710 if ((pos $$textref) < $pos) {
711 $rhtext .= "$1, ";
712 }
713 }
714 if ( $rhtext ) { return "Header($rhtext)" }
715 else { return "" }
716}
717
718# not the most robust tag stripping
719# regexps (see perl.com FAQ) but good enough
720#
721# used by caption & tag-based near text algorithms
722sub strip_tags {
723 my ( $self, $value ) = @_;
724 if ( ! defined($value) ) { $value = "" } # handle nulls
725 else {
726 $value =~ s/<.*?>//gs; # strip all html tags
727 $value =~ s/\s+/\ /g; # remove extra whitespace
728 $value =~ s/\&\w+\;//g; # remove &nbsp; etc
729 }
730 return $value;
731}
732
733# uses the given tag(s) to identify
734# the caption near to the image
735# (below, above or both below and above)
736sub extract_caption {
737 my ($self, $tag, $bound_tag, $contextref) = (@_);
738 my (@nt, $n, $etag, $gotcap);
739 return ("", "") if ( ! ($$contextref =~ /\Q$tag/) );
740
741 $nt[0] = $`;
742 $nt[1] = $';
743 $gotcap = 0;
744
745 # look before the image for a boundary tag
746 ($etag, $nt[0]) = $nt[0] =~ /<($bound_tag)[\s]?.*?>(.*?)$/is;
747 # if bound_tag too far from the image, then prob not caption
748 # (note: have to allow for tags, so multiply by 3
749 if ( $etag && length($nt[0]) < ($self->{'caption_length'} * 3) ) {
750 if ( $nt[0] =~ /<\/$etag>/si ) {
751 # the whole caption is above the image: <tag>text</tag><img>
752 ($nt[0]) =~ /<(?:$etag)[\s]?.*?>(.*?)<\/$etag>/is;
753 $nt[0] = $self->strip_tags($nt[0]);
754 if ( length($nt[0]) > $self->{'min_near_text'} ) {
755 $gotcap = 1;
756 $nt[1] = "";
757 }
758
759 } elsif ( $nt[1] =~ /<\/$etag>/si) {
760 # the caption tag covers image: <tag>text?<img>text?</tag>
761 ($nt[1]) = $nt[1] =~ /(.*?)<\/$etag>/si;
762 $nt[0] = $self->strip_tags($nt[0] . $nt[1]);
763 if ( length($nt[0]) > $self->{'min_near_text'} ) {
764 $gotcap = 2;
765 $nt[1] = "";
766 }
767 }
768 }
769 # else try below the image
770 if ( ! $gotcap ) {
771 # the caption is after the image: <img><tag>text</tag>
772 ($etag, $nt[1]) = $nt[1] =~ /^.*?<($bound_tag)[\s]?.*?>(.*)/is;
773 if ( $etag && $nt[1] =~ /<\/$etag>/s) {
774 ($nt[1]) = $nt[1] =~ /(.*?)<\/$etag>/si;
775 $gotcap = 3;
776 $nt[0] = "";
777 $nt[1] = $self->strip_tags($nt[1]);
778 }
779 }
780 if ( ! $gotcap ) { $nt[0] = $nt[1] = "" }
781 else {
782 # strip part-tags
783 $nt[0] =~ s/^.*>//s;
784 $nt[1] =~ s/<.*$//s;
785 }
786 my ($type);
787 if ( $gotcap == 0 ) { return ("nocaption", "") }
788 elsif ( $gotcap == 1 ) { $type = "captionabove:" }
789 elsif ( $gotcap == 2 ) { $type = "captioncovering:" }
790 elsif ( $gotcap == 3 ) { $type = "captionbelow:" }
791 return ($type, $nt[0], $nt[1]);
792}
793
794# tag-based near text
795#
796# tries different tag sets
797# and chooses the best one
798sub extract_near_text {
799 my ($self, $tag, $textref, $prevpos, $pos, $nextpos) = (@_);
800 my (@neartext, $len, $hdelim, $maxtext, $mintext, $goodlen,
801 @bestlen, @best, $startpos, $context, $context_size,
802 $dist, $bdist, $best1, $i, $nt);
803 $bestlen[0] = $bestlen[1] = 0; $bestlen[2] = $bdist = 999999;
804 $best[0] = $best[1] = $best[2] = "";
805 $maxtext = $self->{'max_near_text'};
806 $mintext = $self->{'min_near_text'};
807 $goodlen = $self->{'neartext_length'};
808
809 # extract a context to extract near text from (faster)
810 $context_size = $maxtext*4;
811 $startpos = $pos - ($context_size / 2);
812 if ($startpos < $prevpos ) { $startpos = $prevpos }
813 if ($nextpos != -1 && $context_size > ( $nextpos - $startpos ))
814 { $context_size = ($nextpos - $startpos) }
815 $context = substr ( $$textref, $startpos, $context_size );
816 $context =~ s/<!--.*?-->//gs;
817 $context =~ s/^.*-->(.*)/$1/gs;
818 $context =~ s/(.*)<!--.*$/$1/gs;
819
820 # try stepping through markup delimiter sets
821 # and selecting the best one
822 foreach $hdelim ( @{ $self->{'delims'} } ) {
823 @neartext = $self->extract_tagged_neartext($tag, $hdelim, \$context);
824 $nt = join("", @neartext);
825 $len = length($nt);
826 # Priorities:
827 # 1. Greater than mintext
828 # 2. Less than maxtext
829 # 3. Closest to goodlen
830 if ( $len <= $goodlen && $len > $bestlen[0] ) {
831 $bestlen[0] = $len;
832 $best[0] = $hdelim;
833 } elsif ( $len >= $maxtext && $len < $bestlen[2] ) {
834 $bestlen[2] = $len;
835 $best[2] = $hdelim;
836 } elsif ( $len >= $bestlen[0] && $len <= $bestlen[2] ) {
837 $dist = abs($goodlen - $len);
838 if ( $dist < $bdist ) {
839 $bestlen[1] = $len;
840 $best[1] = $hdelim;
841 $bdist = $dist;
842 }
843 }
844 }
845 $best1 = 2;
846 foreach $i ( 0..2 ) {
847 if ( $bestlen[$i] == 999999 ) { $bestlen[$i] = 0 }
848 $dist = abs($goodlen - $bestlen[$i]);
849 if ( $bestlen[$i] > $mintext && $dist <= $bdist ) {
850 $best1 = $i;
851 $bdist = $dist;
852 }
853 }
854 @neartext = $self->extract_tagged_neartext($tag, $best[$best1], \$context);
855 if ( $bestlen[$best1] > $maxtext ) {
856 # truncate on word boundary if too much text
857 my $hmax = $maxtext / 2;
858 ($neartext[0]) = $neartext[0] =~ /([^\s]*.{1,$hmax})$/s;
859 ($neartext[1]) = $neartext[1] =~ /^(.{1,$hmax}[^\s]*)/s;
860 } elsif ( $bestlen[$best1] < $mintext ) {
861 # use plain text extraction if tags failed (e.g. usable tag outside context)
862 print {$self->{'outhandle'}} "HTMLImagePlugin: Fallback to plain-text extraction for $tag\n"
863 if $self->{'verbosity'} > 2;
864 $neartext[0] = "<tr><td>RawNeartext</td><td>" . $self->extract_raw_neartext($tag, $textref) . "</td></tr>";
865 $neartext[1] = "";
866 }
867 # get previous header if available
868 $neartext[0] .= "<br>\n" .
869 $self->get_prev_header($pos, \$context) if ( $self->{'aggressiveness'} >= 4 );
870 $neartext[0] = " " if (! defined $neartext[0]);
871 $neartext[1] = " " if (! defined $neartext[1]);
872
873 return "<tr><td>NearText</td><td>" . (join "|", @neartext) . "</td></tr>\n"; # TODO: the | is for testing purposes
874} # end extract_near_text
875
876# actually captures tag-based
877# near-text given a tag set
878sub extract_tagged_neartext {
879 my ($self, $tag, $bound_tag, $textref) = (@_);
880 return "" if ( ! ($$textref =~ /\Q$tag/) );
881 my (@nt, $delim, $pre_tag, $n);
882 $nt[0] = $`;
883 $nt[1] = $';
884
885 # get text after previous image tag
886 $nt[0] =~ s/.*<($bound_tag)[^>]*>(.*)/$2/is; # get rid of preceding text
887 if (defined($1)) { $delim = $1 }
888 $pre_tag = $bound_tag;
889
890 if (defined($delim)) {
891 # we want to try and use the end tag of the previous delimiter
892 # (put it on the front of the list)
893 $pre_tag =~ s/(^|\|)($delim)($|\|)//i; # take it out
894 $pre_tag =~ s/\|\|/\|/i; # replace || with |
895 $pre_tag = $delim . "|" . $pre_tag; # put it on the front
896 }
897
898 # get text before next image tag
899 $nt[1] =~ s/<\/?(?:$pre_tag)[^>]*>.*//is; # get rid of stuff after first delimiter
900
901 # process related text
902 for $n (0..1) {
903 if ( defined($nt[$n]) ) {
904 $nt[$n] =~ s/<.*?>//gs; # strip all html tags
905 $nt[$n] =~ s/\s+/\ /gs; # remove extra whitespace
906 $nt[$n] =~ s/\&\w+\;//sg; # remove &nbsp; etc
907 # strip part-tags
908 if ( $n == 0 ) { $nt[0] =~ s/^.*>//s }
909 if ( $n == 1 ) { $nt[1] =~ s/<.*$//s }
910 } else { $nt[$n] = ""; } # handle nulls
911 }
912 return @nt;
913}
914
915# this function is fall-back
916# if tags aren't suitable.
917#
918# extracts a fixed length of characters
919# either side of image tag (on word boundary)
920sub extract_raw_neartext {
921 my ($self, $tag, $textref) = (@_);
922 my ($rawtext, $startpos, $fp);
923 my $imgs = \%{$self->{'imglist'}};
924 ($fp) = $tag =~ /([\w\\\/]+\.(?:gif|jpe?g|png))/is;
925 if (! $fp) { return " " };
926 # if the cached, plain-text version isn't there, then create it
927 $self->init_plaintext($textref) unless defined($self->{'plaintext'});
928
929 # take the closest maxtext/2 characters
930 # either side of the tag (by word boundary)
931 return "" if ( ! exists $imgs->{$fp}{'rawpos'} );
932 $startpos = $imgs->{$fp}{'rawpos'} - (($self->{'max_near_text'} / 2) + 20);
933 if ( $startpos < 0 ) { $startpos = 0 }
934 $rawtext = substr $self->{'plaintext'}, $startpos, $self->{'max_near_text'} + 20;
935 $rawtext =~ s/\s\s/ /g;
936
937 return $rawtext;
938}
939
940# init plaintext variable for HTML-stripped version
941# (for full text index/raw assoc text extraction)
942sub init_plaintext {
943 my ($self, $textref) = (@_);
944 my ($page, $fp);
945 my $imgs = \%{$self->{'imglist'}};
946 $page = $$textref; # make a copy of original
947
948 # strip tags around image filenames so they don't get zapped
949 $page =~ s/<\w+\s+.*?([\w\/\\]+\.(?:gif|jpe?g|png))[^>]*>/\"$1\"/gsi;
950 $page =~ s/<.*?>//gs;
951 $page =~ s/&nbsp;/ /gs;
952 $page =~ s/&amp;/&/gs; #TODO: more &zzz; replacements (except &lt;, $gt;)
953
954 # get positions and strip images
955 while ( $page =~ /([^\s\'\"]+\.(jpe?g|gif|png))/ig ) {
956 $fp = $1;
957 if ( $imgs->{$fp}{'exists'} ) {
958 $imgs->{$fp}{'rawpos'} = pos $page;
959 }
960 $page =~ s/\"$fp\"//gs;
961 }
962 $self->{'plaintext'} = $page;
963}
964
965# finds and filters images based on size
966# (dimensions, height, filesize) and existence
967#
968# looks for image filenames (.jpg, .gif, etc)
969# and checks for existence on disk
970# (hence supports most JavaScript images)
971sub get_img_list {
972 my $self = shift (@_);
973 my ($textref) = (@_);
974 my ($filepath, $relpath, $abspath, $pos, $num, $width, $height, $filesize);
975 my $imgs = \%{$self->{'imglist'}};
976 while ( $$textref =~ /([^\s\'\"]+\.(jpe?g|gif|png))/ig ) {
977 $filepath = $1;
978 $pos = pos $$textref;
979 next if ( $imgs->{$filepath}{'relpath'} );
980 $relpath = $filepath;
981 $relpath =~ s/^http\:\/\///; # remove http:// in case we have mirrored it
982 $relpath =~ s/\\/\//g; # replace \ with /
983 $relpath =~ s/^\.\///s; # make "./filepath" into "filepath"
984 $imgs->{$filepath}{'relpath'} = $relpath;
985 $abspath = "$self->{'htpath'}/$relpath";
986
987 if (! -e $abspath) { next }
988
989 # can't modify real filepath var because it
990 # then can't be located in the page for tag recognition later
991 my $imagick_cmd = "\"".&util::get_perl_exec()."\" -S gs-magick.pl";
992 ($width, $height) =
993 `$imagick_cmd identify $abspath -ping -format "%wx%h"` =~ /^(\d*)x(\d*)$/m;
994 if (! ($width && $height)) {
995 print STDERR "HTMLImagePlugin: ($abspath) 'identify' failed. Check ImageMagicK binaries are installed and working correctly\n"; next;
996 }
997 $filesize = (-s $abspath);
998 if ( $filesize >= $self->{'min_size'}
999 && ( $width >= $self->{'min_width'} )
1000 && ( $height >= $self->{'min_height'} ) ) {
1001
1002 $imgs->{$filepath}{'exists'} = 1;
1003 $imgs->{$filepath}{'pos'} = $pos;
1004 $imgs->{$filepath}{'width'} = $width;
1005 $imgs->{$filepath}{'height'} = $height;
1006 $imgs->{$filepath}{'filesize'} = $filesize;
1007 } else {
1008 print {$self->{'outhandle'}} "HTMLImagePlugin: skipping $self->{'base_path'}/$relpath: $filesize, $width x $height\n"
1009 if $self->{'verbosity'} > 2;
1010 }
1011 }
1012 $num = 0;
1013 foreach my $i ( keys %{$imgs} ) {
1014 if ( $imgs->{$i}{'pos'} ) {
1015 $num++;
1016 } else { delete $imgs->{$i} }
1017 }
1018 return $num;
1019}
1020
1021# make the text available to the read function
1022# by making it an object variable
1023sub read_file {
1024 my ($self, $filename, $encoding, $language, $textref) = @_;
1025 $self->SUPER::read_file($filename, $encoding, $language, $textref);
1026
1027 # if HTMLplug has run through, then it will
1028 # have replaced references so we have to
1029 # make a copy of the text before processing
1030 if ( $self->{'index_pages'} ) {
1031 $self->{'text'} = $$textref;
1032 $self->{'textref'} = \($self->{'text'});
1033 } else {
1034 $self->{'textref'} = $textref;
1035 }
1036 $self->{'plaintext'} = undef;
1037}
1038
1039# HTMLPlugin only extracts meta-data if it is specified in plugin options
1040# hence a special function to do it here
1041sub get_meta_value {
1042 my ($self, $name, $textref) = @_;
1043 my ($value);
1044 $name = lc $name;
1045 if ($name eq "title") {
1046 ($value) = $$textref =~ /<title>(.*?)<\/title>/is
1047 } else {
1048 my $qm = "(?:\"|\')";
1049 ($value) = $$textref =~ /<meta name\s*=\s*$qm?$name$qm?\s+content\s*=\s*$qm?(.*?)$qm?\s*>/is
1050 }
1051 $value = "" unless $value;
1052 return $value;
1053}
1054
1055# make filename an anchor reference
1056# so we can go straight to the image
1057# within the cached version of the source page
1058# (augment's HTMLPlugin sub)
1059sub replace_images {
1060 my $self = shift (@_);
1061 my ($front, $link, $back, $base_dir,
1062 $file, $doc_obj, $section) = @_;
1063 $link =~ s/\"//g;
1064 my ($a_name) = $link;
1065 $a_name =~ s/[\/\\\:\&]/_/g;
1066 # keep a list so we don't repeat the same anchor
1067 if ( ! $self->{'imglist'}{$link}{'anchored'} ) {
1068 $front = "<a name=\"gsdl_$a_name\">$front";
1069 $back = "$back</a>";
1070 $self->{'imglist'}{$link}{'anchored'} = 1;
1071 }
1072 return $self->SUPER::replace_images($front, $link, $back, $base_dir,
1073 $file, $doc_obj, $section);
1074}
1075
10761;
Note: See TracBrowser for help on using the repository browser.