source: gsdl/trunk/perllib/plugins/AutoExtractMetadata.pm@ 15918

Last change on this file since 15918 was 15869, checked in by kjdon, 16 years ago

plugin overhaul: BasPlug has been split into several base plugins: PrintInfo just does the printing for pluginfo.pl, and does the argument parsing in the constructor. All plugins and supporting extractors etc inherit directly or indirectly from this. AbstractPlugin adds a few methods to this, is used by Directory and ArchivesInf plugins. These are not really plugins so can we remove them? anyway, not sure if AbstractPlugin will live for very long. BasePlugin is a proper base plugin, has read and read_into_doc_obj methods. It does nothing with reading in the file or textcat stuff. Makes a basic doc obj and adds some metadata. It also handles all the blocking stuff, associate ext stuff etc. Binary plugins can implement the process method to do file specific stuff. AutoExtractMetadata inherits BasePlugin and adds automatic metadata extraction using hte new Extractor plugins. ReadTextFile is the equivalent in functionality to the old BasPlug - does lang and encoding extraction, and reading in the file. It inherits from AutoExtractMetadata. If your file type is binary and will have no text, then inherit from BasePlugin. If its binary but ends up with text (eg using convert_to) then inherit from AutoExtractMetadata. If your file is a text type file, then inherit from ReadTextFile.

  • Property svn:executable set to *
File size: 3.7 KB
Line 
1###########################################################################
2#
3# AutoExtractMetadata.pm -- base plugin for all plugins that want to do metadata extraction from text and/or metadata
4# A component of the Greenstone digital library software
5# from the New Zealand Digital Library Project at the
6# University of Waikato, New Zealand.
7#
8# Copyright (C) 2008 New Zealand Digital Library Project
9#
10# This program is free software; you can redistribute it and/or modify
11# it under the terms of the GNU General Public License as published by
12# the Free Software Foundation; either version 2 of the License, or
13# (at your option) any later version.
14#
15# This program is distributed in the hope that it will be useful,
16# but WITHOUT ANY WARRANTY; without even the implied warranty of
17# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
18# GNU General Public License for more details.
19#
20# You should have received a copy of the GNU General Public License
21# along with this program; if not, write to the Free Software
22# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
23#
24###########################################################################
25
26# This plugin uses the supporting Extractors to add metadata extraction
27# functionality to BasePlugin.
28
29
30package AutoExtractMetadata;
31
32use strict;
33no strict 'subs';
34no strict 'refs'; # allow filehandles to be variables and viceversa
35
36use BasePlugin;
37use AcronymExtractor;
38use KeyphraseExtractor;
39use EmailAddressExtractor;
40use DateExtractor;
41use GISExtractor;
42
43sub BEGIN {
44 @AutoExtractMetadata::ISA = ( 'BasePlugin', 'AcronymExtractor', 'KeyphraseExtractor', 'EmailAddressExtractor', 'DateExtractor', 'GISExtractor' );
45}
46
47my $arguments = [];
48
49
50my $options = { 'name' => "AutoExtractMetadata",
51 'desc' => "{AutoExtractMetadata.desc}",
52 'abstract' => "yes",
53 'inherits' => "no",
54 'args' => $arguments };
55
56
57sub new {
58
59 # Start the AutoExtractMetadata Constructor
60 my $class = shift (@_);
61 my ($pluginlist,$inputargs,$hashArgOptLists) = @_;
62 push(@$pluginlist, $class);
63
64 push(@{$hashArgOptLists->{"ArgList"}},@{$arguments});
65 push(@{$hashArgOptLists->{"OptList"}},$options);
66
67 # load up the options and args for the supporting plugins
68 new AcronymExtractor($pluginlist, $inputargs, $hashArgOptLists);
69 new KeyphraseExtractor($pluginlist, $inputargs, $hashArgOptLists);
70 new EmailAddressExtractor($pluginlist, $inputargs, $hashArgOptLists);
71 new DateExtractor($pluginlist, $inputargs, $hashArgOptLists);
72 new GISExtractor($pluginlist, $inputargs, $hashArgOptLists);
73
74 my $self = new BasePlugin($pluginlist, $inputargs, $hashArgOptLists);
75
76 return bless $self, $class;
77
78}
79
80sub begin {
81 my $self = shift (@_);
82 my ($pluginfo, $base_dir, $processor, $maxdocs) = @_;
83
84 #initialise those extractors that need initialisation
85 $self->initialise_acronym_extractor();
86}
87
88sub end {
89 # potentially called at the end of each plugin pass
90 # import.pl only has one plugin pass, but buildcol.pl has multiple ones
91
92 my ($self) = @_;
93 # finalise those extractors that need finalisation
94 $self->finalise_acronym_extractor();
95}
96
97# here is where we call methods from the supporting plugins - gis and textextract
98sub auto_extract_metadata {
99 my $self = shift(@_);
100 my ($doc_obj) = @_;
101
102 $self->extract_acronym_metadata($doc_obj);
103 $self->extract_keyphrase_metadata($doc_obj);
104 $self->extract_email_metadata($doc_obj);
105 $self->extract_date_metadata($doc_obj);
106 $self->extract_gis_metadata($doc_obj);
107
108}
109
110sub clean_up_after_doc_obj_processing {
111 my $self = shift(@_);
112
113 $self->SUPER::clean_up_after_doc_obj_processing();
114 $self->GISExtractor::clean_up_temp_files();
115}
116
1171;
Note: See TracBrowser for help on using the repository browser.