root/main/trunk/greenstone2/perllib/plugins/GreenstoneSQLPlugin.pm @ 32565

Revision 32565, 19.6 KB (checked in by ak19, 20 months ago)

I think this is a bugfix to plugin.pm::remove_some(): when processing files to delete on deletion/reindexing, it would return after the first successful remove_one(). This wasn't a noticeable problem in the past since no plugin had a real need for implementing remove_one(). But now GS SQLPlugin does remove_one() in earnest, we need all the docs marked for deletion to be processed.

Line 
1###########################################################################
2#
3# GreenstoneSQLPlugin.pm -- reads into doc_obj from SQL db and docsql.xml
4# Metadata and/or fulltext are stored in SQL db, the rest may be stored in
5# the docsql .xml files.
6# A component of the Greenstone digital library software
7# from the New Zealand Digital Library Project at the
8# University of Waikato, New Zealand.
9#
10# Copyright (C) 2001 New Zealand Digital Library Project
11#
12# This program is free software; you can redistribute it and/or modify
13# it under the terms of the GNU General Public License as published by
14# the Free Software Foundation; either version 2 of the License, or
15# (at your option) any later version.
16#
17# This program is distributed in the hope that it will be useful,
18# but WITHOUT ANY WARRANTY; without even the implied warranty of
19# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
20# GNU General Public License for more details.
21#
22# You should have received a copy of the GNU General Public License
23# along with this program; if not, write to the Free Software
24# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
25#
26###########################################################################
27
28package GreenstoneSQLPlugin;
29
30
31use strict;
32no strict 'refs'; # allow filehandles to be variables and viceversa
33
34use DBI;
35use docprint; # for new unescape_text() subroutine
36use GreenstoneXMLPlugin;
37use gssql;
38
39
40# TODO:
41# - Run TODOs here, in Plugout and in gssql.pm by Dr Bainbridge.
42# - Have not yet tested writing out just meta or just fulltxt to sql db and reading just that
43# back in from the sql db while the remainder is to be read back in from the docsql .xml files.
44
45# TODO: deal with incremental vs removeold. If docs removed from import folder, then import step
46# won't delete it from archives but buildcol step will. Need to implement this with this database plugin or wherever the actual flow is
47
48# TODO Q: is "reindex" = del from db + add to db?
49# - is this okay for reindexing, or will it need to modify existing values (update table)
50# - if it's okay, what does reindex need to accomplish (and how) if the OID changes because hash id produced is different?
51# - delete is accomplished in GS SQL Plugin, during buildcol.pl. When should reindexing take place?
52# during SQL plugout/import.pl or during plugin? If adding is done by GSSQLPlugout, does it need to
53# be reimplemented in GSSQLPlugin to support the adding portion of reindexing.
54
55# TODO: Add public instructions on using this plugin and its plugout: start with installing mysql binary, changing pwd, running the server (and the client against it for checking: basic cmds like create and drop). Then discuss db name, table names (per coll), db cols and col types, and how the plugout and plugin work.
56# Discuss the plugin/plugout parameters.
57
58
59# DONE:
60# + TODO: Incremental delete can't work until GSSQLPlugout has implemented build_mode = incremental
61# (instead of tossing away db on every build)
62# + Ask about docsql naming convention adopted to identify OID. Better way?
63# collection names -> table names: it seems hyphens not allowed. Changed to underscores.
64# + Startup parameters (except removeold/build_mode)
65# + how do we detect we're to do removeold during plugout in import.pl phase
66# + incremental building: where do we need to add code to delete rows from our sql table after
67# incrementally importing a coll with fewer docs (for instance)? What about deleted/modified meta?
68# + Ask if I can assume that all SQL dbs (not just MySQL) will preserve the order of inserted nodes
69# (sections) which in this case had made it easy to reconstruct the doc_obj in memory in the correct order.
70# YES: Otherwise for later db types (drivers), can set order by primary key column and then order by did column
71# + NOTTODO: when db is not running GLI is paralyzed -> can we set timeout on DBI connection attempt?
72#   NOT A PROBLEM: Tested to find DBI connection attempt fails immediately when MySQL server not
73# running. The GLI "paralyzing" incident last time was not because of the gs sql connection code,
74# but because my computer was freezing on-and-off.
75# + "Courier" demo documents in lucene-sql collection: character (degree symbol) not preserved in title. Is this because we encode in utf8 when putting into db and reading back in?
76# Test doc with meta and text like macron in Maori text.
77# + TODO Q: During import, the GS SQL Plugin is called before the GS SQL Plugout with undesirable side
78# effect that if the db doesn't exist, gssql::use_db() fails, as it won't create db.
79#   This got fixed when GSSQLPlugin stopped connecting on init().
80
81
82########################################################################################
83
84# GreenstoneSQLPlugin inherits from GreenstoneXMLPlugin so that it if meta or fulltext
85# is still written out to doc.xml (docsql .xml), that will be processed as usual,
86# whereas GreenstoneSQLPlugin will process all the rest (full text and/or meta, whichever
87# is written out by GreenstoneSQLPlugout into the SQL db).
88
89
90sub BEGIN {
91    @GreenstoneSQLPlugin::ISA = ('GreenstoneXMLPlugin');
92}
93
94# This plugin must be in the document plugins pipeline IN PLACE OF GreenstoneXMLPlugin
95# So we won't have a process exp conflict here.
96# The structure of docsql.xml files is identical to doc.xml and the contents are similar except:
97#   - since metadata and/or fulltxt are stored in mysql db instead, just XML comments indicating
98#   this are left inside docsql.xml within the <Description> (for meta) and/or <Content> (for txt)
99#   - the root element Archive now has a docoid attribute: <Archive docoid="OID">
100sub get_default_process_exp {
101    my $self = shift (@_);
102
103    return q^(?i)docsql(-\d+)?\.xml$^; # regex based on this method in GreenstoneXMLPlugin
104    #return q^(?i)docsql(-.+)?\.xml$^; # no longer storing the OID embedded in docsql .xml filename
105}
106
107my $process_mode_list =
108    [ { 'name' => "meta_only",
109        'desc' => "{GreenstoneSQLPlug.process_mode.meta_only}" },     
110      { 'name' => "text_only",
111        'desc' => "{GreenstoneSQLPlug.process_mode.text_only}" },
112      { 'name' => "all",
113        'desc' => "{GreenstoneSQLPlug.process_mode.all}" } ];
114
115my $arguments =
116    [ { 'name' => "process_exp",
117    'desc' => "{BaseImporter.process_exp}",
118    'type' => "regexp",
119    'deft' => &get_default_process_exp(),
120    'reqd' => "no" },
121      { 'name' => "process_mode",
122    'desc' => "{GreenstoneSQLPlug.process_mode}",
123    'type' => "enum",
124    'list' => $process_mode_list,
125    'deft' => "all",
126    'reqd' => "no"},
127      { 'name' => "db_driver",
128    'desc' => "{GreenstoneSQLPlug.db_driver}",
129    'type' => "string",
130    'deft' => "mysql",
131    'reqd' => "yes"},
132      { 'name' => "db_client_user",
133    'desc' => "{GreenstoneSQLPlug.db_client_user}",
134    'type' => "string",
135    'deft' => "root",
136    'reqd' => "yes"},
137      { 'name' => "db_client_pwd",
138    'desc' => "{GreenstoneSQLPlug.db_client_pwd}",
139    'type' => "string",
140    'deft' => "",
141    'reqd' => "yes"}, # pwd required?
142      { 'name' => "db_host",
143    'desc' => "{GreenstoneSQLPlug.db_host}",
144    'type' => "string",
145    'deft' => "127.0.0.1",
146    'reqd' => "yes"},
147    ];
148
149my $options = { 'name'     => "GreenstoneSQLPlugin",
150        'desc'     => "{GreenstoneSQLPlugin.desc}",
151        'abstract' => "no",
152        'inherits' => "yes",
153            'args'     => $arguments };
154
155
156# TODO: For on cancel, add a SIGTERM handler or so to call end()
157# or to explicitly call gs_sql->close_connection if $gs_sql def
158
159sub new {
160    my ($class) = shift (@_);
161    my ($pluginlist,$inputargs,$hashArgOptLists) = @_;
162    push(@$pluginlist, $class);
163
164    push(@{$hashArgOptLists->{"ArgList"}},@{$arguments});
165    push(@{$hashArgOptLists->{"OptList"}},$options);
166
167    my $self = new GreenstoneXMLPlugin($pluginlist, $inputargs, $hashArgOptLists);
168
169   
170    #return bless $self, $class;
171    $self = bless $self, $class;
172    if ($self->{'info_only'}) {
173    # If running pluginfo, we don't need to go further.
174    return $self;
175    }
176
177    # do anything else that needs to be done here when not pluginfo
178   
179    return $self;
180}
181
182# This is called once if removeold is set with import.pl. Most plugins will do
183# nothing but if a plugin does any stuff outside of creating doc obj, then
184# it may need to clear something.
185# In the case of GreenstoneSQL plugs: this is the first time we have a chance
186# to purge the tables of the current collection from the current site's database
187sub remove_all {
188    my $self = shift (@_);
189    my ($pluginfo, $base_dir, $processor, $maxdocs) = @_;
190
191    print STDERR "   Building with removeold option set, so deleting current collection's tables if they exist\n" if($self->{'verbosity'});
192   
193    # if we're in here, we'd already have run 'use database <site_name>;' during sub init()
194    # so we can go ahead and delete the collection's tables
195    my $gs_sql = $self->{'gs_sql'};
196    $gs_sql->delete_collection_tables(); # will delete them if they exist
197
198    # and recreate tables? No. Tables' existence is ensured in GreenstoneSQLPlugout::begin()
199    my $proc_mode = $self->{'process_mode'};
200    if($proc_mode ne "text_only") {
201    $gs_sql->ensure_meta_table_exists();
202    }
203    if($proc_mode ne "meta_only") {
204    $gs_sql->ensure_fulltxt_table_exists();
205    }
206}
207
208# This is called per document for docs that have been deleted from the
209# collection. Most plugins will do nothing
210# but if a plugin does any stuff outside of creating doc obj, then it may need
211# to clear something.
212# remove the doc(s) denoted by oids from GS SQL db
213# This takes care of incremental deletes (docs marked D by ArchivesInfPlugin when building
214# incrementally) as well as cases of "Non-icremental Delete", see ArchivesInfPlugin.pm
215sub remove_one {
216    my $self = shift (@_);
217   
218    my ($file, $oids, $archivedir) = @_;
219
220    print STDERR "@@@ IN SQLPLUG::REMOVE_ONE: $file\n";
221   
222    #return undef unless $self->can_process_this_file($file); # NO, DON'T DO THIS:
223           # WE DON'T CARE IF IT'S AN IMAGE FILE THAT WAS DELETED.
224           # WE CARE ABOUT REMOVING THE DOCOID OF THAT IMAGE FILE FROM THE DB
225
226    print STDERR "*****************************\nAsked to remove_one oid\n***********************\n";
227   
228    my $gs_sql = $self->{'gs_sql'} || return 0; # couldn't make the connection or no db etc
229    if(scalar @$oids > 1) {
230    print STDERR "TODO: We now have reason to optimise GreenstoneSQLPlugin::remove_one() by using dbi::prepare() + dbi::execute() instead of dbi::do().\n";
231    }
232   
233    my $proc_mode = $self->{'process_mode'};
234    foreach my $oid (@$oids) { 
235    if($proc_mode eq "all" || $proc_mode eq "meta_only") {
236        print STDERR "@@@@@@@@ Deleting $oid from meta table\n" if $self->{'verbosity'} > 2;
237        $gs_sql->delete_recs_from_metatable_with_docid($oid);
238    }
239    if($proc_mode eq "all" || $proc_mode eq "text_only") {
240        print STDERR "@@@@@@@@ Deleting $oid from fulltxt table\n" if $self->{'verbosity'} > 2;
241        $gs_sql->delete_recs_from_texttable_with_docid($oid);
242    }
243    }
244    return 1;
245}
246
247
248sub xml_start_tag {
249    my $self = shift(@_);
250    my ($expat, $element) = @_;
251
252    my $outhandle = $self->{'outhandle'};
253   
254    $self->{'element'} = $element;
255    if ($element eq "Archive") { # docsql.xml files contain a OID attribute on Archive element
256    # the element's attributes are in %_ as per ReadXMLFile::xml_start_tag() (while $_
257    # contains the tag)
258
259    # Don't access %_{'docoid'} directly: keep getting a warning message to
260    # use $_{'docoid'} for scalar contexts, but %_ is the element's attr hashmap
261    # whereas $_ has the tag info. So we don't want to do $_{'docoid'}.
262    my %attr_hash = %_; # right way, see OAIPlugin.pm
263    $self->{'doc_oid'} = $attr_hash{'docoid'};
264    ##print STDERR "XXXXXXXXXXXXXX in SQLPlugin::xml_start_tag()\n";
265    print $outhandle "Extracted OID from docsql.xml: ".$self->{'doc_oid'}."\n"
266        if $self->{'verbosity'} > 2;
267
268    }
269    else { # let superclass GreenstoneXMLPlugin continue to process <Section> and <Metadata> elements
270    $self->SUPER::xml_start_tag(@_);
271    }
272}
273
274# TODO Q: Why are there 4 passes when we're only indexing at doc and section level (2 passes)? What's the dummy pass, why is there a pass for infodb?
275
276# We should only ever get here during the buildcol.pl phase
277# At the end of superclass GreenstoneXMLPlugin.pm's close_document() method,
278# the doc_obj in memory is processed (indexed) and then made undef.
279# So we have to work with doc_obj before superclass close_document() is finished.
280sub close_document {
281    my $self = shift(@_);
282
283    ##print STDERR "XXXXXXXXX in SQLPlugin::close_doc()\n";
284   
285    my $gs_sql = $self->{'gs_sql'};
286   
287    my $outhandle = $self->{'outhandle'};
288    my $doc_obj = $self->{'doc_obj'};
289
290    my $oid = $self->{'doc_oid'}; # we stored current doc's OID during sub xml_start_tag()
291    my $proc_mode = $self->{'process_mode'};
292   
293    # For now, we have access to doc_obj (until just before super::close_document() terminates)
294   
295    # no need to call $self->{'doc_obj'}->set_OID($oid);
296    # because either the OID is stored in the SQL db as meta 'Identifier' alongside other metadata
297    # or it's stored in the doc.xml as metadata 'Identifier' alongside other metadata
298    # Either way, Identifier meta will be read into the docobj automatically with other meta.
299
300    print STDERR "   GreenstoneSQLPlugin processing doc $oid (reading into docobj from SQL db)\n"
301    if $self->{'verbosity'} > 0;
302   
303    if($proc_mode eq "all" || $proc_mode eq "meta_only") {
304    # read in meta for the collection (i.e. select * from <col>_metadata table
305   
306    my $sth = $gs_sql->select_from_metatable_matching_docid($oid);
307    print $outhandle "### SQL select stmt: ".$sth->{'Statement'}."\n"
308        if $self->{'verbosity'} > 2;
309   
310    print $outhandle "----------SQL DB contains meta-----------\n" if $self->{'verbosity'} > 2;
311    # https://www.effectiveperlprogramming.com/2010/07/set-custom-dbi-error-handlers/
312    while( my @row = $sth->fetchrow_array() ) {     
313        #print $outhandle "row: @row\n";
314        my ($primary_key, $did, $sid, $metaname, $metaval) = @row;
315       
316        # get rid of the artificial "root" introduced in section id when saving to sql db
317        $sid =~ s@^root@@;
318        $sid = $doc_obj->get_top_section() unless $sid;
319        print $outhandle "### did: $did, sid: |$sid|, meta: $metaname, val: $metaval\n"
320        if $self->{'verbosity'} > 2;
321       
322        # TODO:  we accessed the db in utf8 mode, so, we can call doc_obj->add_utf8_meta directly:
323        $doc_obj->add_utf8_metadata($sid, $metaname, &docprint::unescape_text($metaval));
324    }
325    print $outhandle "----------FIN READING DOC's META FROM SQL DB------------\n"
326        if $self->{'verbosity'} > 2;
327    }
328   
329    if($proc_mode eq "all" || $proc_mode eq "text_only") {
330    # read in fulltxt for the collection (i.e. select * from <col>_fulltxt table
331   
332    my $fulltxt_table = $gs_sql->get_fulltext_table_name();
333   
334   
335    my $sth = $gs_sql->select_from_texttable_matching_docid($oid);
336    print $outhandle "### stmt: ".$sth->{'Statement'}."\n" if $self->{'verbosity'} > 2;
337   
338    print $outhandle "----------\nSQL DB contains txt entries for-----------\n"
339        if $self->{'verbosity'} > 2;
340    while( my ($primary_key, $did, $sid, $text) = $sth->fetchrow_array() ) {       
341       
342        # get rid of the artificial "root" introduced in section id when saving to sql db
343        #$sid =~ s@^root@@;
344        $sid = $doc_obj->get_top_section() if ($sid eq "root");
345        print $outhandle "### did: $did, sid: |$sid|, fulltext: <TXT>\n"
346        if $self->{'verbosity'} > 2;
347       
348        # TODO - pass by ref?
349        # TODO: we accessed the db in utf8 mode, so, we can call doc_obj->add_utf8_text directly:
350        $doc_obj->add_utf8_text($sid, &docprint::unescape_text($text));
351    }   
352    print $outhandle "----------FIN READING DOC's TXT FROM SQL DB------------\n"
353        if $self->{'verbosity'} > 2;
354    }
355   
356    # done reading into docobj from SQL db
357   
358    # don't forget to clean up on close() in superclass
359    # It will get the doc_obj indexed then make it undef
360    $self->SUPER::close_document(@_);
361}
362
363
364# TODO: only want to work with sql db if buildcol.pl. Unfortunately, also runs on import.pl.
365# During import, the GS SQL Plugin is called before the GS SQL Plugout with undesirable side
366# effect that if the db doesn't exist, gssql::use_db() fails, as it won't create db.
367
368# GS SQL Plugin::init() (and deinit()) is called by import.pl and also by buildcol.pl
369# This means it connects and deconnects during import.pl as well. This is okay
370# as removeold, which should drop the collection tables, happens during the import phase
371# and therefore also requires a db connection.
372# TODO: Eventually can try moving get_gssql_instance into gssql.pm? That way both GS SQL Plugin
373# and Plugout would be using one connection during import.pl phase when both plugs exist.
374
375# Call init() not begin() because there can be multiple plugin passes and begin() called for
376# each pass (one for doc level and another for section level indexing), whereas init() should
377# be called before any and all passes.
378# This way, we can connect to the SQL database once per buildcol run.
379sub init {
380    my ($self) = shift (@_);
381    ##print STDERR "@@@@@@@@@@ INIT CALLED\n";
382   
383    $self->SUPER::init(@_); # super (GreenstoneXMLPlugin) will not yet be trying to read from doc.xml (docsql .xml) files in init().
384
385    ####################
386#    print "@@@ SITE NAME: ". $self->{'site_name'} . "\n" if defined $self->{'site_name'};
387#    print "@@@ COLL NAME: ". $ENV{'GSDLCOLLECTION'} . "\n";
388
389#    print STDERR "@@@@ db_pwd: " . $self->{'db_client_pwd'} . "\n";
390#    print STDERR "@@@@ user: " . $self->{'db_client_user'} . "\n";
391#    print STDERR "@@@@ db_host: " . $self->{'db_host'} . "\n";
392#    print STDERR "@@@@ db_driver: " . $self->{'db_driver'} . "\n";
393    ####################
394
395    # create gssql object.
396    # collection name will be used for naming tables (site name will be used for naming database)
397    my $gs_sql = new gssql({
398    'collection_name' => $ENV{'GSDLCOLLECTION'},
399    'verbosity' => $self->{'verbosity'}
400               });
401
402    # try connecting to the mysql db, if that fails it will die
403    if(!$gs_sql->connect_to_db({
404    'db_driver' => $self->{'db_driver'},
405    'db_client_user' => $self->{'db_client_user'},
406    'db_client_pwd' => $self->{'db_client_pwd'},
407    'db_host' => $self->{'db_host'}
408                   })
409    )
410    {
411    # This is fatal for the plugout, let's terminate here
412    # PrintError would already have displayed the warning message on connection fail   
413    die("Could not connect to db. Can't proceed.\n");
414    }
415   
416    my $db_name = $self->{'site_name'} || "greenstone2"; # one database per GS3 site, for GS2 the db is called greenstone2
417
418    # Attempt to use the db, create it if it doesn't exist (but don't create the tables yet)
419    # Bail if we can't use the database
420    if(!$gs_sql->use_db($db_name)) {
421   
422    # This is fatal for the plugout, let's terminate here after disconnecting again
423    # PrintError would already have displayed the warning message on load fail
424    $gs_sql->disconnect_from_db()
425        || warn("Unable to disconnect from database.\n");
426    die("Could not use db $db_name. Can't proceed.\n");
427    }
428   
429   
430    # store db handle now that we're connected
431    $self->{'gs_sql'} = $gs_sql;   
432}
433
434
435# This method also runs on import.pl if gs_sql has a value. But we just want to run it on buildcol
436# Call deinit() not end() because there can be multiple plugin passes:
437# one for doc level and another for section level indexing
438# and deinit() should be called before all passes
439# This way, we can close the SQL database once per buildcol run.
440sub deinit {
441    my ($self) = shift (@_);
442   
443    ##print STDERR "@@@@@@@@@@ GreenstoneSQLPlugin::DEINIT CALLED\n";
444   
445    if($self->{'gs_sql'}) { # only want to work with sql db if buildcol.pl, gs_sql won't have
446    # a value except during buildcol, so when processor =~ m/buildproc$/.
447    $self->{'gs_sql'}->disconnect_from_db()
448        || warn("Unable to disconnect from database " . $self->{'site_name'} . "\n");
449
450    # explicitly delete gs_sql key (setting key to undef has a different meaning from deleting)
451    # so all future use has to make the connection again
452    delete $self->{'gs_sql'};
453    }
454
455    $self->SUPER::deinit(@_);
456}
457
458
459
460
Note: See TracBrowser for help on using the browser.