root/main/trunk/greenstone2/perllib/plugins/GreenstoneSQLPlugin.pm @ 32563

Revision 32563, 19.6 KB (checked in by ak19, 21 months ago)

1. Overhaul of GreenstoneSQLPlugs to handle removeold and incremental delete correctly. And now code also automatically handles 'non-incremental delete' (see mention in ArchivesInfPlugin?). The new version no longer does lazy loading for getting the sql db connection in the GS SQL Plugin, as now the connection needs to be active since the start of the plugin to run SQL delete statements on remove_old. So the db connection code for the GS SQL plugin has moved back into its init() method. Lots of changes to gssql.pm (and some flow on effects to the GS SQL Plugout) as when database tables exist and need to be created have changed. 2. Undoing most of the changes of changeset 32555 since we're doing incremental delete and removeold differently and in the correct way now when using the GreenstoneSQLPlugs.

Line 
1###########################################################################
2#
3# GreenstoneSQLPlugin.pm -- reads into doc_obj from SQL db and docsql.xml
4# Metadata and/or fulltext are stored in SQL db, the rest may be stored in
5# the docsql .xml files.
6# A component of the Greenstone digital library software
7# from the New Zealand Digital Library Project at the
8# University of Waikato, New Zealand.
9#
10# Copyright (C) 2001 New Zealand Digital Library Project
11#
12# This program is free software; you can redistribute it and/or modify
13# it under the terms of the GNU General Public License as published by
14# the Free Software Foundation; either version 2 of the License, or
15# (at your option) any later version.
16#
17# This program is distributed in the hope that it will be useful,
18# but WITHOUT ANY WARRANTY; without even the implied warranty of
19# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
20# GNU General Public License for more details.
21#
22# You should have received a copy of the GNU General Public License
23# along with this program; if not, write to the Free Software
24# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
25#
26###########################################################################
27
28package GreenstoneSQLPlugin;
29
30
31use strict;
32no strict 'refs'; # allow filehandles to be variables and viceversa
33
34use DBI;
35use docprint; # for new unescape_text() subroutine
36use GreenstoneXMLPlugin;
37use gssql;
38
39
40# TODO:
41# - Run TODOs here, in Plugout and in gssql.pm by Dr Bainbridge.
42# - Have not yet tested writing out just meta or just fulltxt to sql db and reading just that
43# back in from the sql db while the remainder is to be read back in from the docsql .xml files.
44
45# TODO: deal with incremental vs removeold. If docs removed from import folder, then import step
46# won't delete it from archives but buildcol step will. Need to implement this with this database plugin or wherever the actual flow is
47
48# TODO Q: is "reindex" = del from db + add to db?
49# - is this okay for reindexing, or will it need to modify existing values (update table)
50# - if it's okay, what does reindex need to accomplish (and how) if the OID changes because hash id produced is different?
51# - delete is accomplished in GS SQL Plugin, during buildcol.pl. When should reindexing take place?
52# during SQL plugout/import.pl or during plugin? If adding is done by GSSQLPlugout, does it need to
53# be reimplemented in GSSQLPlugin to support the adding portion of reindexing.
54
55# TODO: Add public instructions on using this plugin and its plugout: start with installing mysql binary, changing pwd, running the server (and the client against it for checking: basic cmds like create and drop). Then discuss db name, table names (per coll), db cols and col types, and how the plugout and plugin work.
56# Discuss the plugin/plugout parameters.
57
58
59# DONE:
60# + TODO: Incremental delete can't work until GSSQLPlugout has implemented build_mode = incremental
61# (instead of tossing away db on every build)
62# + Ask about docsql naming convention adopted to identify OID. Better way?
63# collection names -> table names: it seems hyphens not allowed. Changed to underscores.
64# + Startup parameters (except removeold/build_mode)
65# + how do we detect we're to do removeold during plugout in import.pl phase
66# + incremental building: where do we need to add code to delete rows from our sql table after
67# incrementally importing a coll with fewer docs (for instance)? What about deleted/modified meta?
68# + Ask if I can assume that all SQL dbs (not just MySQL) will preserve the order of inserted nodes
69# (sections) which in this case had made it easy to reconstruct the doc_obj in memory in the correct order.
70# YES: Otherwise for later db types (drivers), can set order by primary key column and then order by did column
71# + NOTTODO: when db is not running GLI is paralyzed -> can we set timeout on DBI connection attempt?
72#   NOT A PROBLEM: Tested to find DBI connection attempt fails immediately when MySQL server not
73# running. The GLI "paralyzing" incident last time was not because of the gs sql connection code,
74# but because my computer was freezing on-and-off.
75# + "Courier" demo documents in lucene-sql collection: character (degree symbol) not preserved in title. Is this because we encode in utf8 when putting into db and reading back in?
76# Test doc with meta and text like macron in Maori text.
77# + TODO Q: During import, the GS SQL Plugin is called before the GS SQL Plugout with undesirable side
78# effect that if the db doesn't exist, gssql::use_db() fails, as it won't create db.
79#   This got fixed when GSSQLPlugin stopped connecting on init().
80
81
82########################################################################################
83
84# GreenstoneSQLPlugin inherits from GreenstoneXMLPlugin so that it if meta or fulltext
85# is still written out to doc.xml (docsql .xml), that will be processed as usual,
86# whereas GreenstoneSQLPlugin will process all the rest (full text and/or meta, whichever
87# is written out by GreenstoneSQLPlugout into the SQL db).
88
89
90sub BEGIN {
91    @GreenstoneSQLPlugin::ISA = ('GreenstoneXMLPlugin');
92}
93
94# This plugin must be in the document plugins pipeline IN PLACE OF GreenstoneXMLPlugin
95# So we won't have a process exp conflict here.
96# The structure of docsql.xml files is identical to doc.xml and the contents are similar except:
97#   - since metadata and/or fulltxt are stored in mysql db instead, just XML comments indicating
98#   this are left inside docsql.xml within the <Description> (for meta) and/or <Content> (for txt)
99#   - the root element Archive now has a docoid attribute: <Archive docoid="OID">
100sub get_default_process_exp {
101    my $self = shift (@_);
102
103    return q^(?i)docsql(-\d+)?\.xml$^; # regex based on this method in GreenstoneXMLPlugin
104    #return q^(?i)docsql(-.+)?\.xml$^; # no longer storing the OID embedded in docsql .xml filename
105}
106
107my $process_mode_list =
108    [ { 'name' => "meta_only",
109        'desc' => "{GreenstoneSQLPlug.process_mode.meta_only}" },     
110      { 'name' => "text_only",
111        'desc' => "{GreenstoneSQLPlug.process_mode.text_only}" },
112      { 'name' => "all",
113        'desc' => "{GreenstoneSQLPlug.process_mode.all}" } ];
114
115my $arguments =
116    [ { 'name' => "process_exp",
117    'desc' => "{BaseImporter.process_exp}",
118    'type' => "regexp",
119    'deft' => &get_default_process_exp(),
120    'reqd' => "no" },
121      { 'name' => "process_mode",
122    'desc' => "{GreenstoneSQLPlug.process_mode}",
123    'type' => "enum",
124    'list' => $process_mode_list,
125    'deft' => "all",
126    'reqd' => "no"},
127      { 'name' => "db_driver",
128    'desc' => "{GreenstoneSQLPlug.db_driver}",
129    'type' => "string",
130    'deft' => "mysql",
131    'reqd' => "yes"},
132      { 'name' => "db_client_user",
133    'desc' => "{GreenstoneSQLPlug.db_client_user}",
134    'type' => "string",
135    'deft' => "root",
136    'reqd' => "yes"},
137      { 'name' => "db_client_pwd",
138    'desc' => "{GreenstoneSQLPlug.db_client_pwd}",
139    'type' => "string",
140    'deft' => "",
141    'reqd' => "yes"}, # pwd required?
142      { 'name' => "db_host",
143    'desc' => "{GreenstoneSQLPlug.db_host}",
144    'type' => "string",
145    'deft' => "127.0.0.1",
146    'reqd' => "yes"},
147    ];
148
149my $options = { 'name'     => "GreenstoneSQLPlugin",
150        'desc'     => "{GreenstoneSQLPlugin.desc}",
151        'abstract' => "no",
152        'inherits' => "yes",
153            'args'     => $arguments };
154
155
156# TODO: For on cancel, add a SIGTERM handler or so to call end()
157# or to explicitly call gs_sql->close_connection if $gs_sql def
158
159sub new {
160    my ($class) = shift (@_);
161    my ($pluginlist,$inputargs,$hashArgOptLists) = @_;
162    push(@$pluginlist, $class);
163
164    push(@{$hashArgOptLists->{"ArgList"}},@{$arguments});
165    push(@{$hashArgOptLists->{"OptList"}},$options);
166
167    my $self = new GreenstoneXMLPlugin($pluginlist, $inputargs, $hashArgOptLists);
168
169   
170    #return bless $self, $class;
171    $self = bless $self, $class;
172    if ($self->{'info_only'}) {
173    # If running pluginfo, we don't need to go further.
174    return $self;
175    }
176
177    # do anything else that needs to be done here when not pluginfo
178   
179    return $self;
180}
181
182# This is called once if removeold is set with import.pl. Most plugins will do
183# nothing but if a plugin does any stuff outside of creating doc obj, then
184# it may need to clear something.
185# In the case of GreenstoneSQL plugs: this is the first time we have a chance
186# to purge the tables of the current collection from the current site's database
187sub remove_all {
188    my $self = shift (@_);
189    my ($pluginfo, $base_dir, $processor, $maxdocs) = @_;
190
191    print STDERR "   Building with removeold option set, so deleting current collection's tables if they exist\n" if($self->{'verbosity'});
192   
193    # if we're in here, we'd already have run 'use database <site_name>;' during sub init()
194    # so we can go ahead and delete the collection's tables
195    my $gs_sql = $self->{'gs_sql'};
196    $gs_sql->delete_collection_tables(); # will delete them if they exist
197
198    # and recreate tables? No. Tables' existence is ensured in GreenstoneSQLPlugout::begin()
199    my $proc_mode = $self->{'process_mode'};
200    if($proc_mode ne "text_only") {
201    $gs_sql->ensure_meta_table_exists();
202    }
203    if($proc_mode ne "meta_only") {
204    $gs_sql->ensure_fulltxt_table_exists();
205    }
206}
207
208# This is called per document for docs that have been deleted from the
209# collection. Most plugins will do nothing
210# but if a plugin does any stuff outside of creating doc obj, then it may need
211# to clear something.
212# remove the doc(s) denoted by oids from GS SQL db
213# This takes care of incremental deletes (docs marked D by ArchivesInfPlugin when building
214# incrementally) as well as cases of "Non-icremental Delete", see ArchivesInfPlugin.pm
215sub remove_one {
216    my $self = shift (@_);
217   
218    my ($file, $oids, $archivedir) = @_;
219
220    print STDERR "@@@ IN SQLPLUG::REMOVE_ONE: $file\n";
221   
222    #return undef unless $self->can_process_this_file($file); # NO, DON'T DO THIS:
223           # WE DON'T CARE IF IT'S AN IMAGE FILE THAT WAS DELETED.
224           # WE CARE ABOUT REMOVING THE DOCOID OF THAT IMAGE FILE FROM THE DB
225
226    print STDERR "*****************************\nAsked to remove_one oid\n***********************\n";
227   
228    my $gs_sql = $self->{'gs_sql'} || return 0; # couldn't make the connection or no db etc
229    if(scalar @$oids > 1) {
230    print STDERR "TODO: We now have reason to optimise GreenstoneSQLPlugin::remove_one() by using prepare and execute.\n";
231    }
232   
233    my $proc_mode = $self->{'process_mode'};
234    foreach my $oid (@$oids) { 
235    if($proc_mode eq "all" || $proc_mode eq "meta_only") {
236        print STDERR "@@@@@@@@ Deleting $oid from meta table\n" if $self->{'verbosity'} > 2;
237        $gs_sql->delete_recs_from_metatable_with_docid($oid);
238    }
239    if($proc_mode eq "all" || $proc_mode eq "text_only") {
240        print STDERR "@@@@@@@@ Deleting $oid from fulltxt table\n" if $self->{'verbosity'} > 2;
241        $gs_sql->delete_recs_from_texttable_with_docid($oid);
242    }
243    }
244    return 1;
245}
246
247
248sub xml_start_tag {
249    my $self = shift(@_);
250    my ($expat, $element) = @_;
251
252    my $outhandle = $self->{'outhandle'};
253   
254    $self->{'element'} = $element;
255    if ($element eq "Archive") { # docsql.xml files contain a OID attribute on Archive element
256    # the element's attributes are in %_ as per ReadXMLFile::xml_start_tag() (while $_
257    # contains the tag)
258
259    # Don't access %_{'docoid'} directly: keep getting a warning message to
260    # use $_{'docoid'} for scalar contexts, but %_ is the element's attr hashmap
261    # whereas $_ has the tag info. So we don't want to do $_{'docoid'}.
262    my %attr_hash = %_; # right way, see OAIPlugin.pm
263    $self->{'doc_oid'} = $attr_hash{'docoid'};
264    ##print STDERR "XXXXXXXXXXXXXX in SQLPlugin::xml_start_tag()\n";
265    print $outhandle "Extracted OID from docsql.xml: ".$self->{'doc_oid'}."\n"
266        if $self->{'verbosity'} > 2;
267
268    }
269    else { # let superclass GreenstoneXMLPlugin continue to process <Section> and <Metadata> elements
270    $self->SUPER::xml_start_tag(@_);
271    }
272}
273
274# TODO Q: Why are there 4 passes when we're only indexing at doc and section level (2 passes)? What's the dummy pass, why is there a pass for infodb?
275
276# We should only ever get here during the buildcol.pl phase
277# At the end of superclass GreenstoneXMLPlugin.pm's close_document() method,
278# the doc_obj in memory is processed (indexed) and then made undef.
279# So we have to work with doc_obj before superclass close_document() is finished.
280sub close_document {
281    my $self = shift(@_);
282
283    ##print STDERR "XXXXXXXXX in SQLPlugin::close_doc()\n";
284   
285    my $gs_sql = $self->{'gs_sql'};
286   
287    my $outhandle = $self->{'outhandle'};
288    my $doc_obj = $self->{'doc_obj'};
289
290    my $oid = $self->{'doc_oid'}; # we stored current doc's OID during sub xml_start_tag()
291    my $proc_mode = $self->{'process_mode'};
292   
293    # For now, we have access to doc_obj (until just before super::close_document() terminates)
294   
295    # no need to call $self->{'doc_obj'}->set_OID($oid);
296    # because either the OID is stored in the SQL db as meta 'Identifier' alongside other metadata
297    # or it's stored in the doc.xml as metadata 'Identifier' alongside other metadata
298    # Either way, Identifier meta will be read into the docobj automatically with other meta.
299
300    print STDERR "   GreenstoneSQLPlugin processing doc $oid (reading into docobj from SQL db)\n"
301    if $self->{'verbosity'} > 0;
302   
303    if($proc_mode eq "all" || $proc_mode eq "meta_only") {
304    # read in meta for the collection (i.e. select * from <col>_metadata table
305   
306    my $sth = $gs_sql->select_from_metatable_matching_docid($oid);
307    print $outhandle "### SQL select stmt: ".$sth->{'Statement'}."\n"
308        if $self->{'verbosity'} > 2;
309   
310    print $outhandle "----------SQL DB contains meta-----------\n" if $self->{'verbosity'} > 2;
311    # https://www.effectiveperlprogramming.com/2010/07/set-custom-dbi-error-handlers/
312    while( my @row = $sth->fetchrow_array() ) {     
313        #print $outhandle "row: @row\n";
314        my ($primary_key, $did, $sid, $metaname, $metaval) = @row;
315       
316        # get rid of the artificial "root" introduced in section id when saving to sql db
317        $sid =~ s@^root@@;
318        $sid = $doc_obj->get_top_section() unless $sid;
319        print $outhandle "### did: $did, sid: |$sid|, meta: $metaname, val: $metaval\n"
320        if $self->{'verbosity'} > 2;
321       
322        # TODO:  we accessed the db in utf8 mode, so, we can call doc_obj->add_utf8_meta directly:
323        $doc_obj->add_utf8_metadata($sid, $metaname, &docprint::unescape_text($metaval));
324    }
325    print $outhandle "----------FIN READING DOC's META FROM SQL DB------------\n"
326        if $self->{'verbosity'} > 2;
327    }
328   
329    if($proc_mode eq "all" || $proc_mode eq "text_only") {
330    # read in fulltxt for the collection (i.e. select * from <col>_fulltxt table
331   
332    my $fulltxt_table = $gs_sql->get_fulltext_table_name();
333   
334   
335    my $sth = $gs_sql->select_from_texttable_matching_docid($oid);
336    print $outhandle "### stmt: ".$sth->{'Statement'}."\n" if $self->{'verbosity'} > 2;
337   
338    print $outhandle "----------\nSQL DB contains txt entries for-----------\n"
339        if $self->{'verbosity'} > 2;
340    while( my ($primary_key, $did, $sid, $text) = $sth->fetchrow_array() ) {       
341       
342        # get rid of the artificial "root" introduced in section id when saving to sql db
343        #$sid =~ s@^root@@;
344        $sid = $doc_obj->get_top_section() if ($sid eq "root");
345        print $outhandle "### did: $did, sid: |$sid|, fulltext: <TXT>\n"
346        if $self->{'verbosity'} > 2;
347       
348        # TODO - pass by ref?
349        # TODO: we accessed the db in utf8 mode, so, we can call doc_obj->add_utf8_text directly:
350        $doc_obj->add_utf8_text($sid, &docprint::unescape_text($text));
351    }   
352    print $outhandle "----------FIN READING DOC's TXT FROM SQL DB------------\n"
353        if $self->{'verbosity'} > 2;
354    }
355   
356    # done reading into docobj from SQL db
357   
358    # don't forget to clean up on close() in superclass
359    # It will get the doc_obj indexed then make it undef
360    $self->SUPER::close_document(@_);
361}
362
363
364# TODO: only want to work with sql db if buildcol.pl. Unfortunately, also runs on import.pl.
365# During import, the GS SQL Plugin is called before the GS SQL Plugout with undesirable side
366# effect that if the db doesn't exist, gssql::use_db() fails, as it won't create db.
367
368# GS SQL Plugin::init() (and deinit()) is called by import.pl and also by buildcol.pl
369# This means it connects and deconnects during import.pl as well. This is okay
370# as removeold, which should drop the collection tables, happens during the import phase
371# and therefore also requires a db connection.
372# TODO: Eventually can try moving get_gssql_instance into gssql.pm? That way both GS SQL Plugin
373# and Plugout would be using one connection during import.pl phase when both plugs exist.
374
375# Call init() not begin() because there can be multiple plugin passes and begin() called for
376# each pass (one for doc level and another for section level indexing), whereas init() should
377# be called before any and all passes.
378# This way, we can connect to the SQL database once per buildcol run.
379sub init {
380    my ($self) = shift (@_);
381    ##print STDERR "@@@@@@@@@@ INIT CALLED\n";
382   
383    $self->SUPER::init(@_); # super (GreenstoneXMLPlugin) will not yet be trying to read from doc.xml (docsql .xml) files in init().
384
385    ####################
386#    print "@@@ SITE NAME: ". $self->{'site_name'} . "\n" if defined $self->{'site_name'};
387#    print "@@@ COLL NAME: ". $ENV{'GSDLCOLLECTION'} . "\n";
388
389#    print STDERR "@@@@ db_pwd: " . $self->{'db_client_pwd'} . "\n";
390#    print STDERR "@@@@ user: " . $self->{'db_client_user'} . "\n";
391#    print STDERR "@@@@ db_host: " . $self->{'db_host'} . "\n";
392#    print STDERR "@@@@ db_driver: " . $self->{'db_driver'} . "\n";
393    ####################
394
395    # create gssql object.
396    # collection name will be used for naming tables (site name will be used for naming database)
397    my $gs_sql = new gssql({
398    'collection_name' => $ENV{'GSDLCOLLECTION'},
399    'verbosity' => $self->{'verbosity'}
400               });
401
402    # try connecting to the mysql db, if that fails it will die
403    if(!$gs_sql->connect_to_db({
404    'db_driver' => $self->{'db_driver'},
405    'db_client_user' => $self->{'db_client_user'},
406    'db_client_pwd' => $self->{'db_client_pwd'},
407    'db_host' => $self->{'db_host'}
408                   })
409    )
410    {
411    # This is fatal for the plugout, let's terminate here
412    # PrintError would already have displayed the warning message on connection fail   
413    die("Could not connect to db. Can't proceed.\n");
414    }
415   
416    my $db_name = $self->{'site_name'} || "greenstone2"; # one database per GS3 site, for GS2 the db is called greenstone2
417
418    # Attempt to use the db, create it if it doesn't exist (but don't create the tables yet)
419    # Bail if we can't use the database
420    if(!$gs_sql->use_db($db_name)) {
421   
422    # This is fatal for the plugout, let's terminate here after disconnecting again
423    # PrintError would already have displayed the warning message on load fail
424    $gs_sql->disconnect_from_db()
425        || warn("Unable to disconnect from database.\n");
426    die("Could not use db $db_name. Can't proceed.\n");
427    }
428   
429   
430    # store db handle now that we're connected
431    $self->{'gs_sql'} = $gs_sql;   
432}
433
434
435# This method also runs on import.pl if gs_sql has a value. But we just want to run it on buildcol
436# Call deinit() not end() because there can be multiple plugin passes:
437# one for doc level and another for section level indexing
438# and deinit() should be called before all passes
439# This way, we can close the SQL database once per buildcol run.
440sub deinit {
441    my ($self) = shift (@_);
442   
443    ##print STDERR "@@@@@@@@@@ GreenstoneSQLPlugin::DEINIT CALLED\n";
444   
445    if($self->{'gs_sql'}) { # only want to work with sql db if buildcol.pl, gs_sql won't have
446    # a value except during buildcol, so when processor =~ m/buildproc$/.
447    $self->{'gs_sql'}->disconnect_from_db()
448        || warn("Unable to disconnect from database " . $self->{'site_name'} . "\n");
449
450    # explicitly delete gs_sql key (setting key to undef has a different meaning from deleting)
451    # so all future use has to make the connection again
452    delete $self->{'gs_sql'};
453    }
454
455    $self->SUPER::deinit(@_);
456}
457
458
459
460
Note: See TracBrowser for help on using the browser.