root/main/trunk/greenstone2/perllib/plugins/GreenstoneSQLPlugin.pm @ 32570

Revision 32570, 20.1 KB (checked in by ak19, 20 months ago)

1. Bugfix for when renaming an imported doc and incrementally-rebuilding the collection: need to do set_OID() during GreenstoneSQLPlugin::close_doc() in order for the old doc oid to be removed from the index, else the old doc filenames/titles are still present when browsing rebuilt collection. 2. To be on the safe side, overridden remove_one() in GS SQL Plugin now remembers to call super version whose implementation is ultimately inherited from BaseImporter?, just in case this (ever) does something relevant.

Line 
1###########################################################################
2#
3# GreenstoneSQLPlugin.pm -- reads into doc_obj from SQL db and docsql.xml
4# Metadata and/or fulltext are stored in SQL db, the rest may be stored in
5# the docsql .xml files.
6# A component of the Greenstone digital library software
7# from the New Zealand Digital Library Project at the
8# University of Waikato, New Zealand.
9#
10# Copyright (C) 2001 New Zealand Digital Library Project
11#
12# This program is free software; you can redistribute it and/or modify
13# it under the terms of the GNU General Public License as published by
14# the Free Software Foundation; either version 2 of the License, or
15# (at your option) any later version.
16#
17# This program is distributed in the hope that it will be useful,
18# but WITHOUT ANY WARRANTY; without even the implied warranty of
19# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
20# GNU General Public License for more details.
21#
22# You should have received a copy of the GNU General Public License
23# along with this program; if not, write to the Free Software
24# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
25#
26###########################################################################
27
28package GreenstoneSQLPlugin;
29
30
31use strict;
32no strict 'refs'; # allow filehandles to be variables and viceversa
33
34use DBI;
35use docprint; # for new unescape_text() subroutine
36use GreenstoneXMLPlugin;
37use gssql;
38
39
40# TODO:
41# - Run TODOs here, in Plugout and in gssql.pm by Dr Bainbridge.
42# - Have not yet tested writing out just meta or just fulltxt to sql db and reading just that
43# back in from the sql db while the remainder is to be read back in from the docsql .xml files.
44
45# TODO: deal with incremental vs removeold. If docs removed from import folder, then import step
46# won't delete it from archives but buildcol step will. Need to implement this with this database plugin or wherever the actual flow is
47
48# TODO Q: is "reindex" = del from db + add to db?
49# - is this okay for reindexing, or will it need to modify existing values (update table)
50# - if it's okay, what does reindex need to accomplish (and how) if the OID changes because hash id produced is different?
51# - delete is accomplished in GS SQL Plugin, during buildcol.pl. When should reindexing take place?
52# during SQL plugout/import.pl or during plugin? If adding is done by GSSQLPlugout, does it need to
53# be reimplemented in GSSQLPlugin to support the adding portion of reindexing.
54
55# TODO: Add public instructions on using this plugin and its plugout: start with installing mysql binary, changing pwd, running the server (and the client against it for checking: basic cmds like create and drop). Then discuss db name, table names (per coll), db cols and col types, and how the plugout and plugin work.
56# Discuss the plugin/plugout parameters.
57
58
59# DONE:
60# + TODO: Incremental delete can't work until GSSQLPlugout has implemented build_mode = incremental
61# (instead of tossing away db on every build)
62# + Ask about docsql naming convention adopted to identify OID. Better way?
63# collection names -> table names: it seems hyphens not allowed. Changed to underscores.
64# + Startup parameters (except removeold/build_mode)
65# + how do we detect we're to do removeold during plugout in import.pl phase
66# + incremental building: where do we need to add code to delete rows from our sql table after
67# incrementally importing a coll with fewer docs (for instance)? What about deleted/modified meta?
68# + Ask if I can assume that all SQL dbs (not just MySQL) will preserve the order of inserted nodes
69# (sections) which in this case had made it easy to reconstruct the doc_obj in memory in the correct order.
70# YES: Otherwise for later db types (drivers), can set order by primary key column and then order by did column
71# + NOTTODO: when db is not running GLI is paralyzed -> can we set timeout on DBI connection attempt?
72#   NOT A PROBLEM: Tested to find DBI connection attempt fails immediately when MySQL server not
73# running. The GLI "paralyzing" incident last time was not because of the gs sql connection code,
74# but because my computer was freezing on-and-off.
75# + "Courier" demo documents in lucene-sql collection: character (degree symbol) not preserved in title. Is this because we encode in utf8 when putting into db and reading back in?
76# Test doc with meta and text like macron in Maori text.
77# + TODO Q: During import, the GS SQL Plugin is called before the GS SQL Plugout with undesirable side
78# effect that if the db doesn't exist, gssql::use_db() fails, as it won't create db.
79#   This got fixed when GSSQLPlugin stopped connecting on init().
80
81
82########################################################################################
83
84# GreenstoneSQLPlugin inherits from GreenstoneXMLPlugin so that it if meta or fulltext
85# is still written out to doc.xml (docsql .xml), that will be processed as usual,
86# whereas GreenstoneSQLPlugin will process all the rest (full text and/or meta, whichever
87# is written out by GreenstoneSQLPlugout into the SQL db).
88
89
90sub BEGIN {
91    @GreenstoneSQLPlugin::ISA = ('GreenstoneXMLPlugin');
92}
93
94# This plugin must be in the document plugins pipeline IN PLACE OF GreenstoneXMLPlugin
95# So we won't have a process exp conflict here.
96# The structure of docsql.xml files is identical to doc.xml and the contents are similar except:
97#   - since metadata and/or fulltxt are stored in mysql db instead, just XML comments indicating
98#   this are left inside docsql.xml within the <Description> (for meta) and/or <Content> (for txt)
99#   - the root element Archive now has a docoid attribute: <Archive docoid="OID">
100sub get_default_process_exp {
101    my $self = shift (@_);
102
103    return q^(?i)docsql(-\d+)?\.xml$^; # regex based on this method in GreenstoneXMLPlugin
104    #return q^(?i)docsql(-.+)?\.xml$^; # no longer storing the OID embedded in docsql .xml filename
105}
106
107my $process_mode_list =
108    [ { 'name' => "meta_only",
109        'desc' => "{GreenstoneSQLPlug.process_mode.meta_only}" },     
110      { 'name' => "text_only",
111        'desc' => "{GreenstoneSQLPlug.process_mode.text_only}" },
112      { 'name' => "all",
113        'desc' => "{GreenstoneSQLPlug.process_mode.all}" } ];
114
115my $arguments =
116    [ { 'name' => "process_exp",
117    'desc' => "{BaseImporter.process_exp}",
118    'type' => "regexp",
119    'deft' => &get_default_process_exp(),
120    'reqd' => "no" },
121      { 'name' => "process_mode",
122    'desc' => "{GreenstoneSQLPlug.process_mode}",
123    'type' => "enum",
124    'list' => $process_mode_list,
125    'deft' => "all",
126    'reqd' => "no"},
127      { 'name' => "db_driver",
128    'desc' => "{GreenstoneSQLPlug.db_driver}",
129    'type' => "string",
130    'deft' => "mysql",
131    'reqd' => "yes"},
132      { 'name' => "db_client_user",
133    'desc' => "{GreenstoneSQLPlug.db_client_user}",
134    'type' => "string",
135    'deft' => "root",
136    'reqd' => "yes"},
137      { 'name' => "db_client_pwd",
138    'desc' => "{GreenstoneSQLPlug.db_client_pwd}",
139    'type' => "string",
140    'deft' => "",
141    'reqd' => "yes"}, # pwd required?
142      { 'name' => "db_host",
143    'desc' => "{GreenstoneSQLPlug.db_host}",
144    'type' => "string",
145    'deft' => "127.0.0.1",
146    'reqd' => "yes"},
147    ];
148
149my $options = { 'name'     => "GreenstoneSQLPlugin",
150        'desc'     => "{GreenstoneSQLPlugin.desc}",
151        'abstract' => "no",
152        'inherits' => "yes",
153            'args'     => $arguments };
154
155
156# TODO: For on cancel, add a SIGTERM handler or so to call end()
157# or to explicitly call gs_sql->close_connection if $gs_sql def
158
159sub new {
160    my ($class) = shift (@_);
161    my ($pluginlist,$inputargs,$hashArgOptLists) = @_;
162    push(@$pluginlist, $class);
163
164    push(@{$hashArgOptLists->{"ArgList"}},@{$arguments});
165    push(@{$hashArgOptLists->{"OptList"}},$options);
166
167    my $self = new GreenstoneXMLPlugin($pluginlist, $inputargs, $hashArgOptLists);
168
169   
170    #return bless $self, $class;
171    $self = bless $self, $class;
172    if ($self->{'info_only'}) {
173    # If running pluginfo, we don't need to go further.
174    return $self;
175    }
176
177    # do anything else that needs to be done here when not pluginfo
178   
179    return $self;
180}
181
182# This is called once if removeold is set with import.pl. Most plugins will do
183# nothing but if a plugin does any stuff outside of creating doc obj, then
184# it may need to clear something.
185# In the case of GreenstoneSQL plugs: this is the first time we have a chance
186# to purge the tables of the current collection from the current site's database
187sub remove_all {
188    my $self = shift (@_);
189    my ($pluginfo, $base_dir, $processor, $maxdocs) = @_;
190
191    print STDERR "   Building with removeold option set, so deleting current collection's tables if they exist\n" if($self->{'verbosity'});
192   
193    # if we're in here, we'd already have run 'use database <site_name>;' during sub init()
194    # so we can go ahead and delete the collection's tables
195    my $gs_sql = $self->{'gs_sql'};
196    $gs_sql->delete_collection_tables(); # will delete them if they exist
197
198    # and recreate tables? No. Tables' existence is ensured in GreenstoneSQLPlugout::begin()
199    my $proc_mode = $self->{'process_mode'};
200    if($proc_mode ne "text_only") {
201    $gs_sql->ensure_meta_table_exists();
202    }
203    if($proc_mode ne "meta_only") {
204    $gs_sql->ensure_fulltxt_table_exists();
205    }
206}
207
208# This is called per document for docs that have been deleted from the
209# collection. Most plugins will do nothing
210# but if a plugin does any stuff outside of creating doc obj, then it may need
211# to clear something.
212# remove the doc(s) denoted by oids from GS SQL db
213# This takes care of incremental deletes (docs marked D by ArchivesInfPlugin when building
214# incrementally) as well as cases of "Non-icremental Delete", see ArchivesInfPlugin.pm
215sub remove_one {
216    my $self = shift (@_);
217   
218    my ($file, $oids, $archivedir) = @_;
219
220    my $rv = $self->SUPER::remove_one(@_);
221   
222    print STDERR "@@@ IN SQLPLUG::REMOVE_ONE: $file\n";
223   
224    #return undef unless $self->can_process_this_file($file); # NO, DON'T DO THIS (inherited remove_one behaviour) HERE:
225           # WE DON'T CARE IF IT'S AN IMAGE FILE THAT WAS DELETED.
226           # WE CARE ABOUT REMOVING THE DOCOID OF THAT IMAGE FILE FROM THE SQL DB
227           # SO DON'T RETURN IF CAN'T_PROCESS_THIS_FILE
228   
229    print STDERR "*****************************\nAsked to remove_one oid\n***********************\n";
230   
231    my $gs_sql = $self->{'gs_sql'} || return 0; # couldn't make the connection or no db etc
232    if(scalar (@$oids) > 1) {
233    print STDERR "TODO: As there's more than 1 docid, we now have reason to optimise GreenstoneSQLPlugin::remove_one() by using dbi::prepare() + dbi::execute() instead of dbi::do().\n";
234    }
235   
236    my $proc_mode = $self->{'process_mode'};
237    foreach my $oid (@$oids) { 
238    if($proc_mode eq "all" || $proc_mode eq "meta_only") {
239        print STDERR "@@@@@@@@ Deleting $oid from meta table\n" if $self->{'verbosity'} > 2;
240        $gs_sql->delete_recs_from_metatable_with_docid($oid);
241    }
242    if($proc_mode eq "all" || $proc_mode eq "text_only") {
243        print STDERR "@@@@@@@@ Deleting $oid from fulltxt table\n" if $self->{'verbosity'} > 2;
244        $gs_sql->delete_recs_from_texttable_with_docid($oid);
245    }
246    }
247    return $rv;
248}
249
250
251sub xml_start_tag {
252    my $self = shift(@_);
253    my ($expat, $element) = @_;
254
255    my $outhandle = $self->{'outhandle'};
256   
257    $self->{'element'} = $element;
258    if ($element eq "Archive") { # docsql.xml files contain a OID attribute on Archive element
259    # the element's attributes are in %_ as per ReadXMLFile::xml_start_tag() (while $_
260    # contains the tag)
261
262    # Don't access %_{'docoid'} directly: keep getting a warning message to
263    # use $_{'docoid'} for scalar contexts, but %_ is the element's attr hashmap
264    # whereas $_ has the tag info. So we don't want to do $_{'docoid'}.
265    my %attr_hash = %_; # right way, see OAIPlugin.pm
266    $self->{'doc_oid'} = $attr_hash{'docoid'};
267    ##print STDERR "XXXXXXXXXXXXXX in SQLPlugin::xml_start_tag()\n";
268    print $outhandle "Extracted OID from docsql.xml: ".$self->{'doc_oid'}."\n"
269        if $self->{'verbosity'} > 2;
270
271    }
272    else { # let superclass GreenstoneXMLPlugin continue to process <Section> and <Metadata> elements
273    $self->SUPER::xml_start_tag(@_);
274    }
275}
276
277# TODO Q: Why are there 4 passes when we're only indexing at doc and section level (2 passes)? What's the dummy pass, why is there a pass for infodb?
278
279# We should only ever get here during the buildcol.pl phase
280# At the end of superclass GreenstoneXMLPlugin.pm's close_document() method,
281# the doc_obj in memory is processed (indexed) and then made undef.
282# So we have to work with doc_obj before superclass close_document() is finished.
283sub close_document {
284    my $self = shift(@_);
285
286    ##print STDERR "XXXXXXXXX in SQLPlugin::close_doc()\n";
287   
288    my $gs_sql = $self->{'gs_sql'};
289   
290    my $outhandle = $self->{'outhandle'};
291    my $doc_obj = $self->{'doc_obj'};
292
293    my $oid = $self->{'doc_oid'}; # we stored current doc's OID during sub xml_start_tag()
294    my $proc_mode = $self->{'process_mode'};
295   
296    # For now, we have access to doc_obj (until just before super::close_document() terminates)
297
298    # OID parsed of docsql.xml file does need to be set on $doc_obj, as noticed in this case:
299    # when a doc in import is renamed, and you do incremental import, it is marked for reindexing
300    # (reindexing is implemented by this plugin as a delete followed by add into the sql db).
301    # In that case, UNLESS you set the OID at this stage, the old deleted doc id (for the old doc
302    # name) continues to exist in the index at the end of incremental rebuilding if you were to
303    # browse the rebuilt collection by files/titles. So unless you set the OID here, the deleted
304    # doc oids will still be listed in the index.
305    $self->{'doc_obj'}->set_OID($oid);
306   
307    print STDERR "   GreenstoneSQLPlugin processing doc $oid (reading into docobj from SQL db)\n"
308    if $self->{'verbosity'} > 0;
309   
310    if($proc_mode eq "all" || $proc_mode eq "meta_only") {
311    # read in meta for the collection (i.e. select * from <col>_metadata table
312   
313    my $sth = $gs_sql->select_from_metatable_matching_docid($oid);
314    print $outhandle "### SQL select stmt: ".$sth->{'Statement'}."\n"
315        if $self->{'verbosity'} > 2;
316   
317    print $outhandle "----------SQL DB contains meta-----------\n" if $self->{'verbosity'} > 2;
318    # https://www.effectiveperlprogramming.com/2010/07/set-custom-dbi-error-handlers/
319    while( my @row = $sth->fetchrow_array() ) {     
320        #print $outhandle "row: @row\n";
321        my ($primary_key, $did, $sid, $metaname, $metaval) = @row;
322       
323        # get rid of the artificial "root" introduced in section id when saving to sql db
324        $sid =~ s@^root@@;
325        $sid = $doc_obj->get_top_section() unless $sid;
326        print $outhandle "### did: $did, sid: |$sid|, meta: $metaname, val: $metaval\n"
327        if $self->{'verbosity'} > 2;
328       
329        # TODO:  we accessed the db in utf8 mode, so, we can call doc_obj->add_utf8_meta directly:
330        $doc_obj->add_utf8_metadata($sid, $metaname, &docprint::unescape_text($metaval));
331    }
332    print $outhandle "----------FIN READING DOC's META FROM SQL DB------------\n"
333        if $self->{'verbosity'} > 2;
334    }
335   
336    if($proc_mode eq "all" || $proc_mode eq "text_only") {
337    # read in fulltxt for the collection (i.e. select * from <col>_fulltxt table
338   
339    my $fulltxt_table = $gs_sql->get_fulltext_table_name();
340   
341   
342    my $sth = $gs_sql->select_from_texttable_matching_docid($oid);
343    print $outhandle "### stmt: ".$sth->{'Statement'}."\n" if $self->{'verbosity'} > 2;
344   
345    print $outhandle "----------\nSQL DB contains txt entries for-----------\n"
346        if $self->{'verbosity'} > 2;
347    while( my ($primary_key, $did, $sid, $text) = $sth->fetchrow_array() ) {       
348       
349        # get rid of the artificial "root" introduced in section id when saving to sql db
350        #$sid =~ s@^root@@;
351        $sid = $doc_obj->get_top_section() if ($sid eq "root");
352        print $outhandle "### did: $did, sid: |$sid|, fulltext: <TXT>\n"
353        if $self->{'verbosity'} > 2;
354       
355        # TODO - pass by ref?
356        # TODO: we accessed the db in utf8 mode, so, we can call doc_obj->add_utf8_text directly:
357        $doc_obj->add_utf8_text($sid, &docprint::unescape_text($text));
358    }   
359    print $outhandle "----------FIN READING DOC's TXT FROM SQL DB------------\n"
360        if $self->{'verbosity'} > 2;
361    }
362   
363    # done reading into docobj from SQL db
364   
365    # don't forget to clean up on close() in superclass
366    # It will get the doc_obj indexed then make it undef
367    $self->SUPER::close_document(@_);
368}
369
370
371# TODO: only want to work with sql db if buildcol.pl. Unfortunately, also runs on import.pl.
372# During import, the GS SQL Plugin is called before the GS SQL Plugout with undesirable side
373# effect that if the db doesn't exist, gssql::use_db() fails, as it won't create db.
374
375# GS SQL Plugin::init() (and deinit()) is called by import.pl and also by buildcol.pl
376# This means it connects and deconnects during import.pl as well. This is okay
377# as removeold, which should drop the collection tables, happens during the import phase
378# and therefore also requires a db connection.
379# TODO: Eventually can try moving get_gssql_instance into gssql.pm? That way both GS SQL Plugin
380# and Plugout would be using one connection during import.pl phase when both plugs exist.
381
382# Call init() not begin() because there can be multiple plugin passes and begin() called for
383# each pass (one for doc level and another for section level indexing), whereas init() should
384# be called before any and all passes.
385# This way, we can connect to the SQL database once per buildcol run.
386sub init {
387    my ($self) = shift (@_);
388    ##print STDERR "@@@@@@@@@@ INIT CALLED\n";
389   
390    $self->SUPER::init(@_); # super (GreenstoneXMLPlugin) will not yet be trying to read from doc.xml (docsql .xml) files in init().
391
392    ####################
393#    print "@@@ SITE NAME: ". $self->{'site_name'} . "\n" if defined $self->{'site_name'};
394#    print "@@@ COLL NAME: ". $ENV{'GSDLCOLLECTION'} . "\n";
395
396#    print STDERR "@@@@ db_pwd: " . $self->{'db_client_pwd'} . "\n";
397#    print STDERR "@@@@ user: " . $self->{'db_client_user'} . "\n";
398#    print STDERR "@@@@ db_host: " . $self->{'db_host'} . "\n";
399#    print STDERR "@@@@ db_driver: " . $self->{'db_driver'} . "\n";
400    ####################
401
402    # create gssql object.
403    # collection name will be used for naming tables (site name will be used for naming database)
404    my $gs_sql = new gssql({
405    'collection_name' => $ENV{'GSDLCOLLECTION'},
406    'verbosity' => $self->{'verbosity'}
407               });
408
409    # try connecting to the mysql db, if that fails it will die
410    if(!$gs_sql->connect_to_db({
411    'db_driver' => $self->{'db_driver'},
412    'db_client_user' => $self->{'db_client_user'},
413    'db_client_pwd' => $self->{'db_client_pwd'},
414    'db_host' => $self->{'db_host'}
415                   })
416    )
417    {
418    # This is fatal for the plugout, let's terminate here
419    # PrintError would already have displayed the warning message on connection fail   
420    die("Could not connect to db. Can't proceed.\n");
421    }
422   
423    my $db_name = $self->{'site_name'} || "greenstone2"; # one database per GS3 site, for GS2 the db is called greenstone2
424
425    # Attempt to use the db, create it if it doesn't exist (but don't create the tables yet)
426    # Bail if we can't use the database
427    if(!$gs_sql->use_db($db_name)) {
428   
429    # This is fatal for the plugout, let's terminate here after disconnecting again
430    # PrintError would already have displayed the warning message on load fail
431    $gs_sql->disconnect_from_db()
432        || warn("Unable to disconnect from database.\n");
433    die("Could not use db $db_name. Can't proceed.\n");
434    }
435   
436   
437    # store db handle now that we're connected
438    $self->{'gs_sql'} = $gs_sql;   
439}
440
441
442# This method also runs on import.pl if gs_sql has a value. But we just want to run it on buildcol
443# Call deinit() not end() because there can be multiple plugin passes:
444# one for doc level and another for section level indexing
445# and deinit() should be called before all passes
446# This way, we can close the SQL database once per buildcol run.
447sub deinit {
448    my ($self) = shift (@_);
449   
450    ##print STDERR "@@@@@@@@@@ GreenstoneSQLPlugin::DEINIT CALLED\n";
451   
452    if($self->{'gs_sql'}) { # only want to work with sql db if buildcol.pl, gs_sql won't have
453    # a value except during buildcol, so when processor =~ m/buildproc$/.
454    $self->{'gs_sql'}->disconnect_from_db()
455        || warn("Unable to disconnect from database " . $self->{'site_name'} . "\n");
456
457    # explicitly delete gs_sql key (setting key to undef has a different meaning from deleting)
458    # so all future use has to make the connection again
459    delete $self->{'gs_sql'};
460    }
461
462    $self->SUPER::deinit(@_);
463}
464
465
466
467
Note: See TracBrowser for help on using the browser.