root/main/trunk/greenstone2/perllib/plugins/GreenstoneSQLPlugin.pm @ 32555

Revision 32555, 18.5 KB (checked in by ak19, 20 months ago)

1. In GreenstoneSQLPlugout, removeold is now paramterised (as are keepold, incremental, incremental_mode). 2. Deletion on incremental_build works. But there are more questions. Why are there 4 passes? What to do on reindexing and when to do it (should it happen during GS SQL plugout or plugin)?

Line 
1###########################################################################
2#
3# GreenstoneSQLPlugin.pm -- reads into doc_obj from SQL db and docsql.xml
4# Metadata and/or fulltext are stored in SQL db, the rest may be stored in
5# the docsql .xml files.
6# A component of the Greenstone digital library software
7# from the New Zealand Digital Library Project at the
8# University of Waikato, New Zealand.
9#
10# Copyright (C) 2001 New Zealand Digital Library Project
11#
12# This program is free software; you can redistribute it and/or modify
13# it under the terms of the GNU General Public License as published by
14# the Free Software Foundation; either version 2 of the License, or
15# (at your option) any later version.
16#
17# This program is distributed in the hope that it will be useful,
18# but WITHOUT ANY WARRANTY; without even the implied warranty of
19# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
20# GNU General Public License for more details.
21#
22# You should have received a copy of the GNU General Public License
23# along with this program; if not, write to the Free Software
24# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
25#
26###########################################################################
27
28package GreenstoneSQLPlugin;
29
30
31use strict;
32no strict 'refs'; # allow filehandles to be variables and viceversa
33
34use DBI;
35use docprint; # for new unescape_text() subroutine
36use GreenstoneXMLPlugin;
37use gssql;
38
39
40# TODO:
41# - Run TODOs here, in Plugout and in gssql.pm by Dr Bainbridge.
42# - "Courier" demo documents in lucene-sql collection: character (degree symbol) not preserved in title. Is this because we encode in utf8 when putting into db and reading back in?
43# Test doc with meta and text like macron in Maori text.
44# - Have not yet tested writing out just meta or just fulltxt to sql db and reading just that
45# back in from the sql db while the remainder is to be read back in from the docsql .xml files.
46
47# TODO: deal with incremental vs removeold. If docs removed from import folder, then import step
48# won't delete it from archives but buildcol step will. Need to implement this with this database plugin or wherever the actual flow is
49
50# TODO: Add public instructions on using this plugin and its plugout: start with installing mysql binary, changing pwd, running the server (and the client against it for checking, basic cmds like create and drop). Then discuss db name, table names (per coll), db cols and col types, and how the plugout and plugin work.
51# Discuss the plugin/plugout parameters.
52
53# TODO: when db is not running GLI is paralyzed -> can we set timeout on DBI connection attempt?
54
55# TODO Q: is "reindex" = del from db + add to db?
56# - is this okay for reindexing, or will it need to modify existing values (update table)
57# - if it's okay, what does reindex need to accomplish (and how) if the OID changes because hash id produced is different?
58# - delete is accomplished in GS SQL Plugin, during buildcol.pl. When should reindexing take place?
59# during SQL plugout/import.pl or during plugin? If adding is done by GSSQLPlugout, does it need to
60# be reimplemented in GSSQLPlugin to support the adding portion of reindexing.
61
62
63# TODO Q: During import, the GS SQL Plugin is called before the GS SQL Plugout with undesirable side
64# effect that if the db doesn't exist, gssql::use_db() fails, as it won't create db.
65
66
67# + TODO: Incremental delete can't work until GSSQLPlugout has implemented build_mode = incremental
68# (instead of tossing away db on every build)
69# + Ask about docsql naming convention adopted to identify OID. Better way?
70# collection names -> table names: it seems hyphens not allowed. Changed to underscores.
71# + Startup parameters (except removeold/build_mode)
72# + how do we detect we're to do removeold during plugout in import.pl phase
73# + incremental building: where do we need to add code to delete rows from our sql table after
74# incrementally importing a coll with fewer docs (for instance)? What about deleted/modified meta?
75# + Ask if I can assume that all SQL dbs (not just MySQL) will preserve the order of inserted nodes
76# (sections) which in this case had made it easy to reconstruct the doc_obj in memory in the correct order.
77# YES: Otherwise for later db types (drivers), can set order by primary key column and then order by did column
78
79
80########################################################################################
81
82# GreenstoneSQLPlugin inherits from GreenstoneXMLPlugin so that it if meta or fulltext
83# is still written out to doc.xml (docsql .xml), that will be processed as usual,
84# whereas GreenstoneSQLPlugin will process all the rest (full text and/or meta, whichever
85# is written out by GreenstoneSQLPlugout into the SQL db).
86
87
88sub BEGIN {
89    @GreenstoneSQLPlugin::ISA = ('GreenstoneXMLPlugin');
90}
91
92# This plugin must be in the document plugins pipeline IN PLACE OF GreenstoneXMLPlugin
93# So we won't have a process exp conflict here.
94# The structure of docsql.xml files is identical to doc.xml and the contents are similar except:
95#   - since metadata and/or fulltxt are stored in mysql db instead, just XML comments indicating
96#   this are left inside docsql.xml within the <Description> (for meta) and/or <Content> (for txt)
97#   - the root element Archive now has a docoid attribute: <Archive docoid="OID">
98sub get_default_process_exp {
99    my $self = shift (@_);
100
101    return q^(?i)docsql(-\d+)?\.xml$^; # regex based on this method in GreenstoneXMLPlugin
102    #return q^(?i)docsql(-.+)?\.xml$^; # no longer storing the OID embedded in docsql .xml filename
103}
104
105my $process_mode_list =
106    [ { 'name' => "meta_only",
107        'desc' => "{GreenstoneSQLPlug.process_mode.meta_only}" },     
108      { 'name' => "text_only",
109        'desc' => "{GreenstoneSQLPlug.process_mode.text_only}" },
110      { 'name' => "all",
111        'desc' => "{GreenstoneSQLPlug.process_mode.all}" } ];
112
113my $arguments =
114    [ { 'name' => "process_exp",
115    'desc' => "{BaseImporter.process_exp}",
116    'type' => "regexp",
117    'deft' => &get_default_process_exp(),
118    'reqd' => "no" },
119      { 'name' => "process_mode",
120    'desc' => "{GreenstoneSQLPlug.process_mode}",
121    'type' => "enum",
122    'list' => $process_mode_list,
123    'deft' => "all",
124    'reqd' => "no"},
125      { 'name' => "db_driver",
126    'desc' => "{GreenstoneSQLPlug.db_driver}",
127    'type' => "string",
128    'deft' => "mysql",
129    'reqd' => "yes"},
130      { 'name' => "db_client_user",
131    'desc' => "{GreenstoneSQLPlug.db_client_user}",
132    'type' => "string",
133    'deft' => "root",
134    'reqd' => "yes"},
135      { 'name' => "db_client_pwd",
136    'desc' => "{GreenstoneSQLPlug.db_client_pwd}",
137    'type' => "string",
138    'deft' => "",
139    'reqd' => "yes"}, # pwd required?
140      { 'name' => "db_host",
141    'desc' => "{GreenstoneSQLPlug.db_host}",
142    'type' => "string",
143    'deft' => "127.0.0.1",
144    'reqd' => "yes"},
145      { 'name' => "db_encoding",
146    'desc' => "{GreenstoneSQLPlug.db_encoding}",
147    'type' => "string",
148    'deft' => "utf8",
149    'reqd' => "yes"}
150    ];
151
152my $options = { 'name'     => "GreenstoneSQLPlugin",
153        'desc'     => "{GreenstoneSQLPlugin.desc}",
154        'abstract' => "no",
155        'inherits' => "yes",
156            'args'     => $arguments };
157
158
159# TODO: For on cancel, add a SIGTERM handler or so to call end()
160# or to explicitly call gs_sql->close_connection if $gs_sql def
161
162sub new {
163    my ($class) = shift (@_);
164    my ($pluginlist,$inputargs,$hashArgOptLists) = @_;
165    push(@$pluginlist, $class);
166
167    push(@{$hashArgOptLists->{"ArgList"}},@{$arguments});
168    push(@{$hashArgOptLists->{"OptList"}},$options);
169
170    my $self = new GreenstoneXMLPlugin($pluginlist, $inputargs, $hashArgOptLists);
171
172   
173    #return bless $self, $class;
174    $self = bless $self, $class;
175    if ($self->{'info_only'}) {
176    # If running pluginfo, we don't need to go further.
177    return $self;
178    }
179
180    # do anything else that needs to be done here when not pluginfo
181    #$self->{'delete_docids'} = (); # list of doc oids to delete during deinit()
182   
183    return $self;
184}
185
186sub xml_start_tag {
187    my $self = shift(@_);
188    my ($expat, $element) = @_;
189
190    my $outhandle = $self->{'outhandle'};
191   
192    $self->{'element'} = $element;
193    if ($element eq "Archive") { # docsql.xml files contain a OID attribute on Archive element
194    # the element's attributes are in %_ as per ReadXMLFile::xml_start_tag() (while $_
195    # contains the tag)
196
197    # Don't access %_{'docoid'} directly: keep getting a warning message to
198    # use $_{'docoid'} for scalar contexts, but %_ is the element's attr hashmap
199    # whereas $_ has the tag info. So we don't want to do $_{'docoid'}.
200    my %attr_hash = %_; # right way, see OAIPlugin.pm
201    $self->{'doc_oid'} = $attr_hash{'docoid'};
202    print STDERR "XXXXXXXXXXXXXX in SQLPlugin::xml_start_tag()\n";
203    print $outhandle "Extracted OID from docsql.xml: ".$self->{'doc_oid'}."\n"
204        if $self->{'verbosity'} > 2;
205
206    }
207    else { # let superclass GreenstoneXMLPlugin continue to process <Section> and <Metadata> elements
208    $self->SUPER::xml_start_tag(@_);
209    }
210}
211
212# TODO Q: Why are there 4 passes when we're only indexing at doc and section level (2 passes)? What's the dummy pass, why is there a pass for infodb?
213
214# At the end of superclass GreenstoneXMLPlugin.pm's close_document() method,
215# the doc_obj in memory is processed (indexed) and then made undef.
216# So we have to work with doc_obj before superclass close_document() is finished.
217sub close_document {
218    my $self = shift(@_);
219
220    print STDERR "XXXXXXXXX in SQLPlugin::close_doc()\n";
221   
222    my $gs_sql = $self->get_gssql_instance();
223   
224    my $outhandle = $self->{'outhandle'};
225    my $doc_obj = $self->{'doc_obj'};
226
227    my $build_proc_mode = $self->{'processor'}->get_mode(); # can be "text" as per basebuildproc or infodb
228    my $oid = $self->{'doc_oid'}; # we stored current doc's OID during sub xml_start_tag()
229    my $proc_mode = $self->{'process_mode'};
230   
231    print $outhandle "++++ OID of document (meta|text) to be del or read in from DB: ".$self->{'doc_oid'}."\n"
232    if $self->{'verbosity'} > 2;
233   
234    # For now, we have access to doc_obj (until just before super::close_document() terminates)
235   
236    # no need to call $self->{'doc_obj'}->set_OID($oid);
237    # because either the OID is stored in the SQL db as meta 'Identifier' alongside other metadata
238    # or it's stored in the doc.xml as metadata 'Identifier' alongside other metadata
239    # Either way, Identifier meta will be read into the docobj automatically with other meta.
240   
241    if ($self->{'verbosity'} > 2) {
242    print STDERR "+++++++++++ buildproc_mode: $build_proc_mode\n";
243    print STDERR "+++++++++++ SQLPlug proc_mode: $proc_mode\n";
244    }
245
246    # TODO: where does reindexing take place, GreenstoneSQL -Plugout or -Plugin?
247    #if($build_proc_mode =~ m/(delete|reindex)$/) { # doc denoted by current OID has been marked for deletion or reindexing (=delete + add)
248    if($build_proc_mode =~ m/(delete)$/) { # doc denoted by current OID has been marked for deletion or reindexing (=delete + add)
249   
250          # build_proc_mode could be "(infodb|text)(delete|reindex)"
251          # "...delete" or "...reindex" as per ArchivesInfPlugin
252   
253    print STDERR "@@@@ DELETING DOC FROM SQL DB\n";
254   
255    if($proc_mode eq "all" || $proc_mode eq "meta_only") {
256        print STDERR "@@@@@@@@ Deleting $oid from meta table\n" if $self->{'verbosity'} > 2;
257        $gs_sql->delete_recs_from_metatable_with_docid($oid);
258    }
259    if($proc_mode eq "all" || $proc_mode eq "text_only") {
260        print STDERR "@@@@@@@@ Deleting $oid from fulltxt table\n" if $self->{'verbosity'} > 2;
261        $gs_sql->delete_recs_from_texttable_with_docid($oid);
262    }
263   
264    # If we're reindexing the current doc, we will we want to continue: which
265    # will add this doc ID back into the db with the new meta/full txt values
266    # But if we're deleting, then we're done processing the document, so set doc_oid to undef
267    # to prevent adding it back into db
268    #undef $self->{'doc_oid'} if($build_proc_mode =~ m/delete$/);   
269   
270    } # done deleting doc from SQL db   
271   
272    else {#if($self->{'doc_oid'}) { # if loading doc from SQL db
273    print STDERR "@@@@ LOADING DOC FROM SQL DB\n"; 
274   
275    if($proc_mode eq "all" || $proc_mode eq "meta_only") {
276        # read in meta for the collection (i.e. select * from <col>_metadata table
277       
278        my $sth = $gs_sql->select_from_metatable_matching_docid($oid);
279        print $outhandle "### SQL select stmt: ".$sth->{'Statement'}."\n"
280        if $self->{'verbosity'} > 2;
281       
282        print $outhandle "----------SQL DB contains meta-----------\n" if $self->{'verbosity'} > 2;
283        # https://www.effectiveperlprogramming.com/2010/07/set-custom-dbi-error-handlers/
284        while( my @row = $sth->fetchrow_array() ) {     
285        #print $outhandle "row: @row\n";
286        my ($primary_key, $did, $sid, $metaname, $metaval) = @row;
287       
288        # get rid of the artificial "root" introduced in section id when saving to sql db
289        $sid =~ s@^root@@;
290        $sid = $doc_obj->get_top_section() unless $sid;
291        print $outhandle "### did: $did, sid: |$sid|, meta: $metaname, val: $metaval\n"
292            if $self->{'verbosity'} > 2;
293       
294        # TODO:  we accessed the db in utf8 mode, so, we can call doc_obj->add_utf8_meta directly:
295        $doc_obj->add_utf8_metadata($sid, $metaname, &docprint::unescape_text($metaval));
296        }
297        print $outhandle "----------FIN READING DOC's META FROM SQL DB------------\n"
298        if $self->{'verbosity'} > 2;
299    }
300   
301    if($proc_mode eq "all" || $proc_mode eq "text_only") {
302        # read in fulltxt for the collection (i.e. select * from <col>_fulltxt table
303       
304        my $fulltxt_table = $gs_sql->get_fulltext_table_name();
305       
306   
307        my $sth = $gs_sql->select_from_texttable_matching_docid($oid);
308        print $outhandle "### stmt: ".$sth->{'Statement'}."\n" if $self->{'verbosity'} > 2;
309       
310        print $outhandle "----------\nSQL DB contains txt entries for-----------\n"
311        if $self->{'verbosity'} > 2;
312        while( my ($primary_key, $did, $sid, $text) = $sth->fetchrow_array() ) {       
313       
314        # get rid of the artificial "root" introduced in section id when saving to sql db
315        #$sid =~ s@^root@@;
316        $sid = $doc_obj->get_top_section() if ($sid eq "root");
317        print $outhandle "### did: $did, sid: |$sid|, fulltext: <TXT>\n"
318            if $self->{'verbosity'} > 2;
319       
320        # TODO - pass by ref?
321        # TODO: we accessed the db in utf8 mode, so, we can call doc_obj->add_utf8_text directly:
322        $doc_obj->add_utf8_text($sid, &docprint::unescape_text($text));
323        }   
324        print $outhandle "----------FIN READING DOC's TXT FROM SQL DB------------\n"
325        if $self->{'verbosity'} > 2;
326    }
327   
328    } # done reading into docobj from SQL db
329   
330    # don't forget to clean up on close() in superclass
331    # It will get the doc_obj indexed then make it undef
332    $self->SUPER::close_document(@_);
333}
334
335
336# We want SQLPlugin to connect to db only during buildcol.pl phase, not during import.pl
337# This works out okay, as close_document() (called by read()) is only invoked during buildcol.pl
338#
339# Further, we want a single db connection for the GS SQL Plugin to be used for
340# the multiple plugin passes: for "dummy" pass, and for doc level and for section level indexing
341# By calling the lazy loading get_sql_instance() from close_document(),
342# we connect to the SQL database once per GSSQLPlugin and only during the buildcol phase.
343#
344# get_gssql_instance() is a lazy loading method that returns singleton db connection for a GreenstoneSQLPlugin object. ("Code pattern" get instance vs singleton.)
345# One instance of db connection that can be used for all the many doc_objects processed by this plugin
346#
347# Except in methods get_gssql_instance() and deinit(), don't access self->{'_gs_sql'} directly.
348# Instead, call method get_gssql_instance() and store return value in a local variable, my $gs_sql
349#
350sub get_gssql_instance
351{   
352    my $self = shift(@_);
353
354    # if we failed to successfully connect once before, don't bother attempting to connect again
355    #return undef if(defined $self->{'failed'}); # plugin/process would have terminated with die()
356                                  # if we couldn't succeed connecting on any connection attempt
357   
358    return $self->{'_gs_sql'} if($self->{'_gs_sql'});
359
360    # assume we'll fail to connect
361    $self->{'failed'} = 1;
362
363    print STDERR "@@@@@@@@@@ LAZY CONNECT CALLED\n";
364   
365    ####################
366#    print "@@@ SITE NAME: ". $self->{'site_name'} . "\n" if defined $self->{'site_name'};
367#    print "@@@ COLL NAME: ". $ENV{'GSDLCOLLECTION'} . "\n";
368
369#    print STDERR "@@@@ db_pwd: " . $self->{'db_client_pwd'} . "\n";
370#    print STDERR "@@@@ user: " . $self->{'db_client_user'} . "\n";
371#    print STDERR "@@@@ db_host: " . $self->{'db_host'} . "\n";
372#    print STDERR "@@@@ db_enc: " . $self->{'db_encoding'} . "\n";
373#    print STDERR "@@@@ db_driver: " . $self->{'db_driver'} . "\n";
374    ####################
375   
376    my $gs_sql = new gssql({
377    'collection_name' => $ENV{'GSDLCOLLECTION'},   
378    'db_encoding' => $self->{'db_encoding'}
379               }
380    );
381
382    # try connecting to the mysql db, if that fails it will die
383    if(!$gs_sql->connect_to_db({
384    'db_driver' => $self->{'db_driver'},
385    'db_client_user' => $self->{'db_client_user'},
386    'db_client_pwd' => $self->{'db_client_pwd'},
387    'db_host' => $self->{'db_host'}
388                   })
389    )
390    {
391    # This is fatal for the plugout, let's terminate here
392    # PrintError would already have displayed the warning message on connection fail   
393    die("Could not connect to db. Can't proceed.\n");
394    }
395   
396    my $db_name = $self->{'site_name'} || "greenstone2"; # one database per GS3 site, for GS2 the db is called greenstone2
397    #my $build_mode = $self->{'build_mode'} || "removeold";
398
399    # the db and its tables should exist. Attempt to use the db:
400    if(!$gs_sql->use_db($db_name)) {
401   
402    # This is fatal for the plugout, let's terminate here after disconnecting again
403    # PrintError would already have displayed the warning message on load fail
404    $gs_sql->disconnect_from_db()
405        || warn("Unable to disconnect from database.\n");
406    die("Could not use db $db_name. Can't proceed.\n");
407    }
408
409    #undef $self->{'failed'};
410   
411    # store db handle now that we're connected
412    $self->{'_gs_sql'} = $gs_sql;
413    return $gs_sql;
414   
415}
416
417# This method also runs on import.pl if gs_sql has a value. But we just want to run it on buildcol
418# Call deinit() not end() because there can be multiple plugin passes:
419# one for doc level and another for section level indexing
420# and deinit() should be called before all passes
421# This way, we can close the SQL database once per buildcol run.
422sub deinit {
423    my ($self) = shift (@_);
424   
425    print STDERR "@@@@@@@@@@ GreenstoneSQLPlugin::DEINIT CALLED\n";
426   
427    if($self->{'_gs_sql'}) { # only want to work with sql db if buildcol.pl, gs_sql won't have
428    # a value except during buildcol, so when processor =~ m/buildproc$/.
429    $self->{'_gs_sql'}->disconnect_from_db()
430        || warn("Unable to disconnect from database " . $self->{'site_name'} . "\n");
431
432    # explicitly set to undef so all future use has to make the connection again
433    undef $self->{'_gs_sql'};
434    }
435
436    $self->SUPER::deinit(@_);
437}
438
439
440
441
Note: See TracBrowser for help on using the browser.