source: main/trunk/greenstone2/perllib/plugins/GreenstoneSQLPlugin.pm@ 32570

Last change on this file since 32570 was 32570, checked in by ak19, 5 years ago
  1. Bugfix for when renaming an imported doc and incrementally-rebuilding the collection: need to do set_OID() during GreenstoneSQLPlugin::close_doc() in order for the old doc oid to be removed from the index, else the old doc filenames/titles are still present when browsing rebuilt collection. 2. To be on the safe side, overridden remove_one() in GS SQL Plugin now remembers to call super version whose implementation is ultimately inherited from BaseImporter, just in case this (ever) does something relevant.
File size: 20.1 KB
RevLine 
[32536]1###########################################################################
2#
[32542]3# GreenstoneSQLPlugin.pm -- reads into doc_obj from SQL db and docsql.xml
[32536]4# Metadata and/or fulltext are stored in SQL db, the rest may be stored in
5# the docsql .xml files.
6# A component of the Greenstone digital library software
7# from the New Zealand Digital Library Project at the
8# University of Waikato, New Zealand.
9#
10# Copyright (C) 2001 New Zealand Digital Library Project
11#
12# This program is free software; you can redistribute it and/or modify
13# it under the terms of the GNU General Public License as published by
14# the Free Software Foundation; either version 2 of the License, or
15# (at your option) any later version.
16#
17# This program is distributed in the hope that it will be useful,
18# but WITHOUT ANY WARRANTY; without even the implied warranty of
19# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
20# GNU General Public License for more details.
21#
22# You should have received a copy of the GNU General Public License
23# along with this program; if not, write to the Free Software
24# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
25#
26###########################################################################
27
28package GreenstoneSQLPlugin;
29
30
31use strict;
32no strict 'refs'; # allow filehandles to be variables and viceversa
33
34use DBI;
35use docprint; # for new unescape_text() subroutine
36use GreenstoneXMLPlugin;
37use gssql;
38
39
40# TODO:
[32541]41# - Run TODOs here, in Plugout and in gssql.pm by Dr Bainbridge.
[32536]42# - Have not yet tested writing out just meta or just fulltxt to sql db and reading just that
43# back in from the sql db while the remainder is to be read back in from the docsql .xml files.
44
[32543]45# TODO: deal with incremental vs removeold. If docs removed from import folder, then import step
46# won't delete it from archives but buildcol step will. Need to implement this with this database plugin or wherever the actual flow is
47
[32555]48# TODO Q: is "reindex" = del from db + add to db?
49# - is this okay for reindexing, or will it need to modify existing values (update table)
50# - if it's okay, what does reindex need to accomplish (and how) if the OID changes because hash id produced is different?
51# - delete is accomplished in GS SQL Plugin, during buildcol.pl. When should reindexing take place?
52# during SQL plugout/import.pl or during plugin? If adding is done by GSSQLPlugout, does it need to
53# be reimplemented in GSSQLPlugin to support the adding portion of reindexing.
54
[32563]55# TODO: Add public instructions on using this plugin and its plugout: start with installing mysql binary, changing pwd, running the server (and the client against it for checking: basic cmds like create and drop). Then discuss db name, table names (per coll), db cols and col types, and how the plugout and plugin work.
56# Discuss the plugin/plugout parameters.
[32555]57
58
[32563]59# DONE:
[32555]60# + TODO: Incremental delete can't work until GSSQLPlugout has implemented build_mode = incremental
61# (instead of tossing away db on every build)
62# + Ask about docsql naming convention adopted to identify OID. Better way?
63# collection names -> table names: it seems hyphens not allowed. Changed to underscores.
64# + Startup parameters (except removeold/build_mode)
65# + how do we detect we're to do removeold during plugout in import.pl phase
66# + incremental building: where do we need to add code to delete rows from our sql table after
67# incrementally importing a coll with fewer docs (for instance)? What about deleted/modified meta?
68# + Ask if I can assume that all SQL dbs (not just MySQL) will preserve the order of inserted nodes
69# (sections) which in this case had made it easy to reconstruct the doc_obj in memory in the correct order.
70# YES: Otherwise for later db types (drivers), can set order by primary key column and then order by did column
[32563]71# + NOTTODO: when db is not running GLI is paralyzed -> can we set timeout on DBI connection attempt?
72# NOT A PROBLEM: Tested to find DBI connection attempt fails immediately when MySQL server not
73# running. The GLI "paralyzing" incident last time was not because of the gs sql connection code,
74# but because my computer was freezing on-and-off.
75# + "Courier" demo documents in lucene-sql collection: character (degree symbol) not preserved in title. Is this because we encode in utf8 when putting into db and reading back in?
76# Test doc with meta and text like macron in Maori text.
77# + TODO Q: During import, the GS SQL Plugin is called before the GS SQL Plugout with undesirable side
78# effect that if the db doesn't exist, gssql::use_db() fails, as it won't create db.
79# This got fixed when GSSQLPlugin stopped connecting on init().
[32555]80
81
82########################################################################################
83
[32536]84# GreenstoneSQLPlugin inherits from GreenstoneXMLPlugin so that it if meta or fulltext
85# is still written out to doc.xml (docsql .xml), that will be processed as usual,
86# whereas GreenstoneSQLPlugin will process all the rest (full text and/or meta, whichever
87# is written out by GreenstoneSQLPlugout into the SQL db).
88
89
90sub BEGIN {
91 @GreenstoneSQLPlugin::ISA = ('GreenstoneXMLPlugin');
92}
93
94# This plugin must be in the document plugins pipeline IN PLACE OF GreenstoneXMLPlugin
95# So we won't have a process exp conflict here.
[32542]96# The structure of docsql.xml files is identical to doc.xml and the contents are similar except:
97# - since metadata and/or fulltxt are stored in mysql db instead, just XML comments indicating
98# this are left inside docsql.xml within the <Description> (for meta) and/or <Content> (for txt)
99# - the root element Archive now has a docoid attribute: <Archive docoid="OID">
[32536]100sub get_default_process_exp {
101 my $self = shift (@_);
102
[32542]103 return q^(?i)docsql(-\d+)?\.xml$^; # regex based on this method in GreenstoneXMLPlugin
104 #return q^(?i)docsql(-.+)?\.xml$^; # no longer storing the OID embedded in docsql .xml filename
[32536]105}
106
107my $process_mode_list =
108 [ { 'name' => "meta_only",
[32537]109 'desc' => "{GreenstoneSQLPlug.process_mode.meta_only}" },
[32536]110 { 'name' => "text_only",
[32537]111 'desc' => "{GreenstoneSQLPlug.process_mode.text_only}" },
[32536]112 { 'name' => "all",
[32537]113 'desc' => "{GreenstoneSQLPlug.process_mode.all}" } ];
[32536]114
115my $arguments =
116 [ { 'name' => "process_exp",
117 'desc' => "{BaseImporter.process_exp}",
118 'type' => "regexp",
119 'deft' => &get_default_process_exp(),
120 'reqd' => "no" },
[32541]121 { 'name' => "process_mode",
122 'desc' => "{GreenstoneSQLPlug.process_mode}",
123 'type' => "enum",
124 'list' => $process_mode_list,
125 'deft' => "all",
126 'reqd' => "no"},
127 { 'name' => "db_driver",
128 'desc' => "{GreenstoneSQLPlug.db_driver}",
129 'type' => "string",
130 'deft' => "mysql",
131 'reqd' => "yes"},
132 { 'name' => "db_client_user",
133 'desc' => "{GreenstoneSQLPlug.db_client_user}",
134 'type' => "string",
135 'deft' => "root",
136 'reqd' => "yes"},
137 { 'name' => "db_client_pwd",
138 'desc' => "{GreenstoneSQLPlug.db_client_pwd}",
139 'type' => "string",
140 'deft' => "",
141 'reqd' => "yes"}, # pwd required?
142 { 'name' => "db_host",
143 'desc' => "{GreenstoneSQLPlug.db_host}",
144 'type' => "string",
145 'deft' => "127.0.0.1",
146 'reqd' => "yes"},
[32536]147 ];
148
149my $options = { 'name' => "GreenstoneSQLPlugin",
150 'desc' => "{GreenstoneSQLPlugin.desc}",
151 'abstract' => "no",
152 'inherits' => "yes",
153 'args' => $arguments };
154
155
156# TODO: For on cancel, add a SIGTERM handler or so to call end()
157# or to explicitly call gs_sql->close_connection if $gs_sql def
158
159sub new {
160 my ($class) = shift (@_);
161 my ($pluginlist,$inputargs,$hashArgOptLists) = @_;
162 push(@$pluginlist, $class);
163
164 push(@{$hashArgOptLists->{"ArgList"}},@{$arguments});
165 push(@{$hashArgOptLists->{"OptList"}},$options);
166
167 my $self = new GreenstoneXMLPlugin($pluginlist, $inputargs, $hashArgOptLists);
168
169
170 #return bless $self, $class;
171 $self = bless $self, $class;
172 if ($self->{'info_only'}) {
173 # If running pluginfo, we don't need to go further.
174 return $self;
175 }
176
177 # do anything else that needs to be done here when not pluginfo
178
179 return $self;
180}
181
[32563]182# This is called once if removeold is set with import.pl. Most plugins will do
183# nothing but if a plugin does any stuff outside of creating doc obj, then
184# it may need to clear something.
185# In the case of GreenstoneSQL plugs: this is the first time we have a chance
186# to purge the tables of the current collection from the current site's database
187sub remove_all {
188 my $self = shift (@_);
189 my ($pluginfo, $base_dir, $processor, $maxdocs) = @_;
190
191 print STDERR " Building with removeold option set, so deleting current collection's tables if they exist\n" if($self->{'verbosity'});
192
193 # if we're in here, we'd already have run 'use database <site_name>;' during sub init()
194 # so we can go ahead and delete the collection's tables
195 my $gs_sql = $self->{'gs_sql'};
196 $gs_sql->delete_collection_tables(); # will delete them if they exist
197
198 # and recreate tables? No. Tables' existence is ensured in GreenstoneSQLPlugout::begin()
199 my $proc_mode = $self->{'process_mode'};
200 if($proc_mode ne "text_only") {
201 $gs_sql->ensure_meta_table_exists();
202 }
203 if($proc_mode ne "meta_only") {
204 $gs_sql->ensure_fulltxt_table_exists();
205 }
206}
207
208# This is called per document for docs that have been deleted from the
209# collection. Most plugins will do nothing
210# but if a plugin does any stuff outside of creating doc obj, then it may need
211# to clear something.
212# remove the doc(s) denoted by oids from GS SQL db
213# This takes care of incremental deletes (docs marked D by ArchivesInfPlugin when building
214# incrementally) as well as cases of "Non-icremental Delete", see ArchivesInfPlugin.pm
215sub remove_one {
216 my $self = shift (@_);
217
218 my ($file, $oids, $archivedir) = @_;
219
[32570]220 my $rv = $self->SUPER::remove_one(@_);
221
[32563]222 print STDERR "@@@ IN SQLPLUG::REMOVE_ONE: $file\n";
223
[32570]224 #return undef unless $self->can_process_this_file($file); # NO, DON'T DO THIS (inherited remove_one behaviour) HERE:
[32563]225 # WE DON'T CARE IF IT'S AN IMAGE FILE THAT WAS DELETED.
[32570]226 # WE CARE ABOUT REMOVING THE DOCOID OF THAT IMAGE FILE FROM THE SQL DB
227 # SO DON'T RETURN IF CAN'T_PROCESS_THIS_FILE
228
[32563]229 print STDERR "*****************************\nAsked to remove_one oid\n***********************\n";
230
231 my $gs_sql = $self->{'gs_sql'} || return 0; # couldn't make the connection or no db etc
[32570]232 if(scalar (@$oids) > 1) {
233 print STDERR "TODO: As there's more than 1 docid, we now have reason to optimise GreenstoneSQLPlugin::remove_one() by using dbi::prepare() + dbi::execute() instead of dbi::do().\n";
[32563]234 }
235
236 my $proc_mode = $self->{'process_mode'};
237 foreach my $oid (@$oids) {
238 if($proc_mode eq "all" || $proc_mode eq "meta_only") {
239 print STDERR "@@@@@@@@ Deleting $oid from meta table\n" if $self->{'verbosity'} > 2;
240 $gs_sql->delete_recs_from_metatable_with_docid($oid);
241 }
242 if($proc_mode eq "all" || $proc_mode eq "text_only") {
243 print STDERR "@@@@@@@@ Deleting $oid from fulltxt table\n" if $self->{'verbosity'} > 2;
244 $gs_sql->delete_recs_from_texttable_with_docid($oid);
245 }
246 }
[32570]247 return $rv;
[32563]248}
249
250
[32542]251sub xml_start_tag {
252 my $self = shift(@_);
253 my ($expat, $element) = @_;
[32536]254
[32542]255 my $outhandle = $self->{'outhandle'};
256
257 $self->{'element'} = $element;
258 if ($element eq "Archive") { # docsql.xml files contain a OID attribute on Archive element
259 # the element's attributes are in %_ as per ReadXMLFile::xml_start_tag() (while $_
260 # contains the tag)
261
262 # Don't access %_{'docoid'} directly: keep getting a warning message to
263 # use $_{'docoid'} for scalar contexts, but %_ is the element's attr hashmap
264 # whereas $_ has the tag info. So we don't want to do $_{'docoid'}.
265 my %attr_hash = %_; # right way, see OAIPlugin.pm
[32555]266 $self->{'doc_oid'} = $attr_hash{'docoid'};
[32563]267 ##print STDERR "XXXXXXXXXXXXXX in SQLPlugin::xml_start_tag()\n";
[32542]268 print $outhandle "Extracted OID from docsql.xml: ".$self->{'doc_oid'}."\n"
[32544]269 if $self->{'verbosity'} > 2;
[32542]270
271 }
272 else { # let superclass GreenstoneXMLPlugin continue to process <Section> and <Metadata> elements
273 $self->SUPER::xml_start_tag(@_);
274 }
275}
276
[32555]277# TODO Q: Why are there 4 passes when we're only indexing at doc and section level (2 passes)? What's the dummy pass, why is there a pass for infodb?
[32536]278
[32563]279# We should only ever get here during the buildcol.pl phase
[32536]280# At the end of superclass GreenstoneXMLPlugin.pm's close_document() method,
281# the doc_obj in memory is processed (indexed) and then made undef.
282# So we have to work with doc_obj before superclass close_document() is finished.
283sub close_document {
284 my $self = shift(@_);
[32555]285
[32563]286 ##print STDERR "XXXXXXXXX in SQLPlugin::close_doc()\n";
[32536]287
[32563]288 my $gs_sql = $self->{'gs_sql'};
[32555]289
[32536]290 my $outhandle = $self->{'outhandle'};
[32544]291 my $doc_obj = $self->{'doc_obj'};
[32536]292
[32542]293 my $oid = $self->{'doc_oid'}; # we stored current doc's OID during sub xml_start_tag()
[32555]294 my $proc_mode = $self->{'process_mode'};
295
[32538]296 # For now, we have access to doc_obj (until just before super::close_document() terminates)
[32563]297
[32570]298 # OID parsed of docsql.xml file does need to be set on $doc_obj, as noticed in this case:
299 # when a doc in import is renamed, and you do incremental import, it is marked for reindexing
300 # (reindexing is implemented by this plugin as a delete followed by add into the sql db).
301 # In that case, UNLESS you set the OID at this stage, the old deleted doc id (for the old doc
302 # name) continues to exist in the index at the end of incremental rebuilding if you were to
303 # browse the rebuilt collection by files/titles. So unless you set the OID here, the deleted
304 # doc oids will still be listed in the index.
305 $self->{'doc_obj'}->set_OID($oid);
306
[32563]307 print STDERR " GreenstoneSQLPlugin processing doc $oid (reading into docobj from SQL db)\n"
308 if $self->{'verbosity'} > 0;
[32555]309
[32563]310 if($proc_mode eq "all" || $proc_mode eq "meta_only") {
311 # read in meta for the collection (i.e. select * from <col>_metadata table
[32555]312
[32563]313 my $sth = $gs_sql->select_from_metatable_matching_docid($oid);
314 print $outhandle "### SQL select stmt: ".$sth->{'Statement'}."\n"
315 if $self->{'verbosity'} > 2;
[32555]316
[32563]317 print $outhandle "----------SQL DB contains meta-----------\n" if $self->{'verbosity'} > 2;
318 # https://www.effectiveperlprogramming.com/2010/07/set-custom-dbi-error-handlers/
319 while( my @row = $sth->fetchrow_array() ) {
320 #print $outhandle "row: @row\n";
321 my ($primary_key, $did, $sid, $metaname, $metaval) = @row;
[32555]322
[32563]323 # get rid of the artificial "root" introduced in section id when saving to sql db
324 $sid =~ s@^root@@;
325 $sid = $doc_obj->get_top_section() unless $sid;
326 print $outhandle "### did: $did, sid: |$sid|, meta: $metaname, val: $metaval\n"
[32544]327 if $self->{'verbosity'} > 2;
[32536]328
[32563]329 # TODO: we accessed the db in utf8 mode, so, we can call doc_obj->add_utf8_meta directly:
330 $doc_obj->add_utf8_metadata($sid, $metaname, &docprint::unescape_text($metaval));
[32536]331 }
[32563]332 print $outhandle "----------FIN READING DOC's META FROM SQL DB------------\n"
333 if $self->{'verbosity'} > 2;
334 }
335
336 if($proc_mode eq "all" || $proc_mode eq "text_only") {
337 # read in fulltxt for the collection (i.e. select * from <col>_fulltxt table
[32536]338
[32563]339 my $fulltxt_table = $gs_sql->get_fulltext_table_name();
[32536]340
[32563]341
342 my $sth = $gs_sql->select_from_texttable_matching_docid($oid);
343 print $outhandle "### stmt: ".$sth->{'Statement'}."\n" if $self->{'verbosity'} > 2;
344
345 print $outhandle "----------\nSQL DB contains txt entries for-----------\n"
346 if $self->{'verbosity'} > 2;
347 while( my ($primary_key, $did, $sid, $text) = $sth->fetchrow_array() ) {
[32555]348
[32563]349 # get rid of the artificial "root" introduced in section id when saving to sql db
350 #$sid =~ s@^root@@;
351 $sid = $doc_obj->get_top_section() if ($sid eq "root");
352 print $outhandle "### did: $did, sid: |$sid|, fulltext: <TXT>\n"
[32544]353 if $self->{'verbosity'} > 2;
[32563]354
355 # TODO - pass by ref?
356 # TODO: we accessed the db in utf8 mode, so, we can call doc_obj->add_utf8_text directly:
357 $doc_obj->add_utf8_text($sid, &docprint::unescape_text($text));
358 }
359 print $outhandle "----------FIN READING DOC's TXT FROM SQL DB------------\n"
360 if $self->{'verbosity'} > 2;
361 }
[32536]362
[32563]363 # done reading into docobj from SQL db
364
[32536]365 # don't forget to clean up on close() in superclass
366 # It will get the doc_obj indexed then make it undef
367 $self->SUPER::close_document(@_);
368}
369
370
[32563]371# TODO: only want to work with sql db if buildcol.pl. Unfortunately, also runs on import.pl.
372# During import, the GS SQL Plugin is called before the GS SQL Plugout with undesirable side
373# effect that if the db doesn't exist, gssql::use_db() fails, as it won't create db.
[32544]374
[32563]375# GS SQL Plugin::init() (and deinit()) is called by import.pl and also by buildcol.pl
376# This means it connects and deconnects during import.pl as well. This is okay
377# as removeold, which should drop the collection tables, happens during the import phase
378# and therefore also requires a db connection.
379# TODO: Eventually can try moving get_gssql_instance into gssql.pm? That way both GS SQL Plugin
380# and Plugout would be using one connection during import.pl phase when both plugs exist.
381
382# Call init() not begin() because there can be multiple plugin passes and begin() called for
383# each pass (one for doc level and another for section level indexing), whereas init() should
384# be called before any and all passes.
385# This way, we can connect to the SQL database once per buildcol run.
386sub init {
387 my ($self) = shift (@_);
388 ##print STDERR "@@@@@@@@@@ INIT CALLED\n";
[32544]389
[32563]390 $self->SUPER::init(@_); # super (GreenstoneXMLPlugin) will not yet be trying to read from doc.xml (docsql .xml) files in init().
[32544]391
[32541]392 ####################
[32536]393# print "@@@ SITE NAME: ". $self->{'site_name'} . "\n" if defined $self->{'site_name'};
394# print "@@@ COLL NAME: ". $ENV{'GSDLCOLLECTION'} . "\n";
[32541]395
396# print STDERR "@@@@ db_pwd: " . $self->{'db_client_pwd'} . "\n";
397# print STDERR "@@@@ user: " . $self->{'db_client_user'} . "\n";
398# print STDERR "@@@@ db_host: " . $self->{'db_host'} . "\n";
399# print STDERR "@@@@ db_driver: " . $self->{'db_driver'} . "\n";
400 ####################
[32559]401
402 # create gssql object.
403 # collection name will be used for naming tables (site name will be used for naming database)
[32536]404 my $gs_sql = new gssql({
[32560]405 'collection_name' => $ENV{'GSDLCOLLECTION'},
406 'verbosity' => $self->{'verbosity'}
[32559]407 });
[32536]408
409 # try connecting to the mysql db, if that fails it will die
410 if(!$gs_sql->connect_to_db({
411 'db_driver' => $self->{'db_driver'},
412 'db_client_user' => $self->{'db_client_user'},
413 'db_client_pwd' => $self->{'db_client_pwd'},
414 'db_host' => $self->{'db_host'}
415 })
416 )
417 {
418 # This is fatal for the plugout, let's terminate here
419 # PrintError would already have displayed the warning message on connection fail
420 die("Could not connect to db. Can't proceed.\n");
421 }
422
[32541]423 my $db_name = $self->{'site_name'} || "greenstone2"; # one database per GS3 site, for GS2 the db is called greenstone2
[32536]424
[32563]425 # Attempt to use the db, create it if it doesn't exist (but don't create the tables yet)
426 # Bail if we can't use the database
[32536]427 if(!$gs_sql->use_db($db_name)) {
428
429 # This is fatal for the plugout, let's terminate here after disconnecting again
430 # PrintError would already have displayed the warning message on load fail
431 $gs_sql->disconnect_from_db()
[32541]432 || warn("Unable to disconnect from database.\n");
[32536]433 die("Could not use db $db_name. Can't proceed.\n");
434 }
435
[32563]436
[32536]437 # store db handle now that we're connected
[32563]438 $self->{'gs_sql'} = $gs_sql;
[32536]439}
440
[32563]441
[32538]442# This method also runs on import.pl if gs_sql has a value. But we just want to run it on buildcol
[32536]443# Call deinit() not end() because there can be multiple plugin passes:
444# one for doc level and another for section level indexing
445# and deinit() should be called before all passes
[32538]446# This way, we can close the SQL database once per buildcol run.
[32536]447sub deinit {
448 my ($self) = shift (@_);
[32544]449
[32563]450 ##print STDERR "@@@@@@@@@@ GreenstoneSQLPlugin::DEINIT CALLED\n";
[32544]451
[32563]452 if($self->{'gs_sql'}) { # only want to work with sql db if buildcol.pl, gs_sql won't have
[32544]453 # a value except during buildcol, so when processor =~ m/buildproc$/.
[32563]454 $self->{'gs_sql'}->disconnect_from_db()
[32536]455 || warn("Unable to disconnect from database " . $self->{'site_name'} . "\n");
[32544]456
[32563]457 # explicitly delete gs_sql key (setting key to undef has a different meaning from deleting)
458 # so all future use has to make the connection again
459 delete $self->{'gs_sql'};
[32536]460 }
[32544]461
[32536]462 $self->SUPER::deinit(@_);
463}
464
[32544]465
466
467
Note: See TracBrowser for help on using the repository browser.