source: main/trunk/greenstone2/perllib/plugins/GreenstoneSQLPlugin.pm@ 32565

Last change on this file since 32565 was 32565, checked in by ak19, 5 years ago

I think this is a bugfix to plugin.pm::remove_some(): when processing files to delete on deletion/reindexing, it would return after the first successful remove_one(). This wasn't a noticeable problem in the past since no plugin had a real need for implementing remove_one(). But now GS SQLPlugin does remove_one() in earnest, we need all the docs marked for deletion to be processed.

File size: 19.6 KB
Line 
1###########################################################################
2#
3# GreenstoneSQLPlugin.pm -- reads into doc_obj from SQL db and docsql.xml
4# Metadata and/or fulltext are stored in SQL db, the rest may be stored in
5# the docsql .xml files.
6# A component of the Greenstone digital library software
7# from the New Zealand Digital Library Project at the
8# University of Waikato, New Zealand.
9#
10# Copyright (C) 2001 New Zealand Digital Library Project
11#
12# This program is free software; you can redistribute it and/or modify
13# it under the terms of the GNU General Public License as published by
14# the Free Software Foundation; either version 2 of the License, or
15# (at your option) any later version.
16#
17# This program is distributed in the hope that it will be useful,
18# but WITHOUT ANY WARRANTY; without even the implied warranty of
19# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
20# GNU General Public License for more details.
21#
22# You should have received a copy of the GNU General Public License
23# along with this program; if not, write to the Free Software
24# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
25#
26###########################################################################
27
28package GreenstoneSQLPlugin;
29
30
31use strict;
32no strict 'refs'; # allow filehandles to be variables and viceversa
33
34use DBI;
35use docprint; # for new unescape_text() subroutine
36use GreenstoneXMLPlugin;
37use gssql;
38
39
40# TODO:
41# - Run TODOs here, in Plugout and in gssql.pm by Dr Bainbridge.
42# - Have not yet tested writing out just meta or just fulltxt to sql db and reading just that
43# back in from the sql db while the remainder is to be read back in from the docsql .xml files.
44
45# TODO: deal with incremental vs removeold. If docs removed from import folder, then import step
46# won't delete it from archives but buildcol step will. Need to implement this with this database plugin or wherever the actual flow is
47
48# TODO Q: is "reindex" = del from db + add to db?
49# - is this okay for reindexing, or will it need to modify existing values (update table)
50# - if it's okay, what does reindex need to accomplish (and how) if the OID changes because hash id produced is different?
51# - delete is accomplished in GS SQL Plugin, during buildcol.pl. When should reindexing take place?
52# during SQL plugout/import.pl or during plugin? If adding is done by GSSQLPlugout, does it need to
53# be reimplemented in GSSQLPlugin to support the adding portion of reindexing.
54
55# TODO: Add public instructions on using this plugin and its plugout: start with installing mysql binary, changing pwd, running the server (and the client against it for checking: basic cmds like create and drop). Then discuss db name, table names (per coll), db cols and col types, and how the plugout and plugin work.
56# Discuss the plugin/plugout parameters.
57
58
59# DONE:
60# + TODO: Incremental delete can't work until GSSQLPlugout has implemented build_mode = incremental
61# (instead of tossing away db on every build)
62# + Ask about docsql naming convention adopted to identify OID. Better way?
63# collection names -> table names: it seems hyphens not allowed. Changed to underscores.
64# + Startup parameters (except removeold/build_mode)
65# + how do we detect we're to do removeold during plugout in import.pl phase
66# + incremental building: where do we need to add code to delete rows from our sql table after
67# incrementally importing a coll with fewer docs (for instance)? What about deleted/modified meta?
68# + Ask if I can assume that all SQL dbs (not just MySQL) will preserve the order of inserted nodes
69# (sections) which in this case had made it easy to reconstruct the doc_obj in memory in the correct order.
70# YES: Otherwise for later db types (drivers), can set order by primary key column and then order by did column
71# + NOTTODO: when db is not running GLI is paralyzed -> can we set timeout on DBI connection attempt?
72# NOT A PROBLEM: Tested to find DBI connection attempt fails immediately when MySQL server not
73# running. The GLI "paralyzing" incident last time was not because of the gs sql connection code,
74# but because my computer was freezing on-and-off.
75# + "Courier" demo documents in lucene-sql collection: character (degree symbol) not preserved in title. Is this because we encode in utf8 when putting into db and reading back in?
76# Test doc with meta and text like macron in Maori text.
77# + TODO Q: During import, the GS SQL Plugin is called before the GS SQL Plugout with undesirable side
78# effect that if the db doesn't exist, gssql::use_db() fails, as it won't create db.
79# This got fixed when GSSQLPlugin stopped connecting on init().
80
81
82########################################################################################
83
84# GreenstoneSQLPlugin inherits from GreenstoneXMLPlugin so that it if meta or fulltext
85# is still written out to doc.xml (docsql .xml), that will be processed as usual,
86# whereas GreenstoneSQLPlugin will process all the rest (full text and/or meta, whichever
87# is written out by GreenstoneSQLPlugout into the SQL db).
88
89
90sub BEGIN {
91 @GreenstoneSQLPlugin::ISA = ('GreenstoneXMLPlugin');
92}
93
94# This plugin must be in the document plugins pipeline IN PLACE OF GreenstoneXMLPlugin
95# So we won't have a process exp conflict here.
96# The structure of docsql.xml files is identical to doc.xml and the contents are similar except:
97# - since metadata and/or fulltxt are stored in mysql db instead, just XML comments indicating
98# this are left inside docsql.xml within the <Description> (for meta) and/or <Content> (for txt)
99# - the root element Archive now has a docoid attribute: <Archive docoid="OID">
100sub get_default_process_exp {
101 my $self = shift (@_);
102
103 return q^(?i)docsql(-\d+)?\.xml$^; # regex based on this method in GreenstoneXMLPlugin
104 #return q^(?i)docsql(-.+)?\.xml$^; # no longer storing the OID embedded in docsql .xml filename
105}
106
107my $process_mode_list =
108 [ { 'name' => "meta_only",
109 'desc' => "{GreenstoneSQLPlug.process_mode.meta_only}" },
110 { 'name' => "text_only",
111 'desc' => "{GreenstoneSQLPlug.process_mode.text_only}" },
112 { 'name' => "all",
113 'desc' => "{GreenstoneSQLPlug.process_mode.all}" } ];
114
115my $arguments =
116 [ { 'name' => "process_exp",
117 'desc' => "{BaseImporter.process_exp}",
118 'type' => "regexp",
119 'deft' => &get_default_process_exp(),
120 'reqd' => "no" },
121 { 'name' => "process_mode",
122 'desc' => "{GreenstoneSQLPlug.process_mode}",
123 'type' => "enum",
124 'list' => $process_mode_list,
125 'deft' => "all",
126 'reqd' => "no"},
127 { 'name' => "db_driver",
128 'desc' => "{GreenstoneSQLPlug.db_driver}",
129 'type' => "string",
130 'deft' => "mysql",
131 'reqd' => "yes"},
132 { 'name' => "db_client_user",
133 'desc' => "{GreenstoneSQLPlug.db_client_user}",
134 'type' => "string",
135 'deft' => "root",
136 'reqd' => "yes"},
137 { 'name' => "db_client_pwd",
138 'desc' => "{GreenstoneSQLPlug.db_client_pwd}",
139 'type' => "string",
140 'deft' => "",
141 'reqd' => "yes"}, # pwd required?
142 { 'name' => "db_host",
143 'desc' => "{GreenstoneSQLPlug.db_host}",
144 'type' => "string",
145 'deft' => "127.0.0.1",
146 'reqd' => "yes"},
147 ];
148
149my $options = { 'name' => "GreenstoneSQLPlugin",
150 'desc' => "{GreenstoneSQLPlugin.desc}",
151 'abstract' => "no",
152 'inherits' => "yes",
153 'args' => $arguments };
154
155
156# TODO: For on cancel, add a SIGTERM handler or so to call end()
157# or to explicitly call gs_sql->close_connection if $gs_sql def
158
159sub new {
160 my ($class) = shift (@_);
161 my ($pluginlist,$inputargs,$hashArgOptLists) = @_;
162 push(@$pluginlist, $class);
163
164 push(@{$hashArgOptLists->{"ArgList"}},@{$arguments});
165 push(@{$hashArgOptLists->{"OptList"}},$options);
166
167 my $self = new GreenstoneXMLPlugin($pluginlist, $inputargs, $hashArgOptLists);
168
169
170 #return bless $self, $class;
171 $self = bless $self, $class;
172 if ($self->{'info_only'}) {
173 # If running pluginfo, we don't need to go further.
174 return $self;
175 }
176
177 # do anything else that needs to be done here when not pluginfo
178
179 return $self;
180}
181
182# This is called once if removeold is set with import.pl. Most plugins will do
183# nothing but if a plugin does any stuff outside of creating doc obj, then
184# it may need to clear something.
185# In the case of GreenstoneSQL plugs: this is the first time we have a chance
186# to purge the tables of the current collection from the current site's database
187sub remove_all {
188 my $self = shift (@_);
189 my ($pluginfo, $base_dir, $processor, $maxdocs) = @_;
190
191 print STDERR " Building with removeold option set, so deleting current collection's tables if they exist\n" if($self->{'verbosity'});
192
193 # if we're in here, we'd already have run 'use database <site_name>;' during sub init()
194 # so we can go ahead and delete the collection's tables
195 my $gs_sql = $self->{'gs_sql'};
196 $gs_sql->delete_collection_tables(); # will delete them if they exist
197
198 # and recreate tables? No. Tables' existence is ensured in GreenstoneSQLPlugout::begin()
199 my $proc_mode = $self->{'process_mode'};
200 if($proc_mode ne "text_only") {
201 $gs_sql->ensure_meta_table_exists();
202 }
203 if($proc_mode ne "meta_only") {
204 $gs_sql->ensure_fulltxt_table_exists();
205 }
206}
207
208# This is called per document for docs that have been deleted from the
209# collection. Most plugins will do nothing
210# but if a plugin does any stuff outside of creating doc obj, then it may need
211# to clear something.
212# remove the doc(s) denoted by oids from GS SQL db
213# This takes care of incremental deletes (docs marked D by ArchivesInfPlugin when building
214# incrementally) as well as cases of "Non-icremental Delete", see ArchivesInfPlugin.pm
215sub remove_one {
216 my $self = shift (@_);
217
218 my ($file, $oids, $archivedir) = @_;
219
220 print STDERR "@@@ IN SQLPLUG::REMOVE_ONE: $file\n";
221
222 #return undef unless $self->can_process_this_file($file); # NO, DON'T DO THIS:
223 # WE DON'T CARE IF IT'S AN IMAGE FILE THAT WAS DELETED.
224 # WE CARE ABOUT REMOVING THE DOCOID OF THAT IMAGE FILE FROM THE DB
225
226 print STDERR "*****************************\nAsked to remove_one oid\n***********************\n";
227
228 my $gs_sql = $self->{'gs_sql'} || return 0; # couldn't make the connection or no db etc
229 if(scalar @$oids > 1) {
230 print STDERR "TODO: We now have reason to optimise GreenstoneSQLPlugin::remove_one() by using dbi::prepare() + dbi::execute() instead of dbi::do().\n";
231 }
232
233 my $proc_mode = $self->{'process_mode'};
234 foreach my $oid (@$oids) {
235 if($proc_mode eq "all" || $proc_mode eq "meta_only") {
236 print STDERR "@@@@@@@@ Deleting $oid from meta table\n" if $self->{'verbosity'} > 2;
237 $gs_sql->delete_recs_from_metatable_with_docid($oid);
238 }
239 if($proc_mode eq "all" || $proc_mode eq "text_only") {
240 print STDERR "@@@@@@@@ Deleting $oid from fulltxt table\n" if $self->{'verbosity'} > 2;
241 $gs_sql->delete_recs_from_texttable_with_docid($oid);
242 }
243 }
244 return 1;
245}
246
247
248sub xml_start_tag {
249 my $self = shift(@_);
250 my ($expat, $element) = @_;
251
252 my $outhandle = $self->{'outhandle'};
253
254 $self->{'element'} = $element;
255 if ($element eq "Archive") { # docsql.xml files contain a OID attribute on Archive element
256 # the element's attributes are in %_ as per ReadXMLFile::xml_start_tag() (while $_
257 # contains the tag)
258
259 # Don't access %_{'docoid'} directly: keep getting a warning message to
260 # use $_{'docoid'} for scalar contexts, but %_ is the element's attr hashmap
261 # whereas $_ has the tag info. So we don't want to do $_{'docoid'}.
262 my %attr_hash = %_; # right way, see OAIPlugin.pm
263 $self->{'doc_oid'} = $attr_hash{'docoid'};
264 ##print STDERR "XXXXXXXXXXXXXX in SQLPlugin::xml_start_tag()\n";
265 print $outhandle "Extracted OID from docsql.xml: ".$self->{'doc_oid'}."\n"
266 if $self->{'verbosity'} > 2;
267
268 }
269 else { # let superclass GreenstoneXMLPlugin continue to process <Section> and <Metadata> elements
270 $self->SUPER::xml_start_tag(@_);
271 }
272}
273
274# TODO Q: Why are there 4 passes when we're only indexing at doc and section level (2 passes)? What's the dummy pass, why is there a pass for infodb?
275
276# We should only ever get here during the buildcol.pl phase
277# At the end of superclass GreenstoneXMLPlugin.pm's close_document() method,
278# the doc_obj in memory is processed (indexed) and then made undef.
279# So we have to work with doc_obj before superclass close_document() is finished.
280sub close_document {
281 my $self = shift(@_);
282
283 ##print STDERR "XXXXXXXXX in SQLPlugin::close_doc()\n";
284
285 my $gs_sql = $self->{'gs_sql'};
286
287 my $outhandle = $self->{'outhandle'};
288 my $doc_obj = $self->{'doc_obj'};
289
290 my $oid = $self->{'doc_oid'}; # we stored current doc's OID during sub xml_start_tag()
291 my $proc_mode = $self->{'process_mode'};
292
293 # For now, we have access to doc_obj (until just before super::close_document() terminates)
294
295 # no need to call $self->{'doc_obj'}->set_OID($oid);
296 # because either the OID is stored in the SQL db as meta 'Identifier' alongside other metadata
297 # or it's stored in the doc.xml as metadata 'Identifier' alongside other metadata
298 # Either way, Identifier meta will be read into the docobj automatically with other meta.
299
300 print STDERR " GreenstoneSQLPlugin processing doc $oid (reading into docobj from SQL db)\n"
301 if $self->{'verbosity'} > 0;
302
303 if($proc_mode eq "all" || $proc_mode eq "meta_only") {
304 # read in meta for the collection (i.e. select * from <col>_metadata table
305
306 my $sth = $gs_sql->select_from_metatable_matching_docid($oid);
307 print $outhandle "### SQL select stmt: ".$sth->{'Statement'}."\n"
308 if $self->{'verbosity'} > 2;
309
310 print $outhandle "----------SQL DB contains meta-----------\n" if $self->{'verbosity'} > 2;
311 # https://www.effectiveperlprogramming.com/2010/07/set-custom-dbi-error-handlers/
312 while( my @row = $sth->fetchrow_array() ) {
313 #print $outhandle "row: @row\n";
314 my ($primary_key, $did, $sid, $metaname, $metaval) = @row;
315
316 # get rid of the artificial "root" introduced in section id when saving to sql db
317 $sid =~ s@^root@@;
318 $sid = $doc_obj->get_top_section() unless $sid;
319 print $outhandle "### did: $did, sid: |$sid|, meta: $metaname, val: $metaval\n"
320 if $self->{'verbosity'} > 2;
321
322 # TODO: we accessed the db in utf8 mode, so, we can call doc_obj->add_utf8_meta directly:
323 $doc_obj->add_utf8_metadata($sid, $metaname, &docprint::unescape_text($metaval));
324 }
325 print $outhandle "----------FIN READING DOC's META FROM SQL DB------------\n"
326 if $self->{'verbosity'} > 2;
327 }
328
329 if($proc_mode eq "all" || $proc_mode eq "text_only") {
330 # read in fulltxt for the collection (i.e. select * from <col>_fulltxt table
331
332 my $fulltxt_table = $gs_sql->get_fulltext_table_name();
333
334
335 my $sth = $gs_sql->select_from_texttable_matching_docid($oid);
336 print $outhandle "### stmt: ".$sth->{'Statement'}."\n" if $self->{'verbosity'} > 2;
337
338 print $outhandle "----------\nSQL DB contains txt entries for-----------\n"
339 if $self->{'verbosity'} > 2;
340 while( my ($primary_key, $did, $sid, $text) = $sth->fetchrow_array() ) {
341
342 # get rid of the artificial "root" introduced in section id when saving to sql db
343 #$sid =~ s@^root@@;
344 $sid = $doc_obj->get_top_section() if ($sid eq "root");
345 print $outhandle "### did: $did, sid: |$sid|, fulltext: <TXT>\n"
346 if $self->{'verbosity'} > 2;
347
348 # TODO - pass by ref?
349 # TODO: we accessed the db in utf8 mode, so, we can call doc_obj->add_utf8_text directly:
350 $doc_obj->add_utf8_text($sid, &docprint::unescape_text($text));
351 }
352 print $outhandle "----------FIN READING DOC's TXT FROM SQL DB------------\n"
353 if $self->{'verbosity'} > 2;
354 }
355
356 # done reading into docobj from SQL db
357
358 # don't forget to clean up on close() in superclass
359 # It will get the doc_obj indexed then make it undef
360 $self->SUPER::close_document(@_);
361}
362
363
364# TODO: only want to work with sql db if buildcol.pl. Unfortunately, also runs on import.pl.
365# During import, the GS SQL Plugin is called before the GS SQL Plugout with undesirable side
366# effect that if the db doesn't exist, gssql::use_db() fails, as it won't create db.
367
368# GS SQL Plugin::init() (and deinit()) is called by import.pl and also by buildcol.pl
369# This means it connects and deconnects during import.pl as well. This is okay
370# as removeold, which should drop the collection tables, happens during the import phase
371# and therefore also requires a db connection.
372# TODO: Eventually can try moving get_gssql_instance into gssql.pm? That way both GS SQL Plugin
373# and Plugout would be using one connection during import.pl phase when both plugs exist.
374
375# Call init() not begin() because there can be multiple plugin passes and begin() called for
376# each pass (one for doc level and another for section level indexing), whereas init() should
377# be called before any and all passes.
378# This way, we can connect to the SQL database once per buildcol run.
379sub init {
380 my ($self) = shift (@_);
381 ##print STDERR "@@@@@@@@@@ INIT CALLED\n";
382
383 $self->SUPER::init(@_); # super (GreenstoneXMLPlugin) will not yet be trying to read from doc.xml (docsql .xml) files in init().
384
385 ####################
386# print "@@@ SITE NAME: ". $self->{'site_name'} . "\n" if defined $self->{'site_name'};
387# print "@@@ COLL NAME: ". $ENV{'GSDLCOLLECTION'} . "\n";
388
389# print STDERR "@@@@ db_pwd: " . $self->{'db_client_pwd'} . "\n";
390# print STDERR "@@@@ user: " . $self->{'db_client_user'} . "\n";
391# print STDERR "@@@@ db_host: " . $self->{'db_host'} . "\n";
392# print STDERR "@@@@ db_driver: " . $self->{'db_driver'} . "\n";
393 ####################
394
395 # create gssql object.
396 # collection name will be used for naming tables (site name will be used for naming database)
397 my $gs_sql = new gssql({
398 'collection_name' => $ENV{'GSDLCOLLECTION'},
399 'verbosity' => $self->{'verbosity'}
400 });
401
402 # try connecting to the mysql db, if that fails it will die
403 if(!$gs_sql->connect_to_db({
404 'db_driver' => $self->{'db_driver'},
405 'db_client_user' => $self->{'db_client_user'},
406 'db_client_pwd' => $self->{'db_client_pwd'},
407 'db_host' => $self->{'db_host'}
408 })
409 )
410 {
411 # This is fatal for the plugout, let's terminate here
412 # PrintError would already have displayed the warning message on connection fail
413 die("Could not connect to db. Can't proceed.\n");
414 }
415
416 my $db_name = $self->{'site_name'} || "greenstone2"; # one database per GS3 site, for GS2 the db is called greenstone2
417
418 # Attempt to use the db, create it if it doesn't exist (but don't create the tables yet)
419 # Bail if we can't use the database
420 if(!$gs_sql->use_db($db_name)) {
421
422 # This is fatal for the plugout, let's terminate here after disconnecting again
423 # PrintError would already have displayed the warning message on load fail
424 $gs_sql->disconnect_from_db()
425 || warn("Unable to disconnect from database.\n");
426 die("Could not use db $db_name. Can't proceed.\n");
427 }
428
429
430 # store db handle now that we're connected
431 $self->{'gs_sql'} = $gs_sql;
432}
433
434
435# This method also runs on import.pl if gs_sql has a value. But we just want to run it on buildcol
436# Call deinit() not end() because there can be multiple plugin passes:
437# one for doc level and another for section level indexing
438# and deinit() should be called before all passes
439# This way, we can close the SQL database once per buildcol run.
440sub deinit {
441 my ($self) = shift (@_);
442
443 ##print STDERR "@@@@@@@@@@ GreenstoneSQLPlugin::DEINIT CALLED\n";
444
445 if($self->{'gs_sql'}) { # only want to work with sql db if buildcol.pl, gs_sql won't have
446 # a value except during buildcol, so when processor =~ m/buildproc$/.
447 $self->{'gs_sql'}->disconnect_from_db()
448 || warn("Unable to disconnect from database " . $self->{'site_name'} . "\n");
449
450 # explicitly delete gs_sql key (setting key to undef has a different meaning from deleting)
451 # so all future use has to make the connection again
452 delete $self->{'gs_sql'};
453 }
454
455 $self->SUPER::deinit(@_);
456}
457
458
459
460
Note: See TracBrowser for help on using the repository browser.