Ignore:
Timestamp:
2018-11-09T22:33:51+13:00 (5 years ago)
Author:
ak19
Message:

Major tidying up: last remaining debug statements, lots of comments, removed TODO lists.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • main/trunk/greenstone2/perllib/plugins/GreenstoneSQLPlugin.pm

    r32592 r32595  
    3838
    3939
    40 # TODO:
    41 # - Run TODOs here, in Plugout and in gsmysql.pm by Dr Bainbridge.
    42 # - Have not yet tested writing out just meta or just fulltxt to sql db and reading just that
    43 # back in from the sql db while the remainder is to be read back in from the docsql .xml files.
    44 
    45 # + TODO: Add public instructions on using this plugin and its plugout: start with installing mysql binary, changing pwd, running the server (and the client against it for checking: basic cmds like create and drop). Then discuss db name, table names (per coll), db cols and col types, and how the plugout and plugin work.
    46 # Discuss the plugin/plugout parameters.
    47 
    48 # TODO, test on windows and mac.
    49 # Note: if parsing fails (e.g. using wrong plugout like GS XML plugout, which chokes on args intended for SQL plugout) then SQL plugin init would have already been called and done connection, but disconnect would not have been done because SQL plugin disconnect would not have been called upon parse failure.
    50 
    51 # DONE:
    52 # + TODO: For on cancel, add a SIGTERM handler or so to call end()
    53 # or to explicitly call gs_sql->close_connection if $gs_sql def
    54 #
    55 # + TODO: Incremental delete can't work until GSSQLPlugout has implemented build_mode = incremental
    56 # (instead of tossing away db on every build)
    57 # + Ask about docsql naming convention adopted to identify OID. Better way?
    58 # collection names -> table names: it seems hyphens not allowed. Changed to underscores.
    59 # + Startup parameters (except removeold/build_mode)
    60 # + how do we detect we're to do removeold during plugout in import.pl phase
    61 # + incremental building: where do we need to add code to delete rows from our sql table after
    62 # incrementally importing a coll with fewer docs (for instance)? What about deleted/modified meta?
    63 # + Ask if I can assume that all SQL dbs (not just MySQL) will preserve the order of inserted nodes
    64 # (sections) which in this case had made it easy to reconstruct the doc_obj in memory in the correct order.
    65 # YES: Otherwise for later db types (drivers), can set order by primary key column and then order by did column
    66 # + NOTTODO: when db is not running GLI is paralyzed -> can we set timeout on DBI connection attempt?
    67 #   NOT A PROBLEM: Tested to find DBI connection attempt fails immediately when MySQL server not
    68 # running. The GLI "paralyzing" incident last time was not because of the gs sql connection code,
    69 # but because my computer was freezing on-and-off.
    70 # + "Courier" demo documents in lucene-sql collection: character (degree symbol) not preserved in title. Is this because we encode in utf8 when putting into db and reading back in?
    71 # Test doc with meta and text like macron in Maori text.
    72 # + TODO Q: During import, the GS SQL Plugin is called before the GS SQL Plugout with undesirable side
    73 # effect that if the db doesn't exist, gsmysql::use_db() fails, as it won't create db.
    74 #   This got fixed when GSSQLPlugin stopped connecting on init().
    75 #
    76 #
    77 #+ TODO: deal with incremental vs removeold. If docs removed from import folder, then import step
    78 # won't delete it from archives but buildcol step will. Need to implement this with this database plugin or wherever the actual flow is.
    79 #
    80 # + TODO Q: is "reindex" = del from db + add to db?
    81 # - is this okay for reindexing, or will it need to modify existing values (update table)
    82 # - if it's okay, what does reindex need to accomplish (and how) if the OID changes because hash id produced is different?
    83 # - delete is accomplished in GS SQL Plugin, during buildcol.pl. When should reindexing take place?
    84 # during SQL plugout/import.pl or during plugin? If adding is done by GSSQLPlugout, does it need to
    85 # be reimplemented in GSSQLPlugin to support the adding portion of reindexing.
    86 #
    87 # INCREMENTAL REBUILDING IMPLEMENTED CORRECTLY AND WORKS:
    88 # Overriding plugins' remove_all() method covered removeold.
    89 # Overriding plugins' remove_one() method is all I needed to do for reindex and deletion
    90 # (incremental and non-incremental) to work.
    91 # but doing all this needed an overhaul of gsmysql.pm and its use by the GS SQL plugin and plugout.
    92 # - needed to correct plugin.pm::remove_some() to process all files
    93 # - and needed to correct GreenstoneSQLPlugin::close_document() to setOID() after all
    94 # All incremental import and buildcol worked after that:
    95 # - deleting files and running incr-import and incr-buildcol (= "incr delete"),
    96 # - deleting files and running incr-import and buildcol (="non-incr delete")
    97 # - modifying meta and doing an incr rebuild
    98 # - modifying fulltext and doing an incr rebuild
    99 # - renaming a file forces a reindex: doc is removed from db and added back in, due to remove_one()
    100 # - tested CSV file: adding some records, changing some records
    101 #    + CSVPlugin test (collection csvsql)
    102 #    + MetadataCSVPlugin test (modified collection sqltest to have metadata.csv refer to the
    103 #      filenames of sqltest's documents)
    104 #    + shared image test (collection shareimg): if 2 html files reference the same image, the docs
    105 #      are indeed both reindexed if the image is modified (e.g. I replaced the image with another
    106 #      of the same name) which in the GS SQL plugin/plugout case is that the 2 docs are deleted
    107 #      and added in again.
    10840
    10941########################################################################################
    11042
    111 # GreenstoneSQLPlugin inherits from GreenstoneXMLPlugin so that it if meta or fulltext
     43# GreenstoneSQLPlugin inherits from GreenstoneXMLPlugin so that if meta or fulltext
    11244# is still written out to doc.xml (docsql .xml), that will be processed as usual,
    11345# whereas GreenstoneSQLPlugin will process all the rest (full text and/or meta, whichever
     
    14577        'desc' => "{GreenstoneSQLPlug.rollbacl_on_cancel}" } ];
    14678
    147 # TODO: If subclassing gsmysql for other supporting databases and if they have different required
     79# NOTE: If subclassing gsmysql for other supporting databases and if they have different required
    14880# connection parameters, we can check how WordPlugin, upon detecting Word is installed,
    14981# dynamically loads Word specific configuration options.
     
    225157}
    226158
    227 # GS SQL Plugin::init() (and deinit()) is called by import.pl and also by buildcol.pl
    228 # This means it connects and deconnects during import.pl as well. This is okay
    229 # as removeold, which should drop the collection tables, happens during the import phase,
    230 # calling GreenstoneSQLPlugin::and therefore also requires a db connection.
    231 # + TODO: Eventually can try moving get_gssql_instance into gsmysql.pm? That way both GS SQL Plugin
    232 # and Plugout would be using one connection during import.pl phase when both plugs exist.
    233 
    234159# Call init() not begin() because there can be multiple plugin passes and begin() called for
    235160# each pass (one for doc level and another for section level indexing), whereas init() should
    236161# be called before any and all passes.
    237162# This way, we can connect to the SQL database once per buildcol run.
     163# Although now it doesn't matter, since gsmysql.pm uses the get_instance pattern to return a
     164# singleton db connection, regardless of the number of gsmysql objects instantiated and
     165# the number of connect() calls made on them.
    238166sub init {
    239167    my ($self) = shift (@_);
    240     ##print STDERR "@@@@@@@@@@ INIT CALLED\n";
    241168   
    242169    $self->SUPER::init(@_); # super (GreenstoneXMLPlugin) will not yet be trying to read from doc.xml (docsql .xml) files in init().
    243170
    244     ####################
    245 #    print "@@@ SITE NAME: ". $self->{'site'} . "\n" if defined $self->{'site'};
    246 #    print "@@@ COLL NAME: ". $ENV{'GSDLCOLLECTION'} . "\n";
    247 
    248 #    print STDERR "@@@@ db_pwd: " . $self->{'db_client_pwd'} . "\n";
    249 #    print STDERR "@@@@ user: " . $self->{'db_client_user'} . "\n";
    250 #    print STDERR "@@@@ db_host: " . $self->{'db_host'} . "\n";
    251 #    print STDERR "@@@@ db_driver: " . $self->{'db_driver'} . "\n";
    252     ####################
    253171
    254172    # create gsmysql object.
     
    287205    # PrintError would already have displayed the warning message on load fail
    288206    # And on die() perl will call gsmysql destroy which will ensure a disconnect() from db
    289     #$gs_sql->force_disconnect_from_db();
    290207    die("Could not use db $db_name. Can't proceed.\n");
    291208    }
     
    297214
    298215
    299 # This method also runs on import.pl if gs_sql has a value. But we just want to run it on buildcol
     216# This method also runs on import.pl if gs_sql has a value.
    300217# Call deinit() not end() because there can be multiple plugin passes:
    301218# one for doc level and another for section level indexing
    302219# and deinit() should be called before all passes
    303220# This way, we can close the SQL database once per buildcol run.
     221# Again, this doesn't matter because we gsmysql the ensures the connection
     222# is a singleton connection instance, which connects once and disconnects once per perl process.
    304223sub deinit {
    305224    my ($self) = shift (@_);
    306    
    307     ##print STDERR "@@@@@@@@@@ GreenstoneSQLPlugin::DEINIT CALLED\n";
    308225   
    309226    if($self->{'gs_sql'}) {
     
    360277# but if a plugin does any stuff outside of creating doc obj, then it may need
    361278# to clear something.
    362 # remove the doc(s) denoted by oids from GS SQL db
     279# In the case of GreenstoneSQL plugs: Remove the doc(s) denoted by oids from GS SQL db.
    363280# This takes care of incremental deletes (docs marked D by ArchivesInfPlugin when building
    364 # incrementally) as well as cases of "Non-icremental Delete", see ArchivesInfPlugin.pm
     281# incrementally) as well as cases of "Non-icremental Delete", see ArchivesInfPlugin.pm.
     282# As well as cases involving reindexing, which are implemented here as delete followed by add.
    365283sub remove_one {
    366284    my $self = shift (@_);
     
    379297   
    380298    my $gs_sql = $self->{'gs_sql'} || return 0; # couldn't make the connection or no db etc
    381 
    382     print STDERR "*****************************\nAsked to remove_one oid\n***********************\n";
    383     print STDERR "Num oids: " . scalar (@$oids) . "\n";
    384299   
    385300    my $proc_mode = $self->{'process_mode'};
     
    431346# Since all passes need the doc_obj, all are read in from docsql + SQL db into the docobj in memory
    432347
    433 # We should only ever get here during the buildcol.pl phase
     348# We only ever get here or do any parsing of the docsql.xml file during the buildcol.pl phase.
    434349# At the end of superclass GreenstoneXMLPlugin.pm's close_document() method,
    435350# the doc_obj in memory is processed (indexed) and then made undef.
     
    438353    my $self = shift(@_);
    439354
    440     ##print STDERR "XXXXXXXXX in SQLPlugin::close_doc()\n";
    441    
    442355    my $gs_sql = $self->{'gs_sql'};
    443356   
     
    468381   
    469382    print $outhandle "----------SQL DB contains meta-----------\n" if $self->{'verbosity'} > 2;
    470     # https://www.effectiveperlprogramming.com/2010/07/set-custom-dbi-error-handlers/
    471383
    472384    foreach my $row (@$records) {
    473         #print $outhandle "row: @$row\n";
    474385        my ($primary_key, $did, $sid, $metaname, $metaval) = @$row;
    475386       
     
    480391        if $self->{'verbosity'} > 2;
    481392       
    482         # + TODO:  we accessed the db in utf8 mode, so, we can call doc_obj->add_utf8_meta directly:
    483         #$doc_obj->add_utf8_metadata($sid, $metaname, &docprint::unescape_text($metaval));
    484        
    485         # data stored unescaped in db: escaping only for html/xml files, not for txt files or db
     393        # We're only dealing with utf8 data where docobj is concerned
     394        # Data stored unescaped in db: escaping only for html/xml files, not for txt files or db
    486395        $doc_obj->add_utf8_metadata($sid, $metaname, $metaval);
    487396    }
     
    510419        print $outhandle "### did: $did, sid: |$sid|, fulltext: <TXT>\n"
    511420        if $self->{'verbosity'} > 2;
    512        
    513         # TODO - pass by ref?
    514         # + TODO: we accessed the db in utf8 mode, so, we can call doc_obj->add_utf8_text directly:
    515         # data stored unescaped in db: escaping is only for html/xml files, not for txt files or db
    516         #my $textref = &docprint::unescape_textref(\$text);
     421
     422        # We're only dealing with utf8 data where docobj is concerned
     423        # Data stored unescaped in db: escaping is only for html/xml files, not for txt files or db
    517424        $doc_obj->add_utf8_textref($sid, \$text);
    518425    }   
Note: See TracChangeset for help on using the changeset viewer.