Context Navigation

← Previous Change
Next Change →

GreenstoneSQLPlugin.pm

Timestamp:

2018-10-25T21:17:02+13:00 (6 years ago)

Author:

ak19

Message:

Instead of the docoid being stored in the docsql-<OID>.xml filename, all filenames produced are back to being docsql.xml, but the root element Archive now contains the doc oid as attribute: <Archive docoid="oid">

File:

: 1 edited

main/trunk/greenstone2/perllib/plugins/GreenstoneSQLPlugin.pm (modified) (7 diffs)

Legend:

: Unmodified
: Added
: Removed

main/trunk/greenstone2/perllib/plugins/GreenstoneSQLPlugin.pm

-              r32541
+              r32542
 ###########################################################################
+#
 # GreenstoneSQLPlugin.pm -- reads into doc_obj from SQL db and docsql-<OID>.xml
+# GreenstoneSQLPlugin.pm -- reads into doc_obj from SQL db and docsql.xml
 # Metadata and/or fulltext are stored in SQL db, the rest may be stored in
 # the docsql .xml files.
 …
 # Ask about docsql naming convention adopted to identify OID. Better way?
 # collection names -> table names: it seems hyphens not allowed. Changed to underscores.
 # - Startup parameters
+# + Startup parameters (except removeold/build_mode)
 # - incremental building: where do we need to add code to delete rows from our sql table after
 # incrementally importing a coll with fewer docs (for instance)? What about deleted/modified meta?
 …
 # won't delete it from archives but buildcol step will. Need to implement this with this database plugin or wherever the actual flow is
+# TODO: Add public instructions on using this plugin and its plugout: start with installing mysql binary, changing pwd, running the server (and the client against it for checking, basic cmds like create and drop). Then discuss db name, table names (per coll), db cols and col types, and how the plugout and plugin work.
+# Discuss the plugin/plugout parameters.
 sub BEGIN {
     @GreenstoneSQLPlugin::ISA = ('GreenstoneXMLPlugin');
 …
 # This plugin must be in the document plugins pipeline IN PLACE OF GreenstoneXMLPlugin
 # So we won't have a process exp conflict here.
+# The structure of docsql.xml files is identical to doc.xml and the contents are similar except:
+#   - since metadata and/or fulltxt are stored in mysql db instead, just XML comments indicating
+#   this are left inside docsql.xml within the <Description> (for meta) and/or <Content> (for txt)
+#   - the root element Archive now has a docoid attribute: <Archive docoid="OID">
 sub get_default_process_exp {
     my $self = shift (@_);
     #return q^(?i)docsql(-\d+)?\.xml$^;
     return q^(?i)docsql(-.+)?\.xml$^;
+    return q^(?i)docsql(-\d+)?\.xml$^; # regex based on this method in GreenstoneXMLPlugin
+    #return q^(?i)docsql(-.+)?\.xml$^; # no longer storing the OID embedded in docsql .xml filename
+}
 …
+}
+sub xml_start_tag {
+    my $self = shift(@_);
+    my ($expat, $element) = @_;
+    my $outhandle = $self->{'outhandle'};
+    $self->{'element'} = $element;
+    if ($element eq "Archive") { # docsql.xml files contain a OID attribute on Archive element
+    # the element's attributes are in %_ as per ReadXMLFile::xml_start_tag() (while $_
+    # contains the tag)
+    # Don't access %_{'docoid'} directly: keep getting a warning message to
+    # use $_{'docoid'} for scalar contexts, but %_ is the element's attr hashmap
+    # whereas $_ has the tag info. So we don't want to do $_{'docoid'}.
+    my %attr_hash = %_; # right way, see OAIPlugin.pm
+    $self->{'doc_oid'} = $attr_hash{'docoid'};
+    print $outhandle "Extracted OID from docsql.xml: ".$self->{'doc_oid'}."\n"
+        if $self->{'verbosity'} > 1;
+    }
+    else { # let superclass GreenstoneXMLPlugin continue to process <Section> and <Metadata> elements
+    $self->SUPER::xml_start_tag(@_);
+    }
+}
 # TODO Q: Why are there 3 passes when we're only indexing at doc and section level (2 passes)?
 …
     my $gs_sql = $self->{'gs_sql'};
     my $oid = $self->{'doc_oid'}; # we stored current doc's OID during sub read()
+    my $oid = $self->{'doc_oid'}; # we stored current doc's OID during sub xml_start_tag()
     print $outhandle "==== OID of document (meta|text) to be read in from DB: $oid\n"
     if $self->{'verbosity'} > 1;
 …
+}
-sub read {
-    my $self = shift (@_);
-    my ($pluginfo, $base_dir, $file, $block_hash, $metadata, $processor, $maxdocs, $total_count, $gli) = @_;
-    # when running buildcol.pl, the filename should match "docsql-<OID>.xml"
-    # when running import.pl it will be the original document's filename
-    # we only want to read in from db when running buildcol.pl
-    # doc_obj doesn't exist yet and only exists during super::read(): a new doc (doc_obj)
-    # is created in super::open_document() and is made undef again on super::close_document().
-    # Further, can't read it in from doc.xml to work out which OID to query in sql db:
-    # even if we got access to doc_obj, if no meta stored in docsql.xml, then when
-    # doc_obj is read in from docsql.xml there will be no OID. So OID is docsql.xml filename
-    # contains OID in filename. Having extracted OID from the filename, store OID in plugin-self
-    if($file =~ m/docsql-(.+?)\.xml$/) {
-    # work out docoid from filename of form "docsql-<OID>.xml". $file can have a containing
-    # subfolder besides filename, e.g. "dir/docsql-<OID>.xml"
-    # https://stackoverflow.com/questions/22836/how-do-i-perform-a-perl-substitution-on-a-string-while-keeping-the-original
-    (my $oid = $file) =~ s@^(.*?)docsql-(.+?)\.xml$@$2@;
-    $self->{'doc_oid'} = $oid;
+    }
-    # always read docsql.xml, as we then know doc structure, and assoc files are dealt with
-    # Plus we need to read docsql.xml if either meta or fulltxt went into there instead of to sql db
-    return $self->SUPER::read(@_); # will open_doc, close_doc then process doc_obj for indexing, then undef doc_obj
+}

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 32542 for main/trunk/greenstone2/perllib/plugins/GreenstoneSQLPlugin.pm

Legend:

main/trunk/greenstone2/perllib/plugins/GreenstoneSQLPlugin.pm

Download in other formats: