Context Navigation

← Previous Changeset
Next Changeset →

Changeset 34129

Timestamp:

2020-05-30T01:01:01+12:00 (4 years ago)

Author:

ak19

Message:

Implemented Kathy's suggestions: 1. Explicit ex prefix to ex meta removed, so that many questions regarding ex.meta setting and retrieval and configuration have been resolved. 2. setup_keep_urls() has been moved to start of process() from where I can likewise ensure it's done only once and only during import. It's a better location than overriding can_process_this_file() for doing this. 3. Investigated and fixed the symptom of util::tidy_up_OID() displaying a warning for every segment that the baseOID had a D prefix. It happened because get_base_OID() which I called from NutchTextDumpPlugin::process() was side-effecting code that didn't just return base OID but also did an add-OID for segments, resulting which ended up calling util::tidy_up_OID() in the message being called multiple times. The solution wasn't to override get_base_OID() to store the return value the first time and return the stored value subsequent times, as this important side-effect was lost. Instead, the solution was override get_base_OID() to store the superclass version's return value every time, and have a separate get_siteID() method that would call the get_base_OID() version if the stored value wasn't set yet. The new get_siteID() method also made me think about additionally supporting my custom arrangement of siteID/dump.txt files as siteID.txt files. 4. I was unsuccessful in further investigating undoing binmode changes that I wanted to do locally for debugging but which would currently be global if uncommented. 5. Investigating encoding questions raised more questions that I've added in. Finally, did some tidy up. Will do more in next commit.

File:

: 1 edited

main/trunk/greenstone2/perllib/plugins/NutchTextDumpPlugin.pm (modified) (15 diffs)

Legend:

: Unmodified
: Added
: Removed

main/trunk/greenstone2/perllib/plugins/NutchTextDumpPlugin.pm

-              r34126
+              r34129
+#
 # CLEANUP:
 # Remove MetadataRead functions and inheritance
+# + Remove MetadataRead functions and inheritance
+#
 # QUESTIONS:
 # - encoding = utf-8, changed to "utf8" as required by copied to_utf8(str) method. Why does it not convert
 # the string parameter but fails in decode() step? Is it because the string is already in UTF8?
 # - Problem converting text with encoding in full set of nutch dump.txt when there encoding is windows-1252.
+# - Problem converting text with encoding in full set of nutch dump.txt when there encoding is windows-1252 and Shift-JIS.
 # - TODOs
+#
+# - Should I add metadata as "ex."+meta or as meta? e.g. ex.srcURL or srcURL?
+# - Want to read in keep_urls_file, maintaining a hashmap of its URLs, only on import, isn't that correct?
+# Then how can I initialise this only once and only during import? constructor and init() methods are called during buildcol too.
+# For now, I've done it in can_proc_this_file() but there must be a more appropriate place and correct way to do this?
+# - why can't I do doc_obj->get_meta_element($section, "ex.srcURL") but have to pass "srcURL" and 1 to ignore
+# namespace?
+# - in collectionConfig file I have to leave out ex. prefix for all but Title, why?
+# - in GLI, browsing classifier sort_leaf options, "ex.srcURL" appears only as "ex.srcurl" (lowercased). Why?
+# - On the other hand, in GLI's search indexes, both ex.srcurl and ex.srcURL appear. But only building
+# with an index on ex.srcURL provides a search option in the search box. But then searching on an existing
+# srcURL produces 0 results anyway.
+# - Is this all because I am naming my ex metadata names wrongly? e.g. ex.srcURL, ex.siteID, ex.srcDomain.
+#
 # CHECK:
 # - title fallback is URL.
 # - util::tidy_up_OID() prints warning. SiteID is foldername and OIDtype=dirname, so fully numeric
+# + util::tidy_up_OID() prints warning. SiteID is foldername and OIDtype=dirname, so fully numeric
 # siteID to OID conversion results in warning message that siteID is fully numeric and gets 'D' prefixed.
 # Is this warning still necessary?
+# - Ask about binmode usage (for debugging) in this file
 # To get all the isMRI results, I ran Robo-3T against our mongodb as
 …
 # into our collection.
 # Remember to configure the NutchTextDumpPlugin with option "keep_urls_file" = isMRI_urls.txt to make use of this.
+#
+# + ex meta -> don't add with ex. prefix
+# + check for and call to setup_keep_urls(): move into process() rather than doing this in more convoluted way in can_process_this_file()
+# + util::tidy_up_oid() -> print callstack to find why it's called on every segment
+# X- binmode STDERR: work out what default mode on STDERR is and reset to that after printing debug messages in utf8 binmode
+# - test collection to check various encodings with and without to_utf8() function - tested collection 00436 in collection cctest3.
+# The srcURL .../divrey/shaar.htm (Identifier: D00436s184) is in Hebrew and described as being in char encoding iso-8859-8.
+# But when I paste the build output when using NutchTextDumpPlugin.pm_debug_iso-8859-8
+# into emacs, the text for this record reads and scrolls R to L in emacs.
+# When previewing the text in the full text section in GS3, it reads L to R.
+# The digits used in the text seem to match, occurring in reverse order from each other between emacs and GS3 preview.
+# Building displays error messages if to_utf8() called to decode this record's title meta or full text
+# using the discovered encoding.
 sub BEGIN {
 …
     #return bless $self, $class;
     $self = bless $self, $class;
+    # Can only call any methods on $self AFTER the bless operation above
+    #$self->setup_keep_urls(); # want to set up the keep_urls hashmap only once, so have to do it here (init is also called by buildcol)
+    # Can only call any $self->method(); AFTER the bless operation above, so from this point onward
     return $self;
+}
-# sub init {
-    # my $self = shift (@_);
-    # my ($verbosity, $outhandle, $failhandle) = @_;
-    # if(!$self->{'keep_urls_file'}) {
-        # my $msg = "NutchTextDumpPlugin INFO: No urls file provided.\n" .
-            # "    No records will be filtered.\n";
-        # print $outhandle $msg if ($verbosity > 2);
-        # $self->SUPER::init(@_);
-        # return;
-    # }
-    # # read in the keep urls files
-    # my $keep_urls_file = &util::locate_config_file($self->{'keep_urls_file'});
-    # if (!defined $keep_urls_file)
-    # {
-        # my $msg = "NutchTextDumpPlugin INFO: Can't locate urls file $keep_urls_file.\n" .
-            # "    No records will be filtered.\n";
-        # print $outhandle $msg;
-        # $self->{'keep_urls'} = undef;
-        # # Not an error if there's no $keep_urls_file: it just means all records
-        # # in dump.txt will be processed.
-    # }
-    # else {
-        # #$self->{'keep_urls'} = $self->parse_keep_urls_file($keep_urls_file, $outhandle);
-        # #$self->{'keep_urls'} = {};
-        # $self->parse_keep_urls_file($keep_urls_file, $outhandle, $failhandle);
-    # }
-    ## if($self->{'keep_urls'} && $verbosity > 2) {
-    #   # print STDERR "@@@@ keep_urls hash map contains:\n";
-    #   # map { print STDERR $_."=>".$self->{'keep_urls'}->{$_}."\n"; } keys %{$self->{'keep_urls'}};
-    ## }
-    # $self->SUPER::init(@_);
-# }
 sub setup_keep_urls {
 …
     $self->{'keep_urls_processed'} = 1; # flag to track whether this method has been called already during import
     #print $outhandle "@@@@ In NutchTextDumpPlugin::setup_keep_urls()\n";
+    #print $outhandle "@@@@ In NutchTextDumpPlugin::setup_keep_urls() - this method should only be called once and only during import.pl\n";
     if(!$self->{'keep_urls_file'}) {
 …
+}
-# TODO: This is an ugly way to do this anda  non-intuitive place to do this. Is there a better way?
-# Overriding can_process_this_file() in order to avoid setting up the keep_urls hashmap during
-# buildcol.pl. We only want to setup the hash during import.
-# During buildcol, this method is called with directories and not files and this method will return
-# false as a result. So when it returns true, it will be import.pl, and we check whether we haven't
-# already setup the keep_urls map. If the keep urls file has not yet been processed, then we set up
-# the hashmap once.
-sub can_process_this_file {
-    my $self = shift(@_);
-    my ($filename) = @_;
-    my $can_process_return_val = $self->SUPER::can_process_this_file(@_);
-    # We want to load in the keep_urls_file and create the keep_urls hashmap only once, during import
-    # Because the keep urls file can be large and it and the hashmap serve no purpose during buildcol.pl.
-    # Check whether we've already processed the file/built the hashmap, as we don't want to do this
-    # more than 1 time even within just the import cycle.
-    if($can_process_return_val && !$self->{'keep_urls_processed'}) { #!defined $self->{'keep_urls'}) {
-    $self->setup_keep_urls();
+    }
-    return $can_process_return_val;
+}
 sub parse_keep_urls_file {
 …
     my $self = shift (@_);
     my ($textref, $pluginfo, $base_dir, $file, $metadata, $doc_obj, $gli) = @_;
+    # Only load the urls from the keep_urls_file into a hash if we've not done so before.
+    # Although this method is called on each dump.txt file found, and we want to only setup_keep_urls()
+    # once for a collection and only during import and not buildcol, it's best to do the check and setup_keep_urls()
+    # call here, because this subroutine, process(), is only called during import() and not during buildcol.
+    # During buildcol, can_process_this_file() is not called on dump.txt files but on folders (archives folder).
+    # Only if this plugin's called on can_process_this_file() is called on a dump.txt, will this process() be called
+    # on each segment of the dump.txt file
+    # So this is the best spot to ensure we've setup_keep_urls() here, if we haven't already:
+    if(!$self->{'keep_urls_processed'}) {
+    $self->setup_keep_urls();
+    }
     my $outhandle = $self->{'outhandle'};
     my $filename = &util::filename_cat($base_dir, $file);
     my $cursection = $doc_obj->get_top_section();
+    # https://perldoc.perl.org/functions/binmode.html
+    # "To mark FILEHANDLE as UTF-8, use :utf8 or :encoding(UTF-8) . :utf8 just marks the data as UTF-8 without further checking,
+    # while :encoding(UTF-8) checks the data for actually being valid UTF-8. More details can be found in PerlIO::encoding."
     # https://stackoverflow.com/questions/27801561/turn-off-binmodestdout-utf8-locally
+    #binmode STDERR, ':utf8'; ## FOR DEBUGGING! To avoid "wide character in print" messages
+    # Is there anything useful here:
+    # https://perldoc.perl.org/PerlIO/encoding.html and https://stackoverflow.com/questions/21452621/binmode-encoding-handling-malformed-data
+    # https://stackoverflow.com/questions/1348639/how-can-i-reinitialize-perls-stdin-stdout-stderr
+    # https://metacpan.org/pod/open::layers
+    #binmode(STDERR, ':utf8'); ## FOR DEBUGGING! To avoid "wide character in print" messages, but modifies globally for process!
     #print STDERR "---------------\nDUMP.TXT\n---------\n", $$textref, "\n------------------------\n";
 …
             if $self->{'verbosity'} > 3;
+        }
         $doc_obj->add_utf8_metadata ($cursection, "ex.srcURL", $url);
         $doc_obj->add_utf8_metadata ($cursection, "ex.key", $key);
+        $doc_obj->add_utf8_metadata ($cursection, "srcURL", $url);
+        $doc_obj->add_utf8_metadata ($cursection, "key", $key);
 …
         my ($domain, $basicDomain) = $url =~ m@(^https?://(?:www\.)?([^/]+)).*@;
         #my ($domain, $protocol, $basicdomain) = $url =~ m@((^https?)://([^/]+)).*@; # Works
         $doc_obj->add_utf8_metadata ($cursection, "ex.srcDomain", $domain);
         $doc_obj->add_utf8_metadata ($cursection, "ex.basicDomain", $basicDomain);
+        $doc_obj->add_utf8_metadata ($cursection, "srcDomain", $domain);
+        $doc_obj->add_utf8_metadata ($cursection, "basicDomain", $basicDomain);
+    }
 …
             $encoding = "utf8"; # method to_utf8() recognises "utf8" not "utf-8"
             } else {
             print STDERR "@@@@@@ WARNING NutchTextDumpPlugin::process(): Record's Nutch-assigned CharEncodingForConversion was not utf-8: $encoding\n";
+            }
+            my $srcURL = $doc_obj->get_metadata_element($cursection, "srcURL");
+            print STDERR "@@@@@@ WARNING NutchTextDumpPlugin::process(): Record's Nutch-assigned CharEncodingForConversion was not utf-8 but $encoding\n\tfor record: $srcURL\n";
+            }
+        }
 …
         # add meta to docObject if both metaname and metavalue are non-empty strings
         if($metaname ne "" && $metavalue ne "") { # && $metaname ne "rs" && $metaname ne "csh") {
+            $doc_obj->add_utf8_metadata ($cursection, "ex.".$metaname, $metavalue);
+             # when no namespace is provided as here, adds as ex. meta.
+            # Don't explicitly prefix ex., as things becomes convoluted when retrieving meta
+            $doc_obj->add_utf8_metadata ($cursection, $metaname, $metavalue);
             #print STDERR "Added meta |$metaname| = |$metavalue|\n"; #if $metaname =~ m/ProtocolStatus/i;
+        }
 …
     # Correct title metadata using encoding, if we have $encoding at last
-    # $title_meta = $self->to_utf8($encoding, $title_meta) if $encoding;
     # https://stackoverflow.com/questions/12994100/perl-encode-pm-cannot-decode-string-with-wide-character
     # Error message: "Perl Encode.pm cannot decode string with wide character"
 …
     #$title_meta = $self->to_utf8($encoding, $title_meta) if ($encoding);
     } else { # if we have "null" as title metadata, set it to the record URL?
     my $srcURL = $doc_obj->get_metadata_element($cursection, "srcURL", 1); # TODO: why does ex.srcURL not work, nor srcURL without 3rd param
+    my $srcURL = $doc_obj->get_metadata_element($cursection, "srcURL");
     my ($basicURL) = $srcURL =~ m@^https?://(?:www\.)?(.*)$@; # use basicURL for title instead of srcURL, else many docs get classified under "Htt" bucket for https
     if(defined $srcURL) {
 …
     # which was crafted to be the siteID. However, because our siteID is all numeric,
     # a D gets prepended to create baseOID. Remove the starting 'D' to get actual siteID.
     my $siteID = $self->get_base_OID($doc_obj);
     #print STDERR "BASE OID: " . $self->get_base_OID($doc_obj) . "\n";
+    my $siteID = $self->get_siteID($doc_obj, $file);
+    #print STDERR "BASE OID: " . $siteID . "\n";
     $siteID =~ s/^D//;
     $doc_obj->add_utf8_metadata ($cursection, "ex.siteID", $siteID);
+    $doc_obj->add_utf8_metadata ($cursection, "siteID", $siteID);
 …
     my $no_text = 1;
     if($text_start_index != -1) { # had found a "text:start:" marker, so we should have text content for this record
     if($$textref =~ m/text:start:\r?\n(.*?)\r?\ntext:end:/) {
         $$textref = $1;
 …
+}
+sub get_siteID {
+    my $self = shift(@_);
+    my ($doc_obj, $file) = @_;
+    my $siteID;
+    if ($file =~ /(\d+).txt/) {
+    # file name without extension is site ID, e.g. 00001.txt
+    $siteID = $1;
+    #$siteID = $file;
+    #$siteID =~ s@\.txt$@@;
+    }
+    else { # if($doc_obj->{'OIDtype'} eq "dirname") or even otherwise, just use baseOID
+    # baseOID is the same as site ID when OIDtype is configured to dirname because docs are stored as 00001/dump.txt
+    # siteID has no real meaning in other cases
+    $siteID = $self->{'dirname_siteID'} || $self->get_base_OID($doc_obj);
+    }
+    if(!$self->{'siteID'} || $siteID ne $self->{'siteID'}) {
+    $self->{'siteID'} = $siteID;
+    }
+    return $self->{'siteID'};
+}
+# SplitTextFile::get_base_OID() has the side-effect of calling SUPER::add_OID()
+# inorder to initialise it. This then ultimately results in calling util::tidy_up_OID() to print warning messages
+# about all-numeric baseOID requiring the D prefix prepended.
+# When the base_OID is already set and we want to get the baseOID without that side-effect, because siteID = baseOID
+# in cases where OIDtype=dirname.
+# We don't want to recalculate baseOID for each segment, only once per dump.txt file as the superclass SplitTextFile
+# did it. However, we need access to the baseOID from this plugin
+# So we override this method to store the calculated baseOID in a variable for use and check if it's set before
+# calling this method.
+# CANNOT override this method in the usual way though: to calculate baseOID once per dump.txt, store it and return
+# the stored value for each segment because the superclass version of get_base_OID has a side-effect and needs to
+# continue doing everything it usually does each time the superclass calls this method.
+sub get_base_OID {
+    my $self = shift(@_);
+    my ($doc_obj) = @_;
+    # Let this method do what it always did, as it does more than return a value and has important side-effects!
+    # SplitTextPlugin calls this method once for every segment, not just for the base document, with the side-effect
+    # of calculating and adding the OID for each segment.
+    # Therefore, do not return the stored dirname_siteID if already set, as otherwise this method will have
+    # the ominous side-effect of "Warning: D00001s1 already exists with index status I" messages for every segment!
+    # Instead, when trying to work out $siteID (when OIDtype=dirname), check if $self->{'dirname_siteID'} already set
+    # and use that else call this method.
+    #if(!defined $self->{'dirname_siteID'}) {
+    $self->{'dirname_siteID'} = $self->SUPER::get_base_OID($doc_obj); # store for NutchTextDumpPlugin's internal use
+    #}
+    return $self->{'dirname_siteID'}; # return superclass return value as always
+}
 ;

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 34129

Legend:

main/trunk/greenstone2/perllib/plugins/NutchTextDumpPlugin.pm

Download in other formats: