Context Navigation

← Previous Changeset
Next Changeset →

Changeset 34137

Timestamp:

2020-06-05T19:51:38+12:00 (4 years ago)

Author:

ak19

Message:

Have only been able to incorporate one of Dr Bainbridge's improvements so far: when there's no title meta, the first title fallback is not basicURL but web page name without file extension, e.g. domain.com/path/my-web-page.html will have the title 'my web page'. Only if that works out to be the empty string, do we resort to basicURL again for title.

File:

: 1 edited

main/trunk/greenstone2/perllib/plugins/NutchTextDumpPlugin.pm (modified) (4 diffs)

Legend:

: Unmodified
: Added
: Removed

main/trunk/greenstone2/perllib/plugins/NutchTextDumpPlugin.pm

-              r34131
+              r34137
 no strict 'refs'; # allow filehandles to be variables and viceversa
+# Seems to be
+# nohup command
+# Not: nohup command > bla.txt 2&>1 &
+# nor even: nohup command &
+#    nohup.out (possibly both STDERR and STDOUT, do a quick test first and then delete nohup.out before re-running)
+#    in the folder the command is run
+# Delete nohup.out when re-running command.
+# Tripped up and unhappy only when commands require keyboard input at any stage.
+#
+#
 # TODO:
+# Use "od" to print out bytevalues of the dump.txt file to check _rs_ and _csh_
+# Also google Nutch about what those fields mean.
+# od -a
+# every byte as ASCII character
+# od -ab
+# ASCII and bytevalue:
+# First comes byteoffset and then ascii character (sp for space). Line underneath the numeric byte values in hex of the individual characters.
+#
 # + 1. Split each dump.txt file into its individual records as individual docs
 # + 2. Store the meta of each individual record/doc
 # ?3. Name each doc, siteID.docID else HASH internal text. See EmailPlugin?
 # - In SplitTextFile::read(), why is $segment which counts discarded docs too used to add record ID
+# + In SplitTextFile::read(), why is $segment which counts discarded docs too used to add record ID
 # rather than $count which only counts included docs? I am referring to code:
 #   $self->add_OID($doc_obj, $id, $segment);
+# Because we get persistent URLs, regardless of whitelist urls file content!
 # The way I've solved this is by setting the OIDtype importOption. Not sure if this is what was required.
 # + 4. Keep a map of all URLs seen - whitelist URLs.
 …
 # CHECK:
 # - title fallback is URL.
+# + title fallback is URL. Remove domain/all folder prefix (unless nothing remains), convert underscores and hyphens to spaces.
 # + util::tidy_up_OID() prints warning. SiteID is foldername and OIDtype=dirname, so fully numeric
 # siteID to OID conversion results in warning message that siteID is fully numeric and gets 'D' prefixed.
 # Is this warning still necessary?
 # - Ask about binmode usage (for debugging) in this file
 # To get all the isMRI results, I ran Robo-3T against our mongodb as
 …
     # https://stackoverflow.com/questions/1348639/how-can-i-reinitialize-perls-stdin-stdout-stderr
     # https://metacpan.org/pod/open::layers
+    # if() { # Google: "what is perl choosing to make the default char encoding for the file handle". Does it take a hint from somewhere, like env vars? Look for env vars
+    #  # is there a perl env var to use, to check char enc? If set to utf-8, do this
     #binmode(STDERR, ':utf8'); ## FOR DEBUGGING! To avoid "wide character in print" messages, but modifies globally for process!
+    #}
+    # Then move this if-block to BEGIN blocks of all perl process files.
     #print STDERR "---------------\nDUMP.TXT\n---------\n", $$textref, "\n------------------------\n";
 …
     } else { # if we have "null" as title metadata, set it to the record URL?
     my $srcURL = $doc_obj->get_metadata_element($cursection, "srcURL");
-    my ($basicURL) = $srcURL =~ m@^https?://(?:www\.)?(.*)$@; # use basicURL for title instead of srcURL, else many docs get classified under "Htt" bucket for https
     if(defined $srcURL) {
+        print STDERR "@@@@ null/empty title to be replaced with ".$basicURL."\n"
+        if $self->{'verbosity'} > 3;
+        $title_meta = $basicURL;
+        # Use the web page name without file ext for doc title, if web page name present,
+        # else use basicURL for title for title instead of srcURL,
+        # else many docs get classified under "Htt" bucket for https
+        my ($basicURL) = $srcURL =~ m@^https?://(?:www\.)?(.*)$@;
+        my ($pageName) = $basicURL =~ m@([^/]+)$@;
+        if (!$pageName) {
+            $pageName = $basicURL;
+        } else {
+            # remove any file extension
+            $pageName =~ s@\.[^\.]+@@;
+        # replace _ and - with spaces
+        $pageName =~ s@[_\-]@ @g;
+        }
+        print STDERR "@@@@ null/empty title for $basicURL to be replaced with: $pageName\n"
+            if $self->{'verbosity'} > 3;
+        $title_meta = $pageName;
+    }
+    }
     $doc_obj->add_utf8_metadata ($cursection, "Title", $title_meta);

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 34137

Legend:

main/trunk/greenstone2/perllib/plugins/NutchTextDumpPlugin.pm

Download in other formats: