Context Navigation

← Previous Change
Next Change →

Changeset 34122 for main

Timestamp:

2020-05-26T01:13:33+12:00 (4 years ago)

Author:

ak19

Message:

After some testing of building the complete commoncrawl collection, noticed warnings about windows-1252 set by nutch as charset encoding. Attempting to use latin-1 for windows-1252 encodings also in to_utf8(), to decode text in such cases. 2. And when encoding is utf8 (set by nutch as utf8), uncommenting the immediate return statement in the to_utf8() function to take away if(not utf8) conditions that call the function. 3. Tidying up. 4. Tabbed lines in emacs after earlier occasional work on Windows.

File:

: 1 edited

main/trunk/greenstone2/perllib/plugins/NutchTextDumpPlugin.pm (modified) (16 diffs)

Legend:

: Unmodified
: Added
: Removed

main/trunk/greenstone2/perllib/plugins/NutchTextDumpPlugin.pm

-              r34121
+              r34122
 # both sorted by ex.srcURL, and an ex.Title classifier.
 # For the ex.srcDomain classifier, set removeprefix to: https?\:\/\/(www\.)?
+# An alternative is to build that List classifier on ex.basicDomain instead of ex.srcDomain.
 # Finally, in the "display" format statement, add the following before the "wrappedSectionText" to
 # display the most relevant metadata of each record:
 …
   #   </div>
 # TODO: remove illegible values for metadata _rs_ and _csh_ in the example below before
+# + DONE: remove illegible values for metadata _rs_ and _csh_ in the example below before
 # committing, in case their encoding affects the loading/reading in of this perl file.
+#
 …
     # metadata CharEncodingForConversion :  utf-8
     # metadata OriginalCharEncoding :   utf-8
     # metadata _rs_ :     ï¿œ
     # metadata _csh_ :
+    # metadata _rs_ :
+    # metadata _csh_ :
     # text:start:
     # Te Kura Kaupapa MÄori o Te WhÄnau Tahi He mihi He mihi Te Kaupapa NgÄ TÄngata Te KÄkano Te Pihinga Te Tipuranga Te PuÄwaitanga Te Tari Te Poari Matua WhakapÄ mai He mihi He mihi Te Kaupapa NgÄ TÄngata Te KÄkano Te Pihinga Te Tipuranga Te PuÄwaitanga Te Tari Te Poari Matua WhakapÄ mai TE KURA KAUPAPA MÄORI O TE WHÄNAU TAHI He mihi Kei te mÅteatea tonu nei ngÄ mahara ki te huhua kua mene atu ki te pÅ, te pÅuriuri, te pÅtangotango, te pÅ oti atu rÄ. Kua rite te wÄhanga ki a rÄtou, hoki mai ki te ao tÅ«roa nei Ko Io Matua Kore te pÅ«taketanga, te pÅ«kaea, te pÅ«tÄtara ka rangona whÄnuitia e te ao. Ko tÄna ko ngÄ whetÅ«, te marama, te haeata ki a Tamanui te rÄ. He atua i whakateretere mai ai ngÄ waka i tawhiti nui, i tawhiti roa, i tawhiti mai rÄ anÅ. Kei nga ihorei, kei ngÄ wahapÅ«, kei ngÄ pukumahara, kei ngÄ kanohi kai mÄtÄrae o tÅ tÄtou nei kura Aho Matua, Te Kura Kaupapa MÄori o Te Whanau Tahi. Anei rÄ te maioha ki a koutou katoa e pÅ«mau tonu ki ngÄ wawata me ngÄ whakakitenga i whakatakotoria e ngÄ poupou i te wÄ i a rÄtou. Ka whakanuia hoki te toru tekau tau o tÄnei kura mai i tÅna orokohanga timatanga tae noa ki tÄnei wÄ Ka pÅ«mau tÅnu mÄtou ki te whakatauki o te kura e mea ana âPoipoia Å tÄtou nei pÅ«manawaâ Takiritia tonutia te ra ki runga i Te Kura Kaupapa Maori o Te Whanau Tahi . Back to Top " Poipoia Å tÄtou nei pÅ«manawa - Â Making our potential a reality " Â  Â©Â  Te Kura Kaupapa MÄori o Te WhÄnau Tahi, 2019Â  Cart ( 0 )
 …
 # - encoding = utf-8, changed to "utf8" as required by copied to_utf8(str) method. Why does it not convert
 # the string parameter but fails in decode() step? Is it because the string is already in UTF8?
+# - Problem converting text with encoding in full set of nutch dump.txt when there encoding is windows-1252.
+# - TODOs
+#
 # - Should I add metadata as "ex."+meta or as meta? e.g. ex.srcURL or srcURL?
 # - Want to read in keep_urls_file, maintaining a hashmap of its URLs, only on import, isn't that correct?
 # Then how can I initialise this only once and only during import? constructor and init() methods are called during buildcol too.
 # For now, I've done it in can_proc_this_file() but there must be a more appropriate place and correct way to do this?
-# - TODOs
 # - why can't I do doc_obj->get_meta_element($section, "ex.srcURL") but have to pass "srcURL" and 1 to ignore
 # namespace?
 …
 # Is this warning still necessary?
+# methods defined in superclasses that have the same signature take
+# precedence in the order given in the ISA list. We want MetaPlugins to
+# call MetadataRead's can_process_this_file_for_metadata(), rather than
+# calling BaseImporter's version of the same method, so list inherited
+# superclasses in this order.
 sub BEGIN {
     @NutchTextDumpPlugin::ISA = ('SplitTextFile');
 …
     my ($pluginlist,$inputargs,$hashArgOptLists) = @_;
     push(@$pluginlist, $class);
     push(@{$hashArgOptLists->{"ArgList"}},@{$arguments});
     push(@{$hashArgOptLists->{"OptList"}},$options);
     my $self = new SplitTextFile($pluginlist, $inputargs, $hashArgOptLists);
     if ($self->{'info_only'}) {
     # don't worry about the options
     return bless $self, $class;
+    }
     $self->{'keep_urls_processed'} = 0;
     $self->{'keep_urls'} = undef;
-    $self->{'type'} = ""; # TODO: value can be 'ascii' or other. Used in MARCPlugin.pm. Keep this field here?
     #return bless $self, $class;
     $self = bless $self, $class;
     # Can only call any methods on $self AFTER the bless operation above
     #$self->setup_keep_urls(); # want to set up the keep_urls hashmap only once, so have to do it here (init is also called by buildcol)
     return $self;
+}
 …
     my $self = shift (@_);
     my $verbosity = $self->{'verbosity'};
     my $outhandle = $self->{'outhandle'};
     my $failhandle = $self->{'failhandle'};
     $self->{'keep_urls_processed'} = 1; # flag to track whether this method has been called already during import
     #print $outhandle "@@@@ In NutchTextDumpPlugin::setup_keep_urls()\n";
+    my $verbosity = $self->{'verbosity'};
+    my $outhandle = $self->{'outhandle'};
+    my $failhandle = $self->{'failhandle'};
+    $self->{'keep_urls_processed'} = 1; # flag to track whether this method has been called already during import
+    #print $outhandle "@@@@ In NutchTextDumpPlugin::setup_keep_urls()\n";
     if(!$self->{'keep_urls_file'}) {
         my $msg = "NutchTextDumpPlugin INFO: No urls file provided.\n" .
             "    No records will be filtered.\n";
         print $outhandle $msg if ($verbosity > 2);
         return;
+    }
+    if(!$self->{'keep_urls_file'}) {
+    my $msg = "NutchTextDumpPlugin INFO: No urls file provided.\n" .
+        "    No records will be filtered.\n";
+    print $outhandle $msg if ($verbosity > 2);
+    return;
+    }
     # read in the keep urls files
     my $keep_urls_file = &util::locate_config_file($self->{'keep_urls_file'});
     if (!defined $keep_urls_file)
+    {
         my $msg = "NutchTextDumpPlugin INFO: Can't locate urls file $keep_urls_file.\n" .
             "    No records will be filtered.\n";
         print $outhandle $msg;
         $self->{'keep_urls'} = undef;
         # TODO: Not a fatal error if $keep_urls_file can't be found: it just means all records
         # in dump.txt will be processed?
+    my $msg = "NutchTextDumpPlugin INFO: Can't locate urls file $keep_urls_file.\n" .
+        "    No records will be filtered.\n";
+    print $outhandle $msg;
+    $self->{'keep_urls'} = undef;
+    # TODO: Not a fatal error if $keep_urls_file can't be found: it just means all records
+    # in dump.txt will be processed?
+    }
     else {
         #$self->{'keep_urls'} = $self->parse_keep_urls_file($keep_urls_file, $outhandle);
         #$self->{'keep_urls'} = {};
         $self->parse_keep_urls_file($keep_urls_file, $outhandle, $failhandle);
+    }
+    #$self->{'keep_urls'} = $self->parse_keep_urls_file($keep_urls_file, $outhandle);
+    #$self->{'keep_urls'} = {};
+    $self->parse_keep_urls_file($keep_urls_file, $outhandle, $failhandle);
+    }
     #if(defined $self->{'keep_urls'}) {
     #   print STDERR "@@@@ keep_urls hash map contains:\n";
     #   map { print STDERR $_."=>".$self->{'keep_urls'}->{$_}."\n"; } keys %{$self->{'keep_urls'}};
     #}
+}
 …
     my $self = shift(@_);
     my ($filename) = @_;
     my $can_process_return_val = $self->SUPER::can_process_this_file(@_);
     # We want to load in the keep_urls_file and create the keep_urls hashmap only once, during import
     # Because the keep urls file can be large and it and the hashmap serve no purpose during buildcol.pl.
     # Check whether we've already processed the file/built the hashmap, as we don't want to do this
     # more than 1 time even within just the import cycle.
     if($can_process_return_val && !$self->{'keep_urls_processed'}) { #!defined $self->{'keep_urls'}) {
         $self->setup_keep_urls();
+    }
     return $can_process_return_val;
+    my $can_process_return_val = $self->SUPER::can_process_this_file(@_);
+    # We want to load in the keep_urls_file and create the keep_urls hashmap only once, during import
+    # Because the keep urls file can be large and it and the hashmap serve no purpose during buildcol.pl.
+    # Check whether we've already processed the file/built the hashmap, as we don't want to do this
+    # more than 1 time even within just the import cycle.
+    if($can_process_return_val && !$self->{'keep_urls_processed'}) { #!defined $self->{'keep_urls'}) {
+    $self->setup_keep_urls();
+    }
+    return $can_process_return_val;
+}
 …
 sub parse_keep_urls_file {
     my $self = shift (@_);
+    my ($urls_file, $outhandle, $failhandle) = @_;
+    my ($urls_file, $outhandle, $failhandle) = @_;
+    # https://www.caveofprogramming.com/perl-tutorial/perl-hashes-a-guide-to-associative-arrays-in-perl.html
+    # https://stackoverflow.com/questions/1817394/whats-the-difference-between-a-hash-and-hash-reference-in-perl
+    $self->{'keep_urls'} = {}; # hash reference init to {}
+    # https://www.caveofprogramming.com/perl-tutorial/perl-hashes-a-guide-to-associative-arrays-in-perl.html
+    # https://stackoverflow.com/questions/1817394/whats-the-difference-between-a-hash-and-hash-reference-in-perl
+    #my %urls_map = (); # hash init to ()
+    $self->{'keep_urls'} = {}; # hash reference init to {}
+    # What if it is a very long file of URLs? Need to read a line at a time!
+    #my $contents = &FileUtils::readUTF8File($urls_file); # could just call $self->read_file() inherited from SplitTextFile's parent ReadTextFile
+    #my @lines = split(/(?:\r?\n)+/, $$textref);
+    # Open the file in UTF-8 mode https://stackoverflow.com/questions/2220717/perl-read-file-with-encoding-method
+    # and read in line by line into map
+    my $fh;
+    if (open($fh,'<:encoding(UTF-8)', $urls_file)) {
+        while (defined (my $line = <$fh>)) {
+            $line = &util::trim($line); #$line =~ s/^\s+|\s+$//g; # trim whitespace
+            if($line =~ m@^https?://@) { # add only URLs
+                #%urls_map{$line} = 1; # add the url to our perl hash
+                $self->{'keep_urls'}->{$line} = 1;
+            }
+        }
+        close $fh;
+    } else {
+        my $msg = "NutchTextDumpPlugin ERROR: Unable to open file keep_urls_file: \"" .
+            $self->{'keep_urls_file'} . "\".\n " .
+            "    No records will be filtered.\n";
+        print $outhandle $msg;
+        print $failhandle $msg;
+        # Not fatal. TODO: should it be fatal when it can still process all URLs just because
+        # it can't find the specified keep-urls.txt file?
+    # What if it is a very long file of URLs? Need to read a line at a time!
+    #my $contents = &FileUtils::readUTF8File($urls_file); # could just call $self->read_file() inherited from SplitTextFile's parent ReadTextFile
+    #my @lines = split(/(?:\r?\n)+/, $$textref);
+    # Open the file in UTF-8 mode https://stackoverflow.com/questions/2220717/perl-read-file-with-encoding-method
+    # and read in line by line into map
+    my $fh;
+    if (open($fh,'<:encoding(UTF-8)', $urls_file)) {
+    while (defined (my $line = <$fh>)) {
+        $line = &util::trim($line); #$line =~ s/^\s+|\s+$//g; # trim whitespace
+        if($line =~ m@^https?://@) { # add only URLs
+        $self->{'keep_urls'}->{$line} = 1; # add the url to our perl hash
+        }
+    }
+    # if keep_urls hash is empty, ensure it is undefined from this point onward
+    # https://stackoverflow.com/questions/9444915/how-to-check-if-a-hash-is-empty-in-perl
+    my %urls_map = $self->{'keep_urls'};
+    if(!keys %urls_map) {
+        $self->{'keep_urls'} = undef;
+    }
+    #return %urls_map;
+}
+    close $fh;
+    } else {
+    my $msg = "NutchTextDumpPlugin ERROR: Unable to open file keep_urls_file: \"" .
+        $self->{'keep_urls_file'} . "\".\n " .
+        "    No records will be filtered.\n";
+    print $outhandle $msg;
+    print $failhandle $msg;
+    # Not fatal. TODO: should it be fatal when it can still process all URLs just because
+    # it can't find the specified keep-urls.txt file?
+    }
+    # if keep_urls hash is empty, ensure it is undefined from this point onward
+    # https://stackoverflow.com/questions/9444915/how-to-check-if-a-hash-is-empty-in-perl
+    my %urls_map = $self->{'keep_urls'};
+    if(!keys %urls_map) {
+    $self->{'keep_urls'} = undef;
+    }
+}
+# Accept "dump.txt" files (which are in numeric siteID folders),
+# and txt files with numeric siteID, e.g. "01441.txt"
+# if I preprocessed dump.txt files by renaming them this way.
 sub get_default_process_exp {
     my $self = shift (@_);
     return q^(?i)((dump|\d+)\.txt)$^;
+}
 …
 sub get_default_split_exp {
     # prev line is either a new line or start of dump.txt
     # current line should start with url protocol and contain " key: .... http(s)/"
     # \r\n for msdos eol, \n for unix
+    #return q^($|\r?\n)https?://\w+\s+key:\s+\w+https?/^;
+    #return q^\r?\n(text:end:|metadata _csh_ :)\r?\n\r?\n^;
+    #return q^(\r?\n)*https?://\w+\s+key:\s+\w+https?/\s*\r?\n^;
+    #return q^(?:$|\r?\n\r?\n)(https?://.+?\skey:\s+.*?https?/)^;
+    #return q^($|\r?\n\r?\n)https?://^;
+    #return q^\r?\n(text:end:)\r?\n\r?\n^;
+    # return q^\r?\n\s*\r?\n|\[\w+\]Record type: USmarc^;
+    # split by default throws away delimiter
+    # The regex return value of this method is passed into a call to perl split.
+    # Perl's split(), by default throws away delimiter
     # Any capturing group that makes up or is part of the delimiter becomes a separate element returned by split
     # We want to throw away the empty newlines preceding the first line of a record "https? .... key: https?/"
 …
     #    https://stackoverflow.com/questions/14907772/split-but-keep-delimiter
     #   - To skip the unwanted empty lines preceding the first line of a record use ?: in front of its capture group
     #    to discard that group:
+    #    to discard that group:
     #    https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-in-regular-expressions
     #   - Next use a positive look-ahead (?= in front of capture group, vs ?! for negative look ahead)
     #    to match but not capture the first line of a record (so the look-ahead matched is retained as the
     #    first line of the next record):
+    #    to match but not capture the first line of a record (so the look-ahead matched is retained as the
+    #    first line of the next record):
     #    https://stackoverflow.com/questions/14907772/split-but-keep-delimiter
     #    and http://www.regular-expressions.info/lookaround.html
 …
     #    https://stackoverflow.com/questions/11898998/how-can-i-write-a-regex-which-matches-non-greedy
     return q^(?:$|\r?\n\r?\n)(?=https?://.+?\skey:\s+.*?https?/)^;
+}
+# TODO: COPIED METHOD STRAIGHT FROM MarcPlugin.pm - move to a utility perl file?
+}
+# TODO: Copied method from MARCPlugin.pm and uncommented return statement when encoding = utf8
+# Move to a utility perl file, since code is mostly shared?
 # The bulk of this function is based on read_line in multiread.pm
 # Unable to use read_line original because it expects to get its input
 # from a file.  Here the line to be converted is passed in as a string
+# TODO:
+# Is this function even applicable to NutchTextDumpPlugin?
+# I get errors in this method when encoding is utf-8 in the decode step.
+# I get warnings/errors somewhere in this file (maybe also at decode) when encoding is windows-1252.
 sub to_utf8
 …
     if ($encoding eq "utf8") {
     # nothing needs to be done
     #return $line;
     } elsif ($encoding eq "iso_8859_1") {
+    return $line;
+    } elsif ($encoding eq "iso_8859_1" || $encoding eq "windows-1252") { # TODO: do this also for windows-1252?
     # we'll use ascii2utf8() for this as it's faster than going
     # through convert2unicode()
 …
     } else {
     # everything else uses unicode::convert2unicode
     $line = &unicode::unicode2utf8 (&unicode::convert2unicode ($encoding, \$line));
+    # everything else uses unicode::convert2unicode
+    $line = &unicode::unicode2utf8 (&unicode::convert2unicode ($encoding, \$line));
+    }
     # At this point $line is a binary byte string
 …
     # Unicode aware pattern matching can be used.
     # For instance: 's/\x{0101}//g' or '[[:upper:]]'
     return decode ("utf8", $line);
+}
 …
     my $self = shift (@_);
     my ($textref, $pluginfo, $base_dir, $file, $metadata, $doc_obj, $gli) = @_;
     my $outhandle = $self->{'outhandle'};
     my $filename = &util::filename_cat($base_dir, $file);
     my $cursection = $doc_obj->get_top_section();
     #print STDERR "---------------\nDUMP.TXT\n---------\n", $$textref, "\n------------------------\n";
+    # (1) parse out the metadata of this record
+    my $metaname;
+    my $encoding;
+    my $title_meta;
+    my $line_index = 0;
+    my $text_start_index = -1;
+    my @lines = split(/(?:\r?\n)+/, $$textref);
+    foreach my $line (@lines) {
+        # first line is special and contains the URL (no metaname)
+        # and the inverted URL labelled with metaname "key"
+        if($line =~ m/^https?/ && $line =~ m/\s+key:\s+/) {
+            my @vals = split(/key:/, $line);
+            my $url = $vals[0];
+            my $key = $vals[1];
+            # trim whitespace https://perlmaven.com/trim
+            $url = &util::trim($url); #=~ s/^\s+|\s+$//g;
+            $key = &util::trim($key); #=~ s/^\s+|\s+$//g;
+            # if we have a keep_urls hash, then only process records of whitelisted urls
+            if(defined $self->{'keep_urls'} && !$self->{'keep_urls'}->{$url}) {
+                # URL not whitelisted, so stop processing this record
+                print STDERR "@@@@@@ INFO NutchTextDumpPlugin::process(): discarding record for URL not whitelisted: $url\n"
+                    if $self->{'verbosity'} > 3;
+                return 0;
+            } else {
+                print STDERR "@@@@@@ INFO NutchTextDumpPlugin::process(): processing record of whitelisted URL $url...\n"
+                    if $self->{'verbosity'} > 3;
+            }
+            $doc_obj->add_utf8_metadata ($cursection, "ex.srcURL", $url);
+            $doc_obj->add_utf8_metadata ($cursection, "ex.key", $key);
+            # # let's also set the domain from the URL, as that will make a
+            # # more informative bookshelf label than siteID
+            # my $domain = $url;
+            # # remove protocol:// and everything after and including subsequent slash
+            # $domain =~ s@^https?://([^/]+).*@$1@;
+            # #$domain =~ s@^https?://@@; # remove protocol
+            # #$domain =~ s@/.*$@@; # now remove everything after first slash
+            # my $protocol = $url;# =~ s@(^https?).*$@@;
+            # $protocol =~ s@(^https?).*$@$1@;
+            # $domain = $protocol."://".$domain;
+            # #$domain =~ s@[\.\-]@@g;
+            # #$domain = "pinky";
+            # $doc_obj->add_utf8_metadata ($cursection, "ex.srcDomain", $domain);
+            # let's also set the domain from the URL, as that will make a
+            # more informative bookshelf label than siteID
+            # For complete domain, keep protocol:// and every non-slash after,
+            # without requiring presence of subsequent slash
+            # https://stackoverflow.com/questions/3652527/match-regex-and-assign-results-in-single-line-of-code
+            # Can clean up protocol and www. in bookshelf's remove_prefix option
+            my ($domain, $basicDomain) = $url =~ m@(^https?://(?:www\.)?([^/]+)).*@;
+            # For domain, the following removes protocol:// and
+            # everything after and including subsequent slash, without requiring subsequent slash
+            #my ($domain, $protocol, $basicdomain) = $url =~ m@((^https?)://([^/]+)).*@; # Works
+            #my ($protocol, $basicdomain) = $url =~ m@(^https?)://([^/]+).*@; # Should work
+            #my $domain = $protocol."://".$basicdomain;
+            $doc_obj->add_utf8_metadata ($cursection, "ex.srcDomain", $domain);
+            $doc_obj->add_utf8_metadata ($cursection, "ex.basicDomain", $basicDomain);
+    # (1) parse out the metadata of this record
+    my $metaname;
+    my $encoding;
+    my $title_meta;
+    my $line_index = 0;
+    my $text_start_index = -1;
+    my @lines = split(/(?:\r?\n)+/, $$textref);
+    foreach my $line (@lines) {
+    # first line is special and contains the URL (no metaname)
+    # and the inverted URL labelled with metaname "key"
+    if($line =~ m/^https?/ && $line =~ m/\s+key:\s+/) {
+        my @vals = split(/key:/, $line);
+        # get url and key, and trim whitespace simultaneously
+        my $url = &util::trim($vals[0]);
+        my $key = &util::trim($vals[1]);
+        # if we have a keep_urls hash, then only process records of whitelisted urls
+        if(defined $self->{'keep_urls'} && !$self->{'keep_urls'}->{$url}) {
+        # URL not whitelisted, so stop processing this record
+        print STDERR "@@@@@@ INFO NutchTextDumpPlugin::process(): discarding record for URL not whitelisted: $url\n"
+            if $self->{'verbosity'} > 3;
+        return 0;
+        } else {
+        print STDERR "@@@@@@ INFO NutchTextDumpPlugin::process(): processing record of whitelisted URL $url...\n"
+            if $self->{'verbosity'} > 3;
+        }
+        $doc_obj->add_utf8_metadata ($cursection, "ex.srcURL", $url);
+        $doc_obj->add_utf8_metadata ($cursection, "ex.key", $key);
+        # let's also set the domain from the URL, as that will make a
+        # more informative bookshelf label than siteID
+        # For complete domain, keep protocol:// and every non-slash after.
+        # (This avoids requiring presence of subsequent slash)
+        # https://stackoverflow.com/questions/3652527/match-regex-and-assign-results-in-single-line-of-code
+        # Can clean up protocol and www. in List classifier's bookshelf's remove_prefix option
+        # or can build classifier on basicDomain instead.
+        my ($domain, $basicDomain) = $url =~ m@(^https?://(?:www\.)?([^/]+)).*@;
+        #my ($domain, $protocol, $basicdomain) = $url =~ m@((^https?)://([^/]+)).*@; # Works
+        $doc_obj->add_utf8_metadata ($cursection, "ex.srcDomain", $domain);
+        $doc_obj->add_utf8_metadata ($cursection, "ex.basicDomain", $basicDomain);
+    }
+    # check for full text
+    elsif ($line =~ m/text:start:/) {
+        $text_start_index = $line_index;
+        last; # if we've reached the full text portion, we're past the metadata portion of this record
+    }
+    elsif($line =~ m/^[^:]+:.+$/) { # look for meta #elsif($line =~ m/^[^:]+:[^:]+$/) { # won't allow protocol://url in metavalue
+        my @metakeyvalues = split(/:/, $line); # split on first :
+        my $metaname = shift(@metakeyvalues);
+        my $metavalue = join("", @metakeyvalues);
+        # skip "metadata _rs_" and "metadata _csh_" as these contain illegible characters for values
+        if($metaname !~ m/metadata\s+_(rs|csh)_/) {
+        # trim whitespace
+        $metaname = &util::trim($metaname);
+        $metavalue = &util::trim($metavalue);
+        if($metaname eq "title") { # TODO: what to do about "title: null" cases?
+            ##print STDERR "@@@@ Found title: $metavalue\n";
+            #$metaname = "Title"; # will set "title" as "Title" metadata instead
+            # TODO: treat title metadata specially by using character encoding to store correctly?
+            # Won't add Title metadata to docObj until after all meta is processed,
+            # when we'll know encoding and can process title meta
+            $title_meta = $metavalue;
+            $metavalue = ""; # will force ex.Title metadata to be added AFTER for loop
+        }
+        # check for full text
+        elsif ($line =~ m/text:start:/) {
+            $text_start_index = $line_index;
+            last; # if we've reached the full text portion, we're past the metadata portion of this record
+        elsif($metaname =~ m/CharEncodingForConversion/) { # TODO: or look for "OriginalCharEncoding"?
+            ##print STDERR "@@@@ Found encoding: $metavalue\n";
+            $encoding = $metavalue; # TODO: should we use this to interpret the text and title in the correct encoding and convert to utf-8?
+            if($encoding eq "utf-8") {
+            $encoding = "utf8"; # method to_utf8() recognises "utf8" not "utf-8"
+            } else {
+            print STDERR "@@@@@@ WARNING NutchTextDumpPlugin::process(): Record's Nutch-assigned CharEncodingForConversion was not utf-8: $encoding\n";
+            }
+        }
+        elsif($line =~ m/^[^:]+:.+$/) { # look for meta #elsif($line =~ m/^[^:]+:[^:]+$/) { # won't allow protocol://url in metavalue
+            my @metakeyvalues = split(/:/, $line);
+            #my $metaname = $metakeyvalues[0];
+            #my $metavalue = $metakeyvalues[1];
+            my $metaname = shift(@metakeyvalues);
+            my $metavalue = join("", @metakeyvalues);
+            # skip "metadata _rs_" and "metadata _csh_" as these contain illegible characters for values
+            if($metaname !~ m/metadata\s+_(rs|csh)_/) {
+                # trim whitespace
+                $metaname = &util::trim($metaname); #=~ s/^\s+|\s+$//g;
+                $metavalue = &util::trim($metavalue); #=~ s/^\s+|\s+$//g;
+                if($metaname eq "title") { # TODO: what to do about "title: null" cases?
+                    ##print STDERR "@@@@ Found title: $metavalue\n";
+                    #$metaname = "Title"; # set this as ex.Title metadata
+                    # TODO: treat title metadata specially by using character encoding to store correctly?
+                    # won't add Title metadata to docObj until after all meta is processed, when we'll know encoding and can process title meta
+                    $title_meta = $metavalue;
+                    $metavalue = "";
+                }
+                elsif($metaname =~ m/CharEncodingForConversion/) { # TODO: or look for "OriginalCharEncoding"?
+                    ##print STDERR "@@@@ Found encoding: $metavalue\n";
+                    $encoding = $metavalue; # TODO: should we use this to interpret the text and title in the correct encoding and convert to utf-8?
+                    if($encoding eq "utf-8") {
+                        $encoding = "utf8"; # method to_utf8() recognises "utf8" not "utf-8"
+                    } else {
+                        print STDERR "@@@@@@ WARNING NutchTextDumpPlugin::process(): Record's Nutch-assigned CharEncodingForConversion was not utf-8: $encoding\n";
+                    }
+                }
+                # move occurrences of "marker " or "metadata " strings at start of metaname to end
+                #$metaname =~ s/^(marker|metadata)\s+(.*)$/$2$1/;
+                # remove "marker " or "metadata " strings from start of metaname
+                $metaname =~ s/^(marker|metadata)\s+//;
+                # remove underscores and all remaining spaces in metaname
+                $metaname =~ s/[ _]//g;
+                # add meta to docObject if both metaname and metavalue are non-empty strings
+                if($metaname ne "" && $metavalue ne "") { # && $metaname ne "rs" && $metaname ne "csh") {
+                    $doc_obj->add_utf8_metadata ($cursection, "ex.".$metaname, $metavalue);
+                    #print STDERR "Added meta |$metaname| = |$metavalue|\n"; #if $metaname =~ m/ProtocolStatus/i;
+                }
+            }
+        } elsif ($line !~ m/^\s*$/) { # Not expecting any other type of non-empty line (or even empty lines)
+            print STDERR "NutchTextDump line not recognised as URL meta, other metadata or text content:\n\t$line\n";
+        # move occurrences of "marker " or "metadata " strings at start of metaname to end
+        #$metaname =~ s/^(marker|metadata)\s+(.*)$/$2$1/;
+        # remove "marker " or "metadata " strings from start of metaname
+        $metaname =~ s/^(marker|metadata)\s+//;
+        # remove underscores and all remaining spaces in metaname
+        $metaname =~ s/[ _]//g;
+        # add meta to docObject if both metaname and metavalue are non-empty strings
+        if($metaname ne "" && $metavalue ne "") { # && $metaname ne "rs" && $metaname ne "csh") {
+            $doc_obj->add_utf8_metadata ($cursection, "ex.".$metaname, $metavalue);
+            #print STDERR "Added meta |$metaname| = |$metavalue|\n"; #if $metaname =~ m/ProtocolStatus/i;
+        }
+        $line_index++;
+        }
+    } elsif ($line !~ m/^\s*$/) { # Not expecting any other type of non-empty line (or even empty lines)
+        print STDERR "NutchTextDump line not recognised as URL meta, other metadata or text content:\n\t$line\n";
+    }
+    $line_index++;
+    }
     # Add fileFormat as the metadata
     $doc_obj->add_metadata($cursection, "FileFormat", "NutchDumpTxt");
+    # Correct title metadata using encoding, if we have $encoding at last
+    # $title_meta = $self->to_utf8($encoding, $title_meta) if $encoding;
+    # https://stackoverflow.com/questions/12994100/perl-encode-pm-cannot-decode-string-with-wide-character
+    # Error message: "Perl Encode.pm cannot decode string with wide character"
+    # "That error message is saying that you have passed in a string that has already been decoded
+    # (and contains characters above codepoint 255). You can't decode it again."
+    if($title_meta && $title_meta ne "" && $title_meta ne "null") {
+        $title_meta = $self->to_utf8($encoding, $title_meta) if ($encoding && $encoding ne "utf8");
+    } else { # if we have "null" as title metadata, set it to the record URL?
+        #my $srcURLs = $doc_obj->get_metadata($cursection, "ex.srcURL");
+        #print STDERR "@@@@ null title to be replaced with ".$srcURLs->[0]."\n";
+        #$title_meta = $srcURLs->[0] if (scalar @$srcURLs > 0);
+        my $srcURL = $doc_obj->get_metadata_element($cursection, "srcURL", 1); # TODO: why does ex.srcURL not work, nor srcURL without 3rd param
+        if(defined $srcURL) {
+            print STDERR "@@@@ null/empty title to be replaced with ".$srcURL."\n"
+                if $self->{'verbosity'} > 3;
+            $title_meta = $srcURL;
+        }
+    # Correct title metadata using encoding, if we have $encoding at last
+    # $title_meta = $self->to_utf8($encoding, $title_meta) if $encoding;
+    # https://stackoverflow.com/questions/12994100/perl-encode-pm-cannot-decode-string-with-wide-character
+    # Error message: "Perl Encode.pm cannot decode string with wide character"
+    # "That error message is saying that you have passed in a string that has already been decoded
+    # (and contains characters above codepoint 255). You can't decode it again."
+    if($title_meta && $title_meta ne "" && $title_meta ne "null") {
+    $title_meta = $self->to_utf8($encoding, $title_meta) if ($encoding);
+    } else { # if we have "null" as title metadata, set it to the record URL?
+    #my $srcURLs = $doc_obj->get_metadata($cursection, "ex.srcURL");
+    #print STDERR "@@@@ null title to be replaced with ".$srcURLs->[0]."\n";
+    #$title_meta = $srcURLs->[0] if (scalar @$srcURLs > 0);
+    my $srcURL = $doc_obj->get_metadata_element($cursection, "srcURL", 1); # TODO: why does ex.srcURL not work, nor srcURL without 3rd param
+    if(defined $srcURL) {
+        print STDERR "@@@@ null/empty title to be replaced with ".$srcURL."\n"
+        if $self->{'verbosity'} > 3;
+        $title_meta = $srcURL;
+    }
+    $doc_obj->add_utf8_metadata ($cursection, "Title", $title_meta);
+    }
+    $doc_obj->add_utf8_metadata ($cursection, "Title", $title_meta);
+    # When importOption OIDtype = dirname, the base_OID will be that dirname
+    # which was crafted to be the siteID. However, because our siteID is all numeric,
+    # a D gets prepended to create baseOID. Remove the starting 'D' to get actual siteID.
     my $siteID = $self->get_base_OID($doc_obj);
     #print STDERR "BASE OID: " . $self->get_base_OID($doc_obj) . "\n";
-    # remove the 'D' that was inserted by a superclass in front of the all-numeric siteID to become baseOID:
     $siteID =~ s/^D//;
     $doc_obj->add_utf8_metadata ($cursection, "ex.siteID", $siteID);
+    # (2) parse out text of this record
+    # if($text_start_index != -1 && pop(@lines) =~ m/text:end:/) { # we only have text content if there were "text:start:" and "text:end:" markers.
+                                                                # # TODO: are we guaranteed popped line is text:end: and not empty/newline?
+        # @lines = splice(@lines,0,$text_start_index+1); # just keep every line AFTER text:start:, have already removed (popped) "text:end:"
+        # # glue together remaining lines, if there are any, into textref
+        # # https://stackoverflow.com/questions/7406807/find-size-of-an-array-in-perl
+        # if(scalar (@lines) > 0) {
+            # # TODO: do anything with $encoding to convert line to utf-8?
+            # foreach my $line (@lines) {
+                # $line = $self->to_utf8($encoding, $line) if $encoding; #if $encoding ne "utf-8";
+                # $$textref .= $line."\n";
+            # }
+        # }
+        # $$textref = "<pre>\n".$$textref."</pre>";
+    # } else {
+        # print STDERR "WARNING: NutchTextDumpPlugin::process: had found a text start marker but not text end marker.\n");
+        # $$textref = "<pre></pre>";
+    # }
+    # (2) parse out text of this record
+    # if($text_start_index != -1 && pop(@lines) =~ m/text:end:/) { # we only have text content if there were "text:start:" and "text:end:" markers.
+    #   # TODO: are we guaranteed popped line is text:end: and not empty/newline?
+    #   @lines = splice(@lines,0,$text_start_index+1); # just keep every line AFTER text:start:, have already removed (popped) "text:end:"
+    my $no_text = 1;
+    if($text_start_index != -1) { # had found a "text:start:" marker, so we should have text content for this record
+        if($$textref =~ m/text:start:\r?\n(.*?)\r?\ntext:end:/) {
+            $$textref = $1;
+            if($$textref !~ m/^\s*$/) {
+                $$textref = $self->to_utf8($encoding, $$textref) if ($encoding && $encoding ne "utf8");
+                $$textref = "<pre>\n".$$textref."\n</pre>";
+                $no_text = 0;
+            }
+        }
+    #   # glue together remaining lines, if there are any, into textref
+    #   # https://stackoverflow.com/questions/7406807/find-size-of-an-array-in-perl
+    #   if(scalar (@lines) > 0) {
+    #       # TODO: do anything with $encoding to convert line to utf-8?
+    #       foreach my $line (@lines) {
+    #       $line = $self->to_utf8($encoding, $line) if $encoding; #if $encoding ne "utf-8";
+    #       $$textref .= $line."\n";
+    #       }
+    #   }
+    #   $$textref = "<pre>\n".$$textref."</pre>";
+    # } else {
+    #   print STDERR "WARNING: NutchTextDumpPlugin::process: had found a text start marker but not text end marker.\n";
+    #   $$textref = "<pre></pre>";
+    # }
+    # (2) parse out text of this record
+    my $no_text = 1;
+    if($text_start_index != -1) { # had found a "text:start:" marker, so we should have text content for this record
+    if($$textref =~ m/text:start:\r?\n(.*?)\r?\ntext:end:/) {
+        $$textref = $1;
+        if($$textref !~ m/^\s*$/) {
+        $$textref = $self->to_utf8($encoding, $$textref) if ($encoding);
+        $$textref = "<pre>\n".$$textref."\n</pre>";
+        $no_text = 0;
+        }
+    }
+    if($no_text) {
+        $$textref = "<pre></pre>";
+    }
+        # Debugging
+        # To avoid "wide character in print" messages for debugging, set binmode of handle to utf8/encoding
+    # https://stackoverflow.com/questions/15210532/use-of-use-utf8-gives-me-wide-character-in-print
+        # if ($self->{'verbosity'} > 3) {
+    #     if($encoding && $encoding eq "utf8") {
+    #   binmode STDERR, ':utf8';
+    #     }
+    #     print STDERR "TITLE: $title_meta\n";
+    #     print STDERR "ENCODING = $encoding\n" if $encoding;
+    #     #print STDERR "---------------\nTEXT CONTENT\n---------\n", $$textref, "\n------------------------\n";
+    # }
+    }
+    if($no_text) {
+    $$textref = "<pre></pre>";
+    }
+    # Debugging
+    # To avoid "wide character in print" messages for debugging, set binmode of handle to utf8/encoding
+    # https://stackoverflow.com/questions/15210532/use-of-use-utf8-gives-me-wide-character-in-print
+    # if ($self->{'verbosity'} > 3) {
+    #     if($encoding && $encoding eq "utf8") {
+    #   binmode STDERR, ':utf8';
+    #     }
+    #     print STDERR "TITLE: $title_meta\n";
+    #     print STDERR "ENCODING = $encoding\n" if $encoding;
+    #     #print STDERR "---------------\nTEXT CONTENT\n---------\n", $$textref, "\n------------------------\n";
+    # }
     $doc_obj->add_utf8_text($cursection, $$textref);

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 34122 for main

Legend:

main/trunk/greenstone2/perllib/plugins/NutchTextDumpPlugin.pm

Download in other formats: