Changeset 34131 for main


Ignore:
Timestamp:
2020-05-30T15:18:25+12:00 (4 years ago)
Author:
ak19
Message:

Allowing input keep-urls-file to contain a comma followed by country code at end, as that's the sort of URLs file I want for the newest commoncrawl collection. The URLs file is the one at http://trac.greenstone.org/browser/other-projects/maori-lang-detection/mongodb-data-auto/isMRI_full_manualList_globalDomains_whereAPageContainsMRI.txt

File:
1 edited

Legend:

Unmodified
Added
Removed
  • main/trunk/greenstone2/perllib/plugins/NutchTextDumpPlugin.pm

    r34130 r34131  
    329329    my $fh;
    330330    if (open($fh,'<:encoding(UTF-8)', $urls_file)) {
    331     while (defined (my $line = <$fh>)) {           
     331    while (defined (my $line = <$fh>)) {
    332332        $line = &util::trim($line); #$line =~ s/^\s+|\s+$//g; # trim whitespace
    333         if($line =~ m@^https?://@) { # add only URLs       
     333
     334        if($line =~ m@^https?://@) { # add only URLs
     335        # remove any ",COUNTRYCODE" at end
     336        # country code can be NZ but also UNKNOWN, so not 2 chars
     337        $line =~ s/,[A-Z]+$//;
     338        #print STDERR "LINE: |$line|\n";
    334339        $self->{'keep_urls'}->{$line} = 1; # add the url to our perl hash
    335340        }
Note: See TracChangeset for help on using the changeset viewer.