Changeset 34121 for main


Ignore:
Timestamp:
2020-05-25T23:53:29+12:00 (4 years ago)
Author:
ak19
Message:
  1. Introducing NutchTextDumpPlugin to process the records (representing web pages' text content) of the dump.txt files produced for each website crawled by Nutch. Created for handling the commoncrawl URLs of interest that we recrawled with Nutch. This first version does everything, but the code requires more cleaning up. 2. Also added a useful util::trim() function as I kept reusing the same code several times.
Location:
main/trunk/greenstone2/perllib
Files:
1 added
2 edited

Legend:

Unmodified
Added
Removed
  • main/trunk/greenstone2/perllib/strings.properties

    r33900 r34121  
    11411141MetadataXMLPlugin.desc:Plugin that processes metadata.xml files.
    11421142
     1143NutchTextDumpMARCXMLPlugin.keep_urls_file:File path or name of optional whitelist file containing one URL per line, whose records are to be retained when processing each url's record in the dump.txt files produced by nutch per website. Those records whose URLs are not listed in the file will be discarded. For relative paths, the plugin will look for the file in the collection's etc directory.
     1144
    11431145GreenstoneMETSPlugin.desc:Process Greenstone-style METS documents
    11441146
  • main/trunk/greenstone2/perllib/util.pm

    r33757 r34121  
    19591959
    19601960
     1961# Useful String utility functions
     1962# trim(str) removes whitespace at start and end of string parameter
     1963sub trim {
     1964    my ($str) = @_;
     1965   
     1966    # trim whitespace at start and end of str
     1967    # https://perlmaven.com/trim
     1968    $str =~ s/^\s+|\s+$//g;
     1969
     1970    return $str;
     1971}
     1972
    196119731;
Note: See TracChangeset for help on using the changeset viewer.