Ignore:
Timestamp:
2019-10-16T20:00:09+13:00 (5 years ago)
Author:
ak19
Message:
  1. batchcrawl.sh now does what it should have from the start, which is to move the log.out and UNFINISHED files into the output folder instead of leaving them in the input folder, as the input to_crawl folder can and does get replaced all the time, every time I regenerate it after black/white/greylisting more urls. 2. Blacklisted more adult sites, greylisted more product sites and .ru, .pl and .tk domains with whitelisting in the whitelist file. 3. CCWETProcessor now looks out for additional adult sites based on URL and adds them to its blacklist in memory (not the file) and logs the domain for checking and manually adding to the blacklist file.
File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/batchcrawl.sh

    r33567 r33569  
    7474        echo "CRAWL FAILED." 2>&1 | tee -a ${siteDir}log.out
    7575    fi
    76    
     76
     77
     78    # move the peripheral crawl products (the log.out and UNFINISHED files)
     79    # from the input to the output folder. This way we can re-run the crawl and
     80    # the original output will still have been preserved
     81    mv ${siteDir}log.out $outputDir/$crawlId/log.out
     82    mv ${siteDir}UNFINISHED $outputDir/$crawlId/UNFINISHED
    7783}
    7884
Note: See TracChangeset for help on using the changeset viewer.