Changeset 33574

Show
Ignore:
Timestamp:
16.10.2019 23:35:45 (4 weeks ago)
Author:
ak19
Message:

If nutch stores a crawled site in more than 1 file, then cat all of them into dump.txt. So far when crawling, nutch has produced only 1 file per site, but it is possible for it to create more.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/batchcrawl.sh

    r33573 r33574  
    3636     
    3737    # $siteDir parameter is the folder containing seedURLs.txt 
     38    crawl_cmd="./$CRAWL_COMMAND $siteDir $crawlId $CRAWL_ITERATIONS" 
     39 
     40    # Since we're going to crawl from scratch, create log.out file 
     41    # Logging to terminal and log file simultaenously 
    3842    # https://stackoverflow.com/questions/418896/how-to-redirect-output-to-a-file-and-stdout 
    39     crawl_cmd="./$CRAWL_COMMAND $siteDir $crawlId $CRAWL_ITERATIONS" 
    40  
    41     # Since we're going to crawl from scratch, create log.out file 
    4243    echo "Going to run nutch crawl command (and copy output to ${siteDir}log.out):" 2>&1 | tee ${siteDir}log.out 
    4344    # append to log.out file hereafter 
     
    6970        ./$NUTCH_COMMAND readdb -dump $outputDir/$crawlId -text -crawlId $crawlId 
    7071        ./$NUTCH_COMMAND readdb -stats -crawlId $crawlId > $outputDir/$crawlId/stats 
    71         cat $outputDir/$crawlId/part-r-00000 > $outputDir/$crawlId/dump.txt 
     72        cat $outputDir/$crawlId/part-r-* > $outputDir/$crawlId/dump.txt 
    7273    else 
    7374    # appending to log.out