Changeset 33574


Ignore:
Timestamp:
2019-10-16T23:35:45+13:00 (5 years ago)
Author:
ak19
Message:

If nutch stores a crawled site in more than 1 file, then cat all of them into dump.txt. So far when crawling, nutch has produced only 1 file per site, but it is possible for it to create more.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/batchcrawl.sh

    r33573 r33574  
    3636   
    3737    # $siteDir parameter is the folder containing seedURLs.txt
     38    crawl_cmd="./$CRAWL_COMMAND $siteDir $crawlId $CRAWL_ITERATIONS"
     39
     40    # Since we're going to crawl from scratch, create log.out file
     41    # Logging to terminal and log file simultaenously
    3842    # https://stackoverflow.com/questions/418896/how-to-redirect-output-to-a-file-and-stdout
    39     crawl_cmd="./$CRAWL_COMMAND $siteDir $crawlId $CRAWL_ITERATIONS"
    40 
    41     # Since we're going to crawl from scratch, create log.out file
    4243    echo "Going to run nutch crawl command (and copy output to ${siteDir}log.out):" 2>&1 | tee ${siteDir}log.out
    4344    # append to log.out file hereafter
     
    6970        ./$NUTCH_COMMAND readdb -dump $outputDir/$crawlId -text -crawlId $crawlId
    7071        ./$NUTCH_COMMAND readdb -stats -crawlId $crawlId > $outputDir/$crawlId/stats
    71         cat $outputDir/$crawlId/part-r-00000 > $outputDir/$crawlId/dump.txt
     72        cat $outputDir/$crawlId/part-r-* > $outputDir/$crawlId/dump.txt
    7273    else
    7374    # appending to log.out
Note: See TracChangeset for help on using the changeset viewer.