Context Navigation

← Previous Changeset
Next Changeset →

Changeset 33574

Timestamp:

2019-10-16T23:35:45+13:00 (5 years ago)

Author:

ak19

Message:

If nutch stores a crawled site in more than 1 file, then cat all of them into dump.txt. So far when crawling, nutch has produced only 1 file per site, but it is possible for it to create more.

File:

: 1 edited

gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/batchcrawl.sh (modified) (2 diffs)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/batchcrawl.sh

-              r33573
+              r33574
     # $siteDir parameter is the folder containing seedURLs.txt
+    crawl_cmd="./$CRAWL_COMMAND $siteDir $crawlId $CRAWL_ITERATIONS"
+    # Since we're going to crawl from scratch, create log.out file
+    # Logging to terminal and log file simultaenously
     # https://stackoverflow.com/questions/418896/how-to-redirect-output-to-a-file-and-stdout
-    crawl_cmd="./$CRAWL_COMMAND $siteDir $crawlId $CRAWL_ITERATIONS"
-    # Since we're going to crawl from scratch, create log.out file
     echo "Going to run nutch crawl command (and copy output to ${siteDir}log.out):" 2>&1 | tee ${siteDir}log.out
     # append to log.out file hereafter
 …
         ./$NUTCH_COMMAND readdb -dump $outputDir/$crawlId -text -crawlId $crawlId
         ./$NUTCH_COMMAND readdb -stats -crawlId $crawlId > $outputDir/$crawlId/stats
         cat $outputDir/$crawlId/part-r-00000 > $outputDir/$crawlId/dump.txt
+        cat $outputDir/$crawlId/part-r-* > $outputDir/$crawlId/dump.txt
     else
     # appending to log.out

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33574

Legend:

gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/batchcrawl.sh

Download in other formats: