Context Navigation

← Previous Change
Next Change →

Changeset 33564 for gs3-extensions

Timestamp:

2019-10-14T21:01:17+13:00 (5 years ago)

Author:

ak19

Message:

batchcrawl.sh now does the crawl and logs output of the crawl, dumps text and stats resulting from the crawl into an output folder and creates an UNFINISHED file with instructions and old crawl cmd if crawl did not terminate in specified number of iterations. At present there's still a break statement to stop after the first site has been processed.

File:

: 1 edited

gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/batchcrawl.sh (modified) (4 diffs)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/batchcrawl.sh

-              r33563
+              r33564
 #!/bin/bash
-echo "Hello world!"
 sitesDir=to_crawl/sites
+echo "SITESDIR: $sitesDir"
+echo "SITES DIR (INPUT): $sitesDir"
+outputDir=crawled
+mkdir -p $outputDir
+echo "OUTPUT DIR: $outputDir"
 NUTCH_HOME=apache-nutch-2.3.1
 NUTCH_CONF_DIR=$NUTCH_HOME/conf
+NUTCH_CONF_DIR=$NUTCH_HOME/runtime/local/conf
 NUTCH_URLFILTER_TEMPLATE=$NUTCH_CONF_DIR/regex-urlfilter.GS_TEMPLATE
 NUTCH_URLFILTER_FILE=$NUTCH_CONF_DIR/regex-urlfilter.txt
 CRAWL_COMMAND=$NUTCH_HOME/runtime/local/bin/crawl
+NUTCH_COMMAND=$NUTCH_HOME/runtime/local/bin/nutch
 CRAWL_ITERATIONS=10
 function prepareSite() {
+function crawlSite() {
     siteDir=$1
     crawlId=$2
+    #echo "processing site $siteDir"
+    #echo "processing site $siteDir with crawlId: $crawlId"
+    echo "processing site $siteDir with crawlId: $crawlId"
     echo "Copying over template $NUTCH_URLFILTER_TEMPLATE to live version of file"
 …
     # $siteDir parameter is the folder containing seedURLs.txt
+    # https://stackoverflow.com/questions/418896/how-to-redirect-output-to-a-file-and-stdout
     crawl_cmd="./$CRAWL_COMMAND $siteDir $crawlId $CRAWL_ITERATIONS"
+    echo "Going to run nutch crawl command:"
+    echo "  $crawl_cmd"
+    # Since we're going to crawl from scratch, create log.out file
+    echo "Going to run nutch crawl command (and copy output to ${siteDir}log.out):" 2>&1 | tee ${siteDir}log.out
+    # append to log.out file hereafter
+    echo "  $crawl_cmd" 2>&1 | tee -a ${siteDir}log.out
+    echo "--------------------------------------------------" 2>&1 | tee -a ${siteDir}log.out
+    # append output of $crawl_cmd to log.out
+    $crawl_cmd 2>&1 | tee -a ${siteDir}log.out
+    result=$?
+    if [ "x$result" = "x0" ]; then
+    # nutch finished crawling successfully.
+    # But check if the site was crawled thoroughly within $CRAWL_ITERATIONS
+    # If not, create file UNFINISHED to indicate a more thorough crawl needed
+    tail -10 ${siteDir}log.out | grep "no more URLs to fetch now" > /dev/null
+    result=$?
+    if [ "x$result" != "x0" ]; then
+        echo "A crawl of $CRAWL_ITERATIONS iterations was insufficient for crawlId $crawlId" 2>&1 | tee ${siteDir}UNFINISHED
+        echo "" 2>&1 | tee -a ${siteDir}UNFINISHED
+        echo "To re-run crawl of site with crawlId $crawlId with a larger number of iterations:" 2>&1 | tee -a ${siteDir}UNFINISHED
+        echo "1. delete $outputDir/$crawlId" 2>&1 | tee -a ${siteDir}UNFINISHED
+        echo "2. copy the regex-urlfilter file:" 2>&1 | tee -a ${siteDir}UNFINISHED
+        echo "   cp $NUTCH_URLFILTER_TEMPLATE $NUTCH_URLFILTER_FILE" 2>&1 | tee -a ${siteDir}UNFINISHED
+        echo "3. Adjust # crawl iterations in old crawl command:\n$crawl_cmd" 2>&1 | tee -a ${siteDir}UNFINISHED
+    fi
+    # outputDir/$crawlId should not yet exist
+        ./$NUTCH_COMMAND readdb -dump $outputDir/$crawlId -text -crawlId $crawlId
+        ./$NUTCH_COMMAND readdb -stats -crawlId $crawlId > $outputDir/$crawlId/stats
+        cat $outputDir/$crawlId/part-r-00000 > $outputDir/$crawlId/dump.txt
+    else
+    # appending to log.out
+        echo "CRAWL FAILED." 2>&1 | tee -a ${siteDir}log.out
+    fi
+}
 …
 # https://stackoverflow.com/questions/4000613/perform-an-action-in-every-sub-directory-using-bash
 for siteDir in $sitesDir/*/; do
+    #echo "$siteDir"
     # to get crawl_id like 00001 from $siteDir like to_crawl/sites/00001/
     # Remove the $sitesDir prefix of to_crawl/sites followed by /,
 …
     crawlId=${crawlId%/}
+    #echo "crawlId: $crawlId"
+    prepareSite $siteDir $crawlId
+    echo "Processing crawlId: $crawlId"
+    if [ -d "$outputDir/$crawlId" ]; then
+    # Skip site already processed. *Append* this msg to log.out
+    echo "" 2>&1 | tee -a ${siteDir}log.out
+    echo "**** $siteDir already processed. Skipping...." 2>&1 | tee -a ${siteDir}log.out
+    echo "Delete $outputDir/$crawlId if you want to reprocess it." 2>&1 | tee -a ${siteDir}log.out
+    else
+    crawlSite $siteDir $crawlId
+    fi
+    echo "--------------------------------------------------"
     break
 done

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33564 for gs3-extensions

Legend:

gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/batchcrawl.sh

Download in other formats: