Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

source: gs3-extensions/maori-lang-detection/hdfs-cc-work/scripts/batchcrawl.sh@ 33563

Last change on this file since 33563 was 33563, checked in by ak19, 5 years ago
Committing inactive testing batch scripts (only creates the regex-urlfilter.txt files and puts them in the correct location before printing the crawl command. Somehow the crawl command printed for any of the 1st, 2nd and last site (with the supposedly correct regex-urlfilter.txt manually put in place) does not work yet. For a while I couldn't get the davidb homepage to work either, but then after removing the nutch bin/crawl binary script and recompiling nutch it worked again. But the similar commands generated buy this batchcrawl.sh script are not working yet.
Property svn:executable set to ``*
File size: 1.6 KB

Line
1	#!/bin/bash
2	echo "Hello world!"
3
4	sitesDir=to_crawl/sites
5	echo "SITESDIR: $sitesDir"
6
7	NUTCH_HOME=apache-nutch-2.3.1
8	NUTCH_CONF_DIR=$NUTCH_HOME/conf
9	NUTCH_URLFILTER_TEMPLATE=$NUTCH_CONF_DIR/regex-urlfilter.GS_TEMPLATE
10	NUTCH_URLFILTER_FILE=$NUTCH_CONF_DIR/regex-urlfilter.txt
11
12	CRAWL_COMMAND=$NUTCH_HOME/runtime/local/bin/crawl
13
14	CRAWL_ITERATIONS=10
15
16	function prepareSite() {
17	siteDir=$1
18	crawlId=$2
19
20	#echo "processing site $siteDir"
21
22	#echo "processing site $siteDir with crawlId: $crawlId"
23
24	echo "Copying over template $NUTCH_URLFILTER_TEMPLATE to live version of file"
25	cp $NUTCH_URLFILTER_TEMPLATE $NUTCH_URLFILTER_FILE
26
27	echo "Appending contents of regex-urlfilter file for site $siteDir to url-filter file:"
28	cat ${siteDir}regex-urlfilter.txt >> $NUTCH_URLFILTER_FILE
29
30	#echo "Contents of seedURLs.txt file for site:"
31	#cat ${siteDir}seedURLs.txt
32
33	# $siteDir parameter is the folder containing seedURLs.txt
34	crawl_cmd="./$CRAWL_COMMAND $siteDir $crawlId $CRAWL_ITERATIONS"
35
36	echo "Going to run nutch crawl command:"
37	echo " $crawl_cmd"
38
39
40	}
41
42
43	# https://stackoverflow.com/questions/4000613/perform-an-action-in-every-sub-directory-using-bash
44	for siteDir in $sitesDir/*/; do
45	#echo "$siteDir"
46	# to get crawl_id like 00001 from $siteDir like to_crawl/sites/00001/
47	# Remove the $sitesDir prefix of to_crawl/sites followed by /,
48	# Next remove the / suffix that remains
49	crawlId=${siteDir#"$sitesDir/"}
50	crawlId=${crawlId%/}
51
52	#echo "crawlId: $crawlId"
53	prepareSite $siteDir $crawlId
54	break
55	done

Note: See TracBrowser for help on using the repository browser.

Download in other formats: