Context Navigation

get_Maori_WET_records_from_CCSep2018_on.sh@ 33513

Last change on this file since 33513 was 33513, checked in by ak19, 5 years ago
Higher level script that runs against each named crawl since Sep 2018 which started featuring content_languages
Property svn:executable set to ``*
File size: 1017 bytes

Line
1	#!/bin/bash
2
3	# crawl_ids are from http://index.commoncrawl.org/
4	# We only want the crawl_ids from Sep 2018 and onwards as that's when
5	# the content_languages field was included in CommonCrawl's columnar index
6
7	# https://www.cyberciti.biz/faq/bash-for-loop-array/
8	# (else chain commands as at https://superuser.com/questions/237072/wrapping-long-bash-commands-in-script-files)
9	crawl_ids=( "CC-MAIN-2019-35" "CC-MAIN-2019-30" "CC-MAIN-2019-26" \
10	"CC-MAIN-2019-22" "CC-MAIN-2019-18" "CC-MAIN-2019-13" \
11	"CC-MAIN-2019-09" "CC-MAIN-2019-04" "CC-MAIN-2018-51" \
12	"CC-MAIN-2018-47" "CC-MAIN-2018-43" "CC-MAIN-2018-39" )
13
14	for crawl_id in "${crawl_ids[@]}"
15	do
16	echo "About to start off index and WARC download process for CRAWL ID: $crawl_id"
17	./src/script/get_maori_WET_records_for_crawl.sh $crawl_id
18	result=$?
19	if [ $result != 0 ]; then
20	echo "Processing common-crawl $crawl_id failed with exit value: $result"
21	echo "Will cease to process remaining cc crawls. Exitting..."
22	exit 1
23	fi
24	done

Note: See TracBrowser for help on using the repository browser.