Last change
on this file since 33524 was 33524, checked in by ak19, 5 years ago |
- Further adjustments to documenting what we did to get things to run on the hadoop filesystem. 2. All the hadoop related gitprojects (with patches), separate copy of patches, config modifications and missing jar files that we needed, scripts we created to run on the hdfs machine and its host machine.
|
-
Property svn:executable
set to
*
|
File size:
835 bytes
|
Line | |
---|
1 | #!/bin/bash
|
---|
2 |
|
---|
3 | # crawl_ids are from http://index.commoncrawl.org/
|
---|
4 | # We only want the crawl_ids from Sep 2018 and onwards as that's when
|
---|
5 | # the content_languages field was included in CommonCrawl's index
|
---|
6 |
|
---|
7 | # https://www.cyberciti.biz/faq/bash-for-loop-array/
|
---|
8 | #crawl_ids=( "CC-MAIN-2019-35" "CC-MAIN-2019-30" "CC-MAIN-2019-26" "CC-MAIN-2019-22" "CC-MAIN-2019-18" "CC-MAIN-2019-13" "CC-MAIN-2019-09" "CC-MAIN-2019-04" "CC-MAIN-2018-51" "CC-MAIN-2018-47" "CC-MAIN-2018-43" "CC-MAIN-2018-39" )
|
---|
9 |
|
---|
10 | crawl_ids=( "CC-MAIN-2019-18" "CC-MAIN-2019-13" "CC-MAIN-2019-09" "CC-MAIN-2019-04" "CC-MAIN-2018-51" "CC-MAIN-2018-47" "CC-MAIN-2018-43" "CC-MAIN-2018-39" )
|
---|
11 |
|
---|
12 | for crawl_id in "${crawl_ids[@]}"
|
---|
13 | do
|
---|
14 | echo "About to start off index and WARC download process for CRAWL ID: $crawl_id"
|
---|
15 | ./src/script/get_maori_WET_records_for_crawl.sh $crawl_id
|
---|
16 | done
|
---|
Note:
See
TracBrowser
for help on using the repository browser.