Last change
on this file since 33524 was 33524, checked in by ak19, 5 years ago |
- Further adjustments to documenting what we did to get things to run on the hadoop filesystem. 2. All the hadoop related gitprojects (with patches), separate copy of patches, config modifications and missing jar files that we needed, scripts we created to run on the hdfs machine and its host machine.
|
File size:
812 bytes
|
Line | |
---|
1 | 1. Most of the scripts in this folder are based on https://github.com/commoncrawl/cc-index-table/tree/master/src/script/convert_url_index.sh
|
---|
2 |
|
---|
3 | to set the environment and setup SPARK etc, in order to run cc-index-table related tasks to process data that we're particularly interested in.
|
---|
4 |
|
---|
5 | Several of the modifications are further based on the instructions and examples at https://github.com/commoncrawl/cc-index-table
|
---|
6 |
|
---|
7 | 2. The third/final phase of the script get_maori_WET_records_for_crawl.sh is based on the instructions at https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
|
---|
8 |
|
---|
9 | 3. The script get_Maori_WET_records_from_CCSep2018_on.sh is merely a batch processing script.
|
---|
10 |
|
---|
11 | 4. Script limit10_export_index.sh never really worked, but it was only scripted for testing purposes, and has no other purpose.
|
---|
Note:
See
TracBrowser
for help on using the repository browser.