Changeset 33494 for gs3-extensions

Timestamp:
2019-09-21T22:49:56+12:00 (5 years ago)
Author:
ak19
Message:

All in one script that takes as parameter a common crawl identifier of the form CC-MAIN-YEAR-NUMBER. The first phase uses SPARK over the full CommonCrawl index to get the warc url and offsets portion of the index for the crawled pages where the primary language is Maori (code MRI) in csv format. The zipped csv parts on hdfs are then unified and unzipped into a proper, large csv file. The second phase of the script then uses SPARK with Hadoop again to take that csv as input and then download the specified warc records at the warc urls and warc offsets in the csv index (db table) file onto the hadoop file system. The third phase converts the warc records to wet records for text and wat records for metadata. The latter are unused, while the wet records are put into the mounted shared area on vagrant on analytics, ready to be copied to my local machine and processed by WETProcessor.java. The idea is to run the new get_maori_WET_records_for_crawl over every common crawl all the way back to September 2018. That's about 12 crawls thus far, though at the end of this month there will be another.

File:
1 added

Note: See TracChangeset for help on using the changeset viewer.