source: main/trunk/model-sites-dev/commoncrawl/resources/siteConfig.properties@ 36270

Last change on this file since 36270 was 34132, checked in by ak19, 4 years ago

Committing the commoncrawl site of Nutch recrawls of our CC data where content-language = MRI. 1. Contains the collection configuration files, but also the keep-urls *.txt files in the etc folder, used by NutchTextDumpPlugin to filter URLs of interest. 2. The import_nutchDumpTxtsOfcrawledMRICC.tar.gz file needs to decompressed into any of the collections that need to be rebuilt. This contains just the Nutch dump.txt files (in their siteID folders) as I've removed the binary files. 3. The script moveDumpTxtFilesIntoImport.sh can be used to generate such cut down versions of the Nutch crawled folders that contain only the dump.txt files within their siteID folders. 4. In the next commit, I'll try to add svn externals to get the import_nutchDumpTxtsOfcrawledMRICC.tar.gz from sitelevel into the collection folders for the 2 current collections in this site.

File size: 151 bytes
Line 
1siteName=DL of CommonCrawl MRI recrawls with Nutch
2siteDescription=Collections of Nutch crawls of CommonCrawl results where content language was MRI.
3
Note: See TracBrowser for help on using the repository browser.