Changeset 33625

05.11.2019 21:58:44 (9 days ago)

A file listing domains with seedurls containing /mi(/) that are located outside NZ. Those outside NZ may be autotranslated and may point to product sites. We have to inspect the file to decide for each domain whether we're keeping its crawldata or not. 2. A file listing domains where any seedURL contains /mi(/) that are inside NZ and which we keep. The 2nd file was accidentally generated when the generating code did the opposite of what was intended, but the contents are informative. The 2nd file's contents don't have to be used by our programs in any way. Once we've inspected and reduced the first file to just the sites to be dropped, the first file may in future be consulted/merged as part of the blacklist or greylist before generating the to_crawl folder for nutch. Now that we've actually finished crawling, we can use the reduced version of file 1 to determine which web sites' crawldata can be ignored and won't go into the mongodb or csv file.

2 added