Last change
on this file since 33604 was 33604, checked in by ak19, 5 years ago |
- Better output into possible-product-sites.txt including the overseas country code prefix to help decide whether the site is worth keeping or not. 2. Updated whitelisting and top-sites filters to grab the /mi/ subsections of sites that don't appear to be autotranslated. This is done in preparation for blocking out product sites hereafter
|
File size:
1.3 KB
|
Line | |
---|
1 | # URL 'whitelist': urls of these forms go into the keep pile.
|
---|
2 | # whitelist overrides blacklist and greylist.
|
---|
3 | # FORMAT:
|
---|
4 | # precede URL by ^ to whitelist urls that match the given prefix
|
---|
5 | # succeed URL by $ to whitelist urls that match the given suffix
|
---|
6 | # ^url$ will whitelist urls that match the given url completely
|
---|
7 | # Without either ^ or $ symbol, urls containing the given url will get whitelisted
|
---|
8 |
|
---|
9 | # Special exception for this url on yale.edu, since we needed to blacklist
|
---|
10 | # some particular other urls on yale.edu
|
---|
11 | http://korora.econ.yale.edu/phillips/archive/hauraki.htm
|
---|
12 |
|
---|
13 | # We've added .ru$ sites to the blacklist, but the following
|
---|
14 | # Russian website contains actual Maori language content
|
---|
15 | http://www.krassotkin.ru/sites/prayer.su/maori/
|
---|
16 | https://mi.centr-zashity.ru/
|
---|
17 |
|
---|
18 |
|
---|
19 |
|
---|
20 | # WHITELIST WEBSITES THAT HAVE NON-AUTOMATED /mi/ SUBSECTIONS
|
---|
21 | # WE CONTROL WHAT PART OF THEM WILL BE DOWNLOADED (THE /mi SUBSECTION)
|
---|
22 | # IN sites-too-big-to-exhaustively-crawl.txt
|
---|
23 | #https://www.martinvrijland.nl/mi/te-mana-hinengaro/Ko-te-nuinga-ake-o-nga-tangata-kei-te-timata-ki-te-kite-kei-te-noho-tatou-i-roto-i-te-whakaata-ko-te-aha-tenei/
|
---|
24 | #https://www.csunplugged.org/mi/principles/
|
---|
25 | #http://www.gpedia.com/mi/gpedia/Reo_M%C4%81ori
|
---|
26 |
|
---|
27 | https://www.martinvrijland.nl
|
---|
28 | https://www.csunplugged.org
|
---|
29 | http://www.gpedia.com
|
---|
30 |
|
---|
31 |
|
---|
Note:
See
TracBrowser
for help on using the repository browser.