source: gs3-extensions/maori-lang-detection/conf/url-whitelist-filter.txt@ 33604

Last change on this file since 33604 was 33604, checked in by ak19, 5 years ago
  1. Better output into possible-product-sites.txt including the overseas country code prefix to help decide whether the site is worth keeping or not. 2. Updated whitelisting and top-sites filters to grab the /mi/ subsections of sites that don't appear to be autotranslated. This is done in preparation for blocking out product sites hereafter
File size: 1.3 KB
Line 
1# URL 'whitelist': urls of these forms go into the keep pile.
2# whitelist overrides blacklist and greylist.
3# FORMAT:
4# precede URL by ^ to whitelist urls that match the given prefix
5# succeed URL by $ to whitelist urls that match the given suffix
6# ^url$ will whitelist urls that match the given url completely
7# Without either ^ or $ symbol, urls containing the given url will get whitelisted
8
9# Special exception for this url on yale.edu, since we needed to blacklist
10# some particular other urls on yale.edu
11http://korora.econ.yale.edu/phillips/archive/hauraki.htm
12
13# We've added .ru$ sites to the blacklist, but the following
14# Russian website contains actual Maori language content
15http://www.krassotkin.ru/sites/prayer.su/maori/
16https://mi.centr-zashity.ru/
17
18
19
20# WHITELIST WEBSITES THAT HAVE NON-AUTOMATED /mi/ SUBSECTIONS
21# WE CONTROL WHAT PART OF THEM WILL BE DOWNLOADED (THE /mi SUBSECTION)
22# IN sites-too-big-to-exhaustively-crawl.txt
23#https://www.martinvrijland.nl/mi/te-mana-hinengaro/Ko-te-nuinga-ake-o-nga-tangata-kei-te-timata-ki-te-kite-kei-te-noho-tatou-i-roto-i-te-whakaata-ko-te-aha-tenei/
24#https://www.csunplugged.org/mi/principles/
25#http://www.gpedia.com/mi/gpedia/Reo_M%C4%81ori
26
27https://www.martinvrijland.nl
28https://www.csunplugged.org
29http://www.gpedia.com
30
31
Note: See TracBrowser for help on using the repository browser.