source: gs3-extensions/maori-lang-detection/conf/url-blacklist-filter.txt@ 33559

Last change on this file since 33559 was 33559, checked in by ak19, 5 years ago
  1. Special string COPY changed to SUBDOMAIN-COPY after Dr Bainbridge explained why it was more accurate to the behaviour. 2. Comments to explain how the sites-too-big-to-exhaustively-crawl.txt should be formatted, what values are expected and how they work. 3. Special blacklisting and whitelisting of urls on yale.edu, coupled with special treatment in topsites file too.
File size: 2.0 KB
Line 
1# URL blacklist
2# FORMAT:
3# precede URL by ^ to blacklist urls that match the given prefix
4# succeed URL by $ to blacklist urls that match the given suffix
5# ^url$ will blacklist urls that match the given url completely
6# Without either ^ or $ symbol, urls containing the given url will get blacklisted
7
8
9# manually adjusting for irrelevant topsite hits
10# Rapa-Nui is related to Easter Island
11^http://codex.cs.yale.edu/avi/silberschatz/gallery/trips-photos/South-America/Rapa-Nui/
12
13# We will blacklist this yale.edu domain except for the subportion that gets whitelisted
14# then in the sites-too-big-to-exhaustively-crawl.txt, we have a mapping for an allowed url
15# pattern in case elements on the page are stored elsewhere
16^http://korora.econ.yale.edu/
17
18# wikipedia pages in
19# ksh (a German dialect), ilo (Filippino), ty Tahitian, wa for Walons/Walloon,
20# io (Ido version of Esperanto) and zh-min-nan (Min-Nan-Chinese) are not in the Maori language
21# Not sure why Commoncrawl had found them for language code MRI
22ksh.wikipedia.org
23ilo.wikipedia.org
24wa.wikipedia.org
25ty.m.wikipedia.org
26io.m.wikipedia.org
27zh-min-nan.wikipedia.org
28zh-min-nan.wiktionary.org
29
30# unwanted domains
31.video-chat.
32.videochat.
333chat.ru
34livevideochatting.org
35lovewebcam.net
36
37cherrybabe.biz
38dreamsbabes.com
39adultfantasyboutique.com
40adultterra.com
41
42leatherdyke.porn
43hornyteenharlots.com
44adultviewsex.com
45adultsexualvideo.com
46ctbererotica.sexe-traque.com
47cybererotia.porn234.com
48cybereroticz.adultsupermart.com
49freegaywebcams.info
50lesbiansinmysoup.com
51videopornoxx.online
52sexandplay.com
53sexynakedselfies.info
54barebabez.com
55britnudes.net
56camaporno.com
57webxvideo.com
58gayspornosex.com
59jasminreviews.com
60sexchatlines4u.com
61sexybabeworld.org
62sexyleaks.info
63uniqueporno.com
64wildsexsluts.com
65xxxblacknudes.com
66
67# sounds like some pirating site
68^http://pirateguides.com/
69
70# from alexa topsites at https://www.alexa.com/topsites
71livejasmin.com
72pornhub.com
73# listed as a similar topsite at https://en.wikipedia.org/wiki/List_of_most_popular_websites
74redtube.com
75xvideos.com
76xhamster.com
77xnxx.com
Note: See TracBrowser for help on using the repository browser.