source: other-projects/maori-lang-detection/conf/url-blacklist-filter.txt@ 33800

Last change on this file since 33800 was 33800, checked in by ak19, 4 years ago

Removed an adult site from crawled contents and added its url to blacklist conf file (for if ever anyone crawls our MRI set of common crawl sites again)

File size: 3.1 KB
Line 
1# URL blacklist
2# FORMAT:
3# precede URL by ^ to blacklist urls that match the given prefix
4# succeed URL by $ to blacklist urls that match the given suffix
5# ^url$ will blacklist urls that match the given url completely
6# Without either ^ or $ symbol, urls containing the given url will get blacklisted
7
8
9# manually adjusting for irrelevant topsite hits
10# Rapa-Nui is related to Easter Island
11^http://codex.cs.yale.edu/avi/silberschatz/gallery/trips-photos/South-America/Rapa-Nui/
12
13# We will blacklist this yale.edu domain except for the subportion that gets whitelisted
14# then in the sites-too-big-to-exhaustively-crawl.txt, we have a mapping for an allowed url
15# pattern in case elements on the page are stored elsewhere
16^http://korora.econ.yale.edu/
17
18# wikipedia pages in
19# ksh (a German dialect), ilo (Filippino), ty Tahitian, wa for Walons/Walloon,
20# io (Ido version of Esperanto) and zh-min-nan (Min-Nan-Chinese) are not in the Maori language
21# Not sure why Commoncrawl had found them for language code MRI
22ksh.wikipedia.org
23ilo.wikipedia.org
24wa.wikipedia.org
25ty.m.wikipedia.org
26io.m.wikipedia.org
27zh-min-nan.wikipedia.org
28zh-min-nan.wiktionary.org
29
30######
31# unwanted domains
32.video-chat.
33.videochat.
343chat.ru
35livevideochatting.org
36lovewebcam.net
37
38cherrybabe.biz
39dreamsbabes.com
40adultfantasyboutique.com
41adultterra.com
42
43leatherdyke.porn
44hornyteenharlots.com
45adultviewsex.com
46adultsexualvideo.com
47ctbererotica.sexe-traque.com
48cybererotia.porn234.com
49cybereroticz.adultsupermart.com
50freegaywebcams.info
51lesbiansinmysoup.com
52videopornoxx.online
53sexandplay.com
54sexynakedselfies.info
55barebabez.com
56britnudes.net
57camaporno.com
58webxvideo.com
59gayspornosex.com
60jasminreviews.com
61sexchatlines4u.com
62sexybabeworld.org
63sexyleaks.info
64uniqueporno.com
65wildsexsluts.com
66xxxblacknudes.com
67bigsexymelons.com
68mi.thebestmasturbators.com
69
70# more adult sites
71acba.osb-land.com
72
73
74# just get rid of any URL containing "livejasmin"
75## livejasmin
76# Actually: do that in the code (CCWETProcessor) with a log message,
77# since we actually need to get rid of any sites in entirety that contain
78# any url with the string "livejasmin"
79# So run the program once, check the log for messages mentioning "additional"
80# adult sites found and add their domains in here.
81anigma-beauty.com
82adultfeet.com
83atopian.org
84bellydancingvideo.net
85bmmodelsagency.com
86brucknergallery.com
87fuckvidz.org
88photobattle.net
89votekat.info
90
91# Similar to above, the following contained the string "jasmin" in the URL
92teenycuties.com
93a.tiles.mapbox.com
94blazingteens.net
95redtubeporn.info
96osb-land.com
97totallyhotmales.com
98babeevents.com
99talkserver.de
100hehechat.org
101fetish-nights.com
102lesslove.com
103hebertsvideo.com
104
105# sounds like some pirating site
106^http://pirateguides.com/
107fastmp3.ru
108
109# from alexa topsites at https://www.alexa.com/topsites
110livejasmin.com
111pornhub.com
112# listed as a similar topsite at https://en.wikipedia.org/wiki/List_of_most_popular_websites
113redtube.com
114xvideos.com
115xhamster.com
116xnxx.com
117
118
119# not sure about the domain name and/or full url seems like it belongs here
120abcutie.com
121
122# only had a single seedURL and it quickly redirected to an adult site
123apparactes.gq
Note: See TracBrowser for help on using the repository browser.