source: other-projects/maori-lang-detection/conf/url-blacklist-filter.txt@ 33823

Last change on this file since 33823 was 33823, checked in by ak19, 14 months ago

Recommitting mongo-data folder with renamed files with numbering.

File size: 3.2 KB
Line 
1# URL blacklist
2# FORMAT:
3# precede URL by ^ to blacklist urls that match the given prefix
4# succeed URL by $ to blacklist urls that match the given suffix
5# ^url$ will blacklist urls that match the given url completely
6# Without either ^ or $ symbol, urls containing the given url will get blacklisted
7
8
9# manually adjusting for irrelevant topsite hits
10# Rapa-Nui is related to Easter Island
11^http://codex.cs.yale.edu/avi/silberschatz/gallery/trips-photos/South-America/Rapa-Nui/
12
13# We will blacklist this yale.edu domain except for the subportion that gets whitelisted
14# then in the sites-too-big-to-exhaustively-crawl.txt, we have a mapping for an allowed url
15# pattern in case elements on the page are stored elsewhere
16^http://korora.econ.yale.edu/
17
18# wikipedia pages in
19# ksh (a German dialect), ilo (Filippino), ty Tahitian, wa for Walons/Walloon,
20# io (Ido version of Esperanto) and zh-min-nan (Min-Nan-Chinese) are not in the Maori language
21# Not sure why Commoncrawl had found them for language code MRI
22ksh.wikipedia.org
23ilo.wikipedia.org
24wa.wikipedia.org
25ty.m.wikipedia.org
26io.m.wikipedia.org
27zh-min-nan.wikipedia.org
28zh-min-nan.wiktionary.org
29
30######
31# unwanted domains
32.video-chat.
33.videochat.
343chat.ru
35livevideochatting.org
36lovewebcam.net
37
38cherrybabe.biz
39dreamsbabes.com
40adultfantasyboutique.com
41adultterra.com
42
43leatherdyke.porn
44hornyteenharlots.com
45adultviewsex.com
46adultsexualvideo.com
47ctbererotica.sexe-traque.com
48cybererotia.porn234.com
49cybereroticz.adultsupermart.com
50freegaywebcams.info
51lesbiansinmysoup.com
52videopornoxx.online
53sexandplay.com
54sexynakedselfies.info
55barebabez.com
56britnudes.net
57camaporno.com
58webxvideo.com
59gayspornosex.com
60jasminreviews.com
61sexchatlines4u.com
62sexybabeworld.org
63sexyleaks.info
64uniqueporno.com
65wildsexsluts.com
66xxxblacknudes.com
67bigsexymelons.com
68mi.thebestmasturbators.com
69
70# more adult sites
71acba.osb-land.com
72the-naked.com
73# the full URL is http://ww25.milfsplease.com, but don't know whether the ww25 prefix should be included or not
74ww25.milfsplease.com
75milfsplease.com
76
77# just get rid of any URL containing "livejasmin"
78## livejasmin
79# Actually: do that in the code (CCWETProcessor) with a log message,
80# since we actually need to get rid of any sites in entirety that contain
81# any url with the string "livejasmin"
82# So run the program once, check the log for messages mentioning "additional"
83# adult sites found and add their domains in here.
84anigma-beauty.com
85adultfeet.com
86atopian.org
87bellydancingvideo.net
88bmmodelsagency.com
89brucknergallery.com
90fuckvidz.org
91photobattle.net
92votekat.info
93
94# Similar to above, the following contained the string "jasmin" in the URL
95teenycuties.com
96a.tiles.mapbox.com
97blazingteens.net
98redtubeporn.info
99osb-land.com
100totallyhotmales.com
101babeevents.com
102talkserver.de
103hehechat.org
104fetish-nights.com
105lesslove.com
106hebertsvideo.com
107
108# sounds like some pirating site
109^http://pirateguides.com/
110fastmp3.ru
111
112# from alexa topsites at https://www.alexa.com/topsites
113livejasmin.com
114pornhub.com
115# listed as a similar topsite at https://en.wikipedia.org/wiki/List_of_most_popular_websites
116redtube.com
117xvideos.com
118xhamster.com
119xnxx.com
120
121
122# not sure about the domain name and/or full url seems like it belongs here
123abcutie.com
124
125# only had a single seedURL and it quickly redirected to an adult site
126apparactes.gq
Note: See TracBrowser for help on using the repository browser.