source: gs3-extensions/maori-lang-detection/conf/url-blacklist-filter.txt@ 33569

Last change on this file since 33569 was 33569, checked in by ak19, 5 years ago
  1. batchcrawl.sh now does what it should have from the start, which is to move the log.out and UNFINISHED files into the output folder instead of leaving them in the input folder, as the input to_crawl folder can and does get replaced all the time, every time I regenerate it after black/white/greylisting more urls. 2. Blacklisted more adult sites, greylisted more product sites and .ru, .pl and .tk domains with whitelisting in the whitelist file. 3. CCWETProcessor now looks out for additional adult sites based on URL and adds them to its blacklist in memory (not the file) and logs the domain for checking and manually adding to the blacklist file.
File size: 3.0 KB
Line 
1# URL blacklist
2# FORMAT:
3# precede URL by ^ to blacklist urls that match the given prefix
4# succeed URL by $ to blacklist urls that match the given suffix
5# ^url$ will blacklist urls that match the given url completely
6# Without either ^ or $ symbol, urls containing the given url will get blacklisted
7
8
9# manually adjusting for irrelevant topsite hits
10# Rapa-Nui is related to Easter Island
11^http://codex.cs.yale.edu/avi/silberschatz/gallery/trips-photos/South-America/Rapa-Nui/
12
13# We will blacklist this yale.edu domain except for the subportion that gets whitelisted
14# then in the sites-too-big-to-exhaustively-crawl.txt, we have a mapping for an allowed url
15# pattern in case elements on the page are stored elsewhere
16^http://korora.econ.yale.edu/
17
18# wikipedia pages in
19# ksh (a German dialect), ilo (Filippino), ty Tahitian, wa for Walons/Walloon,
20# io (Ido version of Esperanto) and zh-min-nan (Min-Nan-Chinese) are not in the Maori language
21# Not sure why Commoncrawl had found them for language code MRI
22ksh.wikipedia.org
23ilo.wikipedia.org
24wa.wikipedia.org
25ty.m.wikipedia.org
26io.m.wikipedia.org
27zh-min-nan.wikipedia.org
28zh-min-nan.wiktionary.org
29
30######
31# unwanted domains
32.video-chat.
33.videochat.
343chat.ru
35livevideochatting.org
36lovewebcam.net
37
38cherrybabe.biz
39dreamsbabes.com
40adultfantasyboutique.com
41adultterra.com
42
43leatherdyke.porn
44hornyteenharlots.com
45adultviewsex.com
46adultsexualvideo.com
47ctbererotica.sexe-traque.com
48cybererotia.porn234.com
49cybereroticz.adultsupermart.com
50freegaywebcams.info
51lesbiansinmysoup.com
52videopornoxx.online
53sexandplay.com
54sexynakedselfies.info
55barebabez.com
56britnudes.net
57camaporno.com
58webxvideo.com
59gayspornosex.com
60jasminreviews.com
61sexchatlines4u.com
62sexybabeworld.org
63sexyleaks.info
64uniqueporno.com
65wildsexsluts.com
66xxxblacknudes.com
67bigsexymelons.com
68
69# more adult sites
70acba.osb-land.com
71
72
73# just get rid of any URL containing "livejasmin"
74## livejasmin
75# Actually: do that in the code (CCWETProcessor) with a log message,
76# since we actually need to get rid of any sites in entirety that contain
77# any url with the string "livejasmin"
78# So run the program once, check the log for messages mentioning "additional"
79# adult sites found and add their domains in here.
80anigma-beauty.com
81adultfeet.com
82atopian.org
83bellydancingvideo.net
84bmmodelsagency.com
85brucknergallery.com
86fuckvidz.org
87photobattle.net
88votekat.info
89
90# Similar to above, the following contained the string "jasmin" in the URL
91teenycuties.com
92a.tiles.mapbox.com
93blazingteens.net
94redtubeporn.info
95osb-land.com
96totallyhotmales.com
97babeevents.com
98talkserver.de
99hehechat.org
100fetish-nights.com
101lesslove.com
102hebertsvideo.com
103
104# sounds like some pirating site
105^http://pirateguides.com/
106fastmp3.ru
107
108# from alexa topsites at https://www.alexa.com/topsites
109livejasmin.com
110pornhub.com
111# listed as a similar topsite at https://en.wikipedia.org/wiki/List_of_most_popular_websites
112redtube.com
113xvideos.com
114xhamster.com
115xnxx.com
116
117
118# not sure about the domain name and/or full url seems like it belongs here
119abcutie.com
120
121# only had a single seedURL and it quickly redirected to an adult site
122apparactes.gq
Note: See TracBrowser for help on using the repository browser.