1 | # URL blacklist
|
---|
2 | # FORMAT:
|
---|
3 | # precede URL by ^ to blacklist urls that match the given prefix
|
---|
4 | # succeed URL by $ to blacklist urls that match the given suffix
|
---|
5 | # ^url$ will blacklist urls that match the given url completely
|
---|
6 | # Without either ^ or $ symbol, urls containing the given url will get blacklisted
|
---|
7 |
|
---|
8 |
|
---|
9 | # manually adjusting for irrelevant topsite hits
|
---|
10 | # Rapa-Nui is related to Easter Island
|
---|
11 | ^http://codex.cs.yale.edu/avi/silberschatz/gallery/trips-photos/South-America/Rapa-Nui/
|
---|
12 |
|
---|
13 | # We will blacklist this yale.edu domain except for the subportion that gets whitelisted
|
---|
14 | # then in the sites-too-big-to-exhaustively-crawl.txt, we have a mapping for an allowed url
|
---|
15 | # pattern in case elements on the page are stored elsewhere
|
---|
16 | ^http://korora.econ.yale.edu/
|
---|
17 |
|
---|
18 | # wikipedia pages in
|
---|
19 | # ksh (a German dialect), ilo (Filippino), ty Tahitian, wa for Walons/Walloon,
|
---|
20 | # io (Ido version of Esperanto) and zh-min-nan (Min-Nan-Chinese) are not in the Maori language
|
---|
21 | # Not sure why Commoncrawl had found them for language code MRI
|
---|
22 | ksh.wikipedia.org
|
---|
23 | ilo.wikipedia.org
|
---|
24 | wa.wikipedia.org
|
---|
25 | ty.m.wikipedia.org
|
---|
26 | io.m.wikipedia.org
|
---|
27 | zh-min-nan.wikipedia.org
|
---|
28 | zh-min-nan.wiktionary.org
|
---|
29 |
|
---|
30 | ######
|
---|
31 | # unwanted domains
|
---|
32 | .video-chat.
|
---|
33 | .videochat.
|
---|
34 | 3chat.ru
|
---|
35 | livevideochatting.org
|
---|
36 | lovewebcam.net
|
---|
37 |
|
---|
38 | cherrybabe.biz
|
---|
39 | dreamsbabes.com
|
---|
40 | adultfantasyboutique.com
|
---|
41 | adultterra.com
|
---|
42 |
|
---|
43 | leatherdyke.porn
|
---|
44 | hornyteenharlots.com
|
---|
45 | adultviewsex.com
|
---|
46 | adultsexualvideo.com
|
---|
47 | ctbererotica.sexe-traque.com
|
---|
48 | cybererotia.porn234.com
|
---|
49 | cybereroticz.adultsupermart.com
|
---|
50 | freegaywebcams.info
|
---|
51 | lesbiansinmysoup.com
|
---|
52 | videopornoxx.online
|
---|
53 | sexandplay.com
|
---|
54 | sexynakedselfies.info
|
---|
55 | barebabez.com
|
---|
56 | britnudes.net
|
---|
57 | camaporno.com
|
---|
58 | webxvideo.com
|
---|
59 | gayspornosex.com
|
---|
60 | jasminreviews.com
|
---|
61 | sexchatlines4u.com
|
---|
62 | sexybabeworld.org
|
---|
63 | sexyleaks.info
|
---|
64 | uniqueporno.com
|
---|
65 | wildsexsluts.com
|
---|
66 | xxxblacknudes.com
|
---|
67 | bigsexymelons.com
|
---|
68 |
|
---|
69 | # more adult sites
|
---|
70 | acba.osb-land.com
|
---|
71 |
|
---|
72 |
|
---|
73 | # just get rid of any URL containing "livejasmin"
|
---|
74 | ## livejasmin
|
---|
75 | # Actually: do that in the code (CCWETProcessor) with a log message,
|
---|
76 | # since we actually need to get rid of any sites in entirety that contain
|
---|
77 | # any url with the string "livejasmin"
|
---|
78 | # So run the program once, check the log for messages mentioning "additional"
|
---|
79 | # adult sites found and add their domains in here.
|
---|
80 | anigma-beauty.com
|
---|
81 | adultfeet.com
|
---|
82 | atopian.org
|
---|
83 | bellydancingvideo.net
|
---|
84 | bmmodelsagency.com
|
---|
85 | brucknergallery.com
|
---|
86 | fuckvidz.org
|
---|
87 | photobattle.net
|
---|
88 | votekat.info
|
---|
89 |
|
---|
90 | # Similar to above, the following contained the string "jasmin" in the URL
|
---|
91 | teenycuties.com
|
---|
92 | a.tiles.mapbox.com
|
---|
93 | blazingteens.net
|
---|
94 | redtubeporn.info
|
---|
95 | osb-land.com
|
---|
96 | totallyhotmales.com
|
---|
97 | babeevents.com
|
---|
98 | talkserver.de
|
---|
99 | hehechat.org
|
---|
100 | fetish-nights.com
|
---|
101 | lesslove.com
|
---|
102 | hebertsvideo.com
|
---|
103 |
|
---|
104 | # sounds like some pirating site
|
---|
105 | ^http://pirateguides.com/
|
---|
106 | fastmp3.ru
|
---|
107 |
|
---|
108 | # from alexa topsites at https://www.alexa.com/topsites
|
---|
109 | livejasmin.com
|
---|
110 | pornhub.com
|
---|
111 | # listed as a similar topsite at https://en.wikipedia.org/wiki/List_of_most_popular_websites
|
---|
112 | redtube.com
|
---|
113 | xvideos.com
|
---|
114 | xhamster.com
|
---|
115 | xnxx.com
|
---|
116 |
|
---|
117 |
|
---|
118 | # not sure about the domain name and/or full url seems like it belongs here
|
---|
119 | abcutie.com
|
---|
120 |
|
---|
121 | # only had a single seedURL and it quickly redirected to an adult site
|
---|
122 | apparactes.gq
|
---|