Context Navigation

url-blacklist-filter.txt@ 33569

Last change on this file since 33569 was 33569, checked in by ak19, 5 years ago

batchcrawl.sh now does what it should have from the start, which is to move the log.out and UNFINISHED files into the output folder instead of leaving them in the input folder, as the input to_crawl folder can and does get replaced all the time, every time I regenerate it after black/white/greylisting more urls. 2. Blacklisted more adult sites, greylisted more product sites and .ru, .pl and .tk domains with whitelisting in the whitelist file. 3. CCWETProcessor now looks out for additional adult sites based on URL and adds them to its blacklist in memory (not the file) and logs the domain for checking and manually adding to the blacklist file.

File size: 3.0 KB

Line
1	# URL blacklist
2	# FORMAT:
3	# precede URL by ^ to blacklist urls that match the given prefix
4	# succeed URL by $ to blacklist urls that match the given suffix
5	# ^url$ will blacklist urls that match the given url completely
6	# Without either ^ or $ symbol, urls containing the given url will get blacklisted
7
8
9	# manually adjusting for irrelevant topsite hits
10	# Rapa-Nui is related to Easter Island
11	^http://codex.cs.yale.edu/avi/silberschatz/gallery/trips-photos/South-America/Rapa-Nui/
12
13	# We will blacklist this yale.edu domain except for the subportion that gets whitelisted
14	# then in the sites-too-big-to-exhaustively-crawl.txt, we have a mapping for an allowed url
15	# pattern in case elements on the page are stored elsewhere
16	^http://korora.econ.yale.edu/
17
18	# wikipedia pages in
19	# ksh (a German dialect), ilo (Filippino), ty Tahitian, wa for Walons/Walloon,
20	# io (Ido version of Esperanto) and zh-min-nan (Min-Nan-Chinese) are not in the Maori language
21	# Not sure why Commoncrawl had found them for language code MRI
22	ksh.wikipedia.org
23	ilo.wikipedia.org
24	wa.wikipedia.org
25	ty.m.wikipedia.org
26	io.m.wikipedia.org
27	zh-min-nan.wikipedia.org
28	zh-min-nan.wiktionary.org
29
30	######
31	# unwanted domains
32	.video-chat.
33	.videochat.
34	3chat.ru
35	livevideochatting.org
36	lovewebcam.net
37
38	cherrybabe.biz
39	dreamsbabes.com
40	adultfantasyboutique.com
41	adultterra.com
42
43	leatherdyke.porn
44	hornyteenharlots.com
45	adultviewsex.com
46	adultsexualvideo.com
47	ctbererotica.sexe-traque.com
48	cybererotia.porn234.com
49	cybereroticz.adultsupermart.com
50	freegaywebcams.info
51	lesbiansinmysoup.com
52	videopornoxx.online
53	sexandplay.com
54	sexynakedselfies.info
55	barebabez.com
56	britnudes.net
57	camaporno.com
58	webxvideo.com
59	gayspornosex.com
60	jasminreviews.com
61	sexchatlines4u.com
62	sexybabeworld.org
63	sexyleaks.info
64	uniqueporno.com
65	wildsexsluts.com
66	xxxblacknudes.com
67	bigsexymelons.com
68
69	# more adult sites
70	acba.osb-land.com
71
72
73	# just get rid of any URL containing "livejasmin"
74	## livejasmin
75	# Actually: do that in the code (CCWETProcessor) with a log message,
76	# since we actually need to get rid of any sites in entirety that contain
77	# any url with the string "livejasmin"
78	# So run the program once, check the log for messages mentioning "additional"
79	# adult sites found and add their domains in here.
80	anigma-beauty.com
81	adultfeet.com
82	atopian.org
83	bellydancingvideo.net
84	bmmodelsagency.com
85	brucknergallery.com
86	fuckvidz.org
87	photobattle.net
88	votekat.info
89
90	# Similar to above, the following contained the string "jasmin" in the URL
91	teenycuties.com
92	a.tiles.mapbox.com
93	blazingteens.net
94	redtubeporn.info
95	osb-land.com
96	totallyhotmales.com
97	babeevents.com
98	talkserver.de
99	hehechat.org
100	fetish-nights.com
101	lesslove.com
102	hebertsvideo.com
103
104	# sounds like some pirating site
105	^http://pirateguides.com/
106	fastmp3.ru
107
108	# from alexa topsites at https://www.alexa.com/topsites
109	livejasmin.com
110	pornhub.com
111	# listed as a similar topsite at https://en.wikipedia.org/wiki/List_of_most_popular_websites
112	redtube.com
113	xvideos.com
114	xhamster.com
115	xnxx.com
116
117
118	# not sure about the domain name and/or full url seems like it belongs here
119	abcutie.com
120
121	# only had a single seedURL and it quickly redirected to an adult site
122	apparactes.gq

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: gs3-extensions/maori-lang-detection/conf/url-blacklist-filter.txt@ 33569

Download in other formats: