# URL blacklist # FORMAT: # precede URL by ^ to blacklist urls that match the given prefix # succeed URL by $ to blacklist urls that match the given suffix # ^url$ will blacklist urls that match the given url completely # Without either ^ or $ symbol, urls containing the given url will get blacklisted # manually adjusting for irrelevant topsite hits # Rapa-Nui is related to Easter Island ^http://codex.cs.yale.edu/avi/silberschatz/gallery/trips-photos/South-America/Rapa-Nui/ # We will blacklist this yale.edu domain except for the subportion that gets whitelisted # then in the sites-too-big-to-exhaustively-crawl.txt, we have a mapping for an allowed url # pattern in case elements on the page are stored elsewhere ^http://korora.econ.yale.edu/ # wikipedia pages in # ksh (a German dialect), ilo (Filippino), ty Tahitian, wa for Walons/Walloon, # io (Ido version of Esperanto) and zh-min-nan (Min-Nan-Chinese) are not in the Maori language # Not sure why Commoncrawl had found them for language code MRI ksh.wikipedia.org ilo.wikipedia.org wa.wikipedia.org ty.m.wikipedia.org io.m.wikipedia.org zh-min-nan.wikipedia.org zh-min-nan.wiktionary.org ###### # unwanted domains .video-chat. .videochat. 3chat.ru livevideochatting.org lovewebcam.net cherrybabe.biz dreamsbabes.com adultfantasyboutique.com adultterra.com leatherdyke.porn hornyteenharlots.com adultviewsex.com adultsexualvideo.com ctbererotica.sexe-traque.com cybererotia.porn234.com cybereroticz.adultsupermart.com freegaywebcams.info lesbiansinmysoup.com videopornoxx.online sexandplay.com sexynakedselfies.info barebabez.com britnudes.net camaporno.com webxvideo.com gayspornosex.com jasminreviews.com sexchatlines4u.com sexybabeworld.org sexyleaks.info uniqueporno.com wildsexsluts.com xxxblacknudes.com bigsexymelons.com mi.thebestmasturbators.com # more adult sites acba.osb-land.com the-naked.com # the full URL is http://ww25.milfsplease.com, but don't know whether the ww25 prefix should be included or not ww25.milfsplease.com milfsplease.com # just get rid of any URL containing "livejasmin" ## livejasmin # Actually: do that in the code (CCWETProcessor) with a log message, # since we actually need to get rid of any sites in entirety that contain # any url with the string "livejasmin" # So run the program once, check the log for messages mentioning "additional" # adult sites found and add their domains in here. anigma-beauty.com adultfeet.com atopian.org bellydancingvideo.net bmmodelsagency.com brucknergallery.com fuckvidz.org photobattle.net votekat.info # Similar to above, the following contained the string "jasmin" in the URL teenycuties.com a.tiles.mapbox.com blazingteens.net redtubeporn.info osb-land.com totallyhotmales.com babeevents.com talkserver.de hehechat.org fetish-nights.com lesslove.com hebertsvideo.com # sounds like some pirating site ^http://pirateguides.com/ fastmp3.ru # from alexa topsites at https://www.alexa.com/topsites livejasmin.com pornhub.com # listed as a similar topsite at https://en.wikipedia.org/wiki/List_of_most_popular_websites redtube.com xvideos.com xhamster.com xnxx.com # not sure about the domain name and/or full url seems like it belongs here abcutie.com # only had a single seedURL and it quickly redirected to an adult site apparactes.gq