# URL blacklist # FORMAT: # precede URL by ^ to blacklist urls that match the given prefix # succeed URL by $ to blacklist urls that match the given suffix # ^url$ will blacklist urls that match the given url completely # Without either ^ or $ symbol, urls containing the given url will get blacklisted # manually adjusting for irrelevant topsite hits # Rapa-Nui is related to Easter Island ^http://codex.cs.yale.edu/avi/silberschatz/gallery/trips-photos/South-America/Rapa-Nui/ # We will blacklist this yale.edu domain except for the subportion that gets whitelisted # then in the sites-too-big-to-exhaustively-crawl.txt, we have a mapping for an allowed url # pattern in case elements on the page are stored elsewhere ^http://korora.econ.yale.edu/ # wikipedia pages in # ksh (a German dialect), ilo (Filippino), ty Tahitian, wa for Walons/Walloon, # io (Ido version of Esperanto) and zh-min-nan (Min-Nan-Chinese) are not in the Maori language # Not sure why Commoncrawl had found them for language code MRI ksh.wikipedia.org ilo.wikipedia.org wa.wikipedia.org ty.m.wikipedia.org io.m.wikipedia.org zh-min-nan.wikipedia.org zh-min-nan.wiktionary.org # unwanted domains .video-chat. .videochat. 3chat.ru livevideochatting.org lovewebcam.net cherrybabe.biz dreamsbabes.com adultfantasyboutique.com adultterra.com leatherdyke.porn hornyteenharlots.com adultviewsex.com adultsexualvideo.com ctbererotica.sexe-traque.com cybererotia.porn234.com cybereroticz.adultsupermart.com freegaywebcams.info lesbiansinmysoup.com videopornoxx.online sexandplay.com sexynakedselfies.info barebabez.com britnudes.net camaporno.com webxvideo.com gayspornosex.com jasminreviews.com sexchatlines4u.com sexybabeworld.org sexyleaks.info uniqueporno.com wildsexsluts.com xxxblacknudes.com # sounds like some pirating site ^http://pirateguides.com/ # from alexa topsites at https://www.alexa.com/topsites livejasmin.com pornhub.com # listed as a similar topsite at https://en.wikipedia.org/wiki/List_of_most_popular_websites redtube.com xvideos.com xhamster.com xnxx.com