Ignore:
Timestamp:
2019-10-16T20:00:09+13:00 (5 years ago)
Author:
ak19
Message:
  1. batchcrawl.sh now does what it should have from the start, which is to move the log.out and UNFINISHED files into the output folder instead of leaving them in the input folder, as the input to_crawl folder can and does get replaced all the time, every time I regenerate it after black/white/greylisting more urls. 2. Blacklisted more adult sites, greylisted more product sites and .ru, .pl and .tk domains with whitelisting in the whitelist file. 3. CCWETProcessor now looks out for additional adult sites based on URL and adds them to its blacklist in memory (not the file) and logs the domain for checking and manually adding to the blacklist file.
File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/conf/url-blacklist-filter.txt

    r33568 r33569  
    2828zh-min-nan.wiktionary.org
    2929
     30######
    3031# unwanted domains
    3132.video-chat.
     
    6970acba.osb-land.com
    7071
     72
     73# just get rid of any URL containing "livejasmin"
     74## livejasmin
     75# Actually: do that in the code (CCWETProcessor) with a log message,
     76# since we actually need to get rid of any sites in entirety that contain
     77# any url with the string "livejasmin"
     78# So run the program once, check the log for messages mentioning "additional"
     79# adult sites found and add their domains in here.
     80anigma-beauty.com
     81adultfeet.com
     82atopian.org
     83bellydancingvideo.net
     84bmmodelsagency.com
     85brucknergallery.com
     86fuckvidz.org
     87photobattle.net
     88votekat.info
     89
     90# Similar to above, the following contained the string "jasmin" in the URL
     91teenycuties.com
     92a.tiles.mapbox.com
     93blazingteens.net
     94redtubeporn.info
     95osb-land.com
     96totallyhotmales.com
     97babeevents.com
     98talkserver.de
     99hehechat.org
     100fetish-nights.com
     101lesslove.com
     102hebertsvideo.com
     103
    71104# sounds like some pirating site
    72105^http://pirateguides.com/
     
    85118# not sure about the domain name and/or full url seems like it belongs here
    86119abcutie.com
     120
     121# only had a single seedURL and it quickly redirected to an adult site
     122apparactes.gq
Note: See TracChangeset for help on using the changeset viewer.