source: gs3-extensions/maori-lang-detection/conf/url-greylist-filter.txt@ 33569

Last change on this file since 33569 was 33569, checked in by ak19, 5 years ago
  1. batchcrawl.sh now does what it should have from the start, which is to move the log.out and UNFINISHED files into the output folder instead of leaving them in the input folder, as the input to_crawl folder can and does get replaced all the time, every time I regenerate it after black/white/greylisting more urls. 2. Blacklisted more adult sites, greylisted more product sites and .ru, .pl and .tk domains with whitelisting in the whitelist file. 3. CCWETProcessor now looks out for additional adult sites based on URL and adds them to its blacklist in memory (not the file) and logs the domain for checking and manually adding to the blacklist file.
File size: 1.7 KB
Line 
1# URL 'greylist': save matching urls to one side, to eyeball later and decide if
2# they should be included after all or whether it was okay to have skipped them
3# FORMAT:
4# precede URL by ^ to greylist urls that match the given prefix
5# succeed URL by $ to greylist urls that match the given suffix
6# ^url$ will greylist urls that match the given url completely
7# Without either ^ or $ symbol, urls containing the given url will get greylisted
8
9
10# Product sites: unwanted auto-translation pages of online product stores and other websites
11/product/
12/products/
13/product-page/
14/product-category/
15ledlamp.china-led-lighting.com
16ledpar64.china-led-lighting.com
17ledwallwasher.china-led-lighting.com
18abacre.com
19cn-huafu.net
20apteka.social
21
22
23# not product stores but autotranslated?
24192-168-1-1l.com
2519216811login.club
2619216811login.club
271videosmusica.com
28256file.com
29# already in greylisting of all .ru
30#7773033.ru
31#abali.ru
32#allbeautyone.ru
33aqualuz.org
34
35# if page doesn't load and can't be tested
361videosmusica.com
37www.kiterewa.pl
38
39
40
41# MANUALLY INSPECTED URLS AND ADDED TO GREYLIST
42
43# license plate site? - already in greylisting of all .ru
44#eba.com.ru
45
46# As per archive.org, there's just a photo on the defunct page at this site
47# And the picture label and filename is probably Japanese
48agri.mine.utsunomiya-u.ac.jp
49
50# seems to be Indonesian or Malaysian Bible rather than in Maori or any Polynesian language
51alkitab.life:2022
52
53# appears defunct
54alixira.com
55
56# single seedURL was not a page in Maori, but global languages.
57# And the rest of the domain appears to be in English
58anglican.org
59
60
61### TLDs that we greylist - any exceptions will be in the whitelist
62# Our list of .ru and .pl domains were not relevant
63.ru/
64.pl/
65.tk/
Note: See TracBrowser for help on using the repository browser.