source: other-projects/maori-lang-detection/conf/url-greylist-filter.txt@ 33961

Last change on this file since 33961 was 33904, checked in by ak19, 4 years ago

Shouldn't greylist anglican.org, as this prevented crawling of justus.anglican.org seedURLs. There's however no need to add an exception into sites-too-big-to-exhaustively-crawl.txt to control how much we crawl, as we only crawl to depth 10 anyway and the seedURLs already list the most promising pages (as well as 2 URLs on anglican.org which weren't promising). Added the to_crwal and finished crawled data for this. siteID is 01463.

File size: 1.8 KB
Line 
1# URL 'greylist': save matching urls to one side, to eyeball later and decide if
2# they should be included after all or whether it was okay to have skipped them
3# FORMAT:
4# precede URL by ^ to greylist urls that match the given prefix
5# succeed URL by $ to greylist urls that match the given suffix
6# ^url$ will greylist urls that match the given url completely
7# Without either ^ or $ symbol, urls containing the given url will get greylisted
8
9
10# Product sites: unwanted auto-translation pages of online product stores and other websites
11/product/
12/products/
13/product-page/
14/product-category/
15ledlamp.china-led-lighting.com
16ledpar64.china-led-lighting.com
17ledwallwasher.china-led-lighting.com
18abacre.com
19cn-huafu.net
20apteka.social
21
22
23# not product stores but autotranslated?
24192-168-1-1l.com
2519216811login.club
2619216811login.club
271videosmusica.com
28256file.com
29# already in greylisting of all .ru
30#7773033.ru
31#abali.ru
32#allbeautyone.ru
33aqualuz.org
34
35# if page doesn't load and can't be tested
361videosmusica.com
37www.kiterewa.pl
38
39
40
41# MANUALLY INSPECTED URLS AND ADDED TO GREYLIST
42
43# license plate site? - already in greylisting of all .ru
44#eba.com.ru
45
46# As per archive.org, there's just a photo on the defunct page at this site
47# And the picture label and filename is probably Japanese
48agri.mine.utsunomiya-u.ac.jp
49
50# seems to be Indonesian or Malaysian Bible rather than in Maori or any Polynesian language
51alkitab.life:2022
52
53# appears defunct
54alixira.com
55
56# single seedURL was not a page in Maori, but global languages.
57# And the rest of the domain appears to be in English.
58#anglican.org
59# but we want the seedURLs from justus.anglican.org,
60# so grab anglican.org anyway
61
62
63### TLDs that we greylist - any exceptions will be in the whitelist
64# Our list of .ru and .pl domains were not relevant
65.ru/
66.pl/
67.tk/
Note: See TracBrowser for help on using the repository browser.