source: gs3-extensions/maori-lang-detection/conf/url-greylist-filter.txt@ 33532

Last change on this file since 33532 was 33532, checked in by ak19, 5 years ago

Found the other top 500 sites link again at last which Dr Bainbridge had discovered the other day. Still need to go through the links in there

File size: 1.4 KB
Line 
1# URL 'greylist': save matching urls to one side, to eyeball later and decide if
2# they should be included after all or whether it was okay to have skipped them
3# FORMAT:
4# precede URL by ^ to greylist urls that match the given prefix
5# succeed URL by $ to greylist urls that match the given suffix
6# ^url$ will greylist urls that match the given url completely
7# Without either ^ or $ symbol, urls containing the given url will get greylisted
8
9
10# Product sites: unwanted auto-translation pages of online product stores
11/product/
12/products/
13/product-page/
14/product-category/
15
16# Add alexa top sites to greylist
17
18youtube.com
19tmall.com
20baidu.com
21qq.com
22sohu.com
23facebook.com
24taobao.com
25#login.tmall.com
26wikipedia.org
27yahoo.com
28360.cn
29jd.com
30amazon.com
31Sina.com.cn
32weibo.com
33#pages.tmall.com
34live.com
35vk.com
36netflix.com
37alipay.com
38office.com
39okezone.com
40csdn.net
41instagram.com
42xinhuanet.com
43babytree.com
44twitter.com
45ebay.com
46stackoverflow.com
47naver.com
48aliexpress.com
49twitch.tv
50tribunnews.com
51apple.com
52soso.com
53tianya.cn
54microsoftonline.com
55yandex.ru
56
57# Remaining top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
58
59ok.ru
60paypal.com
61t.co
62pinterest.com
63sogou.com
64espn.com
65walmart.com
66bitly.com
67ampproject.org
68sm.cn
69
70
71
72# UNSURE - what if these contain translated pages?
73google.com
74bing.com
75amazon.co
76msn.com
77microsoft.com
78accuweather.com
79
80#nasa.gov
81# w3schools.com
82# quora.com
83#reddit.com
84#blogspot.com
85#yahoo.co.
86
87
88## TODO: Get more from https://moz.com/top500
Note: See TracBrowser for help on using the repository browser.