Changeset 33550


Ignore:
Timestamp:
2019-10-04T19:06:51+13:00 (5 years ago)
Author:
ak19
Message:

First stage of introducing sites-too-big-to-exhaustively-crawl.tx: split url-greylist-filter.txt into true greylisted sites (product sites so far) and the existing top sites urls that simply represent sites too big to crawl in entirety.

Location:
gs3-extensions/maori-lang-detection/conf
Files:
1 added
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/conf/url-greylist-filter.txt

    r33532 r33550  
    1313/product-page/
    1414/product-category/
    15 
    16 # Add alexa top sites to greylist
    17 
    18 youtube.com
    19 tmall.com
    20 baidu.com
    21 qq.com
    22 sohu.com
    23 facebook.com
    24 taobao.com
    25 #login.tmall.com
    26 wikipedia.org
    27 yahoo.com
    28 360.cn
    29 jd.com
    30 amazon.com
    31 Sina.com.cn
    32 weibo.com
    33 #pages.tmall.com
    34 live.com
    35 vk.com
    36 netflix.com
    37 alipay.com
    38 office.com
    39 okezone.com
    40 csdn.net
    41 instagram.com
    42 xinhuanet.com
    43 babytree.com
    44 twitter.com
    45 ebay.com
    46 stackoverflow.com
    47 naver.com
    48 aliexpress.com
    49 twitch.tv
    50 tribunnews.com
    51 apple.com
    52 soso.com
    53 tianya.cn
    54 microsoftonline.com
    55 yandex.ru
    56 
    57 # Remaining top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
    58 
    59 ok.ru
    60 paypal.com
    61 t.co
    62 pinterest.com
    63 sogou.com
    64 espn.com
    65 walmart.com
    66 bitly.com
    67 ampproject.org
    68 sm.cn
    69 
    70 
    71 
    72 # UNSURE - what if these contain translated pages?
    73 google.com
    74 bing.com
    75 amazon.co
    76 msn.com
    77 microsoft.com
    78 accuweather.com
    79 
    80 #nasa.gov
    81 # w3schools.com
    82 # quora.com
    83 #reddit.com
    84 #blogspot.com
    85 #yahoo.co.
    86 
    87 
    88 ## TODO: Get more from https://moz.com/top500
Note: See TracChangeset for help on using the changeset viewer.