Changeset 33555

Show
Ignore:
Timestamp:
09.10.2019 18:43:47 (8 days ago)
Author:
ak19
Message:

Modified top sites list as Dr Bainbridge described: suffixes for the same resource (e.g. google.com, google.it) are all retained. I have removed prefixes however, e.g. translate.google.com is removed since google.com is already there. Used latest version of alexa, wiki top sites page, and the moz top 500 sites page Dr Bainbridge had found.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt

    r33554 r33555  
    2727addtoany.com 
    2828adobe.com 
     29adweek.com 
    2930airbnb.com 
    3031akamaihd.net 
     
    3637allaboutcookies.org 
    3738allrecipes.com 
    38 amazon. 
     39amazon.ca 
     40amazon.co.jp 
     41amazon.co.uk 
     42amazon.com 
     43amazon.de 
     44amazon.es 
     45amazon.fr 
     46amazon.in 
     47ameblo.jp 
    3948ampproject.org 
    4049android.com 
     
    4554apple.com 
    4655archive.org 
     56archives.gov 
    4757arstechnica.com 
    4858arxiv.org 
     
    5666bbc.co.uk 
    5767bbc.com 
     68behance.net 
    5869berkeley.edu 
    5970biblegateway.com 
     
    6879bloomberg.com 
    6980booking.com 
     81boston.com 
    7082box.com 
    7183britannica.com 
     
    7890ca.gov 
    7991cambridge.org 
     92canalblog.com 
    8093cbc.ca 
     94cbslocal.com 
    8195cbsnews.com 
    8296cdc.gov 
     
    8498channel4.com 
    8599chicagotribune.com 
     100chinadaily.com.cn 
    86101cisco.com 
    87102clickbank.net 
    88103cloudflare.com 
     104cmu.edu 
    89105cnbc.com 
    90106cnet.com 
     
    92108cocolog-nifty.com 
    93109columbia.edu 
     110connect.over-blog.com 
    94111cornell.edu 
    95112corriere.it 
     
    103120dan.com 
    104121daum.net 
     122debian.org 
    105123dell.com 
    106124depositfiles.com 
    107125detik.com 
    108126digg.com 
     127discovery.com 
    109128disney.com 
     129disney.go.com 
    110130disqus.com 
    111131doubleclick.net 
     
    166186goo.ne.jp 
    167187goodreads.com 
    168 google. 
     188google.ca 
     189google.co.id 
     190google.co.in 
     191google.co.jp 
     192google.co.uk 
     193google.com 
     194google.com.br 
     195google.com.hk 
     196google.com.tr 
     197google.de 
     198google.es 
     199google.fr 
     200google.it 
     201google.nl 
     202google.pl 
     203google.ru 
     204googleapis.com 
    169205googleblog.com 
    170206googleusercontent.com 
     
    203239indiegogo.com 
    204240instagram.com 
     241instructables.com 
    205242intel.com 
     243interia.pl 
    206244issuu.com 
    207245istockphoto.com 
     
    224262livescience.com 
    225263loc.gov 
     264lonelyplanet.com 
    226265lycos.com 
     266m.wikipedia.org 
    227267mail.ru 
    228268marketwatch.com 
     
    232272medium.com 
    233273mega.nz 
     274megaupload.com 
    234275mercurynews.com 
    235276merriam-webster.com 
     
    253294naver.com 
    254295naver.jp 
     296nba.com 
    255297nbcnews.com 
    256298ndtv.com 
     
    262304newscientist.com 
    263305newsweek.com 
     306newyorker.com 
    264307nginx.com 
    265308nginx.org 
     
    279322odnoklassniki.ru 
    280323office.com 
     324offset.com 
    281325ok.ru 
    282326okezone.com 
     
    295339paypal.com 
    296340pbs.org 
     341pcmag.com 
    297342people.com 
    298343photobucket.com 
     
    302347playstation.com 
    303348plesk.com 
     349plos.org 
    304350politico.com 
     351prestashop.com 
    305352prezi.com 
    306353princeton.edu 
     
    316363reddit.com 
    317364repubblica.it 
     365researchgate.net 
    318366reuters.com 
    319367ria.ru 
     
    321369rt.com 
    322370rtve.es 
     371sakura.ne.jp 
    323372samsung.com 
    324373sapo.pt 
     374scholastic.com 
    325375sciencedaily.com 
    326376sciencedirect.com 
     
    341391skype.com 
    342392skyrock.com 
     393slate.com 
    343394slideshare.net 
    344395sm.cn 
     
    355406springer.com 
    356407sputniknews.com 
     408ssl-images-amazon.com 
    357409stackoverflow.com 
     410standard.co.uk 
    358411stanford.edu 
    359412state.gov 
     
    361414steampowered.com 
    362415storage.canalblog.com 
     416storage.googleapis.com 
    363417stores.jp 
    364418storify.com 
     
    372426taobao.com 
    373427target.com 
     428teamviewer.com 
    374429techcrunch.com 
    375430ted.com 
     
    377432telegraph.co.uk 
    378433terra.com.br 
     434theatlantic.com 
     435thefreedictionary.com 
    379436theglobeandmail.com 
    380437theguardian.com 
    381438themeforest.net 
     439thenextweb.com 
    382440thestar.com 
    383441thesun.co.uk 
     
    403461uol.com.br 
    404462urbandictionary.com 
     463usa.gov 
    405464usatoday.com 
    406465usgs.gov 
     
    424483washingtonpost.com 
    425484wattpad.com 
     485weather.com 
    426486web.fc2.com 
    427487webmd.com 
     
    448508xinhuanet.com 
    449509yadi.sk 
    450 yahoo.co. 
     510yahoo.co.jp 
    451511yahoo.com 
    452512yale.edu 
     
    459519ytimg.com 
    460520zdnet.com 
     521zend.com 
    461522zendesk.com 
    462  
     523zippyshare.com