Changeset 33555 for gs3-extensions


Ignore:
Timestamp:
2019-10-09T18:43:47+13:00 (5 years ago)
Author:
ak19
Message:

Modified top sites list as Dr Bainbridge described: suffixes for the same resource (e.g. google.com, google.it) are all retained. I have removed prefixes however, e.g. translate.google.com is removed since google.com is already there. Used latest version of alexa, wiki top sites page, and the moz top 500 sites page Dr Bainbridge had found.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt

    r33554 r33555  
    2727addtoany.com
    2828adobe.com
     29adweek.com
    2930airbnb.com
    3031akamaihd.net
     
    3637allaboutcookies.org
    3738allrecipes.com
    38 amazon.
     39amazon.ca
     40amazon.co.jp
     41amazon.co.uk
     42amazon.com
     43amazon.de
     44amazon.es
     45amazon.fr
     46amazon.in
     47ameblo.jp
    3948ampproject.org
    4049android.com
     
    4554apple.com
    4655archive.org
     56archives.gov
    4757arstechnica.com
    4858arxiv.org
     
    5666bbc.co.uk
    5767bbc.com
     68behance.net
    5869berkeley.edu
    5970biblegateway.com
     
    6879bloomberg.com
    6980booking.com
     81boston.com
    7082box.com
    7183britannica.com
     
    7890ca.gov
    7991cambridge.org
     92canalblog.com
    8093cbc.ca
     94cbslocal.com
    8195cbsnews.com
    8296cdc.gov
     
    8498channel4.com
    8599chicagotribune.com
     100chinadaily.com.cn
    86101cisco.com
    87102clickbank.net
    88103cloudflare.com
     104cmu.edu
    89105cnbc.com
    90106cnet.com
     
    92108cocolog-nifty.com
    93109columbia.edu
     110connect.over-blog.com
    94111cornell.edu
    95112corriere.it
     
    103120dan.com
    104121daum.net
     122debian.org
    105123dell.com
    106124depositfiles.com
    107125detik.com
    108126digg.com
     127discovery.com
    109128disney.com
     129disney.go.com
    110130disqus.com
    111131doubleclick.net
     
    166186goo.ne.jp
    167187goodreads.com
    168 google.
     188google.ca
     189google.co.id
     190google.co.in
     191google.co.jp
     192google.co.uk
     193google.com
     194google.com.br
     195google.com.hk
     196google.com.tr
     197google.de
     198google.es
     199google.fr
     200google.it
     201google.nl
     202google.pl
     203google.ru
     204googleapis.com
    169205googleblog.com
    170206googleusercontent.com
     
    203239indiegogo.com
    204240instagram.com
     241instructables.com
    205242intel.com
     243interia.pl
    206244issuu.com
    207245istockphoto.com
     
    224262livescience.com
    225263loc.gov
     264lonelyplanet.com
    226265lycos.com
     266m.wikipedia.org
    227267mail.ru
    228268marketwatch.com
     
    232272medium.com
    233273mega.nz
     274megaupload.com
    234275mercurynews.com
    235276merriam-webster.com
     
    253294naver.com
    254295naver.jp
     296nba.com
    255297nbcnews.com
    256298ndtv.com
     
    262304newscientist.com
    263305newsweek.com
     306newyorker.com
    264307nginx.com
    265308nginx.org
     
    279322odnoklassniki.ru
    280323office.com
     324offset.com
    281325ok.ru
    282326okezone.com
     
    295339paypal.com
    296340pbs.org
     341pcmag.com
    297342people.com
    298343photobucket.com
     
    302347playstation.com
    303348plesk.com
     349plos.org
    304350politico.com
     351prestashop.com
    305352prezi.com
    306353princeton.edu
     
    316363reddit.com
    317364repubblica.it
     365researchgate.net
    318366reuters.com
    319367ria.ru
     
    321369rt.com
    322370rtve.es
     371sakura.ne.jp
    323372samsung.com
    324373sapo.pt
     374scholastic.com
    325375sciencedaily.com
    326376sciencedirect.com
     
    341391skype.com
    342392skyrock.com
     393slate.com
    343394slideshare.net
    344395sm.cn
     
    355406springer.com
    356407sputniknews.com
     408ssl-images-amazon.com
    357409stackoverflow.com
     410standard.co.uk
    358411stanford.edu
    359412state.gov
     
    361414steampowered.com
    362415storage.canalblog.com
     416storage.googleapis.com
    363417stores.jp
    364418storify.com
     
    372426taobao.com
    373427target.com
     428teamviewer.com
    374429techcrunch.com
    375430ted.com
     
    377432telegraph.co.uk
    378433terra.com.br
     434theatlantic.com
     435thefreedictionary.com
    379436theglobeandmail.com
    380437theguardian.com
    381438themeforest.net
     439thenextweb.com
    382440thestar.com
    383441thesun.co.uk
     
    403461uol.com.br
    404462urbandictionary.com
     463usa.gov
    405464usatoday.com
    406465usgs.gov
     
    424483washingtonpost.com
    425484wattpad.com
     485weather.com
    426486web.fc2.com
    427487webmd.com
     
    448508xinhuanet.com
    449509yadi.sk
    450 yahoo.co.
     510yahoo.co.jp
    451511yahoo.com
    452512yale.edu
     
    459519ytimg.com
    460520zdnet.com
     521zend.com
    461522zendesk.com
    462 
     523zippyshare.com
Note: See TracChangeset for help on using the changeset viewer.