Changeset 33551


Ignore:
Timestamp:
2019-10-04T19:35:06+13:00 (5 years ago)
Author:
ak19
Message:

Added in top 500 urls from moz.com/top500 and removed duplicates, and removed subdomain variants keeping just main site variant, and sorted alphabetically again.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt

    r33550 r33551  
    11
    2 # Add alexa top sites to greylist
    3 # Remaining top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
    4 ## TODO: Get more from https://moz.com/top500
     2# Add alexa top sites (only 50 visible)
     3# Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
     4## Finally also got the CSV from https://moz.com/top500 and added it to the list and added them in.
     5# Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates.
     6# Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext to keep just <site>.ext
     7# And resorted alphabetically
    58
     9
     10000webhost.com
    611360.cn
     124shared.com
     13a8.net
     14abc.es
     15abc.net.au
     16abcnews.go.com
     17about.com
     18about.me
     19aboutads.info
     20abril.com.br
     21academia.edu
    722accuweather.com
    8 amazon.co
    9 amazon.com
    10 ampproject.org
     23addthis.com
     24addtoany.com
     25adobe.com
     26airbnb.com
     27akamaihd.net
     28alexa.com
     29alibaba.com
    1130aliexpress.com
    1231alipay.com
     32aljazeera.com
     33allaboutcookies.org
     34allrecipes.com
     35amazon.
     36ampproject.org
     37android.com
     38aol.com
     39ap.org
     40apache.org
     41apachefriends.org
    1342apple.com
     43archive.org
     44arstechnica.com
     45arxiv.org
     46asahi.com
     47ask.fm
     48asus.com
     49axs.com
    1450babytree.com
    1551baidu.com
     52bandcamp.com
     53bbc.co.uk
     54bbc.com
     55berkeley.edu
     56biblegateway.com
     57biglobe.ne.jp
     58billboard.com
    1659bing.com
     60bit.ly
    1761bitly.com
     62blackberry.com
     63blogger.com
    1864blogspot.com
     65bloomberg.com
     66booking.com
     67box.com
     68britannica.com
     69bt.com
     70bund.de
     71businessinsider.com
     72businesswire.com
     73buydomains.com
     74buzzfeed.com
     75ca.gov
     76cambridge.org
     77cbc.ca
     78cbsnews.com
     79cdc.gov
     80change.org
     81channel4.com
     82chicagotribune.com
     83cisco.com
     84clickbank.net
     85cloudflare.com
     86cnbc.com
     87cnet.com
     88cnn.com
     89cocolog-nifty.com
     90columbia.edu
     91cornell.edu
     92corriere.it
     93cpanel.com
     94cpanel.net
     95creativecommons.org
    1996csdn.net
     97csmonitor.com
     98dailymail.co.uk
     99dailymotion.com
     100dan.com
     101daum.net
     102dell.com
     103depositfiles.com
     104detik.com
     105digg.com
     106disney.com
     107disqus.com
     108doubleclick.net
     109dreniq.com
     110dribbble.com
     111dropbox.com
     112dropboxusercontent.com
     113dw.com
     114e-recht24.de
     115ea.com
     116ebay.co.uk
    20117ebay.com
     118economist.com
     119eff.org
     120ehow.com
     121elmundo.es
     122elpais.com
     123engadget.com
     124entrepreneur.com
     125eonline.com
    21126espn.com
     127espn.go.com
     128etsy.com
     129europa.eu
     130eventbrite.com
     131example.com
     132excite.co.jp
     133express.co.uk
    22134facebook.com
     135fandom.com
     136fastcompany.com
     137fb.com
     138fb.me
     139fda.gov
     140fedoraproject.org
     141feedburner.com
     142fifa.com
     143files.wordpress.com
     144flickr.com
     145forbes.com
     146fortune.com
     147foursquare.com
     148foxnews.com
     149ft.com
     150ftc.gov
     151gen.xyz
     152geocities.jp
     153gesetze-im-internet.de
     154ggpht.com
     155github.com
     156gizmodo.com
     157globo.com
     158gmail.com
     159gnu.org
     160godaddy.com
     161gofundme.com
     162goo.gl
     163goo.ne.jp
     164goodreads.com
     165google.
     166googleblog.com
     167googleusercontent.com
     168gooyaabitemplates.com
     169gov.uk
     170gravatar.com
     171greenpeace.org
     172gstatic.com
     173guardian.co.uk
     174harvard.edu
     175hatena.ne.jp
     176histats.com
     177hm.com
     178hollywoodreporter.com
     179home.pl
     180house.gov
     181howstuffworks.com
     182hp.com
     183huffingtonpost.com
     184huffpost.com
     185hugedomains.com
     186ibm.com
     187ibtimes.com
     188icann.org
     189ieee.org
     190ietf.org
     191ig.com.br
     192ign.com
     193ikea.com
     194imageshack.us
     195imdb.com
     196imgur.com
     197inc.com
     198independent.co.uk
     199indiatimes.com
     200indiegogo.com
    23201instagram.com
     202intel.com
     203issuu.com
     204istockphoto.com
     205iubenda.com
    24206jd.com
    25 google.com
     207joomla.org
     208jquery.com
     209jstor.org
     210kickstarter.com
     211kinja.com
     212last.fm
     213latimes.com
     214lefigaro.fr
     215lemonde.fr
     216line.me
     217linkedin.com
     218list-manage.com
    26219live.com
     220livejournal.com
     221livescience.com
     222loc.gov
     223lycos.com
     224mail.ru
     225marketwatch.com
     226marriott.com
     227mashable.com
     228mediafire.com
     229medium.com
     230mega.nz
     231mercurynews.com
     232merriam-webster.com
     233metro.co.uk
    27234microsoft.com
    28235microsoftonline.com
     236mirror.co.uk
     237mit.edu
     238mixcloud.com
     239mlb.com
     240mozilla.com
     241mozilla.org
    29242msn.com
     243myspace.com
     244mysql.com
     245namecheap.com
     246narod.ru
     247nasa.gov
     248nationalgeographic.com
     249nature.com
    30250naver.com
    31 nasa.gov
     251naver.jp
     252nbcnews.com
     253ndtv.com
    32254netflix.com
     255netsons.com
     256netvibes.com
     257networkadvertising.org
     258news.com.au
     259newscientist.com
     260newsweek.com
     261nginx.com
     262nginx.org
     263nhk.or.jp
     264nicovideo.jp
     265nifty.com
     266nih.gov
     267nikkei.com
     268noaa.gov
     269nokia.com
     270npr.org
     271nvidia.com
     272nydailynews.com
     273nypost.com
     274nytimes.com
     275nyu.edu
     276odnoklassniki.ru
    33277office.com
    34278ok.ru
    35279okezone.com
     280opera.com
     281oracle.com
     282orange.fr
     283oreilly.com
     284oup.com
     285over-blog.com
     286ovh.co.uk
     287ovh.com
     288ovh.net
     289ox.ac.uk
     290parallels.com
     291pastebin.com
    36292paypal.com
     293pbs.org
     294people.com
     295photobucket.com
     296php.net
    37297pinterest.com
     298pixabay.com
     299playstation.com
     300plesk.com
     301politico.com
     302prezi.com
     303princeton.edu
     304privacyshield.gov
     305prnewswire.com
     306psychologytoday.com
    38307qq.com
     308quantcast.com
    39309quora.com
     310rakuten.co.jp
     311rambler.ru
     312rapidshare.com
    40313reddit.com
    41 Sina.com.cn
     314repubblica.it
     315reuters.com
     316ria.ru
     317rottentomatoes.com
     318rt.com
     319rtve.es
     320samsung.com
     321sapo.pt
     322sciencedaily.com
     323sciencedirect.com
     324sciencemag.org
     325scientificamerican.com
     326scribd.com
     327seattletimes.com
     328secureserver.net
     329sedo.com
     330seesaa.net
     331sendspace.com
     332sfgate.com
     333shopify.com
     334shutterstock.com
     335siemens.com
     336sina.com.cn
     337sky.com
     338skype.com
     339skyrock.com
     340slideshare.net
    42341sm.cn
     342smh.com.au
     343so-net.ne.jp
     344softonic.com
    43345sogou.com
    44346sohu.com
     347soratemplates.com
    45348soso.com
     349soundcloud.com
     350spiegel.de
     351spotify.com
     352springer.com
     353sputniknews.com
    46354stackoverflow.com
     355stanford.edu
     356state.gov
     357steamcommunity.com
     358steampowered.com
     359storage.canalblog.com
     360stores.jp
     361storify.com
     362stuff.co.nz
     363surveymonkey.com
     364symantec.com
     365t-online.de
    47366t.co
     367t.me
     368tabelog.com
    48369taobao.com
     370target.com
     371techcrunch.com
     372ted.com
     373telegram.me
     374telegraph.co.uk
     375terra.com.br
     376theglobeandmail.com
     377theguardian.com
     378themeforest.net
     379thestar.com
     380thesun.co.uk
     381thetimes.co.uk
     382theverge.com
     383thoughtco.com
    49384tianya.cn
     385time.com
     386tinyurl.com
    50387tmall.com
    51 #pages.tmall.com
    52 #login.tmall.com
     388tmz.com
    53389tribunnews.com
     390tripadvisor.com
     391trustpilot.com
    54392twitch.tv
    55393twitter.com
     394ucoz.ru
     395uiuc.edu
     396umich.edu
     397un.org
     398undeveloped.com
     399unesco.org
     400uol.com.br
     401urbandictionary.com
     402usatoday.com
     403usgs.gov
     404usnews.com
     405uspto.gov
     406ustream.tv
     407utexas.edu
     408variety.com
     409venturebeat.com
     410vice.com
     411viglink.com
     412vimeo.com
    56413vk.com
     414vkontakte.ru
     415vox.com
     416w3.org
    57417w3schools.com
     418wa.me
    58419walmart.com
     420washington.edu
     421washingtonpost.com
     422wattpad.com
     423web.fc2.com
     424webmd.com
     425weebly.com
    59426weibo.com
     427welt.de
     428whatsapp.com
     429whitehouse.gov
     430who.int
     431wikia.com
     432wikihow.com
     433wikimedia.org
    60434wikipedia.org
     435wikipedia.org
     436wikipedia.org
     437wiktionary.org
     438wiley.com
     439windowsphone.com
     440wired.com
     441wix.com
     442wordpress.org
     443worldbank.org
     444wp.com
     445wsj.com
     446xbox.com
    61447xinhuanet.com
     448yadi.sk
    62449yahoo.co.
    63450yahoo.com
     451yahoo.com
     452yale.edu
    64453yandex.ru
     454yelp.com
     455youku.com
     456youronlinechoices.com
     457youtu.be
    65458youtube.com
     459ytimg.com
     460zdnet.com
     461zendesk.com
    66462
    67463
    68 
    69 
    70 
    71 
    72 
    73 
Note: See TracChangeset for help on using the changeset viewer.