source: gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt@ 33551

Last change on this file since 33551 was 33551, checked in by ak19, 5 years ago

Added in top 500 urls from moz.com/top500 and removed duplicates, and removed subdomain variants keeping just main site variant, and sorted alphabetically again.

File size: 5.8 KB
Line 
1
2# Add alexa top sites (only 50 visible)
3# Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
4## Finally also got the CSV from https://moz.com/top500 and added it to the list and added them in.
5# Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates.
6# Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext to keep just <site>.ext
7# And resorted alphabetically
8
9
10000webhost.com
11360.cn
124shared.com
13a8.net
14abc.es
15abc.net.au
16abcnews.go.com
17about.com
18about.me
19aboutads.info
20abril.com.br
21academia.edu
22accuweather.com
23addthis.com
24addtoany.com
25adobe.com
26airbnb.com
27akamaihd.net
28alexa.com
29alibaba.com
30aliexpress.com
31alipay.com
32aljazeera.com
33allaboutcookies.org
34allrecipes.com
35amazon.
36ampproject.org
37android.com
38aol.com
39ap.org
40apache.org
41apachefriends.org
42apple.com
43archive.org
44arstechnica.com
45arxiv.org
46asahi.com
47ask.fm
48asus.com
49axs.com
50babytree.com
51baidu.com
52bandcamp.com
53bbc.co.uk
54bbc.com
55berkeley.edu
56biblegateway.com
57biglobe.ne.jp
58billboard.com
59bing.com
60bit.ly
61bitly.com
62blackberry.com
63blogger.com
64blogspot.com
65bloomberg.com
66booking.com
67box.com
68britannica.com
69bt.com
70bund.de
71businessinsider.com
72businesswire.com
73buydomains.com
74buzzfeed.com
75ca.gov
76cambridge.org
77cbc.ca
78cbsnews.com
79cdc.gov
80change.org
81channel4.com
82chicagotribune.com
83cisco.com
84clickbank.net
85cloudflare.com
86cnbc.com
87cnet.com
88cnn.com
89cocolog-nifty.com
90columbia.edu
91cornell.edu
92corriere.it
93cpanel.com
94cpanel.net
95creativecommons.org
96csdn.net
97csmonitor.com
98dailymail.co.uk
99dailymotion.com
100dan.com
101daum.net
102dell.com
103depositfiles.com
104detik.com
105digg.com
106disney.com
107disqus.com
108doubleclick.net
109dreniq.com
110dribbble.com
111dropbox.com
112dropboxusercontent.com
113dw.com
114e-recht24.de
115ea.com
116ebay.co.uk
117ebay.com
118economist.com
119eff.org
120ehow.com
121elmundo.es
122elpais.com
123engadget.com
124entrepreneur.com
125eonline.com
126espn.com
127espn.go.com
128etsy.com
129europa.eu
130eventbrite.com
131example.com
132excite.co.jp
133express.co.uk
134facebook.com
135fandom.com
136fastcompany.com
137fb.com
138fb.me
139fda.gov
140fedoraproject.org
141feedburner.com
142fifa.com
143files.wordpress.com
144flickr.com
145forbes.com
146fortune.com
147foursquare.com
148foxnews.com
149ft.com
150ftc.gov
151gen.xyz
152geocities.jp
153gesetze-im-internet.de
154ggpht.com
155github.com
156gizmodo.com
157globo.com
158gmail.com
159gnu.org
160godaddy.com
161gofundme.com
162goo.gl
163goo.ne.jp
164goodreads.com
165google.
166googleblog.com
167googleusercontent.com
168gooyaabitemplates.com
169gov.uk
170gravatar.com
171greenpeace.org
172gstatic.com
173guardian.co.uk
174harvard.edu
175hatena.ne.jp
176histats.com
177hm.com
178hollywoodreporter.com
179home.pl
180house.gov
181howstuffworks.com
182hp.com
183huffingtonpost.com
184huffpost.com
185hugedomains.com
186ibm.com
187ibtimes.com
188icann.org
189ieee.org
190ietf.org
191ig.com.br
192ign.com
193ikea.com
194imageshack.us
195imdb.com
196imgur.com
197inc.com
198independent.co.uk
199indiatimes.com
200indiegogo.com
201instagram.com
202intel.com
203issuu.com
204istockphoto.com
205iubenda.com
206jd.com
207joomla.org
208jquery.com
209jstor.org
210kickstarter.com
211kinja.com
212last.fm
213latimes.com
214lefigaro.fr
215lemonde.fr
216line.me
217linkedin.com
218list-manage.com
219live.com
220livejournal.com
221livescience.com
222loc.gov
223lycos.com
224mail.ru
225marketwatch.com
226marriott.com
227mashable.com
228mediafire.com
229medium.com
230mega.nz
231mercurynews.com
232merriam-webster.com
233metro.co.uk
234microsoft.com
235microsoftonline.com
236mirror.co.uk
237mit.edu
238mixcloud.com
239mlb.com
240mozilla.com
241mozilla.org
242msn.com
243myspace.com
244mysql.com
245namecheap.com
246narod.ru
247nasa.gov
248nationalgeographic.com
249nature.com
250naver.com
251naver.jp
252nbcnews.com
253ndtv.com
254netflix.com
255netsons.com
256netvibes.com
257networkadvertising.org
258news.com.au
259newscientist.com
260newsweek.com
261nginx.com
262nginx.org
263nhk.or.jp
264nicovideo.jp
265nifty.com
266nih.gov
267nikkei.com
268noaa.gov
269nokia.com
270npr.org
271nvidia.com
272nydailynews.com
273nypost.com
274nytimes.com
275nyu.edu
276odnoklassniki.ru
277office.com
278ok.ru
279okezone.com
280opera.com
281oracle.com
282orange.fr
283oreilly.com
284oup.com
285over-blog.com
286ovh.co.uk
287ovh.com
288ovh.net
289ox.ac.uk
290parallels.com
291pastebin.com
292paypal.com
293pbs.org
294people.com
295photobucket.com
296php.net
297pinterest.com
298pixabay.com
299playstation.com
300plesk.com
301politico.com
302prezi.com
303princeton.edu
304privacyshield.gov
305prnewswire.com
306psychologytoday.com
307qq.com
308quantcast.com
309quora.com
310rakuten.co.jp
311rambler.ru
312rapidshare.com
313reddit.com
314repubblica.it
315reuters.com
316ria.ru
317rottentomatoes.com
318rt.com
319rtve.es
320samsung.com
321sapo.pt
322sciencedaily.com
323sciencedirect.com
324sciencemag.org
325scientificamerican.com
326scribd.com
327seattletimes.com
328secureserver.net
329sedo.com
330seesaa.net
331sendspace.com
332sfgate.com
333shopify.com
334shutterstock.com
335siemens.com
336sina.com.cn
337sky.com
338skype.com
339skyrock.com
340slideshare.net
341sm.cn
342smh.com.au
343so-net.ne.jp
344softonic.com
345sogou.com
346sohu.com
347soratemplates.com
348soso.com
349soundcloud.com
350spiegel.de
351spotify.com
352springer.com
353sputniknews.com
354stackoverflow.com
355stanford.edu
356state.gov
357steamcommunity.com
358steampowered.com
359storage.canalblog.com
360stores.jp
361storify.com
362stuff.co.nz
363surveymonkey.com
364symantec.com
365t-online.de
366t.co
367t.me
368tabelog.com
369taobao.com
370target.com
371techcrunch.com
372ted.com
373telegram.me
374telegraph.co.uk
375terra.com.br
376theglobeandmail.com
377theguardian.com
378themeforest.net
379thestar.com
380thesun.co.uk
381thetimes.co.uk
382theverge.com
383thoughtco.com
384tianya.cn
385time.com
386tinyurl.com
387tmall.com
388tmz.com
389tribunnews.com
390tripadvisor.com
391trustpilot.com
392twitch.tv
393twitter.com
394ucoz.ru
395uiuc.edu
396umich.edu
397un.org
398undeveloped.com
399unesco.org
400uol.com.br
401urbandictionary.com
402usatoday.com
403usgs.gov
404usnews.com
405uspto.gov
406ustream.tv
407utexas.edu
408variety.com
409venturebeat.com
410vice.com
411viglink.com
412vimeo.com
413vk.com
414vkontakte.ru
415vox.com
416w3.org
417w3schools.com
418wa.me
419walmart.com
420washington.edu
421washingtonpost.com
422wattpad.com
423web.fc2.com
424webmd.com
425weebly.com
426weibo.com
427welt.de
428whatsapp.com
429whitehouse.gov
430who.int
431wikia.com
432wikihow.com
433wikimedia.org
434wikipedia.org
435wikipedia.org
436wikipedia.org
437wiktionary.org
438wiley.com
439windowsphone.com
440wired.com
441wix.com
442wordpress.org
443worldbank.org
444wp.com
445wsj.com
446xbox.com
447xinhuanet.com
448yadi.sk
449yahoo.co.
450yahoo.com
451yahoo.com
452yale.edu
453yandex.ru
454yelp.com
455youku.com
456youronlinechoices.com
457youtu.be
458youtube.com
459ytimg.com
460zdnet.com
461zendesk.com
462
463
Note: See TracBrowser for help on using the repository browser.