source: gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt@ 33553

Last change on this file since 33553 was 33553, checked in by ak19, 5 years ago

Comments

File size: 6.2 KB
Line 
1# URL blacklist
2# FORMAT:
3# precede URL by ^ to blacklist urls that match the given prefix
4# succeed URL by $ to blacklist urls that match the given suffix
5# ^url$ will blacklist urls that match the given url completely
6# Without either ^ or $ symbol, urls containing the given url will get blacklisted
7
8# Contains alexa top sites (where only the first 50 were visible)
9# Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
10# Finally also added https://moz.com/top500 by downloading its CSV file and
11# adding its URLs to the existing listing here from alexa/wiki.
12# Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates.
13# Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext variants, keeping
14# just <site>.ext
15# And finally, re-sorted the reduced list alphabetically and pasted into here.
16
17
18000webhost.com
19360.cn
204shared.com
21a8.net
22abc.es
23abc.net.au
24abcnews.go.com
25about.com
26about.me
27aboutads.info
28abril.com.br
29academia.edu
30accuweather.com
31addthis.com
32addtoany.com
33adobe.com
34airbnb.com
35akamaihd.net
36alexa.com
37alibaba.com
38aliexpress.com
39alipay.com
40aljazeera.com
41allaboutcookies.org
42allrecipes.com
43amazon.
44ampproject.org
45android.com
46aol.com
47ap.org
48apache.org
49apachefriends.org
50apple.com
51archive.org
52arstechnica.com
53arxiv.org
54asahi.com
55ask.fm
56asus.com
57axs.com
58babytree.com
59baidu.com
60bandcamp.com
61bbc.co.uk
62bbc.com
63berkeley.edu
64biblegateway.com
65biglobe.ne.jp
66billboard.com
67bing.com
68bit.ly
69bitly.com
70blackberry.com
71blogger.com
72blogspot.com
73bloomberg.com
74booking.com
75box.com
76britannica.com
77bt.com
78bund.de
79businessinsider.com
80businesswire.com
81buydomains.com
82buzzfeed.com
83ca.gov
84cambridge.org
85cbc.ca
86cbsnews.com
87cdc.gov
88change.org
89channel4.com
90chicagotribune.com
91cisco.com
92clickbank.net
93cloudflare.com
94cnbc.com
95cnet.com
96cnn.com
97cocolog-nifty.com
98columbia.edu
99cornell.edu
100corriere.it
101cpanel.com
102cpanel.net
103creativecommons.org
104csdn.net
105csmonitor.com
106dailymail.co.uk
107dailymotion.com
108dan.com
109daum.net
110dell.com
111depositfiles.com
112detik.com
113digg.com
114disney.com
115disqus.com
116doubleclick.net
117dreniq.com
118dribbble.com
119dropbox.com
120dropboxusercontent.com
121dw.com
122e-recht24.de
123ea.com
124ebay.co.uk
125ebay.com
126economist.com
127eff.org
128ehow.com
129elmundo.es
130elpais.com
131engadget.com
132entrepreneur.com
133eonline.com
134espn.com
135espn.go.com
136etsy.com
137europa.eu
138eventbrite.com
139example.com
140excite.co.jp
141express.co.uk
142facebook.com
143fandom.com
144fastcompany.com
145fb.com
146fb.me
147fda.gov
148fedoraproject.org
149feedburner.com
150fifa.com
151files.wordpress.com
152flickr.com
153forbes.com
154fortune.com
155foursquare.com
156foxnews.com
157ft.com
158ftc.gov
159gen.xyz
160geocities.jp
161gesetze-im-internet.de
162ggpht.com
163github.com
164gizmodo.com
165globo.com
166gmail.com
167gnu.org
168godaddy.com
169gofundme.com
170goo.gl
171goo.ne.jp
172goodreads.com
173google.
174googleblog.com
175googleusercontent.com
176gooyaabitemplates.com
177gov.uk
178gravatar.com
179greenpeace.org
180gstatic.com
181guardian.co.uk
182harvard.edu
183hatena.ne.jp
184histats.com
185hm.com
186hollywoodreporter.com
187home.pl
188house.gov
189howstuffworks.com
190hp.com
191huffingtonpost.com
192huffpost.com
193hugedomains.com
194ibm.com
195ibtimes.com
196icann.org
197ieee.org
198ietf.org
199ig.com.br
200ign.com
201ikea.com
202imageshack.us
203imdb.com
204imgur.com
205inc.com
206independent.co.uk
207indiatimes.com
208indiegogo.com
209instagram.com
210intel.com
211issuu.com
212istockphoto.com
213iubenda.com
214jd.com
215joomla.org
216jquery.com
217jstor.org
218kickstarter.com
219kinja.com
220last.fm
221latimes.com
222lefigaro.fr
223lemonde.fr
224line.me
225linkedin.com
226list-manage.com
227live.com
228livejournal.com
229livescience.com
230loc.gov
231lycos.com
232mail.ru
233marketwatch.com
234marriott.com
235mashable.com
236mediafire.com
237medium.com
238mega.nz
239mercurynews.com
240merriam-webster.com
241metro.co.uk
242microsoft.com
243microsoftonline.com
244mirror.co.uk
245mit.edu
246mixcloud.com
247mlb.com
248mozilla.com
249mozilla.org
250msn.com
251myspace.com
252mysql.com
253namecheap.com
254narod.ru
255nasa.gov
256nationalgeographic.com
257nature.com
258naver.com
259naver.jp
260nbcnews.com
261ndtv.com
262netflix.com
263netsons.com
264netvibes.com
265networkadvertising.org
266news.com.au
267newscientist.com
268newsweek.com
269nginx.com
270nginx.org
271nhk.or.jp
272nicovideo.jp
273nifty.com
274nih.gov
275nikkei.com
276noaa.gov
277nokia.com
278npr.org
279nvidia.com
280nydailynews.com
281nypost.com
282nytimes.com
283nyu.edu
284odnoklassniki.ru
285office.com
286ok.ru
287okezone.com
288opera.com
289oracle.com
290orange.fr
291oreilly.com
292oup.com
293over-blog.com
294ovh.co.uk
295ovh.com
296ovh.net
297ox.ac.uk
298parallels.com
299pastebin.com
300paypal.com
301pbs.org
302people.com
303photobucket.com
304php.net
305pinterest.com
306pixabay.com
307playstation.com
308plesk.com
309politico.com
310prezi.com
311princeton.edu
312privacyshield.gov
313prnewswire.com
314psychologytoday.com
315qq.com
316quantcast.com
317quora.com
318rakuten.co.jp
319rambler.ru
320rapidshare.com
321reddit.com
322repubblica.it
323reuters.com
324ria.ru
325rottentomatoes.com
326rt.com
327rtve.es
328samsung.com
329sapo.pt
330sciencedaily.com
331sciencedirect.com
332sciencemag.org
333scientificamerican.com
334scribd.com
335seattletimes.com
336secureserver.net
337sedo.com
338seesaa.net
339sendspace.com
340sfgate.com
341shopify.com
342shutterstock.com
343siemens.com
344sina.com.cn
345sky.com
346skype.com
347skyrock.com
348slideshare.net
349sm.cn
350smh.com.au
351so-net.ne.jp
352softonic.com
353sogou.com
354sohu.com
355soratemplates.com
356soso.com
357soundcloud.com
358spiegel.de
359spotify.com
360springer.com
361sputniknews.com
362stackoverflow.com
363stanford.edu
364state.gov
365steamcommunity.com
366steampowered.com
367storage.canalblog.com
368stores.jp
369storify.com
370stuff.co.nz
371surveymonkey.com
372symantec.com
373t-online.de
374t.co
375t.me
376tabelog.com
377taobao.com
378target.com
379techcrunch.com
380ted.com
381telegram.me
382telegraph.co.uk
383terra.com.br
384theglobeandmail.com
385theguardian.com
386themeforest.net
387thestar.com
388thesun.co.uk
389thetimes.co.uk
390theverge.com
391thoughtco.com
392tianya.cn
393time.com
394tinyurl.com
395tmall.com
396tmz.com
397tribunnews.com
398tripadvisor.com
399trustpilot.com
400twitch.tv
401twitter.com
402ucoz.ru
403uiuc.edu
404umich.edu
405un.org
406undeveloped.com
407unesco.org
408uol.com.br
409urbandictionary.com
410usatoday.com
411usgs.gov
412usnews.com
413uspto.gov
414ustream.tv
415utexas.edu
416variety.com
417venturebeat.com
418vice.com
419viglink.com
420vimeo.com
421vk.com
422vkontakte.ru
423vox.com
424w3.org
425w3schools.com
426wa.me
427walmart.com
428washington.edu
429washingtonpost.com
430wattpad.com
431web.fc2.com
432webmd.com
433weebly.com
434weibo.com
435welt.de
436whatsapp.com
437whitehouse.gov
438who.int
439wikia.com
440wikihow.com
441wikimedia.org
442wikipedia.org
443wikipedia.org
444wikipedia.org
445wiktionary.org
446wiley.com
447windowsphone.com
448wired.com
449wix.com
450wordpress.org
451worldbank.org
452wp.com
453wsj.com
454xbox.com
455xinhuanet.com
456yadi.sk
457yahoo.co.
458yahoo.com
459yahoo.com
460yale.edu
461yandex.ru
462yelp.com
463youku.com
464youronlinechoices.com
465youtu.be
466youtube.com
467ytimg.com
468zdnet.com
469zendesk.com
470
471
Note: See TracBrowser for help on using the repository browser.