Changeset 33551 for gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt
- Timestamp:
- 2019-10-04T19:35:06+13:00 (5 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt
r33550 r33551 1 1 2 # Add alexa top sites to greylist 3 # Remaining top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites 4 ## TODO: Get more from https://moz.com/top500 2 # Add alexa top sites (only 50 visible) 3 # Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites 4 ## Finally also got the CSV from https://moz.com/top500 and added it to the list and added them in. 5 # Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates. 6 # Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext to keep just <site>.ext 7 # And resorted alphabetically 5 8 9 10 000webhost.com 6 11 360.cn 12 4shared.com 13 a8.net 14 abc.es 15 abc.net.au 16 abcnews.go.com 17 about.com 18 about.me 19 aboutads.info 20 abril.com.br 21 academia.edu 7 22 accuweather.com 8 amazon.co 9 amazon.com 10 ampproject.org 23 addthis.com 24 addtoany.com 25 adobe.com 26 airbnb.com 27 akamaihd.net 28 alexa.com 29 alibaba.com 11 30 aliexpress.com 12 31 alipay.com 32 aljazeera.com 33 allaboutcookies.org 34 allrecipes.com 35 amazon. 36 ampproject.org 37 android.com 38 aol.com 39 ap.org 40 apache.org 41 apachefriends.org 13 42 apple.com 43 archive.org 44 arstechnica.com 45 arxiv.org 46 asahi.com 47 ask.fm 48 asus.com 49 axs.com 14 50 babytree.com 15 51 baidu.com 52 bandcamp.com 53 bbc.co.uk 54 bbc.com 55 berkeley.edu 56 biblegateway.com 57 biglobe.ne.jp 58 billboard.com 16 59 bing.com 60 bit.ly 17 61 bitly.com 62 blackberry.com 63 blogger.com 18 64 blogspot.com 65 bloomberg.com 66 booking.com 67 box.com 68 britannica.com 69 bt.com 70 bund.de 71 businessinsider.com 72 businesswire.com 73 buydomains.com 74 buzzfeed.com 75 ca.gov 76 cambridge.org 77 cbc.ca 78 cbsnews.com 79 cdc.gov 80 change.org 81 channel4.com 82 chicagotribune.com 83 cisco.com 84 clickbank.net 85 cloudflare.com 86 cnbc.com 87 cnet.com 88 cnn.com 89 cocolog-nifty.com 90 columbia.edu 91 cornell.edu 92 corriere.it 93 cpanel.com 94 cpanel.net 95 creativecommons.org 19 96 csdn.net 97 csmonitor.com 98 dailymail.co.uk 99 dailymotion.com 100 dan.com 101 daum.net 102 dell.com 103 depositfiles.com 104 detik.com 105 digg.com 106 disney.com 107 disqus.com 108 doubleclick.net 109 dreniq.com 110 dribbble.com 111 dropbox.com 112 dropboxusercontent.com 113 dw.com 114 e-recht24.de 115 ea.com 116 ebay.co.uk 20 117 ebay.com 118 economist.com 119 eff.org 120 ehow.com 121 elmundo.es 122 elpais.com 123 engadget.com 124 entrepreneur.com 125 eonline.com 21 126 espn.com 127 espn.go.com 128 etsy.com 129 europa.eu 130 eventbrite.com 131 example.com 132 excite.co.jp 133 express.co.uk 22 134 facebook.com 135 fandom.com 136 fastcompany.com 137 fb.com 138 fb.me 139 fda.gov 140 fedoraproject.org 141 feedburner.com 142 fifa.com 143 files.wordpress.com 144 flickr.com 145 forbes.com 146 fortune.com 147 foursquare.com 148 foxnews.com 149 ft.com 150 ftc.gov 151 gen.xyz 152 geocities.jp 153 gesetze-im-internet.de 154 ggpht.com 155 github.com 156 gizmodo.com 157 globo.com 158 gmail.com 159 gnu.org 160 godaddy.com 161 gofundme.com 162 goo.gl 163 goo.ne.jp 164 goodreads.com 165 google. 166 googleblog.com 167 googleusercontent.com 168 gooyaabitemplates.com 169 gov.uk 170 gravatar.com 171 greenpeace.org 172 gstatic.com 173 guardian.co.uk 174 harvard.edu 175 hatena.ne.jp 176 histats.com 177 hm.com 178 hollywoodreporter.com 179 home.pl 180 house.gov 181 howstuffworks.com 182 hp.com 183 huffingtonpost.com 184 huffpost.com 185 hugedomains.com 186 ibm.com 187 ibtimes.com 188 icann.org 189 ieee.org 190 ietf.org 191 ig.com.br 192 ign.com 193 ikea.com 194 imageshack.us 195 imdb.com 196 imgur.com 197 inc.com 198 independent.co.uk 199 indiatimes.com 200 indiegogo.com 23 201 instagram.com 202 intel.com 203 issuu.com 204 istockphoto.com 205 iubenda.com 24 206 jd.com 25 google.com 207 joomla.org 208 jquery.com 209 jstor.org 210 kickstarter.com 211 kinja.com 212 last.fm 213 latimes.com 214 lefigaro.fr 215 lemonde.fr 216 line.me 217 linkedin.com 218 list-manage.com 26 219 live.com 220 livejournal.com 221 livescience.com 222 loc.gov 223 lycos.com 224 mail.ru 225 marketwatch.com 226 marriott.com 227 mashable.com 228 mediafire.com 229 medium.com 230 mega.nz 231 mercurynews.com 232 merriam-webster.com 233 metro.co.uk 27 234 microsoft.com 28 235 microsoftonline.com 236 mirror.co.uk 237 mit.edu 238 mixcloud.com 239 mlb.com 240 mozilla.com 241 mozilla.org 29 242 msn.com 243 myspace.com 244 mysql.com 245 namecheap.com 246 narod.ru 247 nasa.gov 248 nationalgeographic.com 249 nature.com 30 250 naver.com 31 nasa.gov 251 naver.jp 252 nbcnews.com 253 ndtv.com 32 254 netflix.com 255 netsons.com 256 netvibes.com 257 networkadvertising.org 258 news.com.au 259 newscientist.com 260 newsweek.com 261 nginx.com 262 nginx.org 263 nhk.or.jp 264 nicovideo.jp 265 nifty.com 266 nih.gov 267 nikkei.com 268 noaa.gov 269 nokia.com 270 npr.org 271 nvidia.com 272 nydailynews.com 273 nypost.com 274 nytimes.com 275 nyu.edu 276 odnoklassniki.ru 33 277 office.com 34 278 ok.ru 35 279 okezone.com 280 opera.com 281 oracle.com 282 orange.fr 283 oreilly.com 284 oup.com 285 over-blog.com 286 ovh.co.uk 287 ovh.com 288 ovh.net 289 ox.ac.uk 290 parallels.com 291 pastebin.com 36 292 paypal.com 293 pbs.org 294 people.com 295 photobucket.com 296 php.net 37 297 pinterest.com 298 pixabay.com 299 playstation.com 300 plesk.com 301 politico.com 302 prezi.com 303 princeton.edu 304 privacyshield.gov 305 prnewswire.com 306 psychologytoday.com 38 307 qq.com 308 quantcast.com 39 309 quora.com 310 rakuten.co.jp 311 rambler.ru 312 rapidshare.com 40 313 reddit.com 41 Sina.com.cn 314 repubblica.it 315 reuters.com 316 ria.ru 317 rottentomatoes.com 318 rt.com 319 rtve.es 320 samsung.com 321 sapo.pt 322 sciencedaily.com 323 sciencedirect.com 324 sciencemag.org 325 scientificamerican.com 326 scribd.com 327 seattletimes.com 328 secureserver.net 329 sedo.com 330 seesaa.net 331 sendspace.com 332 sfgate.com 333 shopify.com 334 shutterstock.com 335 siemens.com 336 sina.com.cn 337 sky.com 338 skype.com 339 skyrock.com 340 slideshare.net 42 341 sm.cn 342 smh.com.au 343 so-net.ne.jp 344 softonic.com 43 345 sogou.com 44 346 sohu.com 347 soratemplates.com 45 348 soso.com 349 soundcloud.com 350 spiegel.de 351 spotify.com 352 springer.com 353 sputniknews.com 46 354 stackoverflow.com 355 stanford.edu 356 state.gov 357 steamcommunity.com 358 steampowered.com 359 storage.canalblog.com 360 stores.jp 361 storify.com 362 stuff.co.nz 363 surveymonkey.com 364 symantec.com 365 t-online.de 47 366 t.co 367 t.me 368 tabelog.com 48 369 taobao.com 370 target.com 371 techcrunch.com 372 ted.com 373 telegram.me 374 telegraph.co.uk 375 terra.com.br 376 theglobeandmail.com 377 theguardian.com 378 themeforest.net 379 thestar.com 380 thesun.co.uk 381 thetimes.co.uk 382 theverge.com 383 thoughtco.com 49 384 tianya.cn 385 time.com 386 tinyurl.com 50 387 tmall.com 51 #pages.tmall.com 52 #login.tmall.com 388 tmz.com 53 389 tribunnews.com 390 tripadvisor.com 391 trustpilot.com 54 392 twitch.tv 55 393 twitter.com 394 ucoz.ru 395 uiuc.edu 396 umich.edu 397 un.org 398 undeveloped.com 399 unesco.org 400 uol.com.br 401 urbandictionary.com 402 usatoday.com 403 usgs.gov 404 usnews.com 405 uspto.gov 406 ustream.tv 407 utexas.edu 408 variety.com 409 venturebeat.com 410 vice.com 411 viglink.com 412 vimeo.com 56 413 vk.com 414 vkontakte.ru 415 vox.com 416 w3.org 57 417 w3schools.com 418 wa.me 58 419 walmart.com 420 washington.edu 421 washingtonpost.com 422 wattpad.com 423 web.fc2.com 424 webmd.com 425 weebly.com 59 426 weibo.com 427 welt.de 428 whatsapp.com 429 whitehouse.gov 430 who.int 431 wikia.com 432 wikihow.com 433 wikimedia.org 60 434 wikipedia.org 435 wikipedia.org 436 wikipedia.org 437 wiktionary.org 438 wiley.com 439 windowsphone.com 440 wired.com 441 wix.com 442 wordpress.org 443 worldbank.org 444 wp.com 445 wsj.com 446 xbox.com 61 447 xinhuanet.com 448 yadi.sk 62 449 yahoo.co. 63 450 yahoo.com 451 yahoo.com 452 yale.edu 64 453 yandex.ru 454 yelp.com 455 youku.com 456 youronlinechoices.com 457 youtu.be 65 458 youtube.com 459 ytimg.com 460 zdnet.com 461 zendesk.com 66 462 67 463 68 69 70 71 72 73
Note:
See TracChangeset
for help on using the changeset viewer.