1 | # Mapping of top sites in base url forms to value
|
---|
2 |
|
---|
3 | # This file contains sites that are too large to crawl exhaustively.
|
---|
4 | # The domains are from Alexa top sites (where only the first 50 were visible)
|
---|
5 | # Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
|
---|
6 | # Finally also added https://moz.com/top500 by downloading its CSV file and
|
---|
7 | # adding its URLs to the existing listing here from alexa/wiki.
|
---|
8 | # Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates.
|
---|
9 | # Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext variants, keeping
|
---|
10 | # just <site>.ext
|
---|
11 | # And finally, re-sorted the reduced list alphabetically and pasted into here.
|
---|
12 |
|
---|
13 | # FORMAT OF THIS FILE'S CONTENTS:
|
---|
14 | # <topsite-base-url><tabspace><value>
|
---|
15 | # where <value> can be empty or one of SUBDOMAIN-COPY, SINGLEPAGE, <url-form-without-protocol>
|
---|
16 | #
|
---|
17 | # - if value is empty: if seedurl contains topsite-base-url, the seedurl will go into the file
|
---|
18 | # unprocessed-topsite-matches.txt and the site/page won't be crawled.
|
---|
19 | # The user will be notified to inspect the file unprocessed-topsite-matches.txt.
|
---|
20 | # - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl.
|
---|
21 | # For example, if the seedurl is http://docs.google.com/some-long-suffix-in-base64, then it
|
---|
22 | # matches the topsite-base-url of docs.google.com and its value of SINGLEPAGE will add the
|
---|
23 | # seedurl itself as the regex url-filter, to restrict the crawl to just the specified page.
|
---|
24 | # - SUBDOMAIN-COPY: if seedurl CONTAINS topsite-base-url, then whatever the seedurl's subdomain
|
---|
25 | # or else domain is, will make up the urlfilter, so we don't leak out into a larger domain.
|
---|
26 | # Use SUBDOMAIN-COPY to restrict to a domain prefix/subdomain. For example, if seedurl is
|
---|
27 | # pinky.blogspot.com, it will match the topsite-base-url of blogspot.com, but SUBDOMAIN-COPY
|
---|
28 | # will ensure we restrict crawling to pages on pinky.blogspot.com.
|
---|
29 | # However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go
|
---|
30 | # into the file unprocessed-topsite-matches.txt and the site/page won't be crawled.
|
---|
31 | # - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided
|
---|
32 | # url-form-without-protocol will make up the urlfilter, again preventing leaking into a
|
---|
33 | # larger part of the domain. For example, if the seedurl is mi.wikipedia.org/SomePage, it will
|
---|
34 | # match the topsite-base-url of wikipedia.org for which the <url-form-without-protocol>
|
---|
35 | # value is mi.wikipedia.org, which should be all that's accepted for wikipedia.org. The
|
---|
36 | # <url-form-without-protocol> ends up in the regex urlfilter file, thereby restricting the
|
---|
37 | # crawl to just mi.wikipedia.org.
|
---|
38 | # Remember to leave out any protocol <from url-form-without-protocol>.
|
---|
39 |
|
---|
40 |
|
---|
41 |
|
---|
42 | docs.google.com SINGLEPAGE
|
---|
43 | drive.google.com SINGLEPAGE
|
---|
44 | forms.office.com SINGLEPAGE
|
---|
45 | player.vimeo.com SINGLEPAGE
|
---|
46 | static-promote.weebly.com SINGLEPAGE
|
---|
47 |
|
---|
48 | # Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
|
---|
49 | # The page's containing folder is whitelisted in case the photos are there.
|
---|
50 | korora.econ.yale.edu SINGLEPAGE
|
---|
51 |
|
---|
52 | 000webhost.com
|
---|
53 | 360.cn
|
---|
54 | 4shared.com
|
---|
55 | a8.net
|
---|
56 | abc.es
|
---|
57 | abc.net.au
|
---|
58 | abcnews.go.com
|
---|
59 | about.com
|
---|
60 | about.me
|
---|
61 | aboutads.info
|
---|
62 | abril.com.br
|
---|
63 | academia.edu
|
---|
64 | accuweather.com
|
---|
65 | addthis.com
|
---|
66 | addtoany.com
|
---|
67 | adobe.com
|
---|
68 | adweek.com
|
---|
69 | airbnb.com
|
---|
70 | akamaihd.net
|
---|
71 | alexa.com
|
---|
72 | alibaba.com
|
---|
73 | aliexpress.com
|
---|
74 | alipay.com
|
---|
75 | aljazeera.com
|
---|
76 | allaboutcookies.org
|
---|
77 | allrecipes.com
|
---|
78 | amazon.ca
|
---|
79 | amazon.co.jp
|
---|
80 | amazon.co.uk
|
---|
81 | amazon.com
|
---|
82 | amazon.de
|
---|
83 | amazon.es
|
---|
84 | amazon.fr
|
---|
85 | amazon.in
|
---|
86 | ameblo.jp
|
---|
87 | ampproject.org
|
---|
88 | android.com
|
---|
89 | aol.com
|
---|
90 | ap.org
|
---|
91 | apache.org
|
---|
92 | apachefriends.org
|
---|
93 | apple.com
|
---|
94 | archive.org
|
---|
95 | archives.gov
|
---|
96 | arstechnica.com
|
---|
97 | arxiv.org
|
---|
98 | asahi.com
|
---|
99 | ask.fm
|
---|
100 | asus.com
|
---|
101 | axs.com
|
---|
102 | babytree.com
|
---|
103 | baidu.com
|
---|
104 | bandcamp.com
|
---|
105 | bbc.co.uk
|
---|
106 | bbc.com
|
---|
107 | behance.net
|
---|
108 | berkeley.edu
|
---|
109 | biblegateway.com
|
---|
110 | biglobe.ne.jp
|
---|
111 | billboard.com
|
---|
112 | bing.com
|
---|
113 | bit.ly
|
---|
114 | bitly.com
|
---|
115 | blackberry.com
|
---|
116 | blogger.com
|
---|
117 | blogspot.com SUBDOMAIN-COPY
|
---|
118 | bloomberg.com
|
---|
119 | booking.com
|
---|
120 | boston.com
|
---|
121 | box.com
|
---|
122 | britannica.com
|
---|
123 | bt.com
|
---|
124 | bund.de
|
---|
125 | businessinsider.com
|
---|
126 | businesswire.com
|
---|
127 | buydomains.com
|
---|
128 | buzzfeed.com
|
---|
129 | ca.gov
|
---|
130 | cambridge.org
|
---|
131 | canalblog.com
|
---|
132 | cbc.ca
|
---|
133 | cbslocal.com
|
---|
134 | cbsnews.com
|
---|
135 | cdc.gov
|
---|
136 | change.org
|
---|
137 | channel4.com
|
---|
138 | chicagotribune.com
|
---|
139 | chinadaily.com.cn
|
---|
140 | cisco.com
|
---|
141 | clickbank.net
|
---|
142 | cloudflare.com
|
---|
143 | cmu.edu
|
---|
144 | cnbc.com
|
---|
145 | cnet.com
|
---|
146 | cnn.com
|
---|
147 | cocolog-nifty.com
|
---|
148 | columbia.edu
|
---|
149 | connect.over-blog.com
|
---|
150 | cornell.edu
|
---|
151 | corriere.it
|
---|
152 | cpanel.com
|
---|
153 | cpanel.net
|
---|
154 | creativecommons.org
|
---|
155 | csdn.net
|
---|
156 | csmonitor.com
|
---|
157 | dailymail.co.uk
|
---|
158 | dailymotion.com
|
---|
159 | dan.com
|
---|
160 | daum.net
|
---|
161 | debian.org
|
---|
162 | dell.com
|
---|
163 | depositfiles.com
|
---|
164 | detik.com
|
---|
165 | digg.com
|
---|
166 | discovery.com
|
---|
167 | disney.com
|
---|
168 | disney.go.com
|
---|
169 | disqus.com
|
---|
170 | doubleclick.net
|
---|
171 | dreniq.com
|
---|
172 | dribbble.com
|
---|
173 | dropbox.com SINGLEPAGE
|
---|
174 | dropboxusercontent.com
|
---|
175 | dw.com
|
---|
176 | e-recht24.de
|
---|
177 | ea.com
|
---|
178 | ebay.co.uk
|
---|
179 | ebay.com
|
---|
180 | economist.com
|
---|
181 | eff.org
|
---|
182 | ehow.com
|
---|
183 | elmundo.es
|
---|
184 | elpais.com
|
---|
185 | engadget.com
|
---|
186 | entrepreneur.com
|
---|
187 | eonline.com
|
---|
188 | espn.com
|
---|
189 | espn.go.com
|
---|
190 | etsy.com
|
---|
191 | europa.eu
|
---|
192 | eventbrite.com
|
---|
193 | example.com
|
---|
194 | excite.co.jp
|
---|
195 | express.co.uk
|
---|
196 | facebook.com
|
---|
197 | fandom.com
|
---|
198 | fastcompany.com
|
---|
199 | fb.com
|
---|
200 | fb.me
|
---|
201 | fda.gov
|
---|
202 | fedoraproject.org
|
---|
203 | feedburner.com
|
---|
204 | fifa.com
|
---|
205 | files.wordpress.com
|
---|
206 | flickr.com
|
---|
207 | forbes.com
|
---|
208 | fortune.com
|
---|
209 | foursquare.com
|
---|
210 | foxnews.com
|
---|
211 | ft.com
|
---|
212 | ftc.gov
|
---|
213 | gen.xyz
|
---|
214 | geocities.jp
|
---|
215 | gesetze-im-internet.de
|
---|
216 | ggpht.com
|
---|
217 | github.com
|
---|
218 | gizmodo.com
|
---|
219 | globo.com
|
---|
220 | gmail.com
|
---|
221 | gnu.org
|
---|
222 | godaddy.com
|
---|
223 | gofundme.com
|
---|
224 | goo.gl
|
---|
225 | goo.ne.jp
|
---|
226 | goodreads.com
|
---|
227 | google.ca
|
---|
228 | google.co.id
|
---|
229 | google.co.in
|
---|
230 | google.co.jp
|
---|
231 | google.co.uk
|
---|
232 | google.com
|
---|
233 | google.com.br
|
---|
234 | google.com.hk
|
---|
235 | google.com.tr
|
---|
236 | google.de
|
---|
237 | google.es
|
---|
238 | google.fr
|
---|
239 | google.it
|
---|
240 | google.nl
|
---|
241 | google.pl
|
---|
242 | google.ru
|
---|
243 | googleapis.com
|
---|
244 | googleblog.com
|
---|
245 | googleusercontent.com
|
---|
246 | gooyaabitemplates.com
|
---|
247 | gov.uk
|
---|
248 | gravatar.com
|
---|
249 | greenpeace.org
|
---|
250 | gstatic.com
|
---|
251 | guardian.co.uk
|
---|
252 | harvard.edu
|
---|
253 | hatena.ne.jp
|
---|
254 | histats.com
|
---|
255 | hm.com
|
---|
256 | hollywoodreporter.com
|
---|
257 | home.pl
|
---|
258 | house.gov
|
---|
259 | howstuffworks.com
|
---|
260 | hp.com
|
---|
261 | huffingtonpost.com
|
---|
262 | huffpost.com
|
---|
263 | hugedomains.com
|
---|
264 | ibm.com
|
---|
265 | ibtimes.com
|
---|
266 | icann.org
|
---|
267 | ieee.org
|
---|
268 | ietf.org
|
---|
269 | ig.com.br
|
---|
270 | ign.com
|
---|
271 | ikea.com
|
---|
272 | imageshack.us
|
---|
273 | imdb.com
|
---|
274 | imgur.com
|
---|
275 | inc.com
|
---|
276 | independent.co.uk
|
---|
277 | indiatimes.com
|
---|
278 | indiegogo.com
|
---|
279 | instagram.com
|
---|
280 | instructables.com
|
---|
281 | intel.com
|
---|
282 | interia.pl
|
---|
283 | issuu.com
|
---|
284 | istockphoto.com
|
---|
285 | iubenda.com
|
---|
286 | jd.com
|
---|
287 | joomla.org
|
---|
288 | jquery.com
|
---|
289 | jstor.org
|
---|
290 | kickstarter.com
|
---|
291 | kinja.com
|
---|
292 | last.fm
|
---|
293 | latimes.com
|
---|
294 | lefigaro.fr
|
---|
295 | lemonde.fr
|
---|
296 | line.me
|
---|
297 | linkedin.com
|
---|
298 | list-manage.com
|
---|
299 | live.com
|
---|
300 | livejournal.com
|
---|
301 | livescience.com
|
---|
302 | loc.gov
|
---|
303 | lonelyplanet.com
|
---|
304 | lycos.com
|
---|
305 | m.wikipedia.org mi.m.wikipedia.org
|
---|
306 | mail.ru
|
---|
307 | marketwatch.com
|
---|
308 | marriott.com
|
---|
309 | mashable.com
|
---|
310 | mediafire.com
|
---|
311 | medium.com
|
---|
312 | mega.nz
|
---|
313 | megaupload.com
|
---|
314 | mercurynews.com
|
---|
315 | merriam-webster.com
|
---|
316 | metro.co.uk
|
---|
317 | microsoft.com microsoft.com/mi-nz/
|
---|
318 | microsoftonline.com
|
---|
319 | mirror.co.uk
|
---|
320 | mit.edu
|
---|
321 | mixcloud.com
|
---|
322 | mlb.com
|
---|
323 | mozilla.com
|
---|
324 | mozilla.org
|
---|
325 | msn.com
|
---|
326 | myspace.com
|
---|
327 | mysql.com
|
---|
328 | namecheap.com
|
---|
329 | narod.ru
|
---|
330 | nasa.gov
|
---|
331 | nationalgeographic.com
|
---|
332 | nature.com
|
---|
333 | naver.com
|
---|
334 | naver.jp
|
---|
335 | nba.com
|
---|
336 | nbcnews.com
|
---|
337 | ndtv.com
|
---|
338 | netflix.com
|
---|
339 | netsons.com
|
---|
340 | netvibes.com
|
---|
341 | networkadvertising.org
|
---|
342 | news.com.au
|
---|
343 | newscientist.com
|
---|
344 | newsweek.com
|
---|
345 | newyorker.com
|
---|
346 | nginx.com
|
---|
347 | nginx.org
|
---|
348 | nhk.or.jp
|
---|
349 | nicovideo.jp
|
---|
350 | nifty.com
|
---|
351 | nih.gov
|
---|
352 | nikkei.com
|
---|
353 | noaa.gov
|
---|
354 | nokia.com
|
---|
355 | npr.org
|
---|
356 | nvidia.com
|
---|
357 | nydailynews.com
|
---|
358 | nypost.com
|
---|
359 | nytimes.com
|
---|
360 | nyu.edu
|
---|
361 | odnoklassniki.ru
|
---|
362 | office.com
|
---|
363 | offset.com
|
---|
364 | ok.ru
|
---|
365 | okezone.com
|
---|
366 | opera.com
|
---|
367 | oracle.com
|
---|
368 | orange.fr
|
---|
369 | oreilly.com
|
---|
370 | oup.com
|
---|
371 | over-blog.com
|
---|
372 | ovh.co.uk
|
---|
373 | ovh.com
|
---|
374 | ovh.net
|
---|
375 | ox.ac.uk
|
---|
376 | parallels.com
|
---|
377 | pastebin.com
|
---|
378 | paypal.com
|
---|
379 | pbs.org
|
---|
380 | pcmag.com
|
---|
381 | people.com
|
---|
382 | photobucket.com
|
---|
383 | php.net
|
---|
384 | pinterest.com SINGLEPAGE
|
---|
385 | pixabay.com
|
---|
386 | playstation.com
|
---|
387 | plesk.com
|
---|
388 | plos.org
|
---|
389 | politico.com
|
---|
390 | prestashop.com
|
---|
391 | prezi.com
|
---|
392 | princeton.edu
|
---|
393 | privacyshield.gov
|
---|
394 | prnewswire.com
|
---|
395 | psychologytoday.com
|
---|
396 | qq.com
|
---|
397 | quantcast.com
|
---|
398 | quora.com
|
---|
399 | rakuten.co.jp
|
---|
400 | rambler.ru
|
---|
401 | rapidshare.com
|
---|
402 | reddit.com
|
---|
403 | repubblica.it
|
---|
404 | researchgate.net
|
---|
405 | reuters.com
|
---|
406 | ria.ru
|
---|
407 | rottentomatoes.com
|
---|
408 | rt.com
|
---|
409 | rtve.es
|
---|
410 | sakura.ne.jp
|
---|
411 | samsung.com
|
---|
412 | sapo.pt
|
---|
413 | scholastic.com
|
---|
414 | sciencedaily.com
|
---|
415 | sciencedirect.com
|
---|
416 | sciencemag.org
|
---|
417 | scientificamerican.com
|
---|
418 | scribd.com
|
---|
419 | seattletimes.com
|
---|
420 | secureserver.net
|
---|
421 | sedo.com
|
---|
422 | seesaa.net
|
---|
423 | sendspace.com
|
---|
424 | sfgate.com
|
---|
425 | shopify.com
|
---|
426 | shutterstock.com
|
---|
427 | siemens.com
|
---|
428 | sina.com.cn
|
---|
429 | sky.com
|
---|
430 | skype.com
|
---|
431 | skyrock.com
|
---|
432 | slate.com
|
---|
433 | slideshare.net
|
---|
434 | sm.cn
|
---|
435 | smh.com.au
|
---|
436 | so-net.ne.jp
|
---|
437 | softonic.com
|
---|
438 | sogou.com
|
---|
439 | sohu.com
|
---|
440 | soratemplates.com
|
---|
441 | soso.com
|
---|
442 | soundcloud.com
|
---|
443 | spiegel.de
|
---|
444 | spotify.com
|
---|
445 | springer.com
|
---|
446 | sputniknews.com
|
---|
447 | ssl-images-amazon.com
|
---|
448 | stackoverflow.com
|
---|
449 | standard.co.uk
|
---|
450 | stanford.edu
|
---|
451 | state.gov
|
---|
452 | steamcommunity.com
|
---|
453 | steampowered.com
|
---|
454 | storage.canalblog.com
|
---|
455 | storage.googleapis.com
|
---|
456 | stores.jp
|
---|
457 | storify.com
|
---|
458 | stuff.co.nz SINGLEPAGE
|
---|
459 | surveymonkey.com
|
---|
460 | symantec.com
|
---|
461 | t-online.de
|
---|
462 | t.co
|
---|
463 | t.me
|
---|
464 | tabelog.com
|
---|
465 | taobao.com
|
---|
466 | target.com
|
---|
467 | teamviewer.com
|
---|
468 | techcrunch.com
|
---|
469 | ted.com
|
---|
470 | telegram.me
|
---|
471 | telegraph.co.uk
|
---|
472 | terra.com.br
|
---|
473 | theatlantic.com
|
---|
474 | thefreedictionary.com
|
---|
475 | theglobeandmail.com
|
---|
476 | theguardian.com
|
---|
477 | themeforest.net
|
---|
478 | thenextweb.com
|
---|
479 | thestar.com
|
---|
480 | thesun.co.uk
|
---|
481 | thetimes.co.uk
|
---|
482 | theverge.com
|
---|
483 | thoughtco.com
|
---|
484 | tianya.cn
|
---|
485 | time.com
|
---|
486 | tinyurl.com
|
---|
487 | tmall.com
|
---|
488 | tmz.com
|
---|
489 | tribunnews.com
|
---|
490 | tripadvisor.com
|
---|
491 | trustpilot.com
|
---|
492 | twitch.tv
|
---|
493 | twitter.com
|
---|
494 | ucoz.ru
|
---|
495 | uiuc.edu
|
---|
496 | umich.edu
|
---|
497 | un.org
|
---|
498 | undeveloped.com
|
---|
499 | unesco.org
|
---|
500 | uol.com.br
|
---|
501 | urbandictionary.com
|
---|
502 | usa.gov
|
---|
503 | usatoday.com
|
---|
504 | usgs.gov
|
---|
505 | usnews.com
|
---|
506 | uspto.gov
|
---|
507 | ustream.tv
|
---|
508 | utexas.edu
|
---|
509 | variety.com
|
---|
510 | venturebeat.com
|
---|
511 | vice.com
|
---|
512 | viglink.com
|
---|
513 | vimeo.com
|
---|
514 | vk.com
|
---|
515 | vkontakte.ru
|
---|
516 | vox.com
|
---|
517 | w3.org
|
---|
518 | w3schools.com
|
---|
519 | wa.me
|
---|
520 | walmart.com
|
---|
521 | washington.edu
|
---|
522 | washingtonpost.com
|
---|
523 | wattpad.com
|
---|
524 | weather.com
|
---|
525 | web.fc2.com
|
---|
526 | webmd.com
|
---|
527 | weebly.com
|
---|
528 | weibo.com
|
---|
529 | welt.de
|
---|
530 | whatsapp.com
|
---|
531 | whitehouse.gov
|
---|
532 | who.int
|
---|
533 | wikia.com
|
---|
534 | wikihow.com
|
---|
535 | wikimedia.org
|
---|
536 | wikipedia.org mi.wikipedia.org
|
---|
537 | wiktionary.org mi.wiktionary.org
|
---|
538 | wiley.com
|
---|
539 | windowsphone.com
|
---|
540 | wired.com
|
---|
541 | wix.com
|
---|
542 | wordpress.org SUBDOMAIN-COPY
|
---|
543 | worldbank.org
|
---|
544 | wp.com
|
---|
545 | wsj.com
|
---|
546 | xbox.com
|
---|
547 | xinhuanet.com
|
---|
548 | yadi.sk
|
---|
549 | yahoo.co.jp
|
---|
550 | yahoo.com
|
---|
551 | yale.edu
|
---|
552 | yandex.ru
|
---|
553 | yelp.com
|
---|
554 | youku.com
|
---|
555 | youronlinechoices.com
|
---|
556 | youtu.be
|
---|
557 | youtube.com
|
---|
558 | ytimg.com
|
---|
559 | zdnet.com
|
---|
560 | zend.com
|
---|
561 | zendesk.com
|
---|
562 | zippyshare.com
|
---|