1 | # Mapping of top sites in base url forms to value
|
---|
2 |
|
---|
3 | # This file contains sites that are too large to crawl exhaustively.
|
---|
4 | # The domains are from Alexa top sites (where only the first 50 were visible)
|
---|
5 | # Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
|
---|
6 | # Finally also added https://moz.com/top500 by downloading its CSV file and
|
---|
7 | # adding its URLs to the existing listing here from alexa/wiki.
|
---|
8 | # Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates.
|
---|
9 | # Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext variants, keeping
|
---|
10 | # just <site>.ext
|
---|
11 | # And finally, re-sorted the reduced list alphabetically and pasted into here.
|
---|
12 |
|
---|
13 | # FORMAT OF THIS FILE'S CONTENTS:
|
---|
14 | # <topsite-base-url>,<value>
|
---|
15 | # where <value> can or is one of
|
---|
16 | # empty, SUBDOMAIN-COPY, FOLLOW-LINKS-WITHIN-TOPSITE, SINGLEPAGE, <url-form-without-protocol>
|
---|
17 | #
|
---|
18 | # - if value is left empty: if seedurl contains topsite-base-url, the seedurl will go into the
|
---|
19 | # file unprocessed-topsite-matches.txt and the site/page won't be crawled.
|
---|
20 | # The user will be notified to inspect the file unprocessed-topsite-matches.txt.
|
---|
21 | # - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl.
|
---|
22 | # For example, if the seedurl is http://docs.google.com/some-long-suffix-in-base64, then it
|
---|
23 | # matches the topsite-base-url of docs.google.com and its value of SINGLEPAGE will add the
|
---|
24 | # seedurl itself as the regex url-filter, to restrict the crawl to just the specified page.
|
---|
25 | # - SUBDOMAIN-COPY: if seedurl CONTAINS topsite-base-url, then whatever the seedurl's subdomain
|
---|
26 | # or else domain is, will make up the urlfilter, so we don't leak out into a larger domain.
|
---|
27 | # Use SUBDOMAIN-COPY to restrict to a domain prefix/subdomain. For example, if seedurl is
|
---|
28 | # pinky.blogspot.com, it will match the topsite-base-url of blogspot.com, but SUBDOMAIN-COPY
|
---|
29 | # will ensure we restrict crawling to pages on pinky.blogspot.com.
|
---|
30 | # However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go
|
---|
31 | # into the file unprocessed-topsite-matches.txt and the site/page won't be crawled.
|
---|
32 | # - FOLLOW-LINKS-WITHIN-TOPSITE: if pages linked from the seedURL page can be followed and
|
---|
33 | # downloaded, as long as it's within the same subdomain matching the topsite-base-url.
|
---|
34 | # This is different from SUBDOMAIN-COPY, as that can download all of a specific subdomain but
|
---|
35 | # restricts against downloading the entire domain (e.g. all pinky.blogspot.com and not anything
|
---|
36 | # else within blogspot.com). FOLLOW-LINKS-WITHIN-TOPSITE can download all linked pages (at
|
---|
37 | # depth specified for the nutch crawl) as long as they're within the topsite-base-url.
|
---|
38 | # e.g. seedURLs on docs.google.com containing links will have those linked pages and any
|
---|
39 | # they link to etc. downloaded as long as they're on docs.google.com.
|
---|
40 | # - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided
|
---|
41 | # url-form-without-protocol will make up the urlfilter, again preventing leaking into a
|
---|
42 | # larger part of the domain. For example, if the seedurl is mi.wikipedia.org/SomePage, it will
|
---|
43 | # match the topsite-base-url of wikipedia.org for which the <url-form-without-protocol>
|
---|
44 | # value is mi.wikipedia.org, which should be all that's accepted for wikipedia.org. The
|
---|
45 | # <url-form-without-protocol> ends up in the regex urlfilter file, thereby restricting the
|
---|
46 | # crawl to just mi.wikipedia.org.
|
---|
47 | # Remember to leave out any protocol <from url-form-without-protocol>.
|
---|
48 | #
|
---|
49 | # TODO If useful:
|
---|
50 | # column 3: whether nutch should do fetch all or not
|
---|
51 | # column 4: number of crawl iterations
|
---|
52 |
|
---|
53 | # docs.google.com is a special case: not all pages are public and any interlinking is likely to
|
---|
54 | # be intentional. Grab all linked pages, for link depth set with nutch's crawl, as long as the
|
---|
55 | # links are within the given topsite-base-url
|
---|
56 | docs.google.com,FOLLOW-LINKS-WITHIN-TOPSITE
|
---|
57 |
|
---|
58 | # Just crawl a single page for these:
|
---|
59 | drive.google.com,SINGLEPAGE
|
---|
60 | forms.office.com,SINGLEPAGE
|
---|
61 | player.vimeo.com,SINGLEPAGE
|
---|
62 | static-promote.weebly.com,SINGLEPAGE
|
---|
63 |
|
---|
64 | # Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
|
---|
65 | # The page's containing folder is whitelisted in case the photos are there.
|
---|
66 | korora.econ.yale.edu,SINGLEPAGE
|
---|
67 |
|
---|
68 | 000webhost.com
|
---|
69 | 360.cn
|
---|
70 | 4shared.com
|
---|
71 | a8.net
|
---|
72 | abc.es
|
---|
73 | abc.net.au
|
---|
74 | abcnews.go.com
|
---|
75 | about.com
|
---|
76 | about.me
|
---|
77 | aboutads.info
|
---|
78 | abril.com.br
|
---|
79 | academia.edu
|
---|
80 | accuweather.com
|
---|
81 | addthis.com
|
---|
82 | addtoany.com
|
---|
83 | adobe.com
|
---|
84 | adweek.com
|
---|
85 | airbnb.com
|
---|
86 | akamaihd.net
|
---|
87 | alexa.com
|
---|
88 | alibaba.com
|
---|
89 | aliexpress.com
|
---|
90 | alipay.com
|
---|
91 | aljazeera.com
|
---|
92 | allaboutcookies.org
|
---|
93 | allrecipes.com
|
---|
94 | amazon.ca
|
---|
95 | amazon.co.jp
|
---|
96 | amazon.co.uk
|
---|
97 | amazon.com
|
---|
98 | amazon.de
|
---|
99 | amazon.es
|
---|
100 | amazon.fr
|
---|
101 | amazon.in
|
---|
102 | ameblo.jp
|
---|
103 | ampproject.org
|
---|
104 | android.com
|
---|
105 | aol.com
|
---|
106 | ap.org
|
---|
107 | apache.org
|
---|
108 | apachefriends.org
|
---|
109 | apple.com
|
---|
110 | archive.org
|
---|
111 | archives.gov
|
---|
112 | arstechnica.com
|
---|
113 | arxiv.org
|
---|
114 | asahi.com
|
---|
115 | ask.fm
|
---|
116 | asus.com
|
---|
117 | axs.com
|
---|
118 | babytree.com
|
---|
119 | baidu.com
|
---|
120 | bandcamp.com
|
---|
121 | bbc.co.uk
|
---|
122 | bbc.com
|
---|
123 | behance.net
|
---|
124 | berkeley.edu
|
---|
125 | biblegateway.com
|
---|
126 | biglobe.ne.jp
|
---|
127 | billboard.com
|
---|
128 | bing.com
|
---|
129 | bit.ly
|
---|
130 | bitly.com
|
---|
131 | blackberry.com
|
---|
132 | blogger.com
|
---|
133 | blogspot.com,SUBDOMAIN-COPY
|
---|
134 | bloomberg.com
|
---|
135 | booking.com
|
---|
136 | boston.com
|
---|
137 | box.com
|
---|
138 | britannica.com
|
---|
139 | bt.com
|
---|
140 | bund.de
|
---|
141 | businessinsider.com
|
---|
142 | businesswire.com
|
---|
143 | buydomains.com
|
---|
144 | buzzfeed.com
|
---|
145 | ca.gov
|
---|
146 | cambridge.org
|
---|
147 | canalblog.com
|
---|
148 | cbc.ca
|
---|
149 | cbslocal.com
|
---|
150 | cbsnews.com
|
---|
151 | cdc.gov
|
---|
152 | change.org
|
---|
153 | channel4.com
|
---|
154 | chicagotribune.com
|
---|
155 | chinadaily.com.cn
|
---|
156 | cisco.com
|
---|
157 | clickbank.net
|
---|
158 | cloudflare.com
|
---|
159 | cmu.edu
|
---|
160 | cnbc.com
|
---|
161 | cnet.com
|
---|
162 | cnn.com
|
---|
163 | cocolog-nifty.com
|
---|
164 | columbia.edu
|
---|
165 | connect.over-blog.com
|
---|
166 | cornell.edu
|
---|
167 | corriere.it
|
---|
168 | cpanel.com
|
---|
169 | cpanel.net
|
---|
170 | creativecommons.org
|
---|
171 | csdn.net
|
---|
172 | csmonitor.com
|
---|
173 | dailymail.co.uk
|
---|
174 | dailymotion.com
|
---|
175 | dan.com
|
---|
176 | daum.net
|
---|
177 | debian.org
|
---|
178 | dell.com
|
---|
179 | depositfiles.com
|
---|
180 | detik.com
|
---|
181 | digg.com
|
---|
182 | discovery.com
|
---|
183 | disney.com
|
---|
184 | disney.go.com
|
---|
185 | disqus.com
|
---|
186 | doubleclick.net
|
---|
187 | dreniq.com
|
---|
188 | dribbble.com
|
---|
189 | dropbox.com,SINGLEPAGE
|
---|
190 | dropboxusercontent.com
|
---|
191 | dw.com
|
---|
192 | e-recht24.de
|
---|
193 | ea.com
|
---|
194 | ebay.co.uk
|
---|
195 | ebay.com
|
---|
196 | economist.com
|
---|
197 | eff.org
|
---|
198 | ehow.com
|
---|
199 | elmundo.es
|
---|
200 | elpais.com
|
---|
201 | engadget.com
|
---|
202 | entrepreneur.com
|
---|
203 | eonline.com
|
---|
204 | espn.com
|
---|
205 | espn.go.com
|
---|
206 | etsy.com
|
---|
207 | europa.eu
|
---|
208 | eventbrite.com
|
---|
209 | example.com
|
---|
210 | excite.co.jp
|
---|
211 | express.co.uk
|
---|
212 | facebook.com
|
---|
213 | fandom.com
|
---|
214 | fastcompany.com
|
---|
215 | fb.com
|
---|
216 | fb.me
|
---|
217 | fda.gov
|
---|
218 | fedoraproject.org
|
---|
219 | feedburner.com
|
---|
220 | fifa.com
|
---|
221 | files.wordpress.com
|
---|
222 | flickr.com
|
---|
223 | forbes.com
|
---|
224 | fortune.com
|
---|
225 | foursquare.com
|
---|
226 | foxnews.com
|
---|
227 | ft.com
|
---|
228 | ftc.gov
|
---|
229 | gen.xyz
|
---|
230 | geocities.jp
|
---|
231 | gesetze-im-internet.de
|
---|
232 | ggpht.com
|
---|
233 | github.com
|
---|
234 | gizmodo.com
|
---|
235 | globo.com
|
---|
236 | gmail.com
|
---|
237 | gnu.org
|
---|
238 | godaddy.com
|
---|
239 | gofundme.com
|
---|
240 | goo.gl
|
---|
241 | goo.ne.jp
|
---|
242 | goodreads.com
|
---|
243 | google.ca
|
---|
244 | google.co.id
|
---|
245 | google.co.in
|
---|
246 | google.co.jp
|
---|
247 | google.co.uk
|
---|
248 | google.com
|
---|
249 | google.com.br
|
---|
250 | google.com.hk
|
---|
251 | google.com.tr
|
---|
252 | google.de
|
---|
253 | google.es
|
---|
254 | google.fr
|
---|
255 | google.it
|
---|
256 | google.nl
|
---|
257 | google.pl
|
---|
258 | google.ru
|
---|
259 | googleapis.com
|
---|
260 | googleblog.com
|
---|
261 | googleusercontent.com
|
---|
262 | gooyaabitemplates.com
|
---|
263 | gov.uk
|
---|
264 | gravatar.com
|
---|
265 | greenpeace.org
|
---|
266 | gstatic.com
|
---|
267 | guardian.co.uk
|
---|
268 | harvard.edu
|
---|
269 | hatena.ne.jp
|
---|
270 | histats.com
|
---|
271 | hm.com
|
---|
272 | hollywoodreporter.com
|
---|
273 | home.pl
|
---|
274 | house.gov
|
---|
275 | howstuffworks.com
|
---|
276 | hp.com
|
---|
277 | huffingtonpost.com
|
---|
278 | huffpost.com
|
---|
279 | hugedomains.com
|
---|
280 | ibm.com
|
---|
281 | ibtimes.com
|
---|
282 | icann.org
|
---|
283 | ieee.org
|
---|
284 | ietf.org
|
---|
285 | ig.com.br
|
---|
286 | ign.com
|
---|
287 | ikea.com
|
---|
288 | imageshack.us
|
---|
289 | imdb.com
|
---|
290 | imgur.com
|
---|
291 | inc.com
|
---|
292 | independent.co.uk
|
---|
293 | indiatimes.com
|
---|
294 | indiegogo.com
|
---|
295 | instagram.com
|
---|
296 | instructables.com
|
---|
297 | intel.com
|
---|
298 | interia.pl
|
---|
299 | issuu.com
|
---|
300 | istockphoto.com
|
---|
301 | iubenda.com
|
---|
302 | jd.com
|
---|
303 | joomla.org
|
---|
304 | jquery.com
|
---|
305 | jstor.org
|
---|
306 | kickstarter.com
|
---|
307 | kinja.com
|
---|
308 | last.fm
|
---|
309 | latimes.com
|
---|
310 | lefigaro.fr
|
---|
311 | lemonde.fr
|
---|
312 | line.me
|
---|
313 | linkedin.com
|
---|
314 | list-manage.com
|
---|
315 | live.com
|
---|
316 | livejournal.com
|
---|
317 | livescience.com
|
---|
318 | loc.gov
|
---|
319 | lonelyplanet.com
|
---|
320 | lycos.com
|
---|
321 | m.wikipedia.org,mi.m.wikipedia.org
|
---|
322 | mail.ru
|
---|
323 | marketwatch.com
|
---|
324 | marriott.com
|
---|
325 | mashable.com
|
---|
326 | mediafire.com
|
---|
327 | medium.com
|
---|
328 | mega.nz
|
---|
329 | megaupload.com
|
---|
330 | mercurynews.com
|
---|
331 | merriam-webster.com
|
---|
332 | metro.co.uk
|
---|
333 | microsoft.com,microsoft.com/mi-nz/
|
---|
334 | microsoftonline.com
|
---|
335 | mirror.co.uk
|
---|
336 | mit.edu
|
---|
337 | mixcloud.com
|
---|
338 | mlb.com
|
---|
339 | mozilla.com
|
---|
340 | mozilla.org
|
---|
341 | msn.com
|
---|
342 | myspace.com
|
---|
343 | mysql.com
|
---|
344 | namecheap.com
|
---|
345 | narod.ru
|
---|
346 | nasa.gov
|
---|
347 | nationalgeographic.com
|
---|
348 | nature.com
|
---|
349 | naver.com
|
---|
350 | naver.jp
|
---|
351 | nba.com
|
---|
352 | nbcnews.com
|
---|
353 | ndtv.com
|
---|
354 | netflix.com
|
---|
355 | netsons.com
|
---|
356 | netvibes.com
|
---|
357 | networkadvertising.org
|
---|
358 | news.com.au
|
---|
359 | newscientist.com
|
---|
360 | newsweek.com
|
---|
361 | newyorker.com
|
---|
362 | nginx.com
|
---|
363 | nginx.org
|
---|
364 | nhk.or.jp
|
---|
365 | nicovideo.jp
|
---|
366 | nifty.com
|
---|
367 | nih.gov
|
---|
368 | nikkei.com
|
---|
369 | noaa.gov
|
---|
370 | nokia.com
|
---|
371 | npr.org
|
---|
372 | nvidia.com
|
---|
373 | nydailynews.com
|
---|
374 | nypost.com
|
---|
375 | nytimes.com
|
---|
376 | nyu.edu
|
---|
377 | odnoklassniki.ru
|
---|
378 | office.com
|
---|
379 | offset.com
|
---|
380 | ok.ru
|
---|
381 | okezone.com
|
---|
382 | opera.com
|
---|
383 | oracle.com
|
---|
384 | orange.fr
|
---|
385 | oreilly.com
|
---|
386 | oup.com
|
---|
387 | over-blog.com
|
---|
388 | ovh.co.uk
|
---|
389 | ovh.com
|
---|
390 | ovh.net
|
---|
391 | ox.ac.uk
|
---|
392 | parallels.com
|
---|
393 | pastebin.com
|
---|
394 | paypal.com
|
---|
395 | pbs.org
|
---|
396 | pcmag.com
|
---|
397 | people.com
|
---|
398 | photobucket.com
|
---|
399 | php.net
|
---|
400 | pinterest.com,SINGLEPAGE
|
---|
401 | pixabay.com
|
---|
402 | playstation.com
|
---|
403 | plesk.com
|
---|
404 | plos.org
|
---|
405 | politico.com
|
---|
406 | prestashop.com
|
---|
407 | prezi.com
|
---|
408 | princeton.edu
|
---|
409 | privacyshield.gov
|
---|
410 | prnewswire.com
|
---|
411 | psychologytoday.com
|
---|
412 | qq.com
|
---|
413 | quantcast.com
|
---|
414 | quora.com
|
---|
415 | rakuten.co.jp
|
---|
416 | rambler.ru
|
---|
417 | rapidshare.com
|
---|
418 | reddit.com
|
---|
419 | repubblica.it
|
---|
420 | researchgate.net
|
---|
421 | reuters.com
|
---|
422 | ria.ru
|
---|
423 | rottentomatoes.com
|
---|
424 | rt.com
|
---|
425 | rtve.es
|
---|
426 | sakura.ne.jp
|
---|
427 | samsung.com
|
---|
428 | sapo.pt
|
---|
429 | scholastic.com
|
---|
430 | sciencedaily.com
|
---|
431 | sciencedirect.com
|
---|
432 | sciencemag.org
|
---|
433 | scientificamerican.com
|
---|
434 | scribd.com
|
---|
435 | seattletimes.com
|
---|
436 | secureserver.net
|
---|
437 | sedo.com
|
---|
438 | seesaa.net
|
---|
439 | sendspace.com
|
---|
440 | sfgate.com
|
---|
441 | shopify.com
|
---|
442 | shutterstock.com
|
---|
443 | siemens.com
|
---|
444 | sina.com.cn
|
---|
445 | sky.com
|
---|
446 | skype.com
|
---|
447 | skyrock.com
|
---|
448 | slate.com
|
---|
449 | slideshare.net
|
---|
450 | sm.cn
|
---|
451 | smh.com.au
|
---|
452 | so-net.ne.jp
|
---|
453 | softonic.com
|
---|
454 | sogou.com
|
---|
455 | sohu.com
|
---|
456 | soratemplates.com
|
---|
457 | soso.com
|
---|
458 | soundcloud.com
|
---|
459 | spiegel.de
|
---|
460 | spotify.com
|
---|
461 | springer.com
|
---|
462 | sputniknews.com
|
---|
463 | ssl-images-amazon.com
|
---|
464 | stackoverflow.com
|
---|
465 | standard.co.uk
|
---|
466 | stanford.edu
|
---|
467 | state.gov
|
---|
468 | steamcommunity.com
|
---|
469 | steampowered.com
|
---|
470 | storage.canalblog.com
|
---|
471 | storage.googleapis.com
|
---|
472 | stores.jp
|
---|
473 | storify.com
|
---|
474 | stuff.co.nz,SINGLEPAGE
|
---|
475 | surveymonkey.com
|
---|
476 | symantec.com
|
---|
477 | t-online.de
|
---|
478 | t.co
|
---|
479 | t.me
|
---|
480 | tabelog.com
|
---|
481 | taobao.com
|
---|
482 | target.com
|
---|
483 | teamviewer.com
|
---|
484 | techcrunch.com
|
---|
485 | ted.com
|
---|
486 | telegram.me
|
---|
487 | telegraph.co.uk
|
---|
488 | terra.com.br
|
---|
489 | theatlantic.com
|
---|
490 | thefreedictionary.com
|
---|
491 | theglobeandmail.com
|
---|
492 | theguardian.com
|
---|
493 | themeforest.net
|
---|
494 | thenextweb.com
|
---|
495 | thestar.com
|
---|
496 | thesun.co.uk
|
---|
497 | thetimes.co.uk
|
---|
498 | theverge.com
|
---|
499 | thoughtco.com
|
---|
500 | tianya.cn
|
---|
501 | time.com
|
---|
502 | tinyurl.com
|
---|
503 | tmall.com
|
---|
504 | tmz.com
|
---|
505 | tribunnews.com
|
---|
506 | tripadvisor.com
|
---|
507 | trustpilot.com
|
---|
508 | twitch.tv
|
---|
509 | twitter.com
|
---|
510 | ucoz.ru
|
---|
511 | uiuc.edu
|
---|
512 | umich.edu
|
---|
513 | un.org
|
---|
514 | undeveloped.com
|
---|
515 | unesco.org
|
---|
516 | uol.com.br
|
---|
517 | urbandictionary.com
|
---|
518 | usa.gov
|
---|
519 | usatoday.com
|
---|
520 | usgs.gov
|
---|
521 | usnews.com
|
---|
522 | uspto.gov
|
---|
523 | ustream.tv
|
---|
524 | utexas.edu
|
---|
525 | variety.com
|
---|
526 | venturebeat.com
|
---|
527 | vice.com
|
---|
528 | viglink.com
|
---|
529 | vimeo.com
|
---|
530 | vk.com
|
---|
531 | vkontakte.ru
|
---|
532 | vox.com
|
---|
533 | w3.org
|
---|
534 | w3schools.com
|
---|
535 | wa.me
|
---|
536 | walmart.com
|
---|
537 | washington.edu
|
---|
538 | washingtonpost.com
|
---|
539 | wattpad.com
|
---|
540 | weather.com
|
---|
541 | web.fc2.com
|
---|
542 | webmd.com
|
---|
543 | weebly.com
|
---|
544 | weibo.com
|
---|
545 | welt.de
|
---|
546 | whatsapp.com
|
---|
547 | whitehouse.gov
|
---|
548 | who.int
|
---|
549 | wikia.com
|
---|
550 | wikihow.com
|
---|
551 | wikimedia.org
|
---|
552 | wikipedia.org,mi.wikipedia.org
|
---|
553 | wiktionary.org,mi.wiktionary.org
|
---|
554 | wiley.com
|
---|
555 | windowsphone.com
|
---|
556 | wired.com
|
---|
557 | wix.com
|
---|
558 | wordpress.org,SUBDOMAIN-COPY
|
---|
559 | worldbank.org
|
---|
560 | wp.com
|
---|
561 | wsj.com
|
---|
562 | xbox.com
|
---|
563 | xinhuanet.com
|
---|
564 | yadi.sk
|
---|
565 | yahoo.co.jp
|
---|
566 | yahoo.com
|
---|
567 | yale.edu
|
---|
568 | yandex.ru
|
---|
569 | yelp.com
|
---|
570 | youku.com
|
---|
571 | youronlinechoices.com
|
---|
572 | youtu.be
|
---|
573 | youtube.com
|
---|
574 | ytimg.com
|
---|
575 | zdnet.com
|
---|
576 | zend.com
|
---|
577 | zendesk.com
|
---|
578 | zippyshare.com
|
---|