1 | # Mapping of top sites in base url forms to value
|
---|
2 |
|
---|
3 | # This file contains sites that are too large to crawl exhaustively.
|
---|
4 | # The domains are from Alexa top sites (where only the first 50 were visible)
|
---|
5 | # Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
|
---|
6 | # Finally also added https://moz.com/top500 by downloading its CSV file and
|
---|
7 | # adding its URLs to the existing listing here from alexa/wiki.
|
---|
8 | # Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates.
|
---|
9 | # Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext variants, keeping
|
---|
10 | # just <site>.ext
|
---|
11 | # And finally, re-sorted the reduced list alphabetically and pasted into here.
|
---|
12 |
|
---|
13 | # FORMAT OF THIS FILE'S CONTENTS:
|
---|
14 | # <topsite-base-url>,<value>
|
---|
15 | # where <value> can or is one of
|
---|
16 | # empty, SUBDOMAIN-COPY, FOLLOW-LINKS-WITHIN-TOPSITE, SINGLEPAGE, <url-form-without-protocol>
|
---|
17 | #
|
---|
18 | # - if value is left empty: if seedurl contains topsite-base-url, the seedurl will go into the
|
---|
19 | # file unprocessed-topsite-matches.txt and the site/page won't be crawled.
|
---|
20 | # The user will be notified to inspect the file unprocessed-topsite-matches.txt.
|
---|
21 | # - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl.
|
---|
22 | # For example, if the seedurl is http://docs.google.com/some-long-suffix-in-base64, then it
|
---|
23 | # matches the topsite-base-url of docs.google.com and its value of SINGLEPAGE will add the
|
---|
24 | # seedurl itself as the regex url-filter, to restrict the crawl to just the specified page.
|
---|
25 | # - SUBDOMAIN-COPY: if seedurl CONTAINS topsite-base-url, then whatever the seedurl's subdomain
|
---|
26 | # or else domain is, will make up the urlfilter, so we don't leak out into a larger domain.
|
---|
27 | # Use SUBDOMAIN-COPY to restrict to a domain prefix/subdomain. For example, if seedurl is
|
---|
28 | # pinky.blogspot.com, it will match the topsite-base-url of blogspot.com, but SUBDOMAIN-COPY
|
---|
29 | # will ensure we restrict crawling to pages on pinky.blogspot.com.
|
---|
30 | # However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go
|
---|
31 | # into the file unprocessed-topsite-matches.txt and the site/page won't be crawled.
|
---|
32 | # - FOLLOW-LINKS-WITHIN-TOPSITE: if pages linked from the seedURL page can be followed and
|
---|
33 | # downloaded, as long as it's within the same subdomain matching the topsite-base-url.
|
---|
34 | # This is different from SUBDOMAIN-COPY, as that can download all of a specific subdomain but
|
---|
35 | # restricts against downloading the entire domain (e.g. all pinky.blogspot.com and not anything
|
---|
36 | # else within blogspot.com). FOLLOW-LINKS-WITHIN-TOPSITE can download all linked pages (at
|
---|
37 | # depth specified for the nutch crawl) as long as they're within the topsite-base-url.
|
---|
38 | # e.g. seedURLs on docs.google.com containing links will have those linked pages and any
|
---|
39 | # they link to etc. downloaded as long as they're on docs.google.com.
|
---|
40 | # - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided
|
---|
41 | # url-form-without-protocol will make up the urlfilter, again preventing leaking into a
|
---|
42 | # larger part of the domain. For example, if the seedurl is mi.wikipedia.org/SomePage, it will
|
---|
43 | # match the topsite-base-url of wikipedia.org for which the <url-form-without-protocol>
|
---|
44 | # value is mi.wikipedia.org, which should be all that's accepted for wikipedia.org. The
|
---|
45 | # <url-form-without-protocol> ends up in the regex urlfilter file, thereby restricting the
|
---|
46 | # crawl to just mi.wikipedia.org.
|
---|
47 | # Remember to leave out any protocol <from url-form-without-protocol>.
|
---|
48 | #
|
---|
49 | # TODO If useful:
|
---|
50 | # column 3: whether nutch should do fetch all or not
|
---|
51 | # column 4: number of crawl iterations
|
---|
52 |
|
---|
53 |
|
---|
54 | # NOT TOP SITES, BUT SITES WE INSPECTED AND WANT TO CONTROL SIMILARLY TO TOP SITES
|
---|
55 | 00.gs,SINGLEPAGE
|
---|
56 |
|
---|
57 | # May be a large site
|
---|
58 | topographic-map.com,SINGLEPAGE
|
---|
59 |
|
---|
60 | # TOP SITES
|
---|
61 |
|
---|
62 | # docs.google.com is a special case: not all pages are public and any interlinking is likely to
|
---|
63 | # be intentional. Grab all linked pages, for link depth set with nutch's crawl, as long as the
|
---|
64 | # links are within the given topsite-base-url
|
---|
65 | docs.google.com,FOLLOW-LINKS-WITHIN-TOPSITE
|
---|
66 |
|
---|
67 | # Just crawl a single page for these:
|
---|
68 | drive.google.com,SINGLEPAGE
|
---|
69 | forms.office.com,SINGLEPAGE
|
---|
70 | player.vimeo.com,SINGLEPAGE
|
---|
71 | static-promote.weebly.com,SINGLEPAGE
|
---|
72 |
|
---|
73 | # Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
|
---|
74 | # The page's containing folder is whitelisted in case the photos are there.
|
---|
75 | korora.econ.yale.edu,SINGLEPAGE
|
---|
76 |
|
---|
77 | 000webhost.com
|
---|
78 | 360.cn
|
---|
79 | 4shared.com
|
---|
80 | a8.net
|
---|
81 | abc.es
|
---|
82 | abc.net.au
|
---|
83 | abcnews.go.com
|
---|
84 | about.com
|
---|
85 | about.me
|
---|
86 | aboutads.info
|
---|
87 | abril.com.br
|
---|
88 | academia.edu
|
---|
89 | accuweather.com
|
---|
90 | addthis.com
|
---|
91 | addtoany.com
|
---|
92 | adobe.com
|
---|
93 | adweek.com
|
---|
94 | airbnb.com
|
---|
95 | akamaihd.net
|
---|
96 | alexa.com
|
---|
97 | alibaba.com
|
---|
98 | aliexpress.com
|
---|
99 | alipay.com
|
---|
100 | aljazeera.com
|
---|
101 | allaboutcookies.org
|
---|
102 | allrecipes.com
|
---|
103 | amazon.ca
|
---|
104 | amazon.co.jp
|
---|
105 | amazon.co.uk
|
---|
106 | amazon.com
|
---|
107 | amazon.de
|
---|
108 | amazon.es
|
---|
109 | amazon.fr
|
---|
110 | amazon.in
|
---|
111 | ameblo.jp
|
---|
112 | ampproject.org
|
---|
113 | android.com
|
---|
114 | aol.com
|
---|
115 | ap.org
|
---|
116 | apache.org
|
---|
117 | apachefriends.org
|
---|
118 | apple.com
|
---|
119 | archive.org
|
---|
120 | archives.gov
|
---|
121 | arstechnica.com
|
---|
122 | arxiv.org
|
---|
123 | asahi.com
|
---|
124 | ask.fm
|
---|
125 | asus.com
|
---|
126 | axs.com
|
---|
127 | babytree.com
|
---|
128 | baidu.com
|
---|
129 | bandcamp.com
|
---|
130 | bbc.co.uk
|
---|
131 | bbc.com
|
---|
132 | behance.net
|
---|
133 | berkeley.edu
|
---|
134 | biblegateway.com
|
---|
135 | biglobe.ne.jp
|
---|
136 | billboard.com
|
---|
137 | bing.com
|
---|
138 | bit.ly
|
---|
139 | bitly.com
|
---|
140 | blackberry.com
|
---|
141 | blogger.com
|
---|
142 | blogspot.com,SUBDOMAIN-COPY
|
---|
143 | bloomberg.com
|
---|
144 | booking.com
|
---|
145 | boston.com
|
---|
146 | box.com
|
---|
147 | britannica.com
|
---|
148 | bt.com
|
---|
149 | bund.de
|
---|
150 | businessinsider.com
|
---|
151 | businesswire.com
|
---|
152 | buydomains.com
|
---|
153 | buzzfeed.com
|
---|
154 | ca.gov
|
---|
155 | cambridge.org
|
---|
156 | canalblog.com
|
---|
157 | cbc.ca
|
---|
158 | cbslocal.com
|
---|
159 | cbsnews.com
|
---|
160 | cdc.gov
|
---|
161 | change.org
|
---|
162 | channel4.com
|
---|
163 | chicagotribune.com
|
---|
164 | chinadaily.com.cn
|
---|
165 | cisco.com
|
---|
166 | clickbank.net
|
---|
167 | cloudflare.com
|
---|
168 | cmu.edu
|
---|
169 | cnbc.com
|
---|
170 | cnet.com
|
---|
171 | cnn.com
|
---|
172 | cocolog-nifty.com
|
---|
173 | columbia.edu
|
---|
174 | connect.over-blog.com
|
---|
175 | cornell.edu
|
---|
176 | corriere.it
|
---|
177 | cpanel.com
|
---|
178 | cpanel.net
|
---|
179 | creativecommons.org
|
---|
180 | csdn.net
|
---|
181 | csmonitor.com
|
---|
182 | dailymail.co.uk
|
---|
183 | dailymotion.com
|
---|
184 | dan.com
|
---|
185 | daum.net
|
---|
186 | debian.org
|
---|
187 | dell.com
|
---|
188 | depositfiles.com
|
---|
189 | detik.com
|
---|
190 | digg.com
|
---|
191 | discovery.com
|
---|
192 | disney.com
|
---|
193 | disney.go.com
|
---|
194 | disqus.com
|
---|
195 | doubleclick.net
|
---|
196 | dreniq.com
|
---|
197 | dribbble.com
|
---|
198 | dropbox.com,SINGLEPAGE
|
---|
199 | dropboxusercontent.com
|
---|
200 | dw.com
|
---|
201 | e-recht24.de
|
---|
202 | ea.com
|
---|
203 | ebay.co.uk
|
---|
204 | ebay.com
|
---|
205 | economist.com
|
---|
206 | eff.org
|
---|
207 | ehow.com
|
---|
208 | elmundo.es
|
---|
209 | elpais.com
|
---|
210 | engadget.com
|
---|
211 | entrepreneur.com
|
---|
212 | eonline.com
|
---|
213 | espn.com
|
---|
214 | espn.go.com
|
---|
215 | etsy.com
|
---|
216 | europa.eu
|
---|
217 | eventbrite.com
|
---|
218 | example.com
|
---|
219 | excite.co.jp
|
---|
220 | express.co.uk
|
---|
221 | facebook.com
|
---|
222 | fandom.com
|
---|
223 | fastcompany.com
|
---|
224 | fb.com
|
---|
225 | fb.me
|
---|
226 | fda.gov
|
---|
227 | fedoraproject.org
|
---|
228 | feedburner.com
|
---|
229 | fifa.com
|
---|
230 | files.wordpress.com
|
---|
231 | flickr.com
|
---|
232 | forbes.com
|
---|
233 | fortune.com
|
---|
234 | foursquare.com
|
---|
235 | foxnews.com
|
---|
236 | ft.com
|
---|
237 | ftc.gov
|
---|
238 | gen.xyz
|
---|
239 | geocities.jp
|
---|
240 | gesetze-im-internet.de
|
---|
241 | ggpht.com
|
---|
242 | github.com
|
---|
243 | gizmodo.com
|
---|
244 | globo.com
|
---|
245 | gmail.com
|
---|
246 | gnu.org
|
---|
247 | godaddy.com
|
---|
248 | gofundme.com
|
---|
249 | goo.gl
|
---|
250 | goo.ne.jp
|
---|
251 | goodreads.com
|
---|
252 | google.ca
|
---|
253 | google.co.id
|
---|
254 | google.co.in
|
---|
255 | google.co.jp
|
---|
256 | google.co.uk
|
---|
257 | google.com
|
---|
258 | google.com.br
|
---|
259 | google.com.hk
|
---|
260 | google.com.tr
|
---|
261 | google.de
|
---|
262 | google.es
|
---|
263 | google.fr
|
---|
264 | google.it
|
---|
265 | google.nl
|
---|
266 | google.pl
|
---|
267 | google.ru
|
---|
268 | googleapis.com
|
---|
269 | googleblog.com
|
---|
270 | googleusercontent.com
|
---|
271 | gooyaabitemplates.com
|
---|
272 | gov.uk
|
---|
273 | gravatar.com
|
---|
274 | greenpeace.org
|
---|
275 | gstatic.com
|
---|
276 | guardian.co.uk
|
---|
277 | harvard.edu
|
---|
278 | hatena.ne.jp
|
---|
279 | histats.com
|
---|
280 | hm.com
|
---|
281 | hollywoodreporter.com
|
---|
282 | home.pl
|
---|
283 | house.gov
|
---|
284 | howstuffworks.com
|
---|
285 | hp.com
|
---|
286 | huffingtonpost.com
|
---|
287 | huffpost.com
|
---|
288 | hugedomains.com
|
---|
289 | ibm.com
|
---|
290 | ibtimes.com
|
---|
291 | icann.org
|
---|
292 | ieee.org
|
---|
293 | ietf.org
|
---|
294 | ig.com.br
|
---|
295 | ign.com
|
---|
296 | ikea.com
|
---|
297 | imageshack.us
|
---|
298 | imdb.com
|
---|
299 | imgur.com
|
---|
300 | inc.com
|
---|
301 | independent.co.uk
|
---|
302 | indiatimes.com
|
---|
303 | indiegogo.com
|
---|
304 | instagram.com
|
---|
305 | instructables.com
|
---|
306 | intel.com
|
---|
307 | interia.pl
|
---|
308 | issuu.com
|
---|
309 | istockphoto.com
|
---|
310 | iubenda.com
|
---|
311 | jd.com
|
---|
312 | joomla.org
|
---|
313 | jquery.com
|
---|
314 | jstor.org
|
---|
315 | kickstarter.com
|
---|
316 | kinja.com
|
---|
317 | last.fm
|
---|
318 | latimes.com
|
---|
319 | lefigaro.fr
|
---|
320 | lemonde.fr
|
---|
321 | line.me
|
---|
322 | linkedin.com
|
---|
323 | list-manage.com
|
---|
324 | live.com
|
---|
325 | livejournal.com
|
---|
326 | livescience.com
|
---|
327 | loc.gov
|
---|
328 | lonelyplanet.com
|
---|
329 | lycos.com
|
---|
330 | m.wikipedia.org,mi.m.wikipedia.org
|
---|
331 | mail.ru
|
---|
332 | marketwatch.com
|
---|
333 | marriott.com
|
---|
334 | mashable.com
|
---|
335 | mediafire.com
|
---|
336 | medium.com
|
---|
337 | mega.nz
|
---|
338 | megaupload.com
|
---|
339 | mercurynews.com
|
---|
340 | merriam-webster.com
|
---|
341 | metro.co.uk
|
---|
342 | microsoft.com,microsoft.com/mi-nz/
|
---|
343 | microsoftonline.com
|
---|
344 | mirror.co.uk
|
---|
345 | mit.edu
|
---|
346 | mixcloud.com
|
---|
347 | mlb.com
|
---|
348 | mozilla.com
|
---|
349 | mozilla.org
|
---|
350 | msn.com
|
---|
351 | myspace.com
|
---|
352 | mysql.com
|
---|
353 | namecheap.com
|
---|
354 | narod.ru
|
---|
355 | nasa.gov
|
---|
356 | nationalgeographic.com
|
---|
357 | nature.com
|
---|
358 | naver.com
|
---|
359 | naver.jp
|
---|
360 | nba.com
|
---|
361 | nbcnews.com
|
---|
362 | ndtv.com
|
---|
363 | netflix.com
|
---|
364 | netsons.com
|
---|
365 | netvibes.com
|
---|
366 | networkadvertising.org
|
---|
367 | news.com.au
|
---|
368 | newscientist.com
|
---|
369 | newsweek.com
|
---|
370 | newyorker.com
|
---|
371 | nginx.com
|
---|
372 | nginx.org
|
---|
373 | nhk.or.jp
|
---|
374 | nicovideo.jp
|
---|
375 | nifty.com
|
---|
376 | nih.gov
|
---|
377 | nikkei.com
|
---|
378 | noaa.gov
|
---|
379 | nokia.com
|
---|
380 | npr.org
|
---|
381 | nvidia.com
|
---|
382 | nydailynews.com
|
---|
383 | nypost.com
|
---|
384 | nytimes.com
|
---|
385 | nyu.edu
|
---|
386 | odnoklassniki.ru
|
---|
387 | office.com
|
---|
388 | offset.com
|
---|
389 | ok.ru
|
---|
390 | okezone.com
|
---|
391 | opera.com
|
---|
392 | oracle.com
|
---|
393 | orange.fr
|
---|
394 | oreilly.com
|
---|
395 | oup.com
|
---|
396 | over-blog.com
|
---|
397 | ovh.co.uk
|
---|
398 | ovh.com
|
---|
399 | ovh.net
|
---|
400 | ox.ac.uk
|
---|
401 | parallels.com
|
---|
402 | pastebin.com
|
---|
403 | paypal.com
|
---|
404 | pbs.org
|
---|
405 | pcmag.com
|
---|
406 | people.com
|
---|
407 | photobucket.com
|
---|
408 | php.net
|
---|
409 | pinterest.com,SINGLEPAGE
|
---|
410 | pixabay.com
|
---|
411 | playstation.com
|
---|
412 | plesk.com
|
---|
413 | plos.org
|
---|
414 | politico.com
|
---|
415 | prestashop.com
|
---|
416 | prezi.com
|
---|
417 | princeton.edu
|
---|
418 | privacyshield.gov
|
---|
419 | prnewswire.com
|
---|
420 | psychologytoday.com
|
---|
421 | qq.com
|
---|
422 | quantcast.com
|
---|
423 | quora.com
|
---|
424 | rakuten.co.jp
|
---|
425 | rambler.ru
|
---|
426 | rapidshare.com
|
---|
427 | reddit.com
|
---|
428 | repubblica.it
|
---|
429 | researchgate.net
|
---|
430 | reuters.com
|
---|
431 | ria.ru
|
---|
432 | rottentomatoes.com
|
---|
433 | rt.com
|
---|
434 | rtve.es
|
---|
435 | sakura.ne.jp
|
---|
436 | samsung.com
|
---|
437 | sapo.pt
|
---|
438 | scholastic.com
|
---|
439 | sciencedaily.com
|
---|
440 | sciencedirect.com
|
---|
441 | sciencemag.org
|
---|
442 | scientificamerican.com
|
---|
443 | scribd.com
|
---|
444 | seattletimes.com
|
---|
445 | secureserver.net
|
---|
446 | sedo.com
|
---|
447 | seesaa.net
|
---|
448 | sendspace.com
|
---|
449 | sfgate.com
|
---|
450 | shopify.com
|
---|
451 | shutterstock.com
|
---|
452 | siemens.com
|
---|
453 | sina.com.cn
|
---|
454 | sky.com
|
---|
455 | skype.com
|
---|
456 | skyrock.com
|
---|
457 | slate.com
|
---|
458 | slideshare.net
|
---|
459 | sm.cn
|
---|
460 | smh.com.au
|
---|
461 | so-net.ne.jp
|
---|
462 | softonic.com
|
---|
463 | sogou.com
|
---|
464 | sohu.com
|
---|
465 | soratemplates.com
|
---|
466 | soso.com
|
---|
467 | soundcloud.com
|
---|
468 | spiegel.de
|
---|
469 | spotify.com
|
---|
470 | springer.com
|
---|
471 | sputniknews.com
|
---|
472 | ssl-images-amazon.com
|
---|
473 | stackoverflow.com
|
---|
474 | standard.co.uk
|
---|
475 | stanford.edu
|
---|
476 | state.gov
|
---|
477 | steamcommunity.com
|
---|
478 | steampowered.com
|
---|
479 | storage.canalblog.com
|
---|
480 | storage.googleapis.com
|
---|
481 | stores.jp
|
---|
482 | storify.com
|
---|
483 | stuff.co.nz,SINGLEPAGE
|
---|
484 | surveymonkey.com
|
---|
485 | symantec.com
|
---|
486 | t-online.de
|
---|
487 | t.co
|
---|
488 | t.me
|
---|
489 | tabelog.com
|
---|
490 | taobao.com
|
---|
491 | target.com
|
---|
492 | teamviewer.com
|
---|
493 | techcrunch.com
|
---|
494 | ted.com
|
---|
495 | telegram.me
|
---|
496 | telegraph.co.uk
|
---|
497 | terra.com.br
|
---|
498 | theatlantic.com
|
---|
499 | thefreedictionary.com
|
---|
500 | theglobeandmail.com
|
---|
501 | theguardian.com
|
---|
502 | themeforest.net
|
---|
503 | thenextweb.com
|
---|
504 | thestar.com
|
---|
505 | thesun.co.uk
|
---|
506 | thetimes.co.uk
|
---|
507 | theverge.com
|
---|
508 | thoughtco.com
|
---|
509 | tianya.cn
|
---|
510 | time.com
|
---|
511 | tinyurl.com
|
---|
512 | tmall.com
|
---|
513 | tmz.com
|
---|
514 | tribunnews.com
|
---|
515 | tripadvisor.com
|
---|
516 | trustpilot.com
|
---|
517 | twitch.tv
|
---|
518 | twitter.com
|
---|
519 | ucoz.ru
|
---|
520 | uiuc.edu
|
---|
521 | umich.edu
|
---|
522 | un.org
|
---|
523 | undeveloped.com
|
---|
524 | unesco.org
|
---|
525 | uol.com.br
|
---|
526 | urbandictionary.com
|
---|
527 | usa.gov
|
---|
528 | usatoday.com
|
---|
529 | usgs.gov
|
---|
530 | usnews.com
|
---|
531 | uspto.gov
|
---|
532 | ustream.tv
|
---|
533 | utexas.edu
|
---|
534 | variety.com
|
---|
535 | venturebeat.com
|
---|
536 | vice.com
|
---|
537 | viglink.com
|
---|
538 | vimeo.com
|
---|
539 | vk.com
|
---|
540 | vkontakte.ru
|
---|
541 | vox.com
|
---|
542 | w3.org
|
---|
543 | w3schools.com
|
---|
544 | wa.me
|
---|
545 | walmart.com
|
---|
546 | washington.edu
|
---|
547 | washingtonpost.com
|
---|
548 | wattpad.com
|
---|
549 | weather.com
|
---|
550 | web.fc2.com
|
---|
551 | webmd.com
|
---|
552 | weebly.com
|
---|
553 | weibo.com
|
---|
554 | welt.de
|
---|
555 | whatsapp.com
|
---|
556 | whitehouse.gov
|
---|
557 | who.int
|
---|
558 | wikia.com
|
---|
559 | wikihow.com
|
---|
560 | wikimedia.org
|
---|
561 | wikipedia.org,mi.wikipedia.org
|
---|
562 | wiktionary.org,mi.wiktionary.org
|
---|
563 | wiley.com
|
---|
564 | windowsphone.com
|
---|
565 | wired.com
|
---|
566 | wix.com
|
---|
567 | wordpress.org,SUBDOMAIN-COPY
|
---|
568 | worldbank.org
|
---|
569 | wp.com
|
---|
570 | wsj.com
|
---|
571 | xbox.com
|
---|
572 | xinhuanet.com
|
---|
573 | yadi.sk
|
---|
574 | yahoo.co.jp
|
---|
575 | yahoo.com
|
---|
576 | yale.edu
|
---|
577 | yandex.ru
|
---|
578 | yelp.com
|
---|
579 | youku.com
|
---|
580 | youronlinechoices.com
|
---|
581 | youtu.be
|
---|
582 | youtube.com
|
---|
583 | ytimg.com
|
---|
584 | zdnet.com
|
---|
585 | zend.com
|
---|
586 | zendesk.com
|
---|
587 | zippyshare.com
|
---|