source: gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt@ 33568

Last change on this file since 33568 was 33568, checked in by ak19, 5 years ago
  1. More sites greylisted and blacklisted, discovered as I attempted to crawl them and afterwards learnt to investigate sites first. Should all .ru and .pl domains be on the greylist? 2. Adjusted instruction comments in CCWETProcessor for compiling and running
File size: 10.6 KB
Line 
1# Mapping of top sites in base url forms to value
2
3# This file contains sites that are too large to crawl exhaustively.
4# The domains are from Alexa top sites (where only the first 50 were visible)
5# Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
6# Finally also added https://moz.com/top500 by downloading its CSV file and
7# adding its URLs to the existing listing here from alexa/wiki.
8# Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates.
9# Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext variants, keeping
10# just <site>.ext
11# And finally, re-sorted the reduced list alphabetically and pasted into here.
12
13# FORMAT OF THIS FILE'S CONTENTS:
14# <topsite-base-url>,<value>
15# where <value> can or is one of
16# empty, SUBDOMAIN-COPY, FOLLOW-LINKS-WITHIN-TOPSITE, SINGLEPAGE, <url-form-without-protocol>
17#
18# - if value is left empty: if seedurl contains topsite-base-url, the seedurl will go into the
19# file unprocessed-topsite-matches.txt and the site/page won't be crawled.
20# The user will be notified to inspect the file unprocessed-topsite-matches.txt.
21# - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl.
22# For example, if the seedurl is http://docs.google.com/some-long-suffix-in-base64, then it
23# matches the topsite-base-url of docs.google.com and its value of SINGLEPAGE will add the
24# seedurl itself as the regex url-filter, to restrict the crawl to just the specified page.
25# - SUBDOMAIN-COPY: if seedurl CONTAINS topsite-base-url, then whatever the seedurl's subdomain
26# or else domain is, will make up the urlfilter, so we don't leak out into a larger domain.
27# Use SUBDOMAIN-COPY to restrict to a domain prefix/subdomain. For example, if seedurl is
28# pinky.blogspot.com, it will match the topsite-base-url of blogspot.com, but SUBDOMAIN-COPY
29# will ensure we restrict crawling to pages on pinky.blogspot.com.
30# However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go
31# into the file unprocessed-topsite-matches.txt and the site/page won't be crawled.
32# - FOLLOW-LINKS-WITHIN-TOPSITE: if pages linked from the seedURL page can be followed and
33# downloaded, as long as it's within the same subdomain matching the topsite-base-url.
34# This is different from SUBDOMAIN-COPY, as that can download all of a specific subdomain but
35# restricts against downloading the entire domain (e.g. all pinky.blogspot.com and not anything
36# else within blogspot.com). FOLLOW-LINKS-WITHIN-TOPSITE can download all linked pages (at
37# depth specified for the nutch crawl) as long as they're within the topsite-base-url.
38# e.g. seedURLs on docs.google.com containing links will have those linked pages and any
39# they link to etc. downloaded as long as they're on docs.google.com.
40# - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided
41# url-form-without-protocol will make up the urlfilter, again preventing leaking into a
42# larger part of the domain. For example, if the seedurl is mi.wikipedia.org/SomePage, it will
43# match the topsite-base-url of wikipedia.org for which the <url-form-without-protocol>
44# value is mi.wikipedia.org, which should be all that's accepted for wikipedia.org. The
45# <url-form-without-protocol> ends up in the regex urlfilter file, thereby restricting the
46# crawl to just mi.wikipedia.org.
47# Remember to leave out any protocol <from url-form-without-protocol>.
48#
49# TODO If useful:
50# column 3: whether nutch should do fetch all or not
51# column 4: number of crawl iterations
52
53
54# NOT TOP SITES, BUT SITES WE INSPECTED AND WANT TO CONTROL SIMILARLY TO TOP SITES
5500.gs,SINGLEPAGE
56
57# May be a large site
58topographic-map.com,SINGLEPAGE
59
60# TOP SITES
61
62# docs.google.com is a special case: not all pages are public and any interlinking is likely to
63# be intentional. Grab all linked pages, for link depth set with nutch's crawl, as long as the
64# links are within the given topsite-base-url
65docs.google.com,FOLLOW-LINKS-WITHIN-TOPSITE
66
67# Just crawl a single page for these:
68drive.google.com,SINGLEPAGE
69forms.office.com,SINGLEPAGE
70player.vimeo.com,SINGLEPAGE
71static-promote.weebly.com,SINGLEPAGE
72
73# Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
74# The page's containing folder is whitelisted in case the photos are there.
75korora.econ.yale.edu,SINGLEPAGE
76
77000webhost.com
78360.cn
794shared.com
80a8.net
81abc.es
82abc.net.au
83abcnews.go.com
84about.com
85about.me
86aboutads.info
87abril.com.br
88academia.edu
89accuweather.com
90addthis.com
91addtoany.com
92adobe.com
93adweek.com
94airbnb.com
95akamaihd.net
96alexa.com
97alibaba.com
98aliexpress.com
99alipay.com
100aljazeera.com
101allaboutcookies.org
102allrecipes.com
103amazon.ca
104amazon.co.jp
105amazon.co.uk
106amazon.com
107amazon.de
108amazon.es
109amazon.fr
110amazon.in
111ameblo.jp
112ampproject.org
113android.com
114aol.com
115ap.org
116apache.org
117apachefriends.org
118apple.com
119archive.org
120archives.gov
121arstechnica.com
122arxiv.org
123asahi.com
124ask.fm
125asus.com
126axs.com
127babytree.com
128baidu.com
129bandcamp.com
130bbc.co.uk
131bbc.com
132behance.net
133berkeley.edu
134biblegateway.com
135biglobe.ne.jp
136billboard.com
137bing.com
138bit.ly
139bitly.com
140blackberry.com
141blogger.com
142blogspot.com,SUBDOMAIN-COPY
143bloomberg.com
144booking.com
145boston.com
146box.com
147britannica.com
148bt.com
149bund.de
150businessinsider.com
151businesswire.com
152buydomains.com
153buzzfeed.com
154ca.gov
155cambridge.org
156canalblog.com
157cbc.ca
158cbslocal.com
159cbsnews.com
160cdc.gov
161change.org
162channel4.com
163chicagotribune.com
164chinadaily.com.cn
165cisco.com
166clickbank.net
167cloudflare.com
168cmu.edu
169cnbc.com
170cnet.com
171cnn.com
172cocolog-nifty.com
173columbia.edu
174connect.over-blog.com
175cornell.edu
176corriere.it
177cpanel.com
178cpanel.net
179creativecommons.org
180csdn.net
181csmonitor.com
182dailymail.co.uk
183dailymotion.com
184dan.com
185daum.net
186debian.org
187dell.com
188depositfiles.com
189detik.com
190digg.com
191discovery.com
192disney.com
193disney.go.com
194disqus.com
195doubleclick.net
196dreniq.com
197dribbble.com
198dropbox.com,SINGLEPAGE
199dropboxusercontent.com
200dw.com
201e-recht24.de
202ea.com
203ebay.co.uk
204ebay.com
205economist.com
206eff.org
207ehow.com
208elmundo.es
209elpais.com
210engadget.com
211entrepreneur.com
212eonline.com
213espn.com
214espn.go.com
215etsy.com
216europa.eu
217eventbrite.com
218example.com
219excite.co.jp
220express.co.uk
221facebook.com
222fandom.com
223fastcompany.com
224fb.com
225fb.me
226fda.gov
227fedoraproject.org
228feedburner.com
229fifa.com
230files.wordpress.com
231flickr.com
232forbes.com
233fortune.com
234foursquare.com
235foxnews.com
236ft.com
237ftc.gov
238gen.xyz
239geocities.jp
240gesetze-im-internet.de
241ggpht.com
242github.com
243gizmodo.com
244globo.com
245gmail.com
246gnu.org
247godaddy.com
248gofundme.com
249goo.gl
250goo.ne.jp
251goodreads.com
252google.ca
253google.co.id
254google.co.in
255google.co.jp
256google.co.uk
257google.com
258google.com.br
259google.com.hk
260google.com.tr
261google.de
262google.es
263google.fr
264google.it
265google.nl
266google.pl
267google.ru
268googleapis.com
269googleblog.com
270googleusercontent.com
271gooyaabitemplates.com
272gov.uk
273gravatar.com
274greenpeace.org
275gstatic.com
276guardian.co.uk
277harvard.edu
278hatena.ne.jp
279histats.com
280hm.com
281hollywoodreporter.com
282home.pl
283house.gov
284howstuffworks.com
285hp.com
286huffingtonpost.com
287huffpost.com
288hugedomains.com
289ibm.com
290ibtimes.com
291icann.org
292ieee.org
293ietf.org
294ig.com.br
295ign.com
296ikea.com
297imageshack.us
298imdb.com
299imgur.com
300inc.com
301independent.co.uk
302indiatimes.com
303indiegogo.com
304instagram.com
305instructables.com
306intel.com
307interia.pl
308issuu.com
309istockphoto.com
310iubenda.com
311jd.com
312joomla.org
313jquery.com
314jstor.org
315kickstarter.com
316kinja.com
317last.fm
318latimes.com
319lefigaro.fr
320lemonde.fr
321line.me
322linkedin.com
323list-manage.com
324live.com
325livejournal.com
326livescience.com
327loc.gov
328lonelyplanet.com
329lycos.com
330m.wikipedia.org,mi.m.wikipedia.org
331mail.ru
332marketwatch.com
333marriott.com
334mashable.com
335mediafire.com
336medium.com
337mega.nz
338megaupload.com
339mercurynews.com
340merriam-webster.com
341metro.co.uk
342microsoft.com,microsoft.com/mi-nz/
343microsoftonline.com
344mirror.co.uk
345mit.edu
346mixcloud.com
347mlb.com
348mozilla.com
349mozilla.org
350msn.com
351myspace.com
352mysql.com
353namecheap.com
354narod.ru
355nasa.gov
356nationalgeographic.com
357nature.com
358naver.com
359naver.jp
360nba.com
361nbcnews.com
362ndtv.com
363netflix.com
364netsons.com
365netvibes.com
366networkadvertising.org
367news.com.au
368newscientist.com
369newsweek.com
370newyorker.com
371nginx.com
372nginx.org
373nhk.or.jp
374nicovideo.jp
375nifty.com
376nih.gov
377nikkei.com
378noaa.gov
379nokia.com
380npr.org
381nvidia.com
382nydailynews.com
383nypost.com
384nytimes.com
385nyu.edu
386odnoklassniki.ru
387office.com
388offset.com
389ok.ru
390okezone.com
391opera.com
392oracle.com
393orange.fr
394oreilly.com
395oup.com
396over-blog.com
397ovh.co.uk
398ovh.com
399ovh.net
400ox.ac.uk
401parallels.com
402pastebin.com
403paypal.com
404pbs.org
405pcmag.com
406people.com
407photobucket.com
408php.net
409pinterest.com,SINGLEPAGE
410pixabay.com
411playstation.com
412plesk.com
413plos.org
414politico.com
415prestashop.com
416prezi.com
417princeton.edu
418privacyshield.gov
419prnewswire.com
420psychologytoday.com
421qq.com
422quantcast.com
423quora.com
424rakuten.co.jp
425rambler.ru
426rapidshare.com
427reddit.com
428repubblica.it
429researchgate.net
430reuters.com
431ria.ru
432rottentomatoes.com
433rt.com
434rtve.es
435sakura.ne.jp
436samsung.com
437sapo.pt
438scholastic.com
439sciencedaily.com
440sciencedirect.com
441sciencemag.org
442scientificamerican.com
443scribd.com
444seattletimes.com
445secureserver.net
446sedo.com
447seesaa.net
448sendspace.com
449sfgate.com
450shopify.com
451shutterstock.com
452siemens.com
453sina.com.cn
454sky.com
455skype.com
456skyrock.com
457slate.com
458slideshare.net
459sm.cn
460smh.com.au
461so-net.ne.jp
462softonic.com
463sogou.com
464sohu.com
465soratemplates.com
466soso.com
467soundcloud.com
468spiegel.de
469spotify.com
470springer.com
471sputniknews.com
472ssl-images-amazon.com
473stackoverflow.com
474standard.co.uk
475stanford.edu
476state.gov
477steamcommunity.com
478steampowered.com
479storage.canalblog.com
480storage.googleapis.com
481stores.jp
482storify.com
483stuff.co.nz,SINGLEPAGE
484surveymonkey.com
485symantec.com
486t-online.de
487t.co
488t.me
489tabelog.com
490taobao.com
491target.com
492teamviewer.com
493techcrunch.com
494ted.com
495telegram.me
496telegraph.co.uk
497terra.com.br
498theatlantic.com
499thefreedictionary.com
500theglobeandmail.com
501theguardian.com
502themeforest.net
503thenextweb.com
504thestar.com
505thesun.co.uk
506thetimes.co.uk
507theverge.com
508thoughtco.com
509tianya.cn
510time.com
511tinyurl.com
512tmall.com
513tmz.com
514tribunnews.com
515tripadvisor.com
516trustpilot.com
517twitch.tv
518twitter.com
519ucoz.ru
520uiuc.edu
521umich.edu
522un.org
523undeveloped.com
524unesco.org
525uol.com.br
526urbandictionary.com
527usa.gov
528usatoday.com
529usgs.gov
530usnews.com
531uspto.gov
532ustream.tv
533utexas.edu
534variety.com
535venturebeat.com
536vice.com
537viglink.com
538vimeo.com
539vk.com
540vkontakte.ru
541vox.com
542w3.org
543w3schools.com
544wa.me
545walmart.com
546washington.edu
547washingtonpost.com
548wattpad.com
549weather.com
550web.fc2.com
551webmd.com
552weebly.com
553weibo.com
554welt.de
555whatsapp.com
556whitehouse.gov
557who.int
558wikia.com
559wikihow.com
560wikimedia.org
561wikipedia.org,mi.wikipedia.org
562wiktionary.org,mi.wiktionary.org
563wiley.com
564windowsphone.com
565wired.com
566wix.com
567wordpress.org,SUBDOMAIN-COPY
568worldbank.org
569wp.com
570wsj.com
571xbox.com
572xinhuanet.com
573yadi.sk
574yahoo.co.jp
575yahoo.com
576yale.edu
577yandex.ru
578yelp.com
579youku.com
580youronlinechoices.com
581youtu.be
582youtube.com
583ytimg.com
584zdnet.com
585zend.com
586zendesk.com
587zippyshare.com
Note: See TracBrowser for help on using the repository browser.