source: gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt@ 33561

Last change on this file since 33561 was 33561, checked in by ak19, 5 years ago
  1. sites-too-big-to-exhaustively-crawl.txt is now a comma separated list. 2. After the discussion with Dr Bainbridge that SINGLEPAGE is not what we want for docs.google.com, I found that the tentative switch to SUBDOMAIN-COPY for docs.google.com will not work precisely because of the important change we had to make yesterday: if SUBDOMAIN-COPY, then only copy SUBdomains, and not root domains. If root domain with SUBDOMAIN-COPY, then the seedURL gets written out to unprocessed-topsite-matches.txt and its site doesn't get crawled. 3. This revealed a lacuna in sites-too-big-to-exhaustively-crawl.txt possible list of values and I had to invent a new value which I introduce and have tested with this commit: FOLLOW_LINKS_WITHIN_TOPSITE. This value so far applies only to docs.google.com and will keep following any links originating in a seedURL on docs.google.com but only as long as it's within that topsite domain (docs.google.com). 4. Tidied some old fashioned use of Iterator, replaced with newer style of for loops that work with Types. Comitting before update code to use the apache csv API.
File size: 10.5 KB
Line 
1# Mapping of top sites in base url forms to value
2
3# This file contains sites that are too large to crawl exhaustively.
4# The domains are from Alexa top sites (where only the first 50 were visible)
5# Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
6# Finally also added https://moz.com/top500 by downloading its CSV file and
7# adding its URLs to the existing listing here from alexa/wiki.
8# Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates.
9# Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext variants, keeping
10# just <site>.ext
11# And finally, re-sorted the reduced list alphabetically and pasted into here.
12
13# FORMAT OF THIS FILE'S CONTENTS:
14# <topsite-base-url>,<value>
15# where <value> can be empty or one of SUBDOMAIN-COPY, SINGLEPAGE, <url-form-without-protocol>
16#
17# - if value is empty: if seedurl contains topsite-base-url, the seedurl will go into the file
18# unprocessed-topsite-matches.txt and the site/page won't be crawled.
19# The user will be notified to inspect the file unprocessed-topsite-matches.txt.
20# - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl.
21# For example, if the seedurl is http://docs.google.com/some-long-suffix-in-base64, then it
22# matches the topsite-base-url of docs.google.com and its value of SINGLEPAGE will add the
23# seedurl itself as the regex url-filter, to restrict the crawl to just the specified page.
24# - SUBDOMAIN-COPY: if seedurl CONTAINS topsite-base-url, then whatever the seedurl's subdomain
25# or else domain is, will make up the urlfilter, so we don't leak out into a larger domain.
26# Use SUBDOMAIN-COPY to restrict to a domain prefix/subdomain. For example, if seedurl is
27# pinky.blogspot.com, it will match the topsite-base-url of blogspot.com, but SUBDOMAIN-COPY
28# will ensure we restrict crawling to pages on pinky.blogspot.com.
29# However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go
30# into the file unprocessed-topsite-matches.txt and the site/page won't be crawled.
31# - FOLLOW-LINKS-WITHIN-TOPSITE: if pages linked from the seedURL page can be followed and
32# downloaded, as long as it's within the same subdomain matching the topsite-base-url.
33# This is different from SUBDOMAIN-COPY, as that can download all of a specific subdomain but
34# restricts against downloading the entire domain (e.g. all pinky.blogspot.com and not anything
35# else within blogspot.com). FOLLOW-LINKS-WITHIN-TOPSITE can download all linked pages (at
36# depth specified for the nutch crawl) as long as they're within the topsite-base-url.
37# e.g. seedURLs on docs.google.com containing links will have those linked pages and any
38# they link to etc. downloaded as long as they're on docs.google.com.
39# - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided
40# url-form-without-protocol will make up the urlfilter, again preventing leaking into a
41# larger part of the domain. For example, if the seedurl is mi.wikipedia.org/SomePage, it will
42# match the topsite-base-url of wikipedia.org for which the <url-form-without-protocol>
43# value is mi.wikipedia.org, which should be all that's accepted for wikipedia.org. The
44# <url-form-without-protocol> ends up in the regex urlfilter file, thereby restricting the
45# crawl to just mi.wikipedia.org.
46# Remember to leave out any protocol <from url-form-without-protocol>.
47
48# column 3: whether nutch should do fetch all or not
49# column 4: number of crawl iterations
50
51# docs.google.com is a special case: not all pages are public and any interlinking is likely to
52# be intentional. But SUBDOMAIN-COPY does not work: as seedURL's domain becomes docs.google.com
53# which, when combined with SUBDOMAIN-COPY, the Java code treats as a special case so that
54# any seedURL on docs.google.com ends up pushed out into the "unprocessed....txt" text file.
55#docs.google.com,SUBDOMAIN-COPY
56docs.google.com,FOLLOW-LINKS-WITHIN-TOPSITE
57
58drive.google.com,SINGLEPAGE
59forms.office.com,SINGLEPAGE
60player.vimeo.com,SINGLEPAGE
61static-promote.weebly.com,SINGLEPAGE
62
63# Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
64# The page's containing folder is whitelisted in case the photos are there.
65korora.econ.yale.edu,,SINGLEPAGE
66
67000webhost.com
68360.cn
694shared.com
70a8.net
71abc.es
72abc.net.au
73abcnews.go.com
74about.com
75about.me
76aboutads.info
77abril.com.br
78academia.edu
79accuweather.com
80addthis.com
81addtoany.com
82adobe.com
83adweek.com
84airbnb.com
85akamaihd.net
86alexa.com
87alibaba.com
88aliexpress.com
89alipay.com
90aljazeera.com
91allaboutcookies.org
92allrecipes.com
93amazon.ca
94amazon.co.jp
95amazon.co.uk
96amazon.com
97amazon.de
98amazon.es
99amazon.fr
100amazon.in
101ameblo.jp
102ampproject.org
103android.com
104aol.com
105ap.org
106apache.org
107apachefriends.org
108apple.com
109archive.org
110archives.gov
111arstechnica.com
112arxiv.org
113asahi.com
114ask.fm
115asus.com
116axs.com
117babytree.com
118baidu.com
119bandcamp.com
120bbc.co.uk
121bbc.com
122behance.net
123berkeley.edu
124biblegateway.com
125biglobe.ne.jp
126billboard.com
127bing.com
128bit.ly
129bitly.com
130blackberry.com
131blogger.com
132blogspot.com,SUBDOMAIN-COPY
133bloomberg.com
134booking.com
135boston.com
136box.com
137britannica.com
138bt.com
139bund.de
140businessinsider.com
141businesswire.com
142buydomains.com
143buzzfeed.com
144ca.gov
145cambridge.org
146canalblog.com
147cbc.ca
148cbslocal.com
149cbsnews.com
150cdc.gov
151change.org
152channel4.com
153chicagotribune.com
154chinadaily.com.cn
155cisco.com
156clickbank.net
157cloudflare.com
158cmu.edu
159cnbc.com
160cnet.com
161cnn.com
162cocolog-nifty.com
163columbia.edu
164connect.over-blog.com
165cornell.edu
166corriere.it
167cpanel.com
168cpanel.net
169creativecommons.org
170csdn.net
171csmonitor.com
172dailymail.co.uk
173dailymotion.com
174dan.com
175daum.net
176debian.org
177dell.com
178depositfiles.com
179detik.com
180digg.com
181discovery.com
182disney.com
183disney.go.com
184disqus.com
185doubleclick.net
186dreniq.com
187dribbble.com
188dropbox.com,SINGLEPAGE
189dropboxusercontent.com
190dw.com
191e-recht24.de
192ea.com
193ebay.co.uk
194ebay.com
195economist.com
196eff.org
197ehow.com
198elmundo.es
199elpais.com
200engadget.com
201entrepreneur.com
202eonline.com
203espn.com
204espn.go.com
205etsy.com
206europa.eu
207eventbrite.com
208example.com
209excite.co.jp
210express.co.uk
211facebook.com
212fandom.com
213fastcompany.com
214fb.com
215fb.me
216fda.gov
217fedoraproject.org
218feedburner.com
219fifa.com
220files.wordpress.com
221flickr.com
222forbes.com
223fortune.com
224foursquare.com
225foxnews.com
226ft.com
227ftc.gov
228gen.xyz
229geocities.jp
230gesetze-im-internet.de
231ggpht.com
232github.com
233gizmodo.com
234globo.com
235gmail.com
236gnu.org
237godaddy.com
238gofundme.com
239goo.gl
240goo.ne.jp
241goodreads.com
242google.ca
243google.co.id
244google.co.in
245google.co.jp
246google.co.uk
247google.com
248google.com.br
249google.com.hk
250google.com.tr
251google.de
252google.es
253google.fr
254google.it
255google.nl
256google.pl
257google.ru
258googleapis.com
259googleblog.com
260googleusercontent.com
261gooyaabitemplates.com
262gov.uk
263gravatar.com
264greenpeace.org
265gstatic.com
266guardian.co.uk
267harvard.edu
268hatena.ne.jp
269histats.com
270hm.com
271hollywoodreporter.com
272home.pl
273house.gov
274howstuffworks.com
275hp.com
276huffingtonpost.com
277huffpost.com
278hugedomains.com
279ibm.com
280ibtimes.com
281icann.org
282ieee.org
283ietf.org
284ig.com.br
285ign.com
286ikea.com
287imageshack.us
288imdb.com
289imgur.com
290inc.com
291independent.co.uk
292indiatimes.com
293indiegogo.com
294instagram.com
295instructables.com
296intel.com
297interia.pl
298issuu.com
299istockphoto.com
300iubenda.com
301jd.com
302joomla.org
303jquery.com
304jstor.org
305kickstarter.com
306kinja.com
307last.fm
308latimes.com
309lefigaro.fr
310lemonde.fr
311line.me
312linkedin.com
313list-manage.com
314live.com
315livejournal.com
316livescience.com
317loc.gov
318lonelyplanet.com
319lycos.com
320m.wikipedia.org,mi.m.wikipedia.org
321mail.ru
322marketwatch.com
323marriott.com
324mashable.com
325mediafire.com
326medium.com
327mega.nz
328megaupload.com
329mercurynews.com
330merriam-webster.com
331metro.co.uk
332microsoft.com,microsoft.com/mi-nz/
333microsoftonline.com
334mirror.co.uk
335mit.edu
336mixcloud.com
337mlb.com
338mozilla.com
339mozilla.org
340msn.com
341myspace.com
342mysql.com
343namecheap.com
344narod.ru
345nasa.gov
346nationalgeographic.com
347nature.com
348naver.com
349naver.jp
350nba.com
351nbcnews.com
352ndtv.com
353netflix.com
354netsons.com
355netvibes.com
356networkadvertising.org
357news.com.au
358newscientist.com
359newsweek.com
360newyorker.com
361nginx.com
362nginx.org
363nhk.or.jp
364nicovideo.jp
365nifty.com
366nih.gov
367nikkei.com
368noaa.gov
369nokia.com
370npr.org
371nvidia.com
372nydailynews.com
373nypost.com
374nytimes.com
375nyu.edu
376odnoklassniki.ru
377office.com
378offset.com
379ok.ru
380okezone.com
381opera.com
382oracle.com
383orange.fr
384oreilly.com
385oup.com
386over-blog.com
387ovh.co.uk
388ovh.com
389ovh.net
390ox.ac.uk
391parallels.com
392pastebin.com
393paypal.com
394pbs.org
395pcmag.com
396people.com
397photobucket.com
398php.net
399pinterest.com,SINGLEPAGE
400pixabay.com
401playstation.com
402plesk.com
403plos.org
404politico.com
405prestashop.com
406prezi.com
407princeton.edu
408privacyshield.gov
409prnewswire.com
410psychologytoday.com
411qq.com
412quantcast.com
413quora.com
414rakuten.co.jp
415rambler.ru
416rapidshare.com
417reddit.com
418repubblica.it
419researchgate.net
420reuters.com
421ria.ru
422rottentomatoes.com
423rt.com
424rtve.es
425sakura.ne.jp
426samsung.com
427sapo.pt
428scholastic.com
429sciencedaily.com
430sciencedirect.com
431sciencemag.org
432scientificamerican.com
433scribd.com
434seattletimes.com
435secureserver.net
436sedo.com
437seesaa.net
438sendspace.com
439sfgate.com
440shopify.com
441shutterstock.com
442siemens.com
443sina.com.cn
444sky.com
445skype.com
446skyrock.com
447slate.com
448slideshare.net
449sm.cn
450smh.com.au
451so-net.ne.jp
452softonic.com
453sogou.com
454sohu.com
455soratemplates.com
456soso.com
457soundcloud.com
458spiegel.de
459spotify.com
460springer.com
461sputniknews.com
462ssl-images-amazon.com
463stackoverflow.com
464standard.co.uk
465stanford.edu
466state.gov
467steamcommunity.com
468steampowered.com
469storage.canalblog.com
470storage.googleapis.com
471stores.jp
472storify.com
473stuff.co.nz,SINGLEPAGE
474surveymonkey.com
475symantec.com
476t-online.de
477t.co
478t.me
479tabelog.com
480taobao.com
481target.com
482teamviewer.com
483techcrunch.com
484ted.com
485telegram.me
486telegraph.co.uk
487terra.com.br
488theatlantic.com
489thefreedictionary.com
490theglobeandmail.com
491theguardian.com
492themeforest.net
493thenextweb.com
494thestar.com
495thesun.co.uk
496thetimes.co.uk
497theverge.com
498thoughtco.com
499tianya.cn
500time.com
501tinyurl.com
502tmall.com
503tmz.com
504tribunnews.com
505tripadvisor.com
506trustpilot.com
507twitch.tv
508twitter.com
509ucoz.ru
510uiuc.edu
511umich.edu
512un.org
513undeveloped.com
514unesco.org
515uol.com.br
516urbandictionary.com
517usa.gov
518usatoday.com
519usgs.gov
520usnews.com
521uspto.gov
522ustream.tv
523utexas.edu
524variety.com
525venturebeat.com
526vice.com
527viglink.com
528vimeo.com
529vk.com
530vkontakte.ru
531vox.com
532w3.org
533w3schools.com
534wa.me
535walmart.com
536washington.edu
537washingtonpost.com
538wattpad.com
539weather.com
540web.fc2.com
541webmd.com
542weebly.com
543weibo.com
544welt.de
545whatsapp.com
546whitehouse.gov
547who.int
548wikia.com
549wikihow.com
550wikimedia.org
551wikipedia.org,mi.wikipedia.org
552wiktionary.org,mi.wiktionary.org
553wiley.com
554windowsphone.com
555wired.com
556wix.com
557wordpress.org,SUBDOMAIN-COPY
558worldbank.org
559wp.com
560wsj.com
561xbox.com
562xinhuanet.com
563yadi.sk
564yahoo.co.jp
565yahoo.com
566yale.edu
567yandex.ru
568yelp.com
569youku.com
570youronlinechoices.com
571youtu.be
572youtube.com
573ytimg.com
574zdnet.com
575zend.com
576zendesk.com
577zippyshare.com
Note: See TracBrowser for help on using the repository browser.