Context Navigation

sites-too-big-to-exhaustively-crawl.txt@ 33561

Last change on this file since 33561 was 33561, checked in by ak19, 5 years ago

sites-too-big-to-exhaustively-crawl.txt is now a comma separated list. 2. After the discussion with Dr Bainbridge that SINGLEPAGE is not what we want for docs.google.com, I found that the tentative switch to SUBDOMAIN-COPY for docs.google.com will not work precisely because of the important change we had to make yesterday: if SUBDOMAIN-COPY, then only copy SUBdomains, and not root domains. If root domain with SUBDOMAIN-COPY, then the seedURL gets written out to unprocessed-topsite-matches.txt and its site doesn't get crawled. 3. This revealed a lacuna in sites-too-big-to-exhaustively-crawl.txt possible list of values and I had to invent a new value which I introduce and have tested with this commit: FOLLOW_LINKS_WITHIN_TOPSITE. This value so far applies only to docs.google.com and will keep following any links originating in a seedURL on docs.google.com but only as long as it's within that topsite domain (docs.google.com). 4. Tidied some old fashioned use of Iterator, replaced with newer style of for loops that work with Types. Comitting before update code to use the apache csv API.

File size: 10.5 KB

Line
1	# Mapping of top sites in base url forms to value
2
3	# This file contains sites that are too large to crawl exhaustively.
4	# The domains are from Alexa top sites (where only the first 50 were visible)
5	# Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
6	# Finally also added https://moz.com/top500 by downloading its CSV file and
7	# adding its URLs to the existing listing here from alexa/wiki.
8	# Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates.
9	# Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext variants, keeping
10	# just <site>.ext
11	# And finally, re-sorted the reduced list alphabetically and pasted into here.
12
13	# FORMAT OF THIS FILE'S CONTENTS:
14	# <topsite-base-url>,<value>
15	# where <value> can be empty or one of SUBDOMAIN-COPY, SINGLEPAGE, <url-form-without-protocol>
16	#
17	# - if value is empty: if seedurl contains topsite-base-url, the seedurl will go into the file
18	# unprocessed-topsite-matches.txt and the site/page won't be crawled.
19	# The user will be notified to inspect the file unprocessed-topsite-matches.txt.
20	# - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl.
21	# For example, if the seedurl is http://docs.google.com/some-long-suffix-in-base64, then it
22	# matches the topsite-base-url of docs.google.com and its value of SINGLEPAGE will add the
23	# seedurl itself as the regex url-filter, to restrict the crawl to just the specified page.
24	# - SUBDOMAIN-COPY: if seedurl CONTAINS topsite-base-url, then whatever the seedurl's subdomain
25	# or else domain is, will make up the urlfilter, so we don't leak out into a larger domain.
26	# Use SUBDOMAIN-COPY to restrict to a domain prefix/subdomain. For example, if seedurl is
27	# pinky.blogspot.com, it will match the topsite-base-url of blogspot.com, but SUBDOMAIN-COPY
28	# will ensure we restrict crawling to pages on pinky.blogspot.com.
29	# However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go
30	# into the file unprocessed-topsite-matches.txt and the site/page won't be crawled.
31	# - FOLLOW-LINKS-WITHIN-TOPSITE: if pages linked from the seedURL page can be followed and
32	# downloaded, as long as it's within the same subdomain matching the topsite-base-url.
33	# This is different from SUBDOMAIN-COPY, as that can download all of a specific subdomain but
34	# restricts against downloading the entire domain (e.g. all pinky.blogspot.com and not anything
35	# else within blogspot.com). FOLLOW-LINKS-WITHIN-TOPSITE can download all linked pages (at
36	# depth specified for the nutch crawl) as long as they're within the topsite-base-url.
37	# e.g. seedURLs on docs.google.com containing links will have those linked pages and any
38	# they link to etc. downloaded as long as they're on docs.google.com.
39	# - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided
40	# url-form-without-protocol will make up the urlfilter, again preventing leaking into a
41	# larger part of the domain. For example, if the seedurl is mi.wikipedia.org/SomePage, it will
42	# match the topsite-base-url of wikipedia.org for which the <url-form-without-protocol>
43	# value is mi.wikipedia.org, which should be all that's accepted for wikipedia.org. The
44	# <url-form-without-protocol> ends up in the regex urlfilter file, thereby restricting the
45	# crawl to just mi.wikipedia.org.
46	# Remember to leave out any protocol <from url-form-without-protocol>.
47
48	# column 3: whether nutch should do fetch all or not
49	# column 4: number of crawl iterations
50
51	# docs.google.com is a special case: not all pages are public and any interlinking is likely to
52	# be intentional. But SUBDOMAIN-COPY does not work: as seedURL's domain becomes docs.google.com
53	# which, when combined with SUBDOMAIN-COPY, the Java code treats as a special case so that
54	# any seedURL on docs.google.com ends up pushed out into the "unprocessed....txt" text file.
55	#docs.google.com,SUBDOMAIN-COPY
56	docs.google.com,FOLLOW-LINKS-WITHIN-TOPSITE
57
58	drive.google.com,SINGLEPAGE
59	forms.office.com,SINGLEPAGE
60	player.vimeo.com,SINGLEPAGE
61	static-promote.weebly.com,SINGLEPAGE
62
63	# Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
64	# The page's containing folder is whitelisted in case the photos are there.
65	korora.econ.yale.edu,,SINGLEPAGE
66
67	000webhost.com
68	360.cn
69	4shared.com
70	a8.net
71	abc.es
72	abc.net.au
73	abcnews.go.com
74	about.com
75	about.me
76	aboutads.info
77	abril.com.br
78	academia.edu
79	accuweather.com
80	addthis.com
81	addtoany.com
82	adobe.com
83	adweek.com
84	airbnb.com
85	akamaihd.net
86	alexa.com
87	alibaba.com
88	aliexpress.com
89	alipay.com
90	aljazeera.com
91	allaboutcookies.org
92	allrecipes.com
93	amazon.ca
94	amazon.co.jp
95	amazon.co.uk
96	amazon.com
97	amazon.de
98	amazon.es
99	amazon.fr
100	amazon.in
101	ameblo.jp
102	ampproject.org
103	android.com
104	aol.com
105	ap.org
106	apache.org
107	apachefriends.org
108	apple.com
109	archive.org
110	archives.gov
111	arstechnica.com
112	arxiv.org
113	asahi.com
114	ask.fm
115	asus.com
116	axs.com
117	babytree.com
118	baidu.com
119	bandcamp.com
120	bbc.co.uk
121	bbc.com
122	behance.net
123	berkeley.edu
124	biblegateway.com
125	biglobe.ne.jp
126	billboard.com
127	bing.com
128	bit.ly
129	bitly.com
130	blackberry.com
131	blogger.com
132	blogspot.com,SUBDOMAIN-COPY
133	bloomberg.com
134	booking.com
135	boston.com
136	box.com
137	britannica.com
138	bt.com
139	bund.de
140	businessinsider.com
141	businesswire.com
142	buydomains.com
143	buzzfeed.com
144	ca.gov
145	cambridge.org
146	canalblog.com
147	cbc.ca
148	cbslocal.com
149	cbsnews.com
150	cdc.gov
151	change.org
152	channel4.com
153	chicagotribune.com
154	chinadaily.com.cn
155	cisco.com
156	clickbank.net
157	cloudflare.com
158	cmu.edu
159	cnbc.com
160	cnet.com
161	cnn.com
162	cocolog-nifty.com
163	columbia.edu
164	connect.over-blog.com
165	cornell.edu
166	corriere.it
167	cpanel.com
168	cpanel.net
169	creativecommons.org
170	csdn.net
171	csmonitor.com
172	dailymail.co.uk
173	dailymotion.com
174	dan.com
175	daum.net
176	debian.org
177	dell.com
178	depositfiles.com
179	detik.com
180	digg.com
181	discovery.com
182	disney.com
183	disney.go.com
184	disqus.com
185	doubleclick.net
186	dreniq.com
187	dribbble.com
188	dropbox.com,SINGLEPAGE
189	dropboxusercontent.com
190	dw.com
191	e-recht24.de
192	ea.com
193	ebay.co.uk
194	ebay.com
195	economist.com
196	eff.org
197	ehow.com
198	elmundo.es
199	elpais.com
200	engadget.com
201	entrepreneur.com
202	eonline.com
203	espn.com
204	espn.go.com
205	etsy.com
206	europa.eu
207	eventbrite.com
208	example.com
209	excite.co.jp
210	express.co.uk
211	facebook.com
212	fandom.com
213	fastcompany.com
214	fb.com
215	fb.me
216	fda.gov
217	fedoraproject.org
218	feedburner.com
219	fifa.com
220	files.wordpress.com
221	flickr.com
222	forbes.com
223	fortune.com
224	foursquare.com
225	foxnews.com
226	ft.com
227	ftc.gov
228	gen.xyz
229	geocities.jp
230	gesetze-im-internet.de
231	ggpht.com
232	github.com
233	gizmodo.com
234	globo.com
235	gmail.com
236	gnu.org
237	godaddy.com
238	gofundme.com
239	goo.gl
240	goo.ne.jp
241	goodreads.com
242	google.ca
243	google.co.id
244	google.co.in
245	google.co.jp
246	google.co.uk
247	google.com
248	google.com.br
249	google.com.hk
250	google.com.tr
251	google.de
252	google.es
253	google.fr
254	google.it
255	google.nl
256	google.pl
257	google.ru
258	googleapis.com
259	googleblog.com
260	googleusercontent.com
261	gooyaabitemplates.com
262	gov.uk
263	gravatar.com
264	greenpeace.org
265	gstatic.com
266	guardian.co.uk
267	harvard.edu
268	hatena.ne.jp
269	histats.com
270	hm.com
271	hollywoodreporter.com
272	home.pl
273	house.gov
274	howstuffworks.com
275	hp.com
276	huffingtonpost.com
277	huffpost.com
278	hugedomains.com
279	ibm.com
280	ibtimes.com
281	icann.org
282	ieee.org
283	ietf.org
284	ig.com.br
285	ign.com
286	ikea.com
287	imageshack.us
288	imdb.com
289	imgur.com
290	inc.com
291	independent.co.uk
292	indiatimes.com
293	indiegogo.com
294	instagram.com
295	instructables.com
296	intel.com
297	interia.pl
298	issuu.com
299	istockphoto.com
300	iubenda.com
301	jd.com
302	joomla.org
303	jquery.com
304	jstor.org
305	kickstarter.com
306	kinja.com
307	last.fm
308	latimes.com
309	lefigaro.fr
310	lemonde.fr
311	line.me
312	linkedin.com
313	list-manage.com
314	live.com
315	livejournal.com
316	livescience.com
317	loc.gov
318	lonelyplanet.com
319	lycos.com
320	m.wikipedia.org,mi.m.wikipedia.org
321	mail.ru
322	marketwatch.com
323	marriott.com
324	mashable.com
325	mediafire.com
326	medium.com
327	mega.nz
328	megaupload.com
329	mercurynews.com
330	merriam-webster.com
331	metro.co.uk
332	microsoft.com,microsoft.com/mi-nz/
333	microsoftonline.com
334	mirror.co.uk
335	mit.edu
336	mixcloud.com
337	mlb.com
338	mozilla.com
339	mozilla.org
340	msn.com
341	myspace.com
342	mysql.com
343	namecheap.com
344	narod.ru
345	nasa.gov
346	nationalgeographic.com
347	nature.com
348	naver.com
349	naver.jp
350	nba.com
351	nbcnews.com
352	ndtv.com
353	netflix.com
354	netsons.com
355	netvibes.com
356	networkadvertising.org
357	news.com.au
358	newscientist.com
359	newsweek.com
360	newyorker.com
361	nginx.com
362	nginx.org
363	nhk.or.jp
364	nicovideo.jp
365	nifty.com
366	nih.gov
367	nikkei.com
368	noaa.gov
369	nokia.com
370	npr.org
371	nvidia.com
372	nydailynews.com
373	nypost.com
374	nytimes.com
375	nyu.edu
376	odnoklassniki.ru
377	office.com
378	offset.com
379	ok.ru
380	okezone.com
381	opera.com
382	oracle.com
383	orange.fr
384	oreilly.com
385	oup.com
386	over-blog.com
387	ovh.co.uk
388	ovh.com
389	ovh.net
390	ox.ac.uk
391	parallels.com
392	pastebin.com
393	paypal.com
394	pbs.org
395	pcmag.com
396	people.com
397	photobucket.com
398	php.net
399	pinterest.com,SINGLEPAGE
400	pixabay.com
401	playstation.com
402	plesk.com
403	plos.org
404	politico.com
405	prestashop.com
406	prezi.com
407	princeton.edu
408	privacyshield.gov
409	prnewswire.com
410	psychologytoday.com
411	qq.com
412	quantcast.com
413	quora.com
414	rakuten.co.jp
415	rambler.ru
416	rapidshare.com
417	reddit.com
418	repubblica.it
419	researchgate.net
420	reuters.com
421	ria.ru
422	rottentomatoes.com
423	rt.com
424	rtve.es
425	sakura.ne.jp
426	samsung.com
427	sapo.pt
428	scholastic.com
429	sciencedaily.com
430	sciencedirect.com
431	sciencemag.org
432	scientificamerican.com
433	scribd.com
434	seattletimes.com
435	secureserver.net
436	sedo.com
437	seesaa.net
438	sendspace.com
439	sfgate.com
440	shopify.com
441	shutterstock.com
442	siemens.com
443	sina.com.cn
444	sky.com
445	skype.com
446	skyrock.com
447	slate.com
448	slideshare.net
449	sm.cn
450	smh.com.au
451	so-net.ne.jp
452	softonic.com
453	sogou.com
454	sohu.com
455	soratemplates.com
456	soso.com
457	soundcloud.com
458	spiegel.de
459	spotify.com
460	springer.com
461	sputniknews.com
462	ssl-images-amazon.com
463	stackoverflow.com
464	standard.co.uk
465	stanford.edu
466	state.gov
467	steamcommunity.com
468	steampowered.com
469	storage.canalblog.com
470	storage.googleapis.com
471	stores.jp
472	storify.com
473	stuff.co.nz,SINGLEPAGE
474	surveymonkey.com
475	symantec.com
476	t-online.de
477	t.co
478	t.me
479	tabelog.com
480	taobao.com
481	target.com
482	teamviewer.com
483	techcrunch.com
484	ted.com
485	telegram.me
486	telegraph.co.uk
487	terra.com.br
488	theatlantic.com
489	thefreedictionary.com
490	theglobeandmail.com
491	theguardian.com
492	themeforest.net
493	thenextweb.com
494	thestar.com
495	thesun.co.uk
496	thetimes.co.uk
497	theverge.com
498	thoughtco.com
499	tianya.cn
500	time.com
501	tinyurl.com
502	tmall.com
503	tmz.com
504	tribunnews.com
505	tripadvisor.com
506	trustpilot.com
507	twitch.tv
508	twitter.com
509	ucoz.ru
510	uiuc.edu
511	umich.edu
512	un.org
513	undeveloped.com
514	unesco.org
515	uol.com.br
516	urbandictionary.com
517	usa.gov
518	usatoday.com
519	usgs.gov
520	usnews.com
521	uspto.gov
522	ustream.tv
523	utexas.edu
524	variety.com
525	venturebeat.com
526	vice.com
527	viglink.com
528	vimeo.com
529	vk.com
530	vkontakte.ru
531	vox.com
532	w3.org
533	w3schools.com
534	wa.me
535	walmart.com
536	washington.edu
537	washingtonpost.com
538	wattpad.com
539	weather.com
540	web.fc2.com
541	webmd.com
542	weebly.com
543	weibo.com
544	welt.de
545	whatsapp.com
546	whitehouse.gov
547	who.int
548	wikia.com
549	wikihow.com
550	wikimedia.org
551	wikipedia.org,mi.wikipedia.org
552	wiktionary.org,mi.wiktionary.org
553	wiley.com
554	windowsphone.com
555	wired.com
556	wix.com
557	wordpress.org,SUBDOMAIN-COPY
558	worldbank.org
559	wp.com
560	wsj.com
561	xbox.com
562	xinhuanet.com
563	yadi.sk
564	yahoo.co.jp
565	yahoo.com
566	yale.edu
567	yandex.ru
568	yelp.com
569	youku.com
570	youronlinechoices.com
571	youtu.be
572	youtube.com
573	ytimg.com
574	zdnet.com
575	zend.com
576	zendesk.com
577	zippyshare.com

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt@ 33561

Download in other formats: