Context Navigation

sites-too-big-to-exhaustively-crawl.txt@ 34011

Last change on this file since 34011 was 33904, checked in by ak19, 4 years ago

Shouldn't greylist anglican.org, as this prevented crawling of justus.anglican.org seedURLs. There's however no need to add an exception into sites-too-big-to-exhaustively-crawl.txt to control how much we crawl, as we only crawl to depth 10 anyway and the seedURLs already list the most promising pages (as well as 2 URLs on anglican.org which weren't promising). Added the to_crwal and finished crawled data for this. siteID is 01463.

File size: 11.3 KB

Line
1	# Mapping of top sites in base url forms to value
2
3	# This file contains sites that are too large to crawl exhaustively.
4	# The domains are from Alexa top sites (where only the first 50 were visible)
5	# Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
6	# Finally also added https://moz.com/top500 by downloading its CSV file and
7	# adding its URLs to the existing listing here from alexa/wiki.
8	# Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates.
9	# Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext variants, keeping
10	# just <site>.ext
11	# And finally, re-sorted the reduced list alphabetically and pasted into here.
12
13	# FORMAT OF THIS FILE'S CONTENTS:
14	# <topsite-base-url>,<value>
15	# where <value> can or is one of
16	# empty, SUBDOMAIN-COPY, FOLLOW-LINKS-WITHIN-TOPSITE, SINGLEPAGE, <url-form-without-protocol>
17	#
18	# - if value is left empty: if seedurl contains topsite-base-url, the seedurl will go into the
19	# file unprocessed-topsite-matches.txt and the site/page won't be crawled.
20	# The user will be notified to inspect the file unprocessed-topsite-matches.txt.
21	# - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl.
22	# For example, if the seedurl is http://docs.google.com/some-long-suffix-in-base64, then it
23	# matches the topsite-base-url of docs.google.com and its value of SINGLEPAGE will add the
24	# seedurl itself as the regex url-filter, to restrict the crawl to just the specified page.
25	# - SUBDOMAIN-COPY: if seedurl CONTAINS topsite-base-url, then whatever the seedurl's subdomain
26	# or else domain is, will make up the urlfilter, so we don't leak out into a larger domain.
27	# Use SUBDOMAIN-COPY to restrict to a domain prefix/subdomain. For example, if seedurl is
28	# pinky.blogspot.com, it will match the topsite-base-url of blogspot.com, but SUBDOMAIN-COPY
29	# will ensure we restrict crawling to pages on pinky.blogspot.com.
30	# However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go
31	# into the file unprocessed-topsite-matches.txt and the site/page won't be crawled.
32	# - FOLLOW-LINKS-WITHIN-TOPSITE: download seedURL pages and pages linked from each seedURL
33	# page should be followed and downloaded too, as long as they're within the same subdomain
34	# matching the topsite-base-url.
35	# This is different from SUBDOMAIN-COPY, as that can download all of a specific subdomain but
36	# restricts against downloading the entire domain (e.g. all pinky.blogspot.com and not anything
37	# else within blogspot.com). FOLLOW-LINKS-WITHIN-TOPSITE can download all linked pages (at
38	# depth specified for the nutch crawl) as long as they're within the topsite-base-url.
39	# e.g. seedURLs on docs.google.com containing links will have those linked pages and any
40	# they link to etc. downloaded as long as they're on docs.google.com.
41	# - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided
42	# url-form-without-protocol will make up the urlfilter, again preventing leaking into a
43	# larger part of the domain. For example, if the seedurl is mi.wikipedia.org/SomePage, it will
44	# match the topsite-base-url of wikipedia.org for which the <url-form-without-protocol>
45	# value is mi.wikipedia.org, which should be all that's accepted for wikipedia.org. The
46	# <url-form-without-protocol> ends up in the regex urlfilter file, thereby restricting the
47	# crawl to just mi.wikipedia.org.
48	# Remember to leave out any protocol <from url-form-without-protocol>.
49	#
50	# TODO If useful:
51	# column 3: whether nutch should do fetch all or not
52	# column 4: number of crawl iterations
53
54
55	# NOT TOP SITES, BUT SITES WE INSPECTED AND WANT TO CONTROL SIMILARLY TO TOP SITES
56	00.gs,SINGLEPAGE
57	# May be a large site with only seedURLs of real relevance
58	topographic-map.com,SINGLEPAGE
59	ami-media.net,SINGLEPAGE
60	# 2 pages of declarations of human rights in Maori, rest in other languages
61	anitra.net,SINGLEPAGE
62	# special case
63	mi.centr-zashity.ru,SINGLEPAGE
64
65	# we want the http://loquevendra318.com/fox/maori.html seed URL but also
66	# pages within the following subsection
67	loquevendra318.com,loquevendra318.com/fox/maori/
68
69	martinvrijland.nl,martinvrijland.nl/mi/
70	csunplugged.org,csunplugged.org/mi/
71	gpedia.com,gpedia.com/mi/
72
73	# TOP SITE BUT NOT TOP 500
74	www.tumblr.com,SINGLEPAGE
75
76
77	# TOP SITES
78
79	# docs.google.com is a special case: not all pages are public and any interlinking is likely to
80	# be intentional. Grab all linked pages, for link depth set with nutch's crawl, as long as the
81	# links are within the given topsite-base-url
82	docs.google.com,FOLLOW-LINKS-WITHIN-TOPSITE
83
84	# Just crawl a single page for these:
85	drive.google.com,SINGLEPAGE
86	forms.office.com,SINGLEPAGE
87	player.vimeo.com,SINGLEPAGE
88	static-promote.weebly.com,SINGLEPAGE
89
90	# Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
91	# The page's containing folder is whitelisted in case the photos are there.
92	korora.econ.yale.edu,SINGLEPAGE
93
94	# Special case of justus.anglican.org - no meaningful seedURLs directly on anglican.org
95	# but justus.anglican.org has them
96	#anglican.org,justus.anglican.org
97
98	000webhost.com
99	360.cn
100	4shared.com
101	a8.net
102	abc.es
103	abc.net.au
104	abcnews.go.com
105	about.com
106	about.me
107	aboutads.info
108	abril.com.br
109	academia.edu
110	accuweather.com
111	addthis.com
112	addtoany.com
113	adobe.com
114	adweek.com
115	airbnb.com
116	akamaihd.net
117	alexa.com
118	alibaba.com
119	aliexpress.com
120	alipay.com
121	aljazeera.com
122	allaboutcookies.org
123	allrecipes.com
124	amazon.ca
125	amazon.co.jp
126	amazon.co.uk
127	amazon.com
128	amazon.de
129	amazon.es
130	amazon.fr
131	amazon.in
132	ameblo.jp
133	ampproject.org
134	android.com
135	aol.com
136	ap.org
137	apache.org
138	apachefriends.org
139	apple.com
140	archive.org
141	archives.gov
142	arstechnica.com
143	arxiv.org
144	asahi.com
145	ask.fm
146	asus.com
147	axs.com
148	babytree.com
149	baidu.com
150	bandcamp.com
151	bbc.co.uk
152	bbc.com
153	behance.net
154	berkeley.edu
155	biblegateway.com
156	biglobe.ne.jp
157	billboard.com
158	bing.com
159	bit.ly
160	bitly.com
161	blackberry.com
162	blogger.com
163	blogspot.com,SUBDOMAIN-COPY
164	bloomberg.com
165	booking.com
166	boston.com
167	box.com
168	britannica.com
169	bt.com
170	bund.de
171	businessinsider.com
172	businesswire.com
173	buydomains.com
174	buzzfeed.com
175	ca.gov
176	cambridge.org
177	canalblog.com
178	cbc.ca
179	cbslocal.com
180	cbsnews.com
181	cdc.gov
182	change.org
183	channel4.com
184	chicagotribune.com
185	chinadaily.com.cn
186	cisco.com
187	clickbank.net
188	cloudflare.com
189	cmu.edu
190	cnbc.com
191	cnet.com
192	cnn.com
193	cocolog-nifty.com
194	columbia.edu
195	connect.over-blog.com
196	cornell.edu
197	corriere.it
198	cpanel.com
199	cpanel.net
200	creativecommons.org
201	csdn.net
202	csmonitor.com
203	dailymail.co.uk
204	dailymotion.com
205	dan.com
206	daum.net
207	debian.org
208	dell.com
209	depositfiles.com
210	detik.com
211	digg.com
212	discovery.com
213	disney.com
214	disney.go.com
215	disqus.com
216	doubleclick.net
217	dreniq.com
218	dribbble.com
219	dropbox.com,SINGLEPAGE
220	dropboxusercontent.com
221	dw.com
222	e-recht24.de
223	ea.com
224	ebay.co.uk
225	ebay.com
226	economist.com
227	eff.org
228	ehow.com
229	elmundo.es
230	elpais.com
231	engadget.com
232	entrepreneur.com
233	eonline.com
234	espn.com
235	espn.go.com
236	etsy.com
237	europa.eu
238	eventbrite.com
239	example.com
240	excite.co.jp
241	express.co.uk
242	facebook.com
243	fandom.com
244	fastcompany.com
245	fb.com
246	fb.me
247	fda.gov
248	fedoraproject.org
249	feedburner.com
250	fifa.com
251	files.wordpress.com
252	flickr.com
253	forbes.com
254	fortune.com
255	foursquare.com
256	foxnews.com
257	ft.com
258	ftc.gov
259	gen.xyz
260	geocities.jp
261	gesetze-im-internet.de
262	ggpht.com
263	github.com
264	gizmodo.com
265	globo.com
266	gmail.com
267	gnu.org
268	godaddy.com
269	gofundme.com
270	goo.gl
271	goo.ne.jp
272	goodreads.com
273	google.ca
274	google.co.id
275	google.co.in
276	google.co.jp
277	google.co.uk
278	google.com
279	google.com.br
280	google.com.hk
281	google.com.tr
282	google.de
283	google.es
284	google.fr
285	google.it
286	google.nl
287	google.pl
288	google.ru
289	googleapis.com
290	googleblog.com
291	googleusercontent.com
292	gooyaabitemplates.com
293	gov.uk
294	gravatar.com
295	greenpeace.org
296	gstatic.com
297	guardian.co.uk
298	harvard.edu
299	hatena.ne.jp
300	histats.com
301	hm.com
302	hollywoodreporter.com
303	home.pl
304	house.gov
305	howstuffworks.com
306	hp.com
307	huffingtonpost.com
308	huffpost.com
309	hugedomains.com
310	ibm.com
311	ibtimes.com
312	icann.org
313	ieee.org
314	ietf.org
315	ig.com.br
316	ign.com
317	ikea.com
318	imageshack.us
319	imdb.com
320	imgur.com
321	inc.com
322	independent.co.uk
323	indiatimes.com
324	indiegogo.com
325	instagram.com
326	instructables.com
327	intel.com
328	interia.pl
329	issuu.com
330	istockphoto.com
331	iubenda.com
332	jd.com
333	joomla.org
334	jquery.com
335	jstor.org
336	kickstarter.com
337	kinja.com
338	last.fm
339	latimes.com
340	lefigaro.fr
341	lemonde.fr
342	line.me
343	linkedin.com
344	list-manage.com
345	live.com
346	livejournal.com
347	livescience.com
348	loc.gov
349	lonelyplanet.com
350	lycos.com
351	m.wikipedia.org,mi.m.wikipedia.org
352	mail.ru
353	marketwatch.com
354	marriott.com
355	mashable.com
356	mediafire.com
357	medium.com
358	mega.nz
359	megaupload.com
360	mercurynews.com
361	merriam-webster.com
362	metro.co.uk
363	microsoft.com,microsoft.com/mi-nz/
364	microsoftonline.com
365	mirror.co.uk
366	mit.edu
367	mixcloud.com
368	mlb.com
369	mozilla.com
370	mozilla.org
371	msn.com
372	myspace.com
373	mysql.com
374	namecheap.com
375	narod.ru
376	nasa.gov
377	nationalgeographic.com
378	nature.com
379	naver.com
380	naver.jp
381	nba.com
382	nbcnews.com
383	ndtv.com
384	netflix.com
385	netsons.com
386	netvibes.com
387	networkadvertising.org
388	news.com.au
389	newscientist.com
390	newsweek.com
391	newyorker.com
392	nginx.com
393	nginx.org
394	nhk.or.jp
395	nicovideo.jp
396	nifty.com
397	nih.gov
398	nikkei.com
399	noaa.gov
400	nokia.com
401	npr.org
402	nvidia.com
403	nydailynews.com
404	nypost.com
405	nytimes.com
406	nyu.edu
407	odnoklassniki.ru
408	office.com
409	offset.com
410	ok.ru
411	okezone.com
412	opera.com
413	oracle.com
414	orange.fr
415	oreilly.com
416	oup.com
417	over-blog.com
418	ovh.co.uk
419	ovh.com
420	ovh.net
421	ox.ac.uk
422	parallels.com
423	pastebin.com
424	paypal.com
425	pbs.org
426	pcmag.com
427	people.com
428	photobucket.com
429	php.net
430	pinterest.com,SINGLEPAGE
431	pixabay.com
432	playstation.com
433	plesk.com
434	plos.org
435	politico.com
436	prestashop.com
437	prezi.com
438	princeton.edu
439	privacyshield.gov
440	prnewswire.com
441	psychologytoday.com
442	qq.com
443	quantcast.com
444	quora.com
445	rakuten.co.jp
446	rambler.ru
447	rapidshare.com
448	reddit.com
449	repubblica.it
450	researchgate.net
451	reuters.com
452	ria.ru
453	rottentomatoes.com
454	rt.com
455	rtve.es
456	sakura.ne.jp
457	samsung.com
458	sapo.pt
459	scholastic.com
460	sciencedaily.com
461	sciencedirect.com
462	sciencemag.org
463	scientificamerican.com
464	scribd.com
465	seattletimes.com
466	secureserver.net
467	sedo.com
468	seesaa.net
469	sendspace.com
470	sfgate.com
471	shopify.com
472	shutterstock.com
473	siemens.com
474	sina.com.cn
475	sky.com
476	skype.com
477	skyrock.com
478	slate.com
479	slideshare.net
480	sm.cn
481	smh.com.au
482	so-net.ne.jp
483	softonic.com
484	sogou.com
485	sohu.com
486	soratemplates.com
487	soso.com
488	soundcloud.com
489	spiegel.de
490	spotify.com
491	springer.com
492	sputniknews.com
493	ssl-images-amazon.com
494	stackoverflow.com
495	standard.co.uk
496	stanford.edu
497	state.gov
498	steamcommunity.com
499	steampowered.com
500	storage.canalblog.com
501	storage.googleapis.com
502	stores.jp
503	storify.com
504	stuff.co.nz,SINGLEPAGE
505	surveymonkey.com
506	symantec.com
507	t-online.de
508	t.co
509	t.me
510	tabelog.com
511	taobao.com
512	target.com
513	teamviewer.com
514	techcrunch.com
515	ted.com
516	telegram.me
517	telegraph.co.uk
518	terra.com.br
519	theatlantic.com
520	thefreedictionary.com
521	theglobeandmail.com
522	theguardian.com
523	themeforest.net
524	thenextweb.com
525	thestar.com
526	thesun.co.uk
527	thetimes.co.uk
528	theverge.com
529	thoughtco.com
530	tianya.cn
531	time.com
532	tinyurl.com
533	tmall.com
534	tmz.com
535	tribunnews.com
536	tripadvisor.com
537	trustpilot.com
538	twitch.tv
539	twitter.com
540	ucoz.ru
541	uiuc.edu
542	umich.edu
543	un.org
544	undeveloped.com
545	unesco.org
546	uol.com.br
547	urbandictionary.com
548	usa.gov
549	usatoday.com
550	usgs.gov
551	usnews.com
552	uspto.gov
553	ustream.tv
554	utexas.edu
555	variety.com
556	venturebeat.com
557	vice.com
558	viglink.com
559	vimeo.com
560	vk.com
561	vkontakte.ru
562	vox.com
563	w3.org
564	w3schools.com
565	wa.me
566	walmart.com
567	washington.edu
568	washingtonpost.com
569	wattpad.com
570	weather.com
571	web.fc2.com
572	webmd.com
573	weebly.com
574	weibo.com
575	welt.de
576	whatsapp.com
577	whitehouse.gov
578	who.int
579	wikia.com
580	wikihow.com
581	wikimedia.org
582	wikipedia.org,mi.wikipedia.org
583	wiktionary.org,mi.wiktionary.org
584	wiley.com
585	windowsphone.com
586	wired.com
587	wix.com
588	wordpress.org,SUBDOMAIN-COPY
589	worldbank.org
590	wp.com
591	wsj.com
592	xbox.com
593	xinhuanet.com
594	yadi.sk
595	yahoo.co.jp
596	yahoo.com
597	yale.edu
598	yandex.ru
599	yelp.com
600	youku.com
601	youronlinechoices.com
602	youtu.be
603	youtube.com
604	ytimg.com
605	zdnet.com
606	zend.com
607	zendesk.com
608	zippyshare.com

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: other-projects/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt@ 34011

Download in other formats: