Context Navigation

source: gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt@ 33568

Last change on this file since 33568 was 33568, checked in by ak19, 5 years ago
More sites greylisted and blacklisted, discovered as I attempted to crawl them and afterwards learnt to investigate sites first. Should all .ru and .pl domains be on the greylist? 2. Adjusted instruction comments in CCWETProcessor for compiling and running
File size: 10.6 KB

Line
1	# Mapping of top sites in base url forms to value
2
3	# This file contains sites that are too large to crawl exhaustively.
4	# The domains are from Alexa top sites (where only the first 50 were visible)
5	# Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
6	# Finally also added https://moz.com/top500 by downloading its CSV file and
7	# adding its URLs to the existing listing here from alexa/wiki.
8	# Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates.
9	# Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext variants, keeping
10	# just <site>.ext
11	# And finally, re-sorted the reduced list alphabetically and pasted into here.
12
13	# FORMAT OF THIS FILE'S CONTENTS:
14	# <topsite-base-url>,<value>
15	# where <value> can or is one of
16	# empty, SUBDOMAIN-COPY, FOLLOW-LINKS-WITHIN-TOPSITE, SINGLEPAGE, <url-form-without-protocol>
17	#
18	# - if value is left empty: if seedurl contains topsite-base-url, the seedurl will go into the
19	# file unprocessed-topsite-matches.txt and the site/page won't be crawled.
20	# The user will be notified to inspect the file unprocessed-topsite-matches.txt.
21	# - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl.
22	# For example, if the seedurl is http://docs.google.com/some-long-suffix-in-base64, then it
23	# matches the topsite-base-url of docs.google.com and its value of SINGLEPAGE will add the
24	# seedurl itself as the regex url-filter, to restrict the crawl to just the specified page.
25	# - SUBDOMAIN-COPY: if seedurl CONTAINS topsite-base-url, then whatever the seedurl's subdomain
26	# or else domain is, will make up the urlfilter, so we don't leak out into a larger domain.
27	# Use SUBDOMAIN-COPY to restrict to a domain prefix/subdomain. For example, if seedurl is
28	# pinky.blogspot.com, it will match the topsite-base-url of blogspot.com, but SUBDOMAIN-COPY
29	# will ensure we restrict crawling to pages on pinky.blogspot.com.
30	# However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go
31	# into the file unprocessed-topsite-matches.txt and the site/page won't be crawled.
32	# - FOLLOW-LINKS-WITHIN-TOPSITE: if pages linked from the seedURL page can be followed and
33	# downloaded, as long as it's within the same subdomain matching the topsite-base-url.
34	# This is different from SUBDOMAIN-COPY, as that can download all of a specific subdomain but
35	# restricts against downloading the entire domain (e.g. all pinky.blogspot.com and not anything
36	# else within blogspot.com). FOLLOW-LINKS-WITHIN-TOPSITE can download all linked pages (at
37	# depth specified for the nutch crawl) as long as they're within the topsite-base-url.
38	# e.g. seedURLs on docs.google.com containing links will have those linked pages and any
39	# they link to etc. downloaded as long as they're on docs.google.com.
40	# - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided
41	# url-form-without-protocol will make up the urlfilter, again preventing leaking into a
42	# larger part of the domain. For example, if the seedurl is mi.wikipedia.org/SomePage, it will
43	# match the topsite-base-url of wikipedia.org for which the <url-form-without-protocol>
44	# value is mi.wikipedia.org, which should be all that's accepted for wikipedia.org. The
45	# <url-form-without-protocol> ends up in the regex urlfilter file, thereby restricting the
46	# crawl to just mi.wikipedia.org.
47	# Remember to leave out any protocol <from url-form-without-protocol>.
48	#
49	# TODO If useful:
50	# column 3: whether nutch should do fetch all or not
51	# column 4: number of crawl iterations
52
53
54	# NOT TOP SITES, BUT SITES WE INSPECTED AND WANT TO CONTROL SIMILARLY TO TOP SITES
55	00.gs,SINGLEPAGE
56
57	# May be a large site
58	topographic-map.com,SINGLEPAGE
59
60	# TOP SITES
61
62	# docs.google.com is a special case: not all pages are public and any interlinking is likely to
63	# be intentional. Grab all linked pages, for link depth set with nutch's crawl, as long as the
64	# links are within the given topsite-base-url
65	docs.google.com,FOLLOW-LINKS-WITHIN-TOPSITE
66
67	# Just crawl a single page for these:
68	drive.google.com,SINGLEPAGE
69	forms.office.com,SINGLEPAGE
70	player.vimeo.com,SINGLEPAGE
71	static-promote.weebly.com,SINGLEPAGE
72
73	# Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
74	# The page's containing folder is whitelisted in case the photos are there.
75	korora.econ.yale.edu,SINGLEPAGE
76
77	000webhost.com
78	360.cn
79	4shared.com
80	a8.net
81	abc.es
82	abc.net.au
83	abcnews.go.com
84	about.com
85	about.me
86	aboutads.info
87	abril.com.br
88	academia.edu
89	accuweather.com
90	addthis.com
91	addtoany.com
92	adobe.com
93	adweek.com
94	airbnb.com
95	akamaihd.net
96	alexa.com
97	alibaba.com
98	aliexpress.com
99	alipay.com
100	aljazeera.com
101	allaboutcookies.org
102	allrecipes.com
103	amazon.ca
104	amazon.co.jp
105	amazon.co.uk
106	amazon.com
107	amazon.de
108	amazon.es
109	amazon.fr
110	amazon.in
111	ameblo.jp
112	ampproject.org
113	android.com
114	aol.com
115	ap.org
116	apache.org
117	apachefriends.org
118	apple.com
119	archive.org
120	archives.gov
121	arstechnica.com
122	arxiv.org
123	asahi.com
124	ask.fm
125	asus.com
126	axs.com
127	babytree.com
128	baidu.com
129	bandcamp.com
130	bbc.co.uk
131	bbc.com
132	behance.net
133	berkeley.edu
134	biblegateway.com
135	biglobe.ne.jp
136	billboard.com
137	bing.com
138	bit.ly
139	bitly.com
140	blackberry.com
141	blogger.com
142	blogspot.com,SUBDOMAIN-COPY
143	bloomberg.com
144	booking.com
145	boston.com
146	box.com
147	britannica.com
148	bt.com
149	bund.de
150	businessinsider.com
151	businesswire.com
152	buydomains.com
153	buzzfeed.com
154	ca.gov
155	cambridge.org
156	canalblog.com
157	cbc.ca
158	cbslocal.com
159	cbsnews.com
160	cdc.gov
161	change.org
162	channel4.com
163	chicagotribune.com
164	chinadaily.com.cn
165	cisco.com
166	clickbank.net
167	cloudflare.com
168	cmu.edu
169	cnbc.com
170	cnet.com
171	cnn.com
172	cocolog-nifty.com
173	columbia.edu
174	connect.over-blog.com
175	cornell.edu
176	corriere.it
177	cpanel.com
178	cpanel.net
179	creativecommons.org
180	csdn.net
181	csmonitor.com
182	dailymail.co.uk
183	dailymotion.com
184	dan.com
185	daum.net
186	debian.org
187	dell.com
188	depositfiles.com
189	detik.com
190	digg.com
191	discovery.com
192	disney.com
193	disney.go.com
194	disqus.com
195	doubleclick.net
196	dreniq.com
197	dribbble.com
198	dropbox.com,SINGLEPAGE
199	dropboxusercontent.com
200	dw.com
201	e-recht24.de
202	ea.com
203	ebay.co.uk
204	ebay.com
205	economist.com
206	eff.org
207	ehow.com
208	elmundo.es
209	elpais.com
210	engadget.com
211	entrepreneur.com
212	eonline.com
213	espn.com
214	espn.go.com
215	etsy.com
216	europa.eu
217	eventbrite.com
218	example.com
219	excite.co.jp
220	express.co.uk
221	facebook.com
222	fandom.com
223	fastcompany.com
224	fb.com
225	fb.me
226	fda.gov
227	fedoraproject.org
228	feedburner.com
229	fifa.com
230	files.wordpress.com
231	flickr.com
232	forbes.com
233	fortune.com
234	foursquare.com
235	foxnews.com
236	ft.com
237	ftc.gov
238	gen.xyz
239	geocities.jp
240	gesetze-im-internet.de
241	ggpht.com
242	github.com
243	gizmodo.com
244	globo.com
245	gmail.com
246	gnu.org
247	godaddy.com
248	gofundme.com
249	goo.gl
250	goo.ne.jp
251	goodreads.com
252	google.ca
253	google.co.id
254	google.co.in
255	google.co.jp
256	google.co.uk
257	google.com
258	google.com.br
259	google.com.hk
260	google.com.tr
261	google.de
262	google.es
263	google.fr
264	google.it
265	google.nl
266	google.pl
267	google.ru
268	googleapis.com
269	googleblog.com
270	googleusercontent.com
271	gooyaabitemplates.com
272	gov.uk
273	gravatar.com
274	greenpeace.org
275	gstatic.com
276	guardian.co.uk
277	harvard.edu
278	hatena.ne.jp
279	histats.com
280	hm.com
281	hollywoodreporter.com
282	home.pl
283	house.gov
284	howstuffworks.com
285	hp.com
286	huffingtonpost.com
287	huffpost.com
288	hugedomains.com
289	ibm.com
290	ibtimes.com
291	icann.org
292	ieee.org
293	ietf.org
294	ig.com.br
295	ign.com
296	ikea.com
297	imageshack.us
298	imdb.com
299	imgur.com
300	inc.com
301	independent.co.uk
302	indiatimes.com
303	indiegogo.com
304	instagram.com
305	instructables.com
306	intel.com
307	interia.pl
308	issuu.com
309	istockphoto.com
310	iubenda.com
311	jd.com
312	joomla.org
313	jquery.com
314	jstor.org
315	kickstarter.com
316	kinja.com
317	last.fm
318	latimes.com
319	lefigaro.fr
320	lemonde.fr
321	line.me
322	linkedin.com
323	list-manage.com
324	live.com
325	livejournal.com
326	livescience.com
327	loc.gov
328	lonelyplanet.com
329	lycos.com
330	m.wikipedia.org,mi.m.wikipedia.org
331	mail.ru
332	marketwatch.com
333	marriott.com
334	mashable.com
335	mediafire.com
336	medium.com
337	mega.nz
338	megaupload.com
339	mercurynews.com
340	merriam-webster.com
341	metro.co.uk
342	microsoft.com,microsoft.com/mi-nz/
343	microsoftonline.com
344	mirror.co.uk
345	mit.edu
346	mixcloud.com
347	mlb.com
348	mozilla.com
349	mozilla.org
350	msn.com
351	myspace.com
352	mysql.com
353	namecheap.com
354	narod.ru
355	nasa.gov
356	nationalgeographic.com
357	nature.com
358	naver.com
359	naver.jp
360	nba.com
361	nbcnews.com
362	ndtv.com
363	netflix.com
364	netsons.com
365	netvibes.com
366	networkadvertising.org
367	news.com.au
368	newscientist.com
369	newsweek.com
370	newyorker.com
371	nginx.com
372	nginx.org
373	nhk.or.jp
374	nicovideo.jp
375	nifty.com
376	nih.gov
377	nikkei.com
378	noaa.gov
379	nokia.com
380	npr.org
381	nvidia.com
382	nydailynews.com
383	nypost.com
384	nytimes.com
385	nyu.edu
386	odnoklassniki.ru
387	office.com
388	offset.com
389	ok.ru
390	okezone.com
391	opera.com
392	oracle.com
393	orange.fr
394	oreilly.com
395	oup.com
396	over-blog.com
397	ovh.co.uk
398	ovh.com
399	ovh.net
400	ox.ac.uk
401	parallels.com
402	pastebin.com
403	paypal.com
404	pbs.org
405	pcmag.com
406	people.com
407	photobucket.com
408	php.net
409	pinterest.com,SINGLEPAGE
410	pixabay.com
411	playstation.com
412	plesk.com
413	plos.org
414	politico.com
415	prestashop.com
416	prezi.com
417	princeton.edu
418	privacyshield.gov
419	prnewswire.com
420	psychologytoday.com
421	qq.com
422	quantcast.com
423	quora.com
424	rakuten.co.jp
425	rambler.ru
426	rapidshare.com
427	reddit.com
428	repubblica.it
429	researchgate.net
430	reuters.com
431	ria.ru
432	rottentomatoes.com
433	rt.com
434	rtve.es
435	sakura.ne.jp
436	samsung.com
437	sapo.pt
438	scholastic.com
439	sciencedaily.com
440	sciencedirect.com
441	sciencemag.org
442	scientificamerican.com
443	scribd.com
444	seattletimes.com
445	secureserver.net
446	sedo.com
447	seesaa.net
448	sendspace.com
449	sfgate.com
450	shopify.com
451	shutterstock.com
452	siemens.com
453	sina.com.cn
454	sky.com
455	skype.com
456	skyrock.com
457	slate.com
458	slideshare.net
459	sm.cn
460	smh.com.au
461	so-net.ne.jp
462	softonic.com
463	sogou.com
464	sohu.com
465	soratemplates.com
466	soso.com
467	soundcloud.com
468	spiegel.de
469	spotify.com
470	springer.com
471	sputniknews.com
472	ssl-images-amazon.com
473	stackoverflow.com
474	standard.co.uk
475	stanford.edu
476	state.gov
477	steamcommunity.com
478	steampowered.com
479	storage.canalblog.com
480	storage.googleapis.com
481	stores.jp
482	storify.com
483	stuff.co.nz,SINGLEPAGE
484	surveymonkey.com
485	symantec.com
486	t-online.de
487	t.co
488	t.me
489	tabelog.com
490	taobao.com
491	target.com
492	teamviewer.com
493	techcrunch.com
494	ted.com
495	telegram.me
496	telegraph.co.uk
497	terra.com.br
498	theatlantic.com
499	thefreedictionary.com
500	theglobeandmail.com
501	theguardian.com
502	themeforest.net
503	thenextweb.com
504	thestar.com
505	thesun.co.uk
506	thetimes.co.uk
507	theverge.com
508	thoughtco.com
509	tianya.cn
510	time.com
511	tinyurl.com
512	tmall.com
513	tmz.com
514	tribunnews.com
515	tripadvisor.com
516	trustpilot.com
517	twitch.tv
518	twitter.com
519	ucoz.ru
520	uiuc.edu
521	umich.edu
522	un.org
523	undeveloped.com
524	unesco.org
525	uol.com.br
526	urbandictionary.com
527	usa.gov
528	usatoday.com
529	usgs.gov
530	usnews.com
531	uspto.gov
532	ustream.tv
533	utexas.edu
534	variety.com
535	venturebeat.com
536	vice.com
537	viglink.com
538	vimeo.com
539	vk.com
540	vkontakte.ru
541	vox.com
542	w3.org
543	w3schools.com
544	wa.me
545	walmart.com
546	washington.edu
547	washingtonpost.com
548	wattpad.com
549	weather.com
550	web.fc2.com
551	webmd.com
552	weebly.com
553	weibo.com
554	welt.de
555	whatsapp.com
556	whitehouse.gov
557	who.int
558	wikia.com
559	wikihow.com
560	wikimedia.org
561	wikipedia.org,mi.wikipedia.org
562	wiktionary.org,mi.wiktionary.org
563	wiley.com
564	windowsphone.com
565	wired.com
566	wix.com
567	wordpress.org,SUBDOMAIN-COPY
568	worldbank.org
569	wp.com
570	wsj.com
571	xbox.com
572	xinhuanet.com
573	yadi.sk
574	yahoo.co.jp
575	yahoo.com
576	yale.edu
577	yandex.ru
578	yelp.com
579	youku.com
580	youronlinechoices.com
581	youtu.be
582	youtube.com
583	ytimg.com
584	zdnet.com
585	zend.com
586	zendesk.com
587	zippyshare.com

Note: See TracBrowser for help on using the repository browser.

Download in other formats: