Context Navigation

source: gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt@ 33562

Last change on this file since 33562 was 33562, checked in by ak19, 5 years ago
The sites-too-big-to-exhaustively-crawl.txt is now a csv file of a semi-custom format, and the Java code now uses the Apache Commons CSV jar file (v1.7 for Java 8) to parse the contents thereof. 2. Tidied up code to reuse reference to ClassLoader.
File size: 10.5 KB

Line
1	# Mapping of top sites in base url forms to value
2
3	# This file contains sites that are too large to crawl exhaustively.
4	# The domains are from Alexa top sites (where only the first 50 were visible)
5	# Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
6	# Finally also added https://moz.com/top500 by downloading its CSV file and
7	# adding its URLs to the existing listing here from alexa/wiki.
8	# Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates.
9	# Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext variants, keeping
10	# just <site>.ext
11	# And finally, re-sorted the reduced list alphabetically and pasted into here.
12
13	# FORMAT OF THIS FILE'S CONTENTS:
14	# <topsite-base-url>,<value>
15	# where <value> can or is one of
16	# empty, SUBDOMAIN-COPY, FOLLOW-LINKS-WITHIN-TOPSITE, SINGLEPAGE, <url-form-without-protocol>
17	#
18	# - if value is left empty: if seedurl contains topsite-base-url, the seedurl will go into the
19	# file unprocessed-topsite-matches.txt and the site/page won't be crawled.
20	# The user will be notified to inspect the file unprocessed-topsite-matches.txt.
21	# - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl.
22	# For example, if the seedurl is http://docs.google.com/some-long-suffix-in-base64, then it
23	# matches the topsite-base-url of docs.google.com and its value of SINGLEPAGE will add the
24	# seedurl itself as the regex url-filter, to restrict the crawl to just the specified page.
25	# - SUBDOMAIN-COPY: if seedurl CONTAINS topsite-base-url, then whatever the seedurl's subdomain
26	# or else domain is, will make up the urlfilter, so we don't leak out into a larger domain.
27	# Use SUBDOMAIN-COPY to restrict to a domain prefix/subdomain. For example, if seedurl is
28	# pinky.blogspot.com, it will match the topsite-base-url of blogspot.com, but SUBDOMAIN-COPY
29	# will ensure we restrict crawling to pages on pinky.blogspot.com.
30	# However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go
31	# into the file unprocessed-topsite-matches.txt and the site/page won't be crawled.
32	# - FOLLOW-LINKS-WITHIN-TOPSITE: if pages linked from the seedURL page can be followed and
33	# downloaded, as long as it's within the same subdomain matching the topsite-base-url.
34	# This is different from SUBDOMAIN-COPY, as that can download all of a specific subdomain but
35	# restricts against downloading the entire domain (e.g. all pinky.blogspot.com and not anything
36	# else within blogspot.com). FOLLOW-LINKS-WITHIN-TOPSITE can download all linked pages (at
37	# depth specified for the nutch crawl) as long as they're within the topsite-base-url.
38	# e.g. seedURLs on docs.google.com containing links will have those linked pages and any
39	# they link to etc. downloaded as long as they're on docs.google.com.
40	# - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided
41	# url-form-without-protocol will make up the urlfilter, again preventing leaking into a
42	# larger part of the domain. For example, if the seedurl is mi.wikipedia.org/SomePage, it will
43	# match the topsite-base-url of wikipedia.org for which the <url-form-without-protocol>
44	# value is mi.wikipedia.org, which should be all that's accepted for wikipedia.org. The
45	# <url-form-without-protocol> ends up in the regex urlfilter file, thereby restricting the
46	# crawl to just mi.wikipedia.org.
47	# Remember to leave out any protocol <from url-form-without-protocol>.
48	#
49	# TODO If useful:
50	# column 3: whether nutch should do fetch all or not
51	# column 4: number of crawl iterations
52
53	# docs.google.com is a special case: not all pages are public and any interlinking is likely to
54	# be intentional. Grab all linked pages, for link depth set with nutch's crawl, as long as the
55	# links are within the given topsite-base-url
56	docs.google.com,FOLLOW-LINKS-WITHIN-TOPSITE
57
58	# Just crawl a single page for these:
59	drive.google.com,SINGLEPAGE
60	forms.office.com,SINGLEPAGE
61	player.vimeo.com,SINGLEPAGE
62	static-promote.weebly.com,SINGLEPAGE
63
64	# Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
65	# The page's containing folder is whitelisted in case the photos are there.
66	korora.econ.yale.edu,SINGLEPAGE
67
68	000webhost.com
69	360.cn
70	4shared.com
71	a8.net
72	abc.es
73	abc.net.au
74	abcnews.go.com
75	about.com
76	about.me
77	aboutads.info
78	abril.com.br
79	academia.edu
80	accuweather.com
81	addthis.com
82	addtoany.com
83	adobe.com
84	adweek.com
85	airbnb.com
86	akamaihd.net
87	alexa.com
88	alibaba.com
89	aliexpress.com
90	alipay.com
91	aljazeera.com
92	allaboutcookies.org
93	allrecipes.com
94	amazon.ca
95	amazon.co.jp
96	amazon.co.uk
97	amazon.com
98	amazon.de
99	amazon.es
100	amazon.fr
101	amazon.in
102	ameblo.jp
103	ampproject.org
104	android.com
105	aol.com
106	ap.org
107	apache.org
108	apachefriends.org
109	apple.com
110	archive.org
111	archives.gov
112	arstechnica.com
113	arxiv.org
114	asahi.com
115	ask.fm
116	asus.com
117	axs.com
118	babytree.com
119	baidu.com
120	bandcamp.com
121	bbc.co.uk
122	bbc.com
123	behance.net
124	berkeley.edu
125	biblegateway.com
126	biglobe.ne.jp
127	billboard.com
128	bing.com
129	bit.ly
130	bitly.com
131	blackberry.com
132	blogger.com
133	blogspot.com,SUBDOMAIN-COPY
134	bloomberg.com
135	booking.com
136	boston.com
137	box.com
138	britannica.com
139	bt.com
140	bund.de
141	businessinsider.com
142	businesswire.com
143	buydomains.com
144	buzzfeed.com
145	ca.gov
146	cambridge.org
147	canalblog.com
148	cbc.ca
149	cbslocal.com
150	cbsnews.com
151	cdc.gov
152	change.org
153	channel4.com
154	chicagotribune.com
155	chinadaily.com.cn
156	cisco.com
157	clickbank.net
158	cloudflare.com
159	cmu.edu
160	cnbc.com
161	cnet.com
162	cnn.com
163	cocolog-nifty.com
164	columbia.edu
165	connect.over-blog.com
166	cornell.edu
167	corriere.it
168	cpanel.com
169	cpanel.net
170	creativecommons.org
171	csdn.net
172	csmonitor.com
173	dailymail.co.uk
174	dailymotion.com
175	dan.com
176	daum.net
177	debian.org
178	dell.com
179	depositfiles.com
180	detik.com
181	digg.com
182	discovery.com
183	disney.com
184	disney.go.com
185	disqus.com
186	doubleclick.net
187	dreniq.com
188	dribbble.com
189	dropbox.com,SINGLEPAGE
190	dropboxusercontent.com
191	dw.com
192	e-recht24.de
193	ea.com
194	ebay.co.uk
195	ebay.com
196	economist.com
197	eff.org
198	ehow.com
199	elmundo.es
200	elpais.com
201	engadget.com
202	entrepreneur.com
203	eonline.com
204	espn.com
205	espn.go.com
206	etsy.com
207	europa.eu
208	eventbrite.com
209	example.com
210	excite.co.jp
211	express.co.uk
212	facebook.com
213	fandom.com
214	fastcompany.com
215	fb.com
216	fb.me
217	fda.gov
218	fedoraproject.org
219	feedburner.com
220	fifa.com
221	files.wordpress.com
222	flickr.com
223	forbes.com
224	fortune.com
225	foursquare.com
226	foxnews.com
227	ft.com
228	ftc.gov
229	gen.xyz
230	geocities.jp
231	gesetze-im-internet.de
232	ggpht.com
233	github.com
234	gizmodo.com
235	globo.com
236	gmail.com
237	gnu.org
238	godaddy.com
239	gofundme.com
240	goo.gl
241	goo.ne.jp
242	goodreads.com
243	google.ca
244	google.co.id
245	google.co.in
246	google.co.jp
247	google.co.uk
248	google.com
249	google.com.br
250	google.com.hk
251	google.com.tr
252	google.de
253	google.es
254	google.fr
255	google.it
256	google.nl
257	google.pl
258	google.ru
259	googleapis.com
260	googleblog.com
261	googleusercontent.com
262	gooyaabitemplates.com
263	gov.uk
264	gravatar.com
265	greenpeace.org
266	gstatic.com
267	guardian.co.uk
268	harvard.edu
269	hatena.ne.jp
270	histats.com
271	hm.com
272	hollywoodreporter.com
273	home.pl
274	house.gov
275	howstuffworks.com
276	hp.com
277	huffingtonpost.com
278	huffpost.com
279	hugedomains.com
280	ibm.com
281	ibtimes.com
282	icann.org
283	ieee.org
284	ietf.org
285	ig.com.br
286	ign.com
287	ikea.com
288	imageshack.us
289	imdb.com
290	imgur.com
291	inc.com
292	independent.co.uk
293	indiatimes.com
294	indiegogo.com
295	instagram.com
296	instructables.com
297	intel.com
298	interia.pl
299	issuu.com
300	istockphoto.com
301	iubenda.com
302	jd.com
303	joomla.org
304	jquery.com
305	jstor.org
306	kickstarter.com
307	kinja.com
308	last.fm
309	latimes.com
310	lefigaro.fr
311	lemonde.fr
312	line.me
313	linkedin.com
314	list-manage.com
315	live.com
316	livejournal.com
317	livescience.com
318	loc.gov
319	lonelyplanet.com
320	lycos.com
321	m.wikipedia.org,mi.m.wikipedia.org
322	mail.ru
323	marketwatch.com
324	marriott.com
325	mashable.com
326	mediafire.com
327	medium.com
328	mega.nz
329	megaupload.com
330	mercurynews.com
331	merriam-webster.com
332	metro.co.uk
333	microsoft.com,microsoft.com/mi-nz/
334	microsoftonline.com
335	mirror.co.uk
336	mit.edu
337	mixcloud.com
338	mlb.com
339	mozilla.com
340	mozilla.org
341	msn.com
342	myspace.com
343	mysql.com
344	namecheap.com
345	narod.ru
346	nasa.gov
347	nationalgeographic.com
348	nature.com
349	naver.com
350	naver.jp
351	nba.com
352	nbcnews.com
353	ndtv.com
354	netflix.com
355	netsons.com
356	netvibes.com
357	networkadvertising.org
358	news.com.au
359	newscientist.com
360	newsweek.com
361	newyorker.com
362	nginx.com
363	nginx.org
364	nhk.or.jp
365	nicovideo.jp
366	nifty.com
367	nih.gov
368	nikkei.com
369	noaa.gov
370	nokia.com
371	npr.org
372	nvidia.com
373	nydailynews.com
374	nypost.com
375	nytimes.com
376	nyu.edu
377	odnoklassniki.ru
378	office.com
379	offset.com
380	ok.ru
381	okezone.com
382	opera.com
383	oracle.com
384	orange.fr
385	oreilly.com
386	oup.com
387	over-blog.com
388	ovh.co.uk
389	ovh.com
390	ovh.net
391	ox.ac.uk
392	parallels.com
393	pastebin.com
394	paypal.com
395	pbs.org
396	pcmag.com
397	people.com
398	photobucket.com
399	php.net
400	pinterest.com,SINGLEPAGE
401	pixabay.com
402	playstation.com
403	plesk.com
404	plos.org
405	politico.com
406	prestashop.com
407	prezi.com
408	princeton.edu
409	privacyshield.gov
410	prnewswire.com
411	psychologytoday.com
412	qq.com
413	quantcast.com
414	quora.com
415	rakuten.co.jp
416	rambler.ru
417	rapidshare.com
418	reddit.com
419	repubblica.it
420	researchgate.net
421	reuters.com
422	ria.ru
423	rottentomatoes.com
424	rt.com
425	rtve.es
426	sakura.ne.jp
427	samsung.com
428	sapo.pt
429	scholastic.com
430	sciencedaily.com
431	sciencedirect.com
432	sciencemag.org
433	scientificamerican.com
434	scribd.com
435	seattletimes.com
436	secureserver.net
437	sedo.com
438	seesaa.net
439	sendspace.com
440	sfgate.com
441	shopify.com
442	shutterstock.com
443	siemens.com
444	sina.com.cn
445	sky.com
446	skype.com
447	skyrock.com
448	slate.com
449	slideshare.net
450	sm.cn
451	smh.com.au
452	so-net.ne.jp
453	softonic.com
454	sogou.com
455	sohu.com
456	soratemplates.com
457	soso.com
458	soundcloud.com
459	spiegel.de
460	spotify.com
461	springer.com
462	sputniknews.com
463	ssl-images-amazon.com
464	stackoverflow.com
465	standard.co.uk
466	stanford.edu
467	state.gov
468	steamcommunity.com
469	steampowered.com
470	storage.canalblog.com
471	storage.googleapis.com
472	stores.jp
473	storify.com
474	stuff.co.nz,SINGLEPAGE
475	surveymonkey.com
476	symantec.com
477	t-online.de
478	t.co
479	t.me
480	tabelog.com
481	taobao.com
482	target.com
483	teamviewer.com
484	techcrunch.com
485	ted.com
486	telegram.me
487	telegraph.co.uk
488	terra.com.br
489	theatlantic.com
490	thefreedictionary.com
491	theglobeandmail.com
492	theguardian.com
493	themeforest.net
494	thenextweb.com
495	thestar.com
496	thesun.co.uk
497	thetimes.co.uk
498	theverge.com
499	thoughtco.com
500	tianya.cn
501	time.com
502	tinyurl.com
503	tmall.com
504	tmz.com
505	tribunnews.com
506	tripadvisor.com
507	trustpilot.com
508	twitch.tv
509	twitter.com
510	ucoz.ru
511	uiuc.edu
512	umich.edu
513	un.org
514	undeveloped.com
515	unesco.org
516	uol.com.br
517	urbandictionary.com
518	usa.gov
519	usatoday.com
520	usgs.gov
521	usnews.com
522	uspto.gov
523	ustream.tv
524	utexas.edu
525	variety.com
526	venturebeat.com
527	vice.com
528	viglink.com
529	vimeo.com
530	vk.com
531	vkontakte.ru
532	vox.com
533	w3.org
534	w3schools.com
535	wa.me
536	walmart.com
537	washington.edu
538	washingtonpost.com
539	wattpad.com
540	weather.com
541	web.fc2.com
542	webmd.com
543	weebly.com
544	weibo.com
545	welt.de
546	whatsapp.com
547	whitehouse.gov
548	who.int
549	wikia.com
550	wikihow.com
551	wikimedia.org
552	wikipedia.org,mi.wikipedia.org
553	wiktionary.org,mi.wiktionary.org
554	wiley.com
555	windowsphone.com
556	wired.com
557	wix.com
558	wordpress.org,SUBDOMAIN-COPY
559	worldbank.org
560	wp.com
561	wsj.com
562	xbox.com
563	xinhuanet.com
564	yadi.sk
565	yahoo.co.jp
566	yahoo.com
567	yale.edu
568	yandex.ru
569	yelp.com
570	youku.com
571	youronlinechoices.com
572	youtu.be
573	youtube.com
574	ytimg.com
575	zdnet.com
576	zend.com
577	zendesk.com
578	zippyshare.com

Note: See TracBrowser for help on using the repository browser.

Download in other formats: