Context Navigation

sites-too-big-to-exhaustively-crawl.txt@ 33622

Last change on this file since 33622 was 33604, checked in by ak19, 5 years ago
Better output into possible-product-sites.txt including the overseas country code prefix to help decide whether the site is worth keeping or not. 2. Updated whitelisting and top-sites filters to grab the /mi/ subsections of sites that don't appear to be autotranslated. This is done in preparation for blocking out product sites hereafter
File size: 11.0 KB

Line
1	# Mapping of top sites in base url forms to value
2
3	# This file contains sites that are too large to crawl exhaustively.
4	# The domains are from Alexa top sites (where only the first 50 were visible)
5	# Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
6	# Finally also added https://moz.com/top500 by downloading its CSV file and
7	# adding its URLs to the existing listing here from alexa/wiki.
8	# Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates.
9	# Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext variants, keeping
10	# just <site>.ext
11	# And finally, re-sorted the reduced list alphabetically and pasted into here.
12
13	# FORMAT OF THIS FILE'S CONTENTS:
14	# <topsite-base-url>,<value>
15	# where <value> can or is one of
16	# empty, SUBDOMAIN-COPY, FOLLOW-LINKS-WITHIN-TOPSITE, SINGLEPAGE, <url-form-without-protocol>
17	#
18	# - if value is left empty: if seedurl contains topsite-base-url, the seedurl will go into the
19	# file unprocessed-topsite-matches.txt and the site/page won't be crawled.
20	# The user will be notified to inspect the file unprocessed-topsite-matches.txt.
21	# - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl.
22	# For example, if the seedurl is http://docs.google.com/some-long-suffix-in-base64, then it
23	# matches the topsite-base-url of docs.google.com and its value of SINGLEPAGE will add the
24	# seedurl itself as the regex url-filter, to restrict the crawl to just the specified page.
25	# - SUBDOMAIN-COPY: if seedurl CONTAINS topsite-base-url, then whatever the seedurl's subdomain
26	# or else domain is, will make up the urlfilter, so we don't leak out into a larger domain.
27	# Use SUBDOMAIN-COPY to restrict to a domain prefix/subdomain. For example, if seedurl is
28	# pinky.blogspot.com, it will match the topsite-base-url of blogspot.com, but SUBDOMAIN-COPY
29	# will ensure we restrict crawling to pages on pinky.blogspot.com.
30	# However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go
31	# into the file unprocessed-topsite-matches.txt and the site/page won't be crawled.
32	# - FOLLOW-LINKS-WITHIN-TOPSITE: if pages linked from the seedURL page can be followed and
33	# downloaded, as long as it's within the same subdomain matching the topsite-base-url.
34	# This is different from SUBDOMAIN-COPY, as that can download all of a specific subdomain but
35	# restricts against downloading the entire domain (e.g. all pinky.blogspot.com and not anything
36	# else within blogspot.com). FOLLOW-LINKS-WITHIN-TOPSITE can download all linked pages (at
37	# depth specified for the nutch crawl) as long as they're within the topsite-base-url.
38	# e.g. seedURLs on docs.google.com containing links will have those linked pages and any
39	# they link to etc. downloaded as long as they're on docs.google.com.
40	# - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided
41	# url-form-without-protocol will make up the urlfilter, again preventing leaking into a
42	# larger part of the domain. For example, if the seedurl is mi.wikipedia.org/SomePage, it will
43	# match the topsite-base-url of wikipedia.org for which the <url-form-without-protocol>
44	# value is mi.wikipedia.org, which should be all that's accepted for wikipedia.org. The
45	# <url-form-without-protocol> ends up in the regex urlfilter file, thereby restricting the
46	# crawl to just mi.wikipedia.org.
47	# Remember to leave out any protocol <from url-form-without-protocol>.
48	#
49	# TODO If useful:
50	# column 3: whether nutch should do fetch all or not
51	# column 4: number of crawl iterations
52
53
54	# NOT TOP SITES, BUT SITES WE INSPECTED AND WANT TO CONTROL SIMILARLY TO TOP SITES
55	00.gs,SINGLEPAGE
56	# May be a large site with only seedURLs of real relevance
57	topographic-map.com,SINGLEPAGE
58	ami-media.net,SINGLEPAGE
59	# 2 pages of declarations of human rights in Maori, rest in other languages
60	anitra.net,SINGLEPAGE
61	# special case
62	mi.centr-zashity.ru,SINGLEPAGE
63
64	martinvrijland.nl,martinvrijland.nl/mi/
65	csunplugged.org,csunplugged.org/mi/
66	gpedia.com,gpedia.com/mi/
67
68	# TOP SITE BUT NOT TOP 500
69	www.tumblr.com,SINGLEPAGE
70
71
72	# TOP SITES
73
74	# docs.google.com is a special case: not all pages are public and any interlinking is likely to
75	# be intentional. Grab all linked pages, for link depth set with nutch's crawl, as long as the
76	# links are within the given topsite-base-url
77	docs.google.com,FOLLOW-LINKS-WITHIN-TOPSITE
78
79	# Just crawl a single page for these:
80	drive.google.com,SINGLEPAGE
81	forms.office.com,SINGLEPAGE
82	player.vimeo.com,SINGLEPAGE
83	static-promote.weebly.com,SINGLEPAGE
84
85	# Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
86	# The page's containing folder is whitelisted in case the photos are there.
87	korora.econ.yale.edu,SINGLEPAGE
88
89
90	000webhost.com
91	360.cn
92	4shared.com
93	a8.net
94	abc.es
95	abc.net.au
96	abcnews.go.com
97	about.com
98	about.me
99	aboutads.info
100	abril.com.br
101	academia.edu
102	accuweather.com
103	addthis.com
104	addtoany.com
105	adobe.com
106	adweek.com
107	airbnb.com
108	akamaihd.net
109	alexa.com
110	alibaba.com
111	aliexpress.com
112	alipay.com
113	aljazeera.com
114	allaboutcookies.org
115	allrecipes.com
116	amazon.ca
117	amazon.co.jp
118	amazon.co.uk
119	amazon.com
120	amazon.de
121	amazon.es
122	amazon.fr
123	amazon.in
124	ameblo.jp
125	ampproject.org
126	android.com
127	aol.com
128	ap.org
129	apache.org
130	apachefriends.org
131	apple.com
132	archive.org
133	archives.gov
134	arstechnica.com
135	arxiv.org
136	asahi.com
137	ask.fm
138	asus.com
139	axs.com
140	babytree.com
141	baidu.com
142	bandcamp.com
143	bbc.co.uk
144	bbc.com
145	behance.net
146	berkeley.edu
147	biblegateway.com
148	biglobe.ne.jp
149	billboard.com
150	bing.com
151	bit.ly
152	bitly.com
153	blackberry.com
154	blogger.com
155	blogspot.com,SUBDOMAIN-COPY
156	bloomberg.com
157	booking.com
158	boston.com
159	box.com
160	britannica.com
161	bt.com
162	bund.de
163	businessinsider.com
164	businesswire.com
165	buydomains.com
166	buzzfeed.com
167	ca.gov
168	cambridge.org
169	canalblog.com
170	cbc.ca
171	cbslocal.com
172	cbsnews.com
173	cdc.gov
174	change.org
175	channel4.com
176	chicagotribune.com
177	chinadaily.com.cn
178	cisco.com
179	clickbank.net
180	cloudflare.com
181	cmu.edu
182	cnbc.com
183	cnet.com
184	cnn.com
185	cocolog-nifty.com
186	columbia.edu
187	connect.over-blog.com
188	cornell.edu
189	corriere.it
190	cpanel.com
191	cpanel.net
192	creativecommons.org
193	csdn.net
194	csmonitor.com
195	dailymail.co.uk
196	dailymotion.com
197	dan.com
198	daum.net
199	debian.org
200	dell.com
201	depositfiles.com
202	detik.com
203	digg.com
204	discovery.com
205	disney.com
206	disney.go.com
207	disqus.com
208	doubleclick.net
209	dreniq.com
210	dribbble.com
211	dropbox.com,SINGLEPAGE
212	dropboxusercontent.com
213	dw.com
214	e-recht24.de
215	ea.com
216	ebay.co.uk
217	ebay.com
218	economist.com
219	eff.org
220	ehow.com
221	elmundo.es
222	elpais.com
223	engadget.com
224	entrepreneur.com
225	eonline.com
226	espn.com
227	espn.go.com
228	etsy.com
229	europa.eu
230	eventbrite.com
231	example.com
232	excite.co.jp
233	express.co.uk
234	facebook.com
235	fandom.com
236	fastcompany.com
237	fb.com
238	fb.me
239	fda.gov
240	fedoraproject.org
241	feedburner.com
242	fifa.com
243	files.wordpress.com
244	flickr.com
245	forbes.com
246	fortune.com
247	foursquare.com
248	foxnews.com
249	ft.com
250	ftc.gov
251	gen.xyz
252	geocities.jp
253	gesetze-im-internet.de
254	ggpht.com
255	github.com
256	gizmodo.com
257	globo.com
258	gmail.com
259	gnu.org
260	godaddy.com
261	gofundme.com
262	goo.gl
263	goo.ne.jp
264	goodreads.com
265	google.ca
266	google.co.id
267	google.co.in
268	google.co.jp
269	google.co.uk
270	google.com
271	google.com.br
272	google.com.hk
273	google.com.tr
274	google.de
275	google.es
276	google.fr
277	google.it
278	google.nl
279	google.pl
280	google.ru
281	googleapis.com
282	googleblog.com
283	googleusercontent.com
284	gooyaabitemplates.com
285	gov.uk
286	gravatar.com
287	greenpeace.org
288	gstatic.com
289	guardian.co.uk
290	harvard.edu
291	hatena.ne.jp
292	histats.com
293	hm.com
294	hollywoodreporter.com
295	home.pl
296	house.gov
297	howstuffworks.com
298	hp.com
299	huffingtonpost.com
300	huffpost.com
301	hugedomains.com
302	ibm.com
303	ibtimes.com
304	icann.org
305	ieee.org
306	ietf.org
307	ig.com.br
308	ign.com
309	ikea.com
310	imageshack.us
311	imdb.com
312	imgur.com
313	inc.com
314	independent.co.uk
315	indiatimes.com
316	indiegogo.com
317	instagram.com
318	instructables.com
319	intel.com
320	interia.pl
321	issuu.com
322	istockphoto.com
323	iubenda.com
324	jd.com
325	joomla.org
326	jquery.com
327	jstor.org
328	kickstarter.com
329	kinja.com
330	last.fm
331	latimes.com
332	lefigaro.fr
333	lemonde.fr
334	line.me
335	linkedin.com
336	list-manage.com
337	live.com
338	livejournal.com
339	livescience.com
340	loc.gov
341	lonelyplanet.com
342	lycos.com
343	m.wikipedia.org,mi.m.wikipedia.org
344	mail.ru
345	marketwatch.com
346	marriott.com
347	mashable.com
348	mediafire.com
349	medium.com
350	mega.nz
351	megaupload.com
352	mercurynews.com
353	merriam-webster.com
354	metro.co.uk
355	microsoft.com,microsoft.com/mi-nz/
356	microsoftonline.com
357	mirror.co.uk
358	mit.edu
359	mixcloud.com
360	mlb.com
361	mozilla.com
362	mozilla.org
363	msn.com
364	myspace.com
365	mysql.com
366	namecheap.com
367	narod.ru
368	nasa.gov
369	nationalgeographic.com
370	nature.com
371	naver.com
372	naver.jp
373	nba.com
374	nbcnews.com
375	ndtv.com
376	netflix.com
377	netsons.com
378	netvibes.com
379	networkadvertising.org
380	news.com.au
381	newscientist.com
382	newsweek.com
383	newyorker.com
384	nginx.com
385	nginx.org
386	nhk.or.jp
387	nicovideo.jp
388	nifty.com
389	nih.gov
390	nikkei.com
391	noaa.gov
392	nokia.com
393	npr.org
394	nvidia.com
395	nydailynews.com
396	nypost.com
397	nytimes.com
398	nyu.edu
399	odnoklassniki.ru
400	office.com
401	offset.com
402	ok.ru
403	okezone.com
404	opera.com
405	oracle.com
406	orange.fr
407	oreilly.com
408	oup.com
409	over-blog.com
410	ovh.co.uk
411	ovh.com
412	ovh.net
413	ox.ac.uk
414	parallels.com
415	pastebin.com
416	paypal.com
417	pbs.org
418	pcmag.com
419	people.com
420	photobucket.com
421	php.net
422	pinterest.com,SINGLEPAGE
423	pixabay.com
424	playstation.com
425	plesk.com
426	plos.org
427	politico.com
428	prestashop.com
429	prezi.com
430	princeton.edu
431	privacyshield.gov
432	prnewswire.com
433	psychologytoday.com
434	qq.com
435	quantcast.com
436	quora.com
437	rakuten.co.jp
438	rambler.ru
439	rapidshare.com
440	reddit.com
441	repubblica.it
442	researchgate.net
443	reuters.com
444	ria.ru
445	rottentomatoes.com
446	rt.com
447	rtve.es
448	sakura.ne.jp
449	samsung.com
450	sapo.pt
451	scholastic.com
452	sciencedaily.com
453	sciencedirect.com
454	sciencemag.org
455	scientificamerican.com
456	scribd.com
457	seattletimes.com
458	secureserver.net
459	sedo.com
460	seesaa.net
461	sendspace.com
462	sfgate.com
463	shopify.com
464	shutterstock.com
465	siemens.com
466	sina.com.cn
467	sky.com
468	skype.com
469	skyrock.com
470	slate.com
471	slideshare.net
472	sm.cn
473	smh.com.au
474	so-net.ne.jp
475	softonic.com
476	sogou.com
477	sohu.com
478	soratemplates.com
479	soso.com
480	soundcloud.com
481	spiegel.de
482	spotify.com
483	springer.com
484	sputniknews.com
485	ssl-images-amazon.com
486	stackoverflow.com
487	standard.co.uk
488	stanford.edu
489	state.gov
490	steamcommunity.com
491	steampowered.com
492	storage.canalblog.com
493	storage.googleapis.com
494	stores.jp
495	storify.com
496	stuff.co.nz,SINGLEPAGE
497	surveymonkey.com
498	symantec.com
499	t-online.de
500	t.co
501	t.me
502	tabelog.com
503	taobao.com
504	target.com
505	teamviewer.com
506	techcrunch.com
507	ted.com
508	telegram.me
509	telegraph.co.uk
510	terra.com.br
511	theatlantic.com
512	thefreedictionary.com
513	theglobeandmail.com
514	theguardian.com
515	themeforest.net
516	thenextweb.com
517	thestar.com
518	thesun.co.uk
519	thetimes.co.uk
520	theverge.com
521	thoughtco.com
522	tianya.cn
523	time.com
524	tinyurl.com
525	tmall.com
526	tmz.com
527	tribunnews.com
528	tripadvisor.com
529	trustpilot.com
530	twitch.tv
531	twitter.com
532	ucoz.ru
533	uiuc.edu
534	umich.edu
535	un.org
536	undeveloped.com
537	unesco.org
538	uol.com.br
539	urbandictionary.com
540	usa.gov
541	usatoday.com
542	usgs.gov
543	usnews.com
544	uspto.gov
545	ustream.tv
546	utexas.edu
547	variety.com
548	venturebeat.com
549	vice.com
550	viglink.com
551	vimeo.com
552	vk.com
553	vkontakte.ru
554	vox.com
555	w3.org
556	w3schools.com
557	wa.me
558	walmart.com
559	washington.edu
560	washingtonpost.com
561	wattpad.com
562	weather.com
563	web.fc2.com
564	webmd.com
565	weebly.com
566	weibo.com
567	welt.de
568	whatsapp.com
569	whitehouse.gov
570	who.int
571	wikia.com
572	wikihow.com
573	wikimedia.org
574	wikipedia.org,mi.wikipedia.org
575	wiktionary.org,mi.wiktionary.org
576	wiley.com
577	windowsphone.com
578	wired.com
579	wix.com
580	wordpress.org,SUBDOMAIN-COPY
581	worldbank.org
582	wp.com
583	wsj.com
584	xbox.com
585	xinhuanet.com
586	yadi.sk
587	yahoo.co.jp
588	yahoo.com
589	yale.edu
590	yandex.ru
591	yelp.com
592	youku.com
593	youronlinechoices.com
594	youtu.be
595	youtube.com
596	ytimg.com
597	zdnet.com
598	zend.com
599	zendesk.com
600	zippyshare.com

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt@ 33622

Download in other formats: