Context Navigation

source: gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt@ 33559

Last change on this file since 33559 was 33559, checked in by ak19, 5 years ago
Special string COPY changed to SUBDOMAIN-COPY after Dr Bainbridge explained why it was more accurate to the behaviour. 2. Comments to explain how the sites-too-big-to-exhaustively-crawl.txt should be formatted, what values are expected and how they work. 3. Special blacklisting and whitelisting of urls on yale.edu, coupled with special treatment in topsites file too.
File size: 9.3 KB

Line
1	# Mapping of top sites in base url forms to value
2
3	# This file contains sites that are too large to crawl exhaustively.
4	# The domains are from Alexa top sites (where only the first 50 were visible)
5	# Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
6	# Finally also added https://moz.com/top500 by downloading its CSV file and
7	# adding its URLs to the existing listing here from alexa/wiki.
8	# Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates.
9	# Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext variants, keeping
10	# just <site>.ext
11	# And finally, re-sorted the reduced list alphabetically and pasted into here.
12
13	# FORMAT OF THIS FILE'S CONTENTS:
14	# <topsite-base-url><tabspace><value>
15	# where <value> can be empty or one of SUBDOMAIN-COPY, SINGLEPAGE, <url-form-without-protocol>
16	#
17	# - if value is empty: if seedurl contains topsite-base-url, the seedurl will go into the file
18	# unprocessed-topsite-matches.txt and the site/page won't be crawled.
19	# The user will be notified to inspect the file unprocessed-topsite-matches.txt.
20	# - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl.
21	# For example, if the seedurl is http://docs.google.com/some-long-suffix-in-base64, then it
22	# matches the topsite-base-url of docs.google.com and its value of SINGLEPAGE will add the
23	# seedurl itself as the regex url-filter, to restrict the crawl to just the specified page.
24	# - SUBDOMAIN-COPY: if seedurl CONTAINS topsite-base-url, then whatever the seedurl's subdomain
25	# or else domain is, will make up the urlfilter, so we don't leak out into a larger domain.
26	# Use SUBDOMAIN-COPY to restrict to a domain prefix/subdomain. For example, if seedurl is
27	# pinky.blogspot.com, it will match the topsite-base-url of blogspot.com, but SUBDOMAIN-COPY
28	# will ensure we restrict crawling to pages on pinky.blogspot.com.
29	# However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go
30	# into the file unprocessed-topsite-matches.txt and the site/page won't be crawled.
31	# - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided
32	# url-form-without-protocol will make up the urlfilter, again preventing leaking into a
33	# larger part of the domain. For example, if the seedurl is mi.wikipedia.org/SomePage, it will
34	# match the topsite-base-url of wikipedia.org for which the <url-form-without-protocol>
35	# value is mi.wikipedia.org, which should be all that's accepted for wikipedia.org. The
36	# <url-form-without-protocol> ends up in the regex urlfilter file, thereby restricting the
37	# crawl to just mi.wikipedia.org.
38	# Remember to leave out any protocol <from url-form-without-protocol>.
39
40
41
42	docs.google.com SINGLEPAGE
43	drive.google.com SINGLEPAGE
44	forms.office.com SINGLEPAGE
45	player.vimeo.com SINGLEPAGE
46	static-promote.weebly.com SINGLEPAGE
47
48	# Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
49	# The page's containing folder is whitelisted in case the photos are there.
50	korora.econ.yale.edu SINGLEPAGE
51
52	000webhost.com
53	360.cn
54	4shared.com
55	a8.net
56	abc.es
57	abc.net.au
58	abcnews.go.com
59	about.com
60	about.me
61	aboutads.info
62	abril.com.br
63	academia.edu
64	accuweather.com
65	addthis.com
66	addtoany.com
67	adobe.com
68	adweek.com
69	airbnb.com
70	akamaihd.net
71	alexa.com
72	alibaba.com
73	aliexpress.com
74	alipay.com
75	aljazeera.com
76	allaboutcookies.org
77	allrecipes.com
78	amazon.ca
79	amazon.co.jp
80	amazon.co.uk
81	amazon.com
82	amazon.de
83	amazon.es
84	amazon.fr
85	amazon.in
86	ameblo.jp
87	ampproject.org
88	android.com
89	aol.com
90	ap.org
91	apache.org
92	apachefriends.org
93	apple.com
94	archive.org
95	archives.gov
96	arstechnica.com
97	arxiv.org
98	asahi.com
99	ask.fm
100	asus.com
101	axs.com
102	babytree.com
103	baidu.com
104	bandcamp.com
105	bbc.co.uk
106	bbc.com
107	behance.net
108	berkeley.edu
109	biblegateway.com
110	biglobe.ne.jp
111	billboard.com
112	bing.com
113	bit.ly
114	bitly.com
115	blackberry.com
116	blogger.com
117	blogspot.com SUBDOMAIN-COPY
118	bloomberg.com
119	booking.com
120	boston.com
121	box.com
122	britannica.com
123	bt.com
124	bund.de
125	businessinsider.com
126	businesswire.com
127	buydomains.com
128	buzzfeed.com
129	ca.gov
130	cambridge.org
131	canalblog.com
132	cbc.ca
133	cbslocal.com
134	cbsnews.com
135	cdc.gov
136	change.org
137	channel4.com
138	chicagotribune.com
139	chinadaily.com.cn
140	cisco.com
141	clickbank.net
142	cloudflare.com
143	cmu.edu
144	cnbc.com
145	cnet.com
146	cnn.com
147	cocolog-nifty.com
148	columbia.edu
149	connect.over-blog.com
150	cornell.edu
151	corriere.it
152	cpanel.com
153	cpanel.net
154	creativecommons.org
155	csdn.net
156	csmonitor.com
157	dailymail.co.uk
158	dailymotion.com
159	dan.com
160	daum.net
161	debian.org
162	dell.com
163	depositfiles.com
164	detik.com
165	digg.com
166	discovery.com
167	disney.com
168	disney.go.com
169	disqus.com
170	doubleclick.net
171	dreniq.com
172	dribbble.com
173	dropbox.com SINGLEPAGE
174	dropboxusercontent.com
175	dw.com
176	e-recht24.de
177	ea.com
178	ebay.co.uk
179	ebay.com
180	economist.com
181	eff.org
182	ehow.com
183	elmundo.es
184	elpais.com
185	engadget.com
186	entrepreneur.com
187	eonline.com
188	espn.com
189	espn.go.com
190	etsy.com
191	europa.eu
192	eventbrite.com
193	example.com
194	excite.co.jp
195	express.co.uk
196	facebook.com
197	fandom.com
198	fastcompany.com
199	fb.com
200	fb.me
201	fda.gov
202	fedoraproject.org
203	feedburner.com
204	fifa.com
205	files.wordpress.com
206	flickr.com
207	forbes.com
208	fortune.com
209	foursquare.com
210	foxnews.com
211	ft.com
212	ftc.gov
213	gen.xyz
214	geocities.jp
215	gesetze-im-internet.de
216	ggpht.com
217	github.com
218	gizmodo.com
219	globo.com
220	gmail.com
221	gnu.org
222	godaddy.com
223	gofundme.com
224	goo.gl
225	goo.ne.jp
226	goodreads.com
227	google.ca
228	google.co.id
229	google.co.in
230	google.co.jp
231	google.co.uk
232	google.com
233	google.com.br
234	google.com.hk
235	google.com.tr
236	google.de
237	google.es
238	google.fr
239	google.it
240	google.nl
241	google.pl
242	google.ru
243	googleapis.com
244	googleblog.com
245	googleusercontent.com
246	gooyaabitemplates.com
247	gov.uk
248	gravatar.com
249	greenpeace.org
250	gstatic.com
251	guardian.co.uk
252	harvard.edu
253	hatena.ne.jp
254	histats.com
255	hm.com
256	hollywoodreporter.com
257	home.pl
258	house.gov
259	howstuffworks.com
260	hp.com
261	huffingtonpost.com
262	huffpost.com
263	hugedomains.com
264	ibm.com
265	ibtimes.com
266	icann.org
267	ieee.org
268	ietf.org
269	ig.com.br
270	ign.com
271	ikea.com
272	imageshack.us
273	imdb.com
274	imgur.com
275	inc.com
276	independent.co.uk
277	indiatimes.com
278	indiegogo.com
279	instagram.com
280	instructables.com
281	intel.com
282	interia.pl
283	issuu.com
284	istockphoto.com
285	iubenda.com
286	jd.com
287	joomla.org
288	jquery.com
289	jstor.org
290	kickstarter.com
291	kinja.com
292	last.fm
293	latimes.com
294	lefigaro.fr
295	lemonde.fr
296	line.me
297	linkedin.com
298	list-manage.com
299	live.com
300	livejournal.com
301	livescience.com
302	loc.gov
303	lonelyplanet.com
304	lycos.com
305	m.wikipedia.org mi.m.wikipedia.org
306	mail.ru
307	marketwatch.com
308	marriott.com
309	mashable.com
310	mediafire.com
311	medium.com
312	mega.nz
313	megaupload.com
314	mercurynews.com
315	merriam-webster.com
316	metro.co.uk
317	microsoft.com microsoft.com/mi-nz/
318	microsoftonline.com
319	mirror.co.uk
320	mit.edu
321	mixcloud.com
322	mlb.com
323	mozilla.com
324	mozilla.org
325	msn.com
326	myspace.com
327	mysql.com
328	namecheap.com
329	narod.ru
330	nasa.gov
331	nationalgeographic.com
332	nature.com
333	naver.com
334	naver.jp
335	nba.com
336	nbcnews.com
337	ndtv.com
338	netflix.com
339	netsons.com
340	netvibes.com
341	networkadvertising.org
342	news.com.au
343	newscientist.com
344	newsweek.com
345	newyorker.com
346	nginx.com
347	nginx.org
348	nhk.or.jp
349	nicovideo.jp
350	nifty.com
351	nih.gov
352	nikkei.com
353	noaa.gov
354	nokia.com
355	npr.org
356	nvidia.com
357	nydailynews.com
358	nypost.com
359	nytimes.com
360	nyu.edu
361	odnoklassniki.ru
362	office.com
363	offset.com
364	ok.ru
365	okezone.com
366	opera.com
367	oracle.com
368	orange.fr
369	oreilly.com
370	oup.com
371	over-blog.com
372	ovh.co.uk
373	ovh.com
374	ovh.net
375	ox.ac.uk
376	parallels.com
377	pastebin.com
378	paypal.com
379	pbs.org
380	pcmag.com
381	people.com
382	photobucket.com
383	php.net
384	pinterest.com SINGLEPAGE
385	pixabay.com
386	playstation.com
387	plesk.com
388	plos.org
389	politico.com
390	prestashop.com
391	prezi.com
392	princeton.edu
393	privacyshield.gov
394	prnewswire.com
395	psychologytoday.com
396	qq.com
397	quantcast.com
398	quora.com
399	rakuten.co.jp
400	rambler.ru
401	rapidshare.com
402	reddit.com
403	repubblica.it
404	researchgate.net
405	reuters.com
406	ria.ru
407	rottentomatoes.com
408	rt.com
409	rtve.es
410	sakura.ne.jp
411	samsung.com
412	sapo.pt
413	scholastic.com
414	sciencedaily.com
415	sciencedirect.com
416	sciencemag.org
417	scientificamerican.com
418	scribd.com
419	seattletimes.com
420	secureserver.net
421	sedo.com
422	seesaa.net
423	sendspace.com
424	sfgate.com
425	shopify.com
426	shutterstock.com
427	siemens.com
428	sina.com.cn
429	sky.com
430	skype.com
431	skyrock.com
432	slate.com
433	slideshare.net
434	sm.cn
435	smh.com.au
436	so-net.ne.jp
437	softonic.com
438	sogou.com
439	sohu.com
440	soratemplates.com
441	soso.com
442	soundcloud.com
443	spiegel.de
444	spotify.com
445	springer.com
446	sputniknews.com
447	ssl-images-amazon.com
448	stackoverflow.com
449	standard.co.uk
450	stanford.edu
451	state.gov
452	steamcommunity.com
453	steampowered.com
454	storage.canalblog.com
455	storage.googleapis.com
456	stores.jp
457	storify.com
458	stuff.co.nz SINGLEPAGE
459	surveymonkey.com
460	symantec.com
461	t-online.de
462	t.co
463	t.me
464	tabelog.com
465	taobao.com
466	target.com
467	teamviewer.com
468	techcrunch.com
469	ted.com
470	telegram.me
471	telegraph.co.uk
472	terra.com.br
473	theatlantic.com
474	thefreedictionary.com
475	theglobeandmail.com
476	theguardian.com
477	themeforest.net
478	thenextweb.com
479	thestar.com
480	thesun.co.uk
481	thetimes.co.uk
482	theverge.com
483	thoughtco.com
484	tianya.cn
485	time.com
486	tinyurl.com
487	tmall.com
488	tmz.com
489	tribunnews.com
490	tripadvisor.com
491	trustpilot.com
492	twitch.tv
493	twitter.com
494	ucoz.ru
495	uiuc.edu
496	umich.edu
497	un.org
498	undeveloped.com
499	unesco.org
500	uol.com.br
501	urbandictionary.com
502	usa.gov
503	usatoday.com
504	usgs.gov
505	usnews.com
506	uspto.gov
507	ustream.tv
508	utexas.edu
509	variety.com
510	venturebeat.com
511	vice.com
512	viglink.com
513	vimeo.com
514	vk.com
515	vkontakte.ru
516	vox.com
517	w3.org
518	w3schools.com
519	wa.me
520	walmart.com
521	washington.edu
522	washingtonpost.com
523	wattpad.com
524	weather.com
525	web.fc2.com
526	webmd.com
527	weebly.com
528	weibo.com
529	welt.de
530	whatsapp.com
531	whitehouse.gov
532	who.int
533	wikia.com
534	wikihow.com
535	wikimedia.org
536	wikipedia.org mi.wikipedia.org
537	wiktionary.org mi.wiktionary.org
538	wiley.com
539	windowsphone.com
540	wired.com
541	wix.com
542	wordpress.org SUBDOMAIN-COPY
543	worldbank.org
544	wp.com
545	wsj.com
546	xbox.com
547	xinhuanet.com
548	yadi.sk
549	yahoo.co.jp
550	yahoo.com
551	yale.edu
552	yandex.ru
553	yelp.com
554	youku.com
555	youronlinechoices.com
556	youtu.be
557	youtube.com
558	ytimg.com
559	zdnet.com
560	zend.com
561	zendesk.com
562	zippyshare.com

Note: See TracBrowser for help on using the repository browser.

Download in other formats: