Context Navigation

sites-too-big-to-exhaustively-crawl.txt@ 33569

Last change on this file since 33569 was 33569, checked in by ak19, 5 years ago

batchcrawl.sh now does what it should have from the start, which is to move the log.out and UNFINISHED files into the output folder instead of leaving them in the input folder, as the input to_crawl folder can and does get replaced all the time, every time I regenerate it after black/white/greylisting more urls. 2. Blacklisted more adult sites, greylisted more product sites and .ru, .pl and .tk domains with whitelisting in the whitelist file. 3. CCWETProcessor now looks out for additional adult sites based on URL and adds them to its blacklist in memory (not the file) and logs the domain for checking and manually adding to the blacklist file.

File size: 10.9 KB

Line
1	# Mapping of top sites in base url forms to value
2
3	# This file contains sites that are too large to crawl exhaustively.
4	# The domains are from Alexa top sites (where only the first 50 were visible)
5	# Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
6	# Finally also added https://moz.com/top500 by downloading its CSV file and
7	# adding its URLs to the existing listing here from alexa/wiki.
8	# Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates.
9	# Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext variants, keeping
10	# just <site>.ext
11	# And finally, re-sorted the reduced list alphabetically and pasted into here.
12
13	# FORMAT OF THIS FILE'S CONTENTS:
14	# <topsite-base-url>,<value>
15	# where <value> can or is one of
16	# empty, SUBDOMAIN-COPY, FOLLOW-LINKS-WITHIN-TOPSITE, SINGLEPAGE, <url-form-without-protocol>
17	#
18	# - if value is left empty: if seedurl contains topsite-base-url, the seedurl will go into the
19	# file unprocessed-topsite-matches.txt and the site/page won't be crawled.
20	# The user will be notified to inspect the file unprocessed-topsite-matches.txt.
21	# - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl.
22	# For example, if the seedurl is http://docs.google.com/some-long-suffix-in-base64, then it
23	# matches the topsite-base-url of docs.google.com and its value of SINGLEPAGE will add the
24	# seedurl itself as the regex url-filter, to restrict the crawl to just the specified page.
25	# - SUBDOMAIN-COPY: if seedurl CONTAINS topsite-base-url, then whatever the seedurl's subdomain
26	# or else domain is, will make up the urlfilter, so we don't leak out into a larger domain.
27	# Use SUBDOMAIN-COPY to restrict to a domain prefix/subdomain. For example, if seedurl is
28	# pinky.blogspot.com, it will match the topsite-base-url of blogspot.com, but SUBDOMAIN-COPY
29	# will ensure we restrict crawling to pages on pinky.blogspot.com.
30	# However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go
31	# into the file unprocessed-topsite-matches.txt and the site/page won't be crawled.
32	# - FOLLOW-LINKS-WITHIN-TOPSITE: if pages linked from the seedURL page can be followed and
33	# downloaded, as long as it's within the same subdomain matching the topsite-base-url.
34	# This is different from SUBDOMAIN-COPY, as that can download all of a specific subdomain but
35	# restricts against downloading the entire domain (e.g. all pinky.blogspot.com and not anything
36	# else within blogspot.com). FOLLOW-LINKS-WITHIN-TOPSITE can download all linked pages (at
37	# depth specified for the nutch crawl) as long as they're within the topsite-base-url.
38	# e.g. seedURLs on docs.google.com containing links will have those linked pages and any
39	# they link to etc. downloaded as long as they're on docs.google.com.
40	# - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided
41	# url-form-without-protocol will make up the urlfilter, again preventing leaking into a
42	# larger part of the domain. For example, if the seedurl is mi.wikipedia.org/SomePage, it will
43	# match the topsite-base-url of wikipedia.org for which the <url-form-without-protocol>
44	# value is mi.wikipedia.org, which should be all that's accepted for wikipedia.org. The
45	# <url-form-without-protocol> ends up in the regex urlfilter file, thereby restricting the
46	# crawl to just mi.wikipedia.org.
47	# Remember to leave out any protocol <from url-form-without-protocol>.
48	#
49	# TODO If useful:
50	# column 3: whether nutch should do fetch all or not
51	# column 4: number of crawl iterations
52
53
54	# NOT TOP SITES, BUT SITES WE INSPECTED AND WANT TO CONTROL SIMILARLY TO TOP SITES
55	00.gs,SINGLEPAGE
56	# May be a large site with only seedURLs of real relevance
57	topographic-map.com,SINGLEPAGE
58	ami-media.net,SINGLEPAGE
59	# 2 pages of declarations of human rights in Maori, rest in other languages
60	anitra.net,SINGLEPAGE
61	# special case
62	mi.centr-zashity.ru,SINGLEPAGE
63
64	# TOP SITE BUT NOT TOP 500
65	www.tumblr.com,SINGLEPAGE
66
67
68	# TOP SITES
69
70	# docs.google.com is a special case: not all pages are public and any interlinking is likely to
71	# be intentional. Grab all linked pages, for link depth set with nutch's crawl, as long as the
72	# links are within the given topsite-base-url
73	docs.google.com,FOLLOW-LINKS-WITHIN-TOPSITE
74
75	# Just crawl a single page for these:
76	drive.google.com,SINGLEPAGE
77	forms.office.com,SINGLEPAGE
78	player.vimeo.com,SINGLEPAGE
79	static-promote.weebly.com,SINGLEPAGE
80
81	# Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
82	# The page's containing folder is whitelisted in case the photos are there.
83	korora.econ.yale.edu,SINGLEPAGE
84
85
86	000webhost.com
87	360.cn
88	4shared.com
89	a8.net
90	abc.es
91	abc.net.au
92	abcnews.go.com
93	about.com
94	about.me
95	aboutads.info
96	abril.com.br
97	academia.edu
98	accuweather.com
99	addthis.com
100	addtoany.com
101	adobe.com
102	adweek.com
103	airbnb.com
104	akamaihd.net
105	alexa.com
106	alibaba.com
107	aliexpress.com
108	alipay.com
109	aljazeera.com
110	allaboutcookies.org
111	allrecipes.com
112	amazon.ca
113	amazon.co.jp
114	amazon.co.uk
115	amazon.com
116	amazon.de
117	amazon.es
118	amazon.fr
119	amazon.in
120	ameblo.jp
121	ampproject.org
122	android.com
123	aol.com
124	ap.org
125	apache.org
126	apachefriends.org
127	apple.com
128	archive.org
129	archives.gov
130	arstechnica.com
131	arxiv.org
132	asahi.com
133	ask.fm
134	asus.com
135	axs.com
136	babytree.com
137	baidu.com
138	bandcamp.com
139	bbc.co.uk
140	bbc.com
141	behance.net
142	berkeley.edu
143	biblegateway.com
144	biglobe.ne.jp
145	billboard.com
146	bing.com
147	bit.ly
148	bitly.com
149	blackberry.com
150	blogger.com
151	blogspot.com,SUBDOMAIN-COPY
152	bloomberg.com
153	booking.com
154	boston.com
155	box.com
156	britannica.com
157	bt.com
158	bund.de
159	businessinsider.com
160	businesswire.com
161	buydomains.com
162	buzzfeed.com
163	ca.gov
164	cambridge.org
165	canalblog.com
166	cbc.ca
167	cbslocal.com
168	cbsnews.com
169	cdc.gov
170	change.org
171	channel4.com
172	chicagotribune.com
173	chinadaily.com.cn
174	cisco.com
175	clickbank.net
176	cloudflare.com
177	cmu.edu
178	cnbc.com
179	cnet.com
180	cnn.com
181	cocolog-nifty.com
182	columbia.edu
183	connect.over-blog.com
184	cornell.edu
185	corriere.it
186	cpanel.com
187	cpanel.net
188	creativecommons.org
189	csdn.net
190	csmonitor.com
191	dailymail.co.uk
192	dailymotion.com
193	dan.com
194	daum.net
195	debian.org
196	dell.com
197	depositfiles.com
198	detik.com
199	digg.com
200	discovery.com
201	disney.com
202	disney.go.com
203	disqus.com
204	doubleclick.net
205	dreniq.com
206	dribbble.com
207	dropbox.com,SINGLEPAGE
208	dropboxusercontent.com
209	dw.com
210	e-recht24.de
211	ea.com
212	ebay.co.uk
213	ebay.com
214	economist.com
215	eff.org
216	ehow.com
217	elmundo.es
218	elpais.com
219	engadget.com
220	entrepreneur.com
221	eonline.com
222	espn.com
223	espn.go.com
224	etsy.com
225	europa.eu
226	eventbrite.com
227	example.com
228	excite.co.jp
229	express.co.uk
230	facebook.com
231	fandom.com
232	fastcompany.com
233	fb.com
234	fb.me
235	fda.gov
236	fedoraproject.org
237	feedburner.com
238	fifa.com
239	files.wordpress.com
240	flickr.com
241	forbes.com
242	fortune.com
243	foursquare.com
244	foxnews.com
245	ft.com
246	ftc.gov
247	gen.xyz
248	geocities.jp
249	gesetze-im-internet.de
250	ggpht.com
251	github.com
252	gizmodo.com
253	globo.com
254	gmail.com
255	gnu.org
256	godaddy.com
257	gofundme.com
258	goo.gl
259	goo.ne.jp
260	goodreads.com
261	google.ca
262	google.co.id
263	google.co.in
264	google.co.jp
265	google.co.uk
266	google.com
267	google.com.br
268	google.com.hk
269	google.com.tr
270	google.de
271	google.es
272	google.fr
273	google.it
274	google.nl
275	google.pl
276	google.ru
277	googleapis.com
278	googleblog.com
279	googleusercontent.com
280	gooyaabitemplates.com
281	gov.uk
282	gravatar.com
283	greenpeace.org
284	gstatic.com
285	guardian.co.uk
286	harvard.edu
287	hatena.ne.jp
288	histats.com
289	hm.com
290	hollywoodreporter.com
291	home.pl
292	house.gov
293	howstuffworks.com
294	hp.com
295	huffingtonpost.com
296	huffpost.com
297	hugedomains.com
298	ibm.com
299	ibtimes.com
300	icann.org
301	ieee.org
302	ietf.org
303	ig.com.br
304	ign.com
305	ikea.com
306	imageshack.us
307	imdb.com
308	imgur.com
309	inc.com
310	independent.co.uk
311	indiatimes.com
312	indiegogo.com
313	instagram.com
314	instructables.com
315	intel.com
316	interia.pl
317	issuu.com
318	istockphoto.com
319	iubenda.com
320	jd.com
321	joomla.org
322	jquery.com
323	jstor.org
324	kickstarter.com
325	kinja.com
326	last.fm
327	latimes.com
328	lefigaro.fr
329	lemonde.fr
330	line.me
331	linkedin.com
332	list-manage.com
333	live.com
334	livejournal.com
335	livescience.com
336	loc.gov
337	lonelyplanet.com
338	lycos.com
339	m.wikipedia.org,mi.m.wikipedia.org
340	mail.ru
341	marketwatch.com
342	marriott.com
343	mashable.com
344	mediafire.com
345	medium.com
346	mega.nz
347	megaupload.com
348	mercurynews.com
349	merriam-webster.com
350	metro.co.uk
351	microsoft.com,microsoft.com/mi-nz/
352	microsoftonline.com
353	mirror.co.uk
354	mit.edu
355	mixcloud.com
356	mlb.com
357	mozilla.com
358	mozilla.org
359	msn.com
360	myspace.com
361	mysql.com
362	namecheap.com
363	narod.ru
364	nasa.gov
365	nationalgeographic.com
366	nature.com
367	naver.com
368	naver.jp
369	nba.com
370	nbcnews.com
371	ndtv.com
372	netflix.com
373	netsons.com
374	netvibes.com
375	networkadvertising.org
376	news.com.au
377	newscientist.com
378	newsweek.com
379	newyorker.com
380	nginx.com
381	nginx.org
382	nhk.or.jp
383	nicovideo.jp
384	nifty.com
385	nih.gov
386	nikkei.com
387	noaa.gov
388	nokia.com
389	npr.org
390	nvidia.com
391	nydailynews.com
392	nypost.com
393	nytimes.com
394	nyu.edu
395	odnoklassniki.ru
396	office.com
397	offset.com
398	ok.ru
399	okezone.com
400	opera.com
401	oracle.com
402	orange.fr
403	oreilly.com
404	oup.com
405	over-blog.com
406	ovh.co.uk
407	ovh.com
408	ovh.net
409	ox.ac.uk
410	parallels.com
411	pastebin.com
412	paypal.com
413	pbs.org
414	pcmag.com
415	people.com
416	photobucket.com
417	php.net
418	pinterest.com,SINGLEPAGE
419	pixabay.com
420	playstation.com
421	plesk.com
422	plos.org
423	politico.com
424	prestashop.com
425	prezi.com
426	princeton.edu
427	privacyshield.gov
428	prnewswire.com
429	psychologytoday.com
430	qq.com
431	quantcast.com
432	quora.com
433	rakuten.co.jp
434	rambler.ru
435	rapidshare.com
436	reddit.com
437	repubblica.it
438	researchgate.net
439	reuters.com
440	ria.ru
441	rottentomatoes.com
442	rt.com
443	rtve.es
444	sakura.ne.jp
445	samsung.com
446	sapo.pt
447	scholastic.com
448	sciencedaily.com
449	sciencedirect.com
450	sciencemag.org
451	scientificamerican.com
452	scribd.com
453	seattletimes.com
454	secureserver.net
455	sedo.com
456	seesaa.net
457	sendspace.com
458	sfgate.com
459	shopify.com
460	shutterstock.com
461	siemens.com
462	sina.com.cn
463	sky.com
464	skype.com
465	skyrock.com
466	slate.com
467	slideshare.net
468	sm.cn
469	smh.com.au
470	so-net.ne.jp
471	softonic.com
472	sogou.com
473	sohu.com
474	soratemplates.com
475	soso.com
476	soundcloud.com
477	spiegel.de
478	spotify.com
479	springer.com
480	sputniknews.com
481	ssl-images-amazon.com
482	stackoverflow.com
483	standard.co.uk
484	stanford.edu
485	state.gov
486	steamcommunity.com
487	steampowered.com
488	storage.canalblog.com
489	storage.googleapis.com
490	stores.jp
491	storify.com
492	stuff.co.nz,SINGLEPAGE
493	surveymonkey.com
494	symantec.com
495	t-online.de
496	t.co
497	t.me
498	tabelog.com
499	taobao.com
500	target.com
501	teamviewer.com
502	techcrunch.com
503	ted.com
504	telegram.me
505	telegraph.co.uk
506	terra.com.br
507	theatlantic.com
508	thefreedictionary.com
509	theglobeandmail.com
510	theguardian.com
511	themeforest.net
512	thenextweb.com
513	thestar.com
514	thesun.co.uk
515	thetimes.co.uk
516	theverge.com
517	thoughtco.com
518	tianya.cn
519	time.com
520	tinyurl.com
521	tmall.com
522	tmz.com
523	tribunnews.com
524	tripadvisor.com
525	trustpilot.com
526	twitch.tv
527	twitter.com
528	ucoz.ru
529	uiuc.edu
530	umich.edu
531	un.org
532	undeveloped.com
533	unesco.org
534	uol.com.br
535	urbandictionary.com
536	usa.gov
537	usatoday.com
538	usgs.gov
539	usnews.com
540	uspto.gov
541	ustream.tv
542	utexas.edu
543	variety.com
544	venturebeat.com
545	vice.com
546	viglink.com
547	vimeo.com
548	vk.com
549	vkontakte.ru
550	vox.com
551	w3.org
552	w3schools.com
553	wa.me
554	walmart.com
555	washington.edu
556	washingtonpost.com
557	wattpad.com
558	weather.com
559	web.fc2.com
560	webmd.com
561	weebly.com
562	weibo.com
563	welt.de
564	whatsapp.com
565	whitehouse.gov
566	who.int
567	wikia.com
568	wikihow.com
569	wikimedia.org
570	wikipedia.org,mi.wikipedia.org
571	wiktionary.org,mi.wiktionary.org
572	wiley.com
573	windowsphone.com
574	wired.com
575	wix.com
576	wordpress.org,SUBDOMAIN-COPY
577	worldbank.org
578	wp.com
579	wsj.com
580	xbox.com
581	xinhuanet.com
582	yadi.sk
583	yahoo.co.jp
584	yahoo.com
585	yale.edu
586	yandex.ru
587	yelp.com
588	youku.com
589	youronlinechoices.com
590	youtu.be
591	youtube.com
592	ytimg.com
593	zdnet.com
594	zend.com
595	zendesk.com
596	zippyshare.com

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt@ 33569

Download in other formats: