Context Navigation

sites-too-big-to-exhaustively-crawl.txt@ 33666

Last change on this file since 33666 was 33666, checked in by ak19, 4 years ago

Having finished sending all the crawl data to mongodb 1. Recrawled the 2 sites which I had earlier noted required recrawling 00152, 00332. 00152 required changes to how it needed to be crawled. MP3 files needed to be blocked, as there were HBase error messages about key values being too large. 2. Modified the regex-urlfilter.GS_TEMPLATE file for this to block mp3 files in general for future crawls too (in the location of the file where jpg etc were already blocked by nutch's default regex url filters). 3. Further had to control the 00152 site to only be crawled under its /maori/ sub-domain. Since the seedURL maori.html was not off a /maori/ url, this revealed that the CCWETProcessor code didn't already consider allowing filters to okay seedURLs even where the crawl was controlled to run over a subdomain (as expressed in conf/sites-too-big-to-exhaustively-crawl file) but where the seedURL didn't match these controlled regex filters. So now, in such cases, the CCWETProcessor adds seedURLs that don't match to the filters too (so we get just the single file of the seedURL pages) besides a filter on the requested subdomain, so we follow all pages linked by the seedURLs that match the subdomain expression. 4. Adding to_crawl.tar.gz to svn, the tarball of the sites to_crawl that I actually ran nutch over, of all the sites folders with their seedURL.txt and regex-urlfilter.txt files that the batchcrawl.sh runs over. This didn't use the latest version of the sites folder and blacklist/whitelist files generated by CCWETProcessor, since the latest version was regenerated after the final modifications to CCWETProcessor which was after crawling was finished. But to_crawl.tar.gz does have a manually modified 00152, wit the correct regex-urlfilter file and uses the newer regex-urlfilter.GS_TEMPLATE file that blocks mp3 files. 5. crawledNode6.tar.gz now contains the dump output for sites 00152 and 00332, which were crawled on node6 today (after which their processed dump.txt file results were added into MongoDB). 7. MoreReading/mongodb.txt now contains the results of some queries I ran against the total nutch-crawled data.

File size: 11.2 KB

Line
1	# Mapping of top sites in base url forms to value
2
3	# This file contains sites that are too large to crawl exhaustively.
4	# The domains are from Alexa top sites (where only the first 50 were visible)
5	# Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
6	# Finally also added https://moz.com/top500 by downloading its CSV file and
7	# adding its URLs to the existing listing here from alexa/wiki.
8	# Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates.
9	# Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext variants, keeping
10	# just <site>.ext
11	# And finally, re-sorted the reduced list alphabetically and pasted into here.
12
13	# FORMAT OF THIS FILE'S CONTENTS:
14	# <topsite-base-url>,<value>
15	# where <value> can or is one of
16	# empty, SUBDOMAIN-COPY, FOLLOW-LINKS-WITHIN-TOPSITE, SINGLEPAGE, <url-form-without-protocol>
17	#
18	# - if value is left empty: if seedurl contains topsite-base-url, the seedurl will go into the
19	# file unprocessed-topsite-matches.txt and the site/page won't be crawled.
20	# The user will be notified to inspect the file unprocessed-topsite-matches.txt.
21	# - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl.
22	# For example, if the seedurl is http://docs.google.com/some-long-suffix-in-base64, then it
23	# matches the topsite-base-url of docs.google.com and its value of SINGLEPAGE will add the
24	# seedurl itself as the regex url-filter, to restrict the crawl to just the specified page.
25	# - SUBDOMAIN-COPY: if seedurl CONTAINS topsite-base-url, then whatever the seedurl's subdomain
26	# or else domain is, will make up the urlfilter, so we don't leak out into a larger domain.
27	# Use SUBDOMAIN-COPY to restrict to a domain prefix/subdomain. For example, if seedurl is
28	# pinky.blogspot.com, it will match the topsite-base-url of blogspot.com, but SUBDOMAIN-COPY
29	# will ensure we restrict crawling to pages on pinky.blogspot.com.
30	# However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go
31	# into the file unprocessed-topsite-matches.txt and the site/page won't be crawled.
32	# - FOLLOW-LINKS-WITHIN-TOPSITE: download seedURL pages and pages linked from each seedURL
33	# page should be followed and downloaded too, as long as they're within the same subdomain
34	# matching the topsite-base-url.
35	# This is different from SUBDOMAIN-COPY, as that can download all of a specific subdomain but
36	# restricts against downloading the entire domain (e.g. all pinky.blogspot.com and not anything
37	# else within blogspot.com). FOLLOW-LINKS-WITHIN-TOPSITE can download all linked pages (at
38	# depth specified for the nutch crawl) as long as they're within the topsite-base-url.
39	# e.g. seedURLs on docs.google.com containing links will have those linked pages and any
40	# they link to etc. downloaded as long as they're on docs.google.com.
41	# - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided
42	# url-form-without-protocol will make up the urlfilter, again preventing leaking into a
43	# larger part of the domain. For example, if the seedurl is mi.wikipedia.org/SomePage, it will
44	# match the topsite-base-url of wikipedia.org for which the <url-form-without-protocol>
45	# value is mi.wikipedia.org, which should be all that's accepted for wikipedia.org. The
46	# <url-form-without-protocol> ends up in the regex urlfilter file, thereby restricting the
47	# crawl to just mi.wikipedia.org.
48	# Remember to leave out any protocol <from url-form-without-protocol>.
49	#
50	# TODO If useful:
51	# column 3: whether nutch should do fetch all or not
52	# column 4: number of crawl iterations
53
54
55	# NOT TOP SITES, BUT SITES WE INSPECTED AND WANT TO CONTROL SIMILARLY TO TOP SITES
56	00.gs,SINGLEPAGE
57	# May be a large site with only seedURLs of real relevance
58	topographic-map.com,SINGLEPAGE
59	ami-media.net,SINGLEPAGE
60	# 2 pages of declarations of human rights in Maori, rest in other languages
61	anitra.net,SINGLEPAGE
62	# special case
63	mi.centr-zashity.ru,SINGLEPAGE
64
65	# we want the http://loquevendra318.com/fox/maori.html seed URL but also
66	# pages within the following subsection
67	loquevendra318.com,loquevendra318.com/fox/maori/
68
69	martinvrijland.nl,martinvrijland.nl/mi/
70	csunplugged.org,csunplugged.org/mi/
71	gpedia.com,gpedia.com/mi/
72
73	# TOP SITE BUT NOT TOP 500
74	www.tumblr.com,SINGLEPAGE
75
76
77	# TOP SITES
78
79	# docs.google.com is a special case: not all pages are public and any interlinking is likely to
80	# be intentional. Grab all linked pages, for link depth set with nutch's crawl, as long as the
81	# links are within the given topsite-base-url
82	docs.google.com,FOLLOW-LINKS-WITHIN-TOPSITE
83
84	# Just crawl a single page for these:
85	drive.google.com,SINGLEPAGE
86	forms.office.com,SINGLEPAGE
87	player.vimeo.com,SINGLEPAGE
88	static-promote.weebly.com,SINGLEPAGE
89
90	# Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
91	# The page's containing folder is whitelisted in case the photos are there.
92	korora.econ.yale.edu,SINGLEPAGE
93
94
95	000webhost.com
96	360.cn
97	4shared.com
98	a8.net
99	abc.es
100	abc.net.au
101	abcnews.go.com
102	about.com
103	about.me
104	aboutads.info
105	abril.com.br
106	academia.edu
107	accuweather.com
108	addthis.com
109	addtoany.com
110	adobe.com
111	adweek.com
112	airbnb.com
113	akamaihd.net
114	alexa.com
115	alibaba.com
116	aliexpress.com
117	alipay.com
118	aljazeera.com
119	allaboutcookies.org
120	allrecipes.com
121	amazon.ca
122	amazon.co.jp
123	amazon.co.uk
124	amazon.com
125	amazon.de
126	amazon.es
127	amazon.fr
128	amazon.in
129	ameblo.jp
130	ampproject.org
131	android.com
132	aol.com
133	ap.org
134	apache.org
135	apachefriends.org
136	apple.com
137	archive.org
138	archives.gov
139	arstechnica.com
140	arxiv.org
141	asahi.com
142	ask.fm
143	asus.com
144	axs.com
145	babytree.com
146	baidu.com
147	bandcamp.com
148	bbc.co.uk
149	bbc.com
150	behance.net
151	berkeley.edu
152	biblegateway.com
153	biglobe.ne.jp
154	billboard.com
155	bing.com
156	bit.ly
157	bitly.com
158	blackberry.com
159	blogger.com
160	blogspot.com,SUBDOMAIN-COPY
161	bloomberg.com
162	booking.com
163	boston.com
164	box.com
165	britannica.com
166	bt.com
167	bund.de
168	businessinsider.com
169	businesswire.com
170	buydomains.com
171	buzzfeed.com
172	ca.gov
173	cambridge.org
174	canalblog.com
175	cbc.ca
176	cbslocal.com
177	cbsnews.com
178	cdc.gov
179	change.org
180	channel4.com
181	chicagotribune.com
182	chinadaily.com.cn
183	cisco.com
184	clickbank.net
185	cloudflare.com
186	cmu.edu
187	cnbc.com
188	cnet.com
189	cnn.com
190	cocolog-nifty.com
191	columbia.edu
192	connect.over-blog.com
193	cornell.edu
194	corriere.it
195	cpanel.com
196	cpanel.net
197	creativecommons.org
198	csdn.net
199	csmonitor.com
200	dailymail.co.uk
201	dailymotion.com
202	dan.com
203	daum.net
204	debian.org
205	dell.com
206	depositfiles.com
207	detik.com
208	digg.com
209	discovery.com
210	disney.com
211	disney.go.com
212	disqus.com
213	doubleclick.net
214	dreniq.com
215	dribbble.com
216	dropbox.com,SINGLEPAGE
217	dropboxusercontent.com
218	dw.com
219	e-recht24.de
220	ea.com
221	ebay.co.uk
222	ebay.com
223	economist.com
224	eff.org
225	ehow.com
226	elmundo.es
227	elpais.com
228	engadget.com
229	entrepreneur.com
230	eonline.com
231	espn.com
232	espn.go.com
233	etsy.com
234	europa.eu
235	eventbrite.com
236	example.com
237	excite.co.jp
238	express.co.uk
239	facebook.com
240	fandom.com
241	fastcompany.com
242	fb.com
243	fb.me
244	fda.gov
245	fedoraproject.org
246	feedburner.com
247	fifa.com
248	files.wordpress.com
249	flickr.com
250	forbes.com
251	fortune.com
252	foursquare.com
253	foxnews.com
254	ft.com
255	ftc.gov
256	gen.xyz
257	geocities.jp
258	gesetze-im-internet.de
259	ggpht.com
260	github.com
261	gizmodo.com
262	globo.com
263	gmail.com
264	gnu.org
265	godaddy.com
266	gofundme.com
267	goo.gl
268	goo.ne.jp
269	goodreads.com
270	google.ca
271	google.co.id
272	google.co.in
273	google.co.jp
274	google.co.uk
275	google.com
276	google.com.br
277	google.com.hk
278	google.com.tr
279	google.de
280	google.es
281	google.fr
282	google.it
283	google.nl
284	google.pl
285	google.ru
286	googleapis.com
287	googleblog.com
288	googleusercontent.com
289	gooyaabitemplates.com
290	gov.uk
291	gravatar.com
292	greenpeace.org
293	gstatic.com
294	guardian.co.uk
295	harvard.edu
296	hatena.ne.jp
297	histats.com
298	hm.com
299	hollywoodreporter.com
300	home.pl
301	house.gov
302	howstuffworks.com
303	hp.com
304	huffingtonpost.com
305	huffpost.com
306	hugedomains.com
307	ibm.com
308	ibtimes.com
309	icann.org
310	ieee.org
311	ietf.org
312	ig.com.br
313	ign.com
314	ikea.com
315	imageshack.us
316	imdb.com
317	imgur.com
318	inc.com
319	independent.co.uk
320	indiatimes.com
321	indiegogo.com
322	instagram.com
323	instructables.com
324	intel.com
325	interia.pl
326	issuu.com
327	istockphoto.com
328	iubenda.com
329	jd.com
330	joomla.org
331	jquery.com
332	jstor.org
333	kickstarter.com
334	kinja.com
335	last.fm
336	latimes.com
337	lefigaro.fr
338	lemonde.fr
339	line.me
340	linkedin.com
341	list-manage.com
342	live.com
343	livejournal.com
344	livescience.com
345	loc.gov
346	lonelyplanet.com
347	lycos.com
348	m.wikipedia.org,mi.m.wikipedia.org
349	mail.ru
350	marketwatch.com
351	marriott.com
352	mashable.com
353	mediafire.com
354	medium.com
355	mega.nz
356	megaupload.com
357	mercurynews.com
358	merriam-webster.com
359	metro.co.uk
360	microsoft.com,microsoft.com/mi-nz/
361	microsoftonline.com
362	mirror.co.uk
363	mit.edu
364	mixcloud.com
365	mlb.com
366	mozilla.com
367	mozilla.org
368	msn.com
369	myspace.com
370	mysql.com
371	namecheap.com
372	narod.ru
373	nasa.gov
374	nationalgeographic.com
375	nature.com
376	naver.com
377	naver.jp
378	nba.com
379	nbcnews.com
380	ndtv.com
381	netflix.com
382	netsons.com
383	netvibes.com
384	networkadvertising.org
385	news.com.au
386	newscientist.com
387	newsweek.com
388	newyorker.com
389	nginx.com
390	nginx.org
391	nhk.or.jp
392	nicovideo.jp
393	nifty.com
394	nih.gov
395	nikkei.com
396	noaa.gov
397	nokia.com
398	npr.org
399	nvidia.com
400	nydailynews.com
401	nypost.com
402	nytimes.com
403	nyu.edu
404	odnoklassniki.ru
405	office.com
406	offset.com
407	ok.ru
408	okezone.com
409	opera.com
410	oracle.com
411	orange.fr
412	oreilly.com
413	oup.com
414	over-blog.com
415	ovh.co.uk
416	ovh.com
417	ovh.net
418	ox.ac.uk
419	parallels.com
420	pastebin.com
421	paypal.com
422	pbs.org
423	pcmag.com
424	people.com
425	photobucket.com
426	php.net
427	pinterest.com,SINGLEPAGE
428	pixabay.com
429	playstation.com
430	plesk.com
431	plos.org
432	politico.com
433	prestashop.com
434	prezi.com
435	princeton.edu
436	privacyshield.gov
437	prnewswire.com
438	psychologytoday.com
439	qq.com
440	quantcast.com
441	quora.com
442	rakuten.co.jp
443	rambler.ru
444	rapidshare.com
445	reddit.com
446	repubblica.it
447	researchgate.net
448	reuters.com
449	ria.ru
450	rottentomatoes.com
451	rt.com
452	rtve.es
453	sakura.ne.jp
454	samsung.com
455	sapo.pt
456	scholastic.com
457	sciencedaily.com
458	sciencedirect.com
459	sciencemag.org
460	scientificamerican.com
461	scribd.com
462	seattletimes.com
463	secureserver.net
464	sedo.com
465	seesaa.net
466	sendspace.com
467	sfgate.com
468	shopify.com
469	shutterstock.com
470	siemens.com
471	sina.com.cn
472	sky.com
473	skype.com
474	skyrock.com
475	slate.com
476	slideshare.net
477	sm.cn
478	smh.com.au
479	so-net.ne.jp
480	softonic.com
481	sogou.com
482	sohu.com
483	soratemplates.com
484	soso.com
485	soundcloud.com
486	spiegel.de
487	spotify.com
488	springer.com
489	sputniknews.com
490	ssl-images-amazon.com
491	stackoverflow.com
492	standard.co.uk
493	stanford.edu
494	state.gov
495	steamcommunity.com
496	steampowered.com
497	storage.canalblog.com
498	storage.googleapis.com
499	stores.jp
500	storify.com
501	stuff.co.nz,SINGLEPAGE
502	surveymonkey.com
503	symantec.com
504	t-online.de
505	t.co
506	t.me
507	tabelog.com
508	taobao.com
509	target.com
510	teamviewer.com
511	techcrunch.com
512	ted.com
513	telegram.me
514	telegraph.co.uk
515	terra.com.br
516	theatlantic.com
517	thefreedictionary.com
518	theglobeandmail.com
519	theguardian.com
520	themeforest.net
521	thenextweb.com
522	thestar.com
523	thesun.co.uk
524	thetimes.co.uk
525	theverge.com
526	thoughtco.com
527	tianya.cn
528	time.com
529	tinyurl.com
530	tmall.com
531	tmz.com
532	tribunnews.com
533	tripadvisor.com
534	trustpilot.com
535	twitch.tv
536	twitter.com
537	ucoz.ru
538	uiuc.edu
539	umich.edu
540	un.org
541	undeveloped.com
542	unesco.org
543	uol.com.br
544	urbandictionary.com
545	usa.gov
546	usatoday.com
547	usgs.gov
548	usnews.com
549	uspto.gov
550	ustream.tv
551	utexas.edu
552	variety.com
553	venturebeat.com
554	vice.com
555	viglink.com
556	vimeo.com
557	vk.com
558	vkontakte.ru
559	vox.com
560	w3.org
561	w3schools.com
562	wa.me
563	walmart.com
564	washington.edu
565	washingtonpost.com
566	wattpad.com
567	weather.com
568	web.fc2.com
569	webmd.com
570	weebly.com
571	weibo.com
572	welt.de
573	whatsapp.com
574	whitehouse.gov
575	who.int
576	wikia.com
577	wikihow.com
578	wikimedia.org
579	wikipedia.org,mi.wikipedia.org
580	wiktionary.org,mi.wiktionary.org
581	wiley.com
582	windowsphone.com
583	wired.com
584	wix.com
585	wordpress.org,SUBDOMAIN-COPY
586	worldbank.org
587	wp.com
588	wsj.com
589	xbox.com
590	xinhuanet.com
591	yadi.sk
592	yahoo.co.jp
593	yahoo.com
594	yale.edu
595	yandex.ru
596	yelp.com
597	youku.com
598	youronlinechoices.com
599	youtu.be
600	youtube.com
601	ytimg.com
602	zdnet.com
603	zend.com
604	zendesk.com
605	zippyshare.com

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: other-projects/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt@ 33666

Download in other formats: