source: other-projects/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt@ 34011

Last change on this file since 34011 was 33904, checked in by ak19, 4 years ago

Shouldn't greylist anglican.org, as this prevented crawling of justus.anglican.org seedURLs. There's however no need to add an exception into sites-too-big-to-exhaustively-crawl.txt to control how much we crawl, as we only crawl to depth 10 anyway and the seedURLs already list the most promising pages (as well as 2 URLs on anglican.org which weren't promising). Added the to_crwal and finished crawled data for this. siteID is 01463.

File size: 11.3 KB
Line 
1# Mapping of top sites in base url forms to value
2
3# This file contains sites that are too large to crawl exhaustively.
4# The domains are from Alexa top sites (where only the first 50 were visible)
5# Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
6# Finally also added https://moz.com/top500 by downloading its CSV file and
7# adding its URLs to the existing listing here from alexa/wiki.
8# Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates.
9# Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext variants, keeping
10# just <site>.ext
11# And finally, re-sorted the reduced list alphabetically and pasted into here.
12
13# FORMAT OF THIS FILE'S CONTENTS:
14# <topsite-base-url>,<value>
15# where <value> can or is one of
16# empty, SUBDOMAIN-COPY, FOLLOW-LINKS-WITHIN-TOPSITE, SINGLEPAGE, <url-form-without-protocol>
17#
18# - if value is left empty: if seedurl contains topsite-base-url, the seedurl will go into the
19# file unprocessed-topsite-matches.txt and the site/page won't be crawled.
20# The user will be notified to inspect the file unprocessed-topsite-matches.txt.
21# - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl.
22# For example, if the seedurl is http://docs.google.com/some-long-suffix-in-base64, then it
23# matches the topsite-base-url of docs.google.com and its value of SINGLEPAGE will add the
24# seedurl itself as the regex url-filter, to restrict the crawl to just the specified page.
25# - SUBDOMAIN-COPY: if seedurl CONTAINS topsite-base-url, then whatever the seedurl's subdomain
26# or else domain is, will make up the urlfilter, so we don't leak out into a larger domain.
27# Use SUBDOMAIN-COPY to restrict to a domain prefix/subdomain. For example, if seedurl is
28# pinky.blogspot.com, it will match the topsite-base-url of blogspot.com, but SUBDOMAIN-COPY
29# will ensure we restrict crawling to pages on pinky.blogspot.com.
30# However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go
31# into the file unprocessed-topsite-matches.txt and the site/page won't be crawled.
32# - FOLLOW-LINKS-WITHIN-TOPSITE: download seedURL pages and pages linked from each seedURL
33# page should be followed and downloaded too, as long as they're within the same subdomain
34# matching the topsite-base-url.
35# This is different from SUBDOMAIN-COPY, as that can download all of a specific subdomain but
36# restricts against downloading the entire domain (e.g. all pinky.blogspot.com and not anything
37# else within blogspot.com). FOLLOW-LINKS-WITHIN-TOPSITE can download all linked pages (at
38# depth specified for the nutch crawl) as long as they're within the topsite-base-url.
39# e.g. seedURLs on docs.google.com containing links will have those linked pages and any
40# they link to etc. downloaded as long as they're on docs.google.com.
41# - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided
42# url-form-without-protocol will make up the urlfilter, again preventing leaking into a
43# larger part of the domain. For example, if the seedurl is mi.wikipedia.org/SomePage, it will
44# match the topsite-base-url of wikipedia.org for which the <url-form-without-protocol>
45# value is mi.wikipedia.org, which should be all that's accepted for wikipedia.org. The
46# <url-form-without-protocol> ends up in the regex urlfilter file, thereby restricting the
47# crawl to just mi.wikipedia.org.
48# Remember to leave out any protocol <from url-form-without-protocol>.
49#
50# TODO If useful:
51# column 3: whether nutch should do fetch all or not
52# column 4: number of crawl iterations
53
54
55# NOT TOP SITES, BUT SITES WE INSPECTED AND WANT TO CONTROL SIMILARLY TO TOP SITES
5600.gs,SINGLEPAGE
57# May be a large site with only seedURLs of real relevance
58topographic-map.com,SINGLEPAGE
59ami-media.net,SINGLEPAGE
60# 2 pages of declarations of human rights in Maori, rest in other languages
61anitra.net,SINGLEPAGE
62# special case
63mi.centr-zashity.ru,SINGLEPAGE
64
65# we want the http://loquevendra318.com/fox/maori.html seed URL but also
66# pages within the following subsection
67loquevendra318.com,loquevendra318.com/fox/maori/
68
69martinvrijland.nl,martinvrijland.nl/mi/
70csunplugged.org,csunplugged.org/mi/
71gpedia.com,gpedia.com/mi/
72
73# TOP SITE BUT NOT TOP 500
74www.tumblr.com,SINGLEPAGE
75
76
77# TOP SITES
78
79# docs.google.com is a special case: not all pages are public and any interlinking is likely to
80# be intentional. Grab all linked pages, for link depth set with nutch's crawl, as long as the
81# links are within the given topsite-base-url
82docs.google.com,FOLLOW-LINKS-WITHIN-TOPSITE
83
84# Just crawl a single page for these:
85drive.google.com,SINGLEPAGE
86forms.office.com,SINGLEPAGE
87player.vimeo.com,SINGLEPAGE
88static-promote.weebly.com,SINGLEPAGE
89
90# Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
91# The page's containing folder is whitelisted in case the photos are there.
92korora.econ.yale.edu,SINGLEPAGE
93
94# Special case of justus.anglican.org - no meaningful seedURLs directly on anglican.org
95# but justus.anglican.org has them
96#anglican.org,justus.anglican.org
97
98000webhost.com
99360.cn
1004shared.com
101a8.net
102abc.es
103abc.net.au
104abcnews.go.com
105about.com
106about.me
107aboutads.info
108abril.com.br
109academia.edu
110accuweather.com
111addthis.com
112addtoany.com
113adobe.com
114adweek.com
115airbnb.com
116akamaihd.net
117alexa.com
118alibaba.com
119aliexpress.com
120alipay.com
121aljazeera.com
122allaboutcookies.org
123allrecipes.com
124amazon.ca
125amazon.co.jp
126amazon.co.uk
127amazon.com
128amazon.de
129amazon.es
130amazon.fr
131amazon.in
132ameblo.jp
133ampproject.org
134android.com
135aol.com
136ap.org
137apache.org
138apachefriends.org
139apple.com
140archive.org
141archives.gov
142arstechnica.com
143arxiv.org
144asahi.com
145ask.fm
146asus.com
147axs.com
148babytree.com
149baidu.com
150bandcamp.com
151bbc.co.uk
152bbc.com
153behance.net
154berkeley.edu
155biblegateway.com
156biglobe.ne.jp
157billboard.com
158bing.com
159bit.ly
160bitly.com
161blackberry.com
162blogger.com
163blogspot.com,SUBDOMAIN-COPY
164bloomberg.com
165booking.com
166boston.com
167box.com
168britannica.com
169bt.com
170bund.de
171businessinsider.com
172businesswire.com
173buydomains.com
174buzzfeed.com
175ca.gov
176cambridge.org
177canalblog.com
178cbc.ca
179cbslocal.com
180cbsnews.com
181cdc.gov
182change.org
183channel4.com
184chicagotribune.com
185chinadaily.com.cn
186cisco.com
187clickbank.net
188cloudflare.com
189cmu.edu
190cnbc.com
191cnet.com
192cnn.com
193cocolog-nifty.com
194columbia.edu
195connect.over-blog.com
196cornell.edu
197corriere.it
198cpanel.com
199cpanel.net
200creativecommons.org
201csdn.net
202csmonitor.com
203dailymail.co.uk
204dailymotion.com
205dan.com
206daum.net
207debian.org
208dell.com
209depositfiles.com
210detik.com
211digg.com
212discovery.com
213disney.com
214disney.go.com
215disqus.com
216doubleclick.net
217dreniq.com
218dribbble.com
219dropbox.com,SINGLEPAGE
220dropboxusercontent.com
221dw.com
222e-recht24.de
223ea.com
224ebay.co.uk
225ebay.com
226economist.com
227eff.org
228ehow.com
229elmundo.es
230elpais.com
231engadget.com
232entrepreneur.com
233eonline.com
234espn.com
235espn.go.com
236etsy.com
237europa.eu
238eventbrite.com
239example.com
240excite.co.jp
241express.co.uk
242facebook.com
243fandom.com
244fastcompany.com
245fb.com
246fb.me
247fda.gov
248fedoraproject.org
249feedburner.com
250fifa.com
251files.wordpress.com
252flickr.com
253forbes.com
254fortune.com
255foursquare.com
256foxnews.com
257ft.com
258ftc.gov
259gen.xyz
260geocities.jp
261gesetze-im-internet.de
262ggpht.com
263github.com
264gizmodo.com
265globo.com
266gmail.com
267gnu.org
268godaddy.com
269gofundme.com
270goo.gl
271goo.ne.jp
272goodreads.com
273google.ca
274google.co.id
275google.co.in
276google.co.jp
277google.co.uk
278google.com
279google.com.br
280google.com.hk
281google.com.tr
282google.de
283google.es
284google.fr
285google.it
286google.nl
287google.pl
288google.ru
289googleapis.com
290googleblog.com
291googleusercontent.com
292gooyaabitemplates.com
293gov.uk
294gravatar.com
295greenpeace.org
296gstatic.com
297guardian.co.uk
298harvard.edu
299hatena.ne.jp
300histats.com
301hm.com
302hollywoodreporter.com
303home.pl
304house.gov
305howstuffworks.com
306hp.com
307huffingtonpost.com
308huffpost.com
309hugedomains.com
310ibm.com
311ibtimes.com
312icann.org
313ieee.org
314ietf.org
315ig.com.br
316ign.com
317ikea.com
318imageshack.us
319imdb.com
320imgur.com
321inc.com
322independent.co.uk
323indiatimes.com
324indiegogo.com
325instagram.com
326instructables.com
327intel.com
328interia.pl
329issuu.com
330istockphoto.com
331iubenda.com
332jd.com
333joomla.org
334jquery.com
335jstor.org
336kickstarter.com
337kinja.com
338last.fm
339latimes.com
340lefigaro.fr
341lemonde.fr
342line.me
343linkedin.com
344list-manage.com
345live.com
346livejournal.com
347livescience.com
348loc.gov
349lonelyplanet.com
350lycos.com
351m.wikipedia.org,mi.m.wikipedia.org
352mail.ru
353marketwatch.com
354marriott.com
355mashable.com
356mediafire.com
357medium.com
358mega.nz
359megaupload.com
360mercurynews.com
361merriam-webster.com
362metro.co.uk
363microsoft.com,microsoft.com/mi-nz/
364microsoftonline.com
365mirror.co.uk
366mit.edu
367mixcloud.com
368mlb.com
369mozilla.com
370mozilla.org
371msn.com
372myspace.com
373mysql.com
374namecheap.com
375narod.ru
376nasa.gov
377nationalgeographic.com
378nature.com
379naver.com
380naver.jp
381nba.com
382nbcnews.com
383ndtv.com
384netflix.com
385netsons.com
386netvibes.com
387networkadvertising.org
388news.com.au
389newscientist.com
390newsweek.com
391newyorker.com
392nginx.com
393nginx.org
394nhk.or.jp
395nicovideo.jp
396nifty.com
397nih.gov
398nikkei.com
399noaa.gov
400nokia.com
401npr.org
402nvidia.com
403nydailynews.com
404nypost.com
405nytimes.com
406nyu.edu
407odnoklassniki.ru
408office.com
409offset.com
410ok.ru
411okezone.com
412opera.com
413oracle.com
414orange.fr
415oreilly.com
416oup.com
417over-blog.com
418ovh.co.uk
419ovh.com
420ovh.net
421ox.ac.uk
422parallels.com
423pastebin.com
424paypal.com
425pbs.org
426pcmag.com
427people.com
428photobucket.com
429php.net
430pinterest.com,SINGLEPAGE
431pixabay.com
432playstation.com
433plesk.com
434plos.org
435politico.com
436prestashop.com
437prezi.com
438princeton.edu
439privacyshield.gov
440prnewswire.com
441psychologytoday.com
442qq.com
443quantcast.com
444quora.com
445rakuten.co.jp
446rambler.ru
447rapidshare.com
448reddit.com
449repubblica.it
450researchgate.net
451reuters.com
452ria.ru
453rottentomatoes.com
454rt.com
455rtve.es
456sakura.ne.jp
457samsung.com
458sapo.pt
459scholastic.com
460sciencedaily.com
461sciencedirect.com
462sciencemag.org
463scientificamerican.com
464scribd.com
465seattletimes.com
466secureserver.net
467sedo.com
468seesaa.net
469sendspace.com
470sfgate.com
471shopify.com
472shutterstock.com
473siemens.com
474sina.com.cn
475sky.com
476skype.com
477skyrock.com
478slate.com
479slideshare.net
480sm.cn
481smh.com.au
482so-net.ne.jp
483softonic.com
484sogou.com
485sohu.com
486soratemplates.com
487soso.com
488soundcloud.com
489spiegel.de
490spotify.com
491springer.com
492sputniknews.com
493ssl-images-amazon.com
494stackoverflow.com
495standard.co.uk
496stanford.edu
497state.gov
498steamcommunity.com
499steampowered.com
500storage.canalblog.com
501storage.googleapis.com
502stores.jp
503storify.com
504stuff.co.nz,SINGLEPAGE
505surveymonkey.com
506symantec.com
507t-online.de
508t.co
509t.me
510tabelog.com
511taobao.com
512target.com
513teamviewer.com
514techcrunch.com
515ted.com
516telegram.me
517telegraph.co.uk
518terra.com.br
519theatlantic.com
520thefreedictionary.com
521theglobeandmail.com
522theguardian.com
523themeforest.net
524thenextweb.com
525thestar.com
526thesun.co.uk
527thetimes.co.uk
528theverge.com
529thoughtco.com
530tianya.cn
531time.com
532tinyurl.com
533tmall.com
534tmz.com
535tribunnews.com
536tripadvisor.com
537trustpilot.com
538twitch.tv
539twitter.com
540ucoz.ru
541uiuc.edu
542umich.edu
543un.org
544undeveloped.com
545unesco.org
546uol.com.br
547urbandictionary.com
548usa.gov
549usatoday.com
550usgs.gov
551usnews.com
552uspto.gov
553ustream.tv
554utexas.edu
555variety.com
556venturebeat.com
557vice.com
558viglink.com
559vimeo.com
560vk.com
561vkontakte.ru
562vox.com
563w3.org
564w3schools.com
565wa.me
566walmart.com
567washington.edu
568washingtonpost.com
569wattpad.com
570weather.com
571web.fc2.com
572webmd.com
573weebly.com
574weibo.com
575welt.de
576whatsapp.com
577whitehouse.gov
578who.int
579wikia.com
580wikihow.com
581wikimedia.org
582wikipedia.org,mi.wikipedia.org
583wiktionary.org,mi.wiktionary.org
584wiley.com
585windowsphone.com
586wired.com
587wix.com
588wordpress.org,SUBDOMAIN-COPY
589worldbank.org
590wp.com
591wsj.com
592xbox.com
593xinhuanet.com
594yadi.sk
595yahoo.co.jp
596yahoo.com
597yale.edu
598yandex.ru
599yelp.com
600youku.com
601youronlinechoices.com
602youtu.be
603youtube.com
604ytimg.com
605zdnet.com
606zend.com
607zendesk.com
608zippyshare.com
Note: See TracBrowser for help on using the repository browser.