root/other-projects/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt @ 33666

Revision 33666, 11.2 KB (checked in by ak19, 2 months ago)

Having finished sending all the crawl data to mongodb 1. Recrawled the 2 sites which I had earlier noted required recrawling 00152, 00332. 00152 required changes to how it needed to be crawled. MP3 files needed to be blocked, as there were HBase error messages about key values being too large. 2. Modified the regex-urlfilter.GS_TEMPLATE file for this to block mp3 files in general for future crawls too (in the location of the file where jpg etc were already blocked by nutch's default regex url filters). 3. Further had to control the 00152 site to only be crawled under its /maori/ sub-domain. Since the seedURL maori.html was not off a /maori/ url, this revealed that the CCWETProcessor code didn't already consider allowing filters to okay seedURLs even where the crawl was controlled to run over a subdomain (as expressed in conf/sites-too-big-to-exhaustively-crawl file) but where the seedURL didn't match these controlled regex filters. So now, in such cases, the CCWETProcessor adds seedURLs that don't match to the filters too (so we get just the single file of the seedURL pages) besides a filter on the requested subdomain, so we follow all pages linked by the seedURLs that match the subdomain expression. 4. Adding to_crawl.tar.gz to svn, the tarball of the sites to_crawl that I actually ran nutch over, of all the sites folders with their seedURL.txt and regex-urlfilter.txt files that the batchcrawl.sh runs over. This didn't use the latest version of the sites folder and blacklist/whitelist files generated by CCWETProcessor, since the latest version was regenerated after the final modifications to CCWETProcessor which was after crawling was finished. But to_crawl.tar.gz does have a manually modified 00152, wit the correct regex-urlfilter file and uses the newer regex-urlfilter.GS_TEMPLATE file that blocks mp3 files. 5. crawledNode6.tar.gz now contains the dump output for sites 00152 and 00332, which were crawled on node6 today (after which their processed dump.txt file results were added into MongoDB). 7. MoreReading?/mongodb.txt now contains the results of some queries I ran against the total nutch-crawled data.

Line 
1# Mapping of top sites in base url forms to value
2
3# This file contains sites that are too large to crawl exhaustively.
4# The domains are from Alexa top sites (where only the first 50 were visible)
5# Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
6# Finally also added https://moz.com/top500 by downloading its CSV file and
7# adding its URLs to the existing listing here from alexa/wiki.
8# Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates.
9# Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext variants, keeping
10# just <site>.ext
11# And finally, re-sorted the reduced list alphabetically and pasted into here.
12
13# FORMAT OF THIS FILE'S CONTENTS:
14#    <topsite-base-url>,<value>
15# where <value> can or is one of
16#    empty, SUBDOMAIN-COPY, FOLLOW-LINKS-WITHIN-TOPSITE, SINGLEPAGE, <url-form-without-protocol>
17#
18#   - if value is left empty: if seedurl contains topsite-base-url, the seedurl will go into the
19#     file unprocessed-topsite-matches.txt and the site/page won't be crawled.
20#     The user will be notified to inspect the file unprocessed-topsite-matches.txt.
21#   - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl.
22#     For example, if the seedurl is http://docs.google.com/some-long-suffix-in-base64, then it
23#     matches the topsite-base-url of docs.google.com and its value of SINGLEPAGE will add the
24#     seedurl itself as the regex url-filter, to restrict the crawl to just the specified page.
25#   - SUBDOMAIN-COPY: if seedurl CONTAINS topsite-base-url, then whatever the seedurl's subdomain
26#     or else domain is, will make up the urlfilter, so we don't leak out into a larger domain.
27#     Use SUBDOMAIN-COPY to restrict to a domain prefix/subdomain. For example, if seedurl is
28#     pinky.blogspot.com, it will match the topsite-base-url of blogspot.com, but SUBDOMAIN-COPY
29#     will ensure we restrict crawling to pages on pinky.blogspot.com.
30#     However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go
31#     into the file unprocessed-topsite-matches.txt and the site/page won't be crawled.
32#   - FOLLOW-LINKS-WITHIN-TOPSITE: download seedURL pages and pages linked from each seedURL
33#     page should be followed and downloaded too, as long as they're within the same subdomain
34#     matching the topsite-base-url.
35#     This is different from SUBDOMAIN-COPY, as that can download all of a specific subdomain but
36#     restricts against downloading the entire domain (e.g. all pinky.blogspot.com and not anything
37#     else within blogspot.com). FOLLOW-LINKS-WITHIN-TOPSITE can download all linked pages (at
38#     depth specified for the nutch crawl) as long as they're within the topsite-base-url.
39#     e.g. seedURLs on docs.google.com containing links will have those linked pages and any
40#     they link to etc. downloaded as long as they're on docs.google.com.
41#   - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided
42#     url-form-without-protocol will make up the urlfilter, again preventing leaking into a
43#     larger part of the domain. For example, if the seedurl is mi.wikipedia.org/SomePage, it will
44#     match the topsite-base-url of wikipedia.org for which the <url-form-without-protocol>
45#     value is mi.wikipedia.org, which should be all that's accepted for wikipedia.org. The
46#     <url-form-without-protocol> ends up in the regex urlfilter file, thereby restricting the
47#     crawl to just mi.wikipedia.org.
48#     Remember to leave out any protocol <from url-form-without-protocol>.
49#
50# TODO If useful:
51#   column 3: whether nutch should do fetch all or not
52#   column 4: number of crawl iterations
53
54
55# NOT TOP SITES, BUT SITES WE INSPECTED AND WANT TO CONTROL SIMILARLY TO TOP SITES
5600.gs,SINGLEPAGE
57# May be a large site with only seedURLs of real relevance
58topographic-map.com,SINGLEPAGE
59ami-media.net,SINGLEPAGE
60# 2 pages of declarations of human rights in Maori, rest in other languages
61anitra.net,SINGLEPAGE
62# special case
63mi.centr-zashity.ru,SINGLEPAGE
64
65# we want the http://loquevendra318.com/fox/maori.html seed URL but also
66# pages within the following subsection
67loquevendra318.com,loquevendra318.com/fox/maori/
68
69martinvrijland.nl,martinvrijland.nl/mi/
70csunplugged.org,csunplugged.org/mi/
71gpedia.com,gpedia.com/mi/
72
73# TOP SITE BUT NOT TOP 500
74www.tumblr.com,SINGLEPAGE
75
76
77# TOP SITES
78
79# docs.google.com is a special case: not all pages are public and any interlinking is likely to
80# be intentional. Grab all linked pages, for link depth set with nutch's crawl, as long as the
81# links are within the given topsite-base-url
82docs.google.com,FOLLOW-LINKS-WITHIN-TOPSITE
83
84# Just crawl a single page for these:
85drive.google.com,SINGLEPAGE
86forms.office.com,SINGLEPAGE
87player.vimeo.com,SINGLEPAGE
88static-promote.weebly.com,SINGLEPAGE
89
90# Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
91# The page's containing folder is whitelisted in case the photos are there.
92korora.econ.yale.edu,SINGLEPAGE
93
94
95000webhost.com
96360.cn
974shared.com
98a8.net
99abc.es
100abc.net.au
101abcnews.go.com
102about.com
103about.me
104aboutads.info
105abril.com.br
106academia.edu
107accuweather.com
108addthis.com
109addtoany.com
110adobe.com
111adweek.com
112airbnb.com
113akamaihd.net
114alexa.com
115alibaba.com
116aliexpress.com
117alipay.com
118aljazeera.com
119allaboutcookies.org
120allrecipes.com
121amazon.ca
122amazon.co.jp
123amazon.co.uk
124amazon.com
125amazon.de
126amazon.es
127amazon.fr
128amazon.in
129ameblo.jp
130ampproject.org
131android.com
132aol.com
133ap.org
134apache.org
135apachefriends.org
136apple.com
137archive.org
138archives.gov
139arstechnica.com
140arxiv.org
141asahi.com
142ask.fm
143asus.com
144axs.com
145babytree.com
146baidu.com
147bandcamp.com
148bbc.co.uk
149bbc.com
150behance.net
151berkeley.edu
152biblegateway.com
153biglobe.ne.jp
154billboard.com
155bing.com
156bit.ly
157bitly.com
158blackberry.com
159blogger.com
160blogspot.com,SUBDOMAIN-COPY
161bloomberg.com
162booking.com
163boston.com
164box.com
165britannica.com
166bt.com
167bund.de
168businessinsider.com
169businesswire.com
170buydomains.com
171buzzfeed.com
172ca.gov
173cambridge.org
174canalblog.com
175cbc.ca
176cbslocal.com
177cbsnews.com
178cdc.gov
179change.org
180channel4.com
181chicagotribune.com
182chinadaily.com.cn
183cisco.com
184clickbank.net
185cloudflare.com
186cmu.edu
187cnbc.com
188cnet.com
189cnn.com
190cocolog-nifty.com
191columbia.edu
192connect.over-blog.com
193cornell.edu
194corriere.it
195cpanel.com
196cpanel.net
197creativecommons.org
198csdn.net
199csmonitor.com
200dailymail.co.uk
201dailymotion.com
202dan.com
203daum.net
204debian.org
205dell.com
206depositfiles.com
207detik.com
208digg.com
209discovery.com
210disney.com
211disney.go.com
212disqus.com
213doubleclick.net
214dreniq.com
215dribbble.com
216dropbox.com,SINGLEPAGE
217dropboxusercontent.com
218dw.com
219e-recht24.de
220ea.com
221ebay.co.uk
222ebay.com
223economist.com
224eff.org
225ehow.com
226elmundo.es
227elpais.com
228engadget.com
229entrepreneur.com
230eonline.com
231espn.com
232espn.go.com
233etsy.com
234europa.eu
235eventbrite.com
236example.com
237excite.co.jp
238express.co.uk
239facebook.com
240fandom.com
241fastcompany.com
242fb.com
243fb.me
244fda.gov
245fedoraproject.org
246feedburner.com
247fifa.com
248files.wordpress.com
249flickr.com
250forbes.com
251fortune.com
252foursquare.com
253foxnews.com
254ft.com
255ftc.gov
256gen.xyz
257geocities.jp
258gesetze-im-internet.de
259ggpht.com
260github.com
261gizmodo.com
262globo.com
263gmail.com
264gnu.org
265godaddy.com
266gofundme.com
267goo.gl
268goo.ne.jp
269goodreads.com
270google.ca
271google.co.id
272google.co.in
273google.co.jp
274google.co.uk
275google.com
276google.com.br
277google.com.hk
278google.com.tr
279google.de
280google.es
281google.fr
282google.it
283google.nl
284google.pl
285google.ru
286googleapis.com
287googleblog.com
288googleusercontent.com
289gooyaabitemplates.com
290gov.uk
291gravatar.com
292greenpeace.org
293gstatic.com
294guardian.co.uk
295harvard.edu
296hatena.ne.jp
297histats.com
298hm.com
299hollywoodreporter.com
300home.pl
301house.gov
302howstuffworks.com
303hp.com
304huffingtonpost.com
305huffpost.com
306hugedomains.com
307ibm.com
308ibtimes.com
309icann.org
310ieee.org
311ietf.org
312ig.com.br
313ign.com
314ikea.com
315imageshack.us
316imdb.com
317imgur.com
318inc.com
319independent.co.uk
320indiatimes.com
321indiegogo.com
322instagram.com
323instructables.com
324intel.com
325interia.pl
326issuu.com
327istockphoto.com
328iubenda.com
329jd.com
330joomla.org
331jquery.com
332jstor.org
333kickstarter.com
334kinja.com
335last.fm
336latimes.com
337lefigaro.fr
338lemonde.fr
339line.me
340linkedin.com
341list-manage.com
342live.com
343livejournal.com
344livescience.com
345loc.gov
346lonelyplanet.com
347lycos.com
348m.wikipedia.org,mi.m.wikipedia.org
349mail.ru
350marketwatch.com
351marriott.com
352mashable.com
353mediafire.com
354medium.com
355mega.nz
356megaupload.com
357mercurynews.com
358merriam-webster.com
359metro.co.uk
360microsoft.com,microsoft.com/mi-nz/
361microsoftonline.com
362mirror.co.uk
363mit.edu
364mixcloud.com
365mlb.com
366mozilla.com
367mozilla.org
368msn.com
369myspace.com
370mysql.com
371namecheap.com
372narod.ru
373nasa.gov
374nationalgeographic.com
375nature.com
376naver.com
377naver.jp
378nba.com
379nbcnews.com
380ndtv.com
381netflix.com
382netsons.com
383netvibes.com
384networkadvertising.org
385news.com.au
386newscientist.com
387newsweek.com
388newyorker.com
389nginx.com
390nginx.org
391nhk.or.jp
392nicovideo.jp
393nifty.com
394nih.gov
395nikkei.com
396noaa.gov
397nokia.com
398npr.org
399nvidia.com
400nydailynews.com
401nypost.com
402nytimes.com
403nyu.edu
404odnoklassniki.ru
405office.com
406offset.com
407ok.ru
408okezone.com
409opera.com
410oracle.com
411orange.fr
412oreilly.com
413oup.com
414over-blog.com
415ovh.co.uk
416ovh.com
417ovh.net
418ox.ac.uk
419parallels.com
420pastebin.com
421paypal.com
422pbs.org
423pcmag.com
424people.com
425photobucket.com
426php.net
427pinterest.com,SINGLEPAGE
428pixabay.com
429playstation.com
430plesk.com
431plos.org
432politico.com
433prestashop.com
434prezi.com
435princeton.edu
436privacyshield.gov
437prnewswire.com
438psychologytoday.com
439qq.com
440quantcast.com
441quora.com
442rakuten.co.jp
443rambler.ru
444rapidshare.com
445reddit.com
446repubblica.it
447researchgate.net
448reuters.com
449ria.ru
450rottentomatoes.com
451rt.com
452rtve.es
453sakura.ne.jp
454samsung.com
455sapo.pt
456scholastic.com
457sciencedaily.com
458sciencedirect.com
459sciencemag.org
460scientificamerican.com
461scribd.com
462seattletimes.com
463secureserver.net
464sedo.com
465seesaa.net
466sendspace.com
467sfgate.com
468shopify.com
469shutterstock.com
470siemens.com
471sina.com.cn
472sky.com
473skype.com
474skyrock.com
475slate.com
476slideshare.net
477sm.cn
478smh.com.au
479so-net.ne.jp
480softonic.com
481sogou.com
482sohu.com
483soratemplates.com
484soso.com
485soundcloud.com
486spiegel.de
487spotify.com
488springer.com
489sputniknews.com
490ssl-images-amazon.com
491stackoverflow.com
492standard.co.uk
493stanford.edu
494state.gov
495steamcommunity.com
496steampowered.com
497storage.canalblog.com
498storage.googleapis.com
499stores.jp
500storify.com
501stuff.co.nz,SINGLEPAGE
502surveymonkey.com
503symantec.com
504t-online.de
505t.co
506t.me
507tabelog.com
508taobao.com
509target.com
510teamviewer.com
511techcrunch.com
512ted.com
513telegram.me
514telegraph.co.uk
515terra.com.br
516theatlantic.com
517thefreedictionary.com
518theglobeandmail.com
519theguardian.com
520themeforest.net
521thenextweb.com
522thestar.com
523thesun.co.uk
524thetimes.co.uk
525theverge.com
526thoughtco.com
527tianya.cn
528time.com
529tinyurl.com
530tmall.com
531tmz.com
532tribunnews.com
533tripadvisor.com
534trustpilot.com
535twitch.tv
536twitter.com
537ucoz.ru
538uiuc.edu
539umich.edu
540un.org
541undeveloped.com
542unesco.org
543uol.com.br
544urbandictionary.com
545usa.gov
546usatoday.com
547usgs.gov
548usnews.com
549uspto.gov
550ustream.tv
551utexas.edu
552variety.com
553venturebeat.com
554vice.com
555viglink.com
556vimeo.com
557vk.com
558vkontakte.ru
559vox.com
560w3.org
561w3schools.com
562wa.me
563walmart.com
564washington.edu
565washingtonpost.com
566wattpad.com
567weather.com
568web.fc2.com
569webmd.com
570weebly.com
571weibo.com
572welt.de
573whatsapp.com
574whitehouse.gov
575who.int
576wikia.com
577wikihow.com
578wikimedia.org
579wikipedia.org,mi.wikipedia.org
580wiktionary.org,mi.wiktionary.org
581wiley.com
582windowsphone.com
583wired.com
584wix.com
585wordpress.org,SUBDOMAIN-COPY
586worldbank.org
587wp.com
588wsj.com
589xbox.com
590xinhuanet.com
591yadi.sk
592yahoo.co.jp
593yahoo.com
594yale.edu
595yandex.ru
596yelp.com
597youku.com
598youronlinechoices.com
599youtu.be
600youtube.com
601ytimg.com
602zdnet.com
603zend.com
604zendesk.com
605zippyshare.com
Note: See TracBrowser for help on using the browser.