source: other-projects/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt@ 33666

Last change on this file since 33666 was 33666, checked in by ak19, 4 years ago

Having finished sending all the crawl data to mongodb 1. Recrawled the 2 sites which I had earlier noted required recrawling 00152, 00332. 00152 required changes to how it needed to be crawled. MP3 files needed to be blocked, as there were HBase error messages about key values being too large. 2. Modified the regex-urlfilter.GS_TEMPLATE file for this to block mp3 files in general for future crawls too (in the location of the file where jpg etc were already blocked by nutch's default regex url filters). 3. Further had to control the 00152 site to only be crawled under its /maori/ sub-domain. Since the seedURL maori.html was not off a /maori/ url, this revealed that the CCWETProcessor code didn't already consider allowing filters to okay seedURLs even where the crawl was controlled to run over a subdomain (as expressed in conf/sites-too-big-to-exhaustively-crawl file) but where the seedURL didn't match these controlled regex filters. So now, in such cases, the CCWETProcessor adds seedURLs that don't match to the filters too (so we get just the single file of the seedURL pages) besides a filter on the requested subdomain, so we follow all pages linked by the seedURLs that match the subdomain expression. 4. Adding to_crawl.tar.gz to svn, the tarball of the sites to_crawl that I actually ran nutch over, of all the sites folders with their seedURL.txt and regex-urlfilter.txt files that the batchcrawl.sh runs over. This didn't use the latest version of the sites folder and blacklist/whitelist files generated by CCWETProcessor, since the latest version was regenerated after the final modifications to CCWETProcessor which was after crawling was finished. But to_crawl.tar.gz does have a manually modified 00152, wit the correct regex-urlfilter file and uses the newer regex-urlfilter.GS_TEMPLATE file that blocks mp3 files. 5. crawledNode6.tar.gz now contains the dump output for sites 00152 and 00332, which were crawled on node6 today (after which their processed dump.txt file results were added into MongoDB). 7. MoreReading/mongodb.txt now contains the results of some queries I ran against the total nutch-crawled data.

File size: 11.2 KB
RevLine 
[33559]1# Mapping of top sites in base url forms to value
[33550]2
[33559]3# This file contains sites that are too large to crawl exhaustively.
4# The domains are from Alexa top sites (where only the first 50 were visible)
[33551]5# Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
[33553]6# Finally also added https://moz.com/top500 by downloading its CSV file and
7# adding its URLs to the existing listing here from alexa/wiki.
[33551]8# Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates.
[33553]9# Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext variants, keeping
10# just <site>.ext
11# And finally, re-sorted the reduced list alphabetically and pasted into here.
[33550]12
[33559]13# FORMAT OF THIS FILE'S CONTENTS:
[33561]14# <topsite-base-url>,<value>
[33562]15# where <value> can or is one of
16# empty, SUBDOMAIN-COPY, FOLLOW-LINKS-WITHIN-TOPSITE, SINGLEPAGE, <url-form-without-protocol>
[33559]17#
[33562]18# - if value is left empty: if seedurl contains topsite-base-url, the seedurl will go into the
19# file unprocessed-topsite-matches.txt and the site/page won't be crawled.
[33559]20# The user will be notified to inspect the file unprocessed-topsite-matches.txt.
21# - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl.
22# For example, if the seedurl is http://docs.google.com/some-long-suffix-in-base64, then it
23# matches the topsite-base-url of docs.google.com and its value of SINGLEPAGE will add the
24# seedurl itself as the regex url-filter, to restrict the crawl to just the specified page.
25# - SUBDOMAIN-COPY: if seedurl CONTAINS topsite-base-url, then whatever the seedurl's subdomain
26# or else domain is, will make up the urlfilter, so we don't leak out into a larger domain.
27# Use SUBDOMAIN-COPY to restrict to a domain prefix/subdomain. For example, if seedurl is
28# pinky.blogspot.com, it will match the topsite-base-url of blogspot.com, but SUBDOMAIN-COPY
29# will ensure we restrict crawling to pages on pinky.blogspot.com.
30# However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go
31# into the file unprocessed-topsite-matches.txt and the site/page won't be crawled.
[33666]32# - FOLLOW-LINKS-WITHIN-TOPSITE: download seedURL pages and pages linked from each seedURL
33# page should be followed and downloaded too, as long as they're within the same subdomain
34# matching the topsite-base-url.
[33561]35# This is different from SUBDOMAIN-COPY, as that can download all of a specific subdomain but
36# restricts against downloading the entire domain (e.g. all pinky.blogspot.com and not anything
37# else within blogspot.com). FOLLOW-LINKS-WITHIN-TOPSITE can download all linked pages (at
38# depth specified for the nutch crawl) as long as they're within the topsite-base-url.
39# e.g. seedURLs on docs.google.com containing links will have those linked pages and any
40# they link to etc. downloaded as long as they're on docs.google.com.
[33559]41# - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided
42# url-form-without-protocol will make up the urlfilter, again preventing leaking into a
43# larger part of the domain. For example, if the seedurl is mi.wikipedia.org/SomePage, it will
44# match the topsite-base-url of wikipedia.org for which the <url-form-without-protocol>
45# value is mi.wikipedia.org, which should be all that's accepted for wikipedia.org. The
46# <url-form-without-protocol> ends up in the regex urlfilter file, thereby restricting the
47# crawl to just mi.wikipedia.org.
48# Remember to leave out any protocol <from url-form-without-protocol>.
[33562]49#
50# TODO If useful:
51# column 3: whether nutch should do fetch all or not
52# column 4: number of crawl iterations
[33551]53
[33565]54
55# NOT TOP SITES, BUT SITES WE INSPECTED AND WANT TO CONTROL SIMILARLY TO TOP SITES
5600.gs,SINGLEPAGE
[33569]57# May be a large site with only seedURLs of real relevance
[33568]58topographic-map.com,SINGLEPAGE
[33569]59ami-media.net,SINGLEPAGE
60# 2 pages of declarations of human rights in Maori, rest in other languages
61anitra.net,SINGLEPAGE
62# special case
63mi.centr-zashity.ru,SINGLEPAGE
[33565]64
[33666]65# we want the http://loquevendra318.com/fox/maori.html seed URL but also
66# pages within the following subsection
67loquevendra318.com,loquevendra318.com/fox/maori/
68
[33604]69martinvrijland.nl,martinvrijland.nl/mi/
70csunplugged.org,csunplugged.org/mi/
71gpedia.com,gpedia.com/mi/
72
[33569]73# TOP SITE BUT NOT TOP 500
74www.tumblr.com,SINGLEPAGE
75
76
[33565]77# TOP SITES
78
[33561]79# docs.google.com is a special case: not all pages are public and any interlinking is likely to
[33562]80# be intentional. Grab all linked pages, for link depth set with nutch's crawl, as long as the
81# links are within the given topsite-base-url
[33561]82docs.google.com,FOLLOW-LINKS-WITHIN-TOPSITE
[33559]83
[33562]84# Just crawl a single page for these:
[33561]85drive.google.com,SINGLEPAGE
86forms.office.com,SINGLEPAGE
87player.vimeo.com,SINGLEPAGE
88static-promote.weebly.com,SINGLEPAGE
[33559]89
90# Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
91# The page's containing folder is whitelisted in case the photos are there.
[33562]92korora.econ.yale.edu,SINGLEPAGE
[33559]93
[33569]94
[33551]95000webhost.com
[33550]96360.cn
[33551]974shared.com
98a8.net
99abc.es
100abc.net.au
101abcnews.go.com
102about.com
103about.me
104aboutads.info
105abril.com.br
106academia.edu
[33550]107accuweather.com
[33551]108addthis.com
109addtoany.com
110adobe.com
[33555]111adweek.com
[33551]112airbnb.com
113akamaihd.net
114alexa.com
115alibaba.com
[33550]116aliexpress.com
117alipay.com
[33551]118aljazeera.com
119allaboutcookies.org
120allrecipes.com
[33555]121amazon.ca
122amazon.co.jp
123amazon.co.uk
124amazon.com
125amazon.de
126amazon.es
127amazon.fr
128amazon.in
129ameblo.jp
[33551]130ampproject.org
131android.com
132aol.com
133ap.org
134apache.org
135apachefriends.org
[33550]136apple.com
[33551]137archive.org
[33555]138archives.gov
[33551]139arstechnica.com
140arxiv.org
141asahi.com
142ask.fm
143asus.com
144axs.com
[33550]145babytree.com
146baidu.com
[33551]147bandcamp.com
148bbc.co.uk
149bbc.com
[33555]150behance.net
[33551]151berkeley.edu
152biblegateway.com
153biglobe.ne.jp
154billboard.com
[33550]155bing.com
[33551]156bit.ly
[33550]157bitly.com
[33551]158blackberry.com
159blogger.com
[33561]160blogspot.com,SUBDOMAIN-COPY
[33551]161bloomberg.com
162booking.com
[33555]163boston.com
[33551]164box.com
165britannica.com
166bt.com
167bund.de
168businessinsider.com
169businesswire.com
170buydomains.com
171buzzfeed.com
172ca.gov
173cambridge.org
[33555]174canalblog.com
[33551]175cbc.ca
[33555]176cbslocal.com
[33551]177cbsnews.com
178cdc.gov
179change.org
180channel4.com
181chicagotribune.com
[33555]182chinadaily.com.cn
[33551]183cisco.com
184clickbank.net
185cloudflare.com
[33555]186cmu.edu
[33551]187cnbc.com
188cnet.com
189cnn.com
190cocolog-nifty.com
191columbia.edu
[33555]192connect.over-blog.com
[33551]193cornell.edu
194corriere.it
195cpanel.com
196cpanel.net
197creativecommons.org
[33550]198csdn.net
[33551]199csmonitor.com
200dailymail.co.uk
201dailymotion.com
202dan.com
203daum.net
[33555]204debian.org
[33551]205dell.com
206depositfiles.com
207detik.com
208digg.com
[33555]209discovery.com
[33551]210disney.com
[33555]211disney.go.com
[33551]212disqus.com
213doubleclick.net
214dreniq.com
215dribbble.com
[33561]216dropbox.com,SINGLEPAGE
[33551]217dropboxusercontent.com
218dw.com
219e-recht24.de
220ea.com
221ebay.co.uk
[33550]222ebay.com
[33551]223economist.com
224eff.org
225ehow.com
226elmundo.es
227elpais.com
228engadget.com
229entrepreneur.com
230eonline.com
[33550]231espn.com
[33551]232espn.go.com
233etsy.com
234europa.eu
235eventbrite.com
236example.com
237excite.co.jp
238express.co.uk
[33550]239facebook.com
[33551]240fandom.com
241fastcompany.com
242fb.com
243fb.me
244fda.gov
245fedoraproject.org
246feedburner.com
247fifa.com
248files.wordpress.com
249flickr.com
250forbes.com
251fortune.com
252foursquare.com
253foxnews.com
254ft.com
255ftc.gov
256gen.xyz
257geocities.jp
258gesetze-im-internet.de
259ggpht.com
260github.com
261gizmodo.com
262globo.com
263gmail.com
264gnu.org
265godaddy.com
266gofundme.com
267goo.gl
268goo.ne.jp
269goodreads.com
[33555]270google.ca
271google.co.id
272google.co.in
273google.co.jp
274google.co.uk
275google.com
276google.com.br
277google.com.hk
278google.com.tr
279google.de
280google.es
281google.fr
282google.it
283google.nl
284google.pl
285google.ru
286googleapis.com
[33551]287googleblog.com
288googleusercontent.com
289gooyaabitemplates.com
290gov.uk
291gravatar.com
292greenpeace.org
293gstatic.com
294guardian.co.uk
295harvard.edu
296hatena.ne.jp
297histats.com
298hm.com
299hollywoodreporter.com
300home.pl
301house.gov
302howstuffworks.com
303hp.com
304huffingtonpost.com
305huffpost.com
306hugedomains.com
307ibm.com
308ibtimes.com
309icann.org
310ieee.org
311ietf.org
312ig.com.br
313ign.com
314ikea.com
315imageshack.us
316imdb.com
317imgur.com
318inc.com
319independent.co.uk
320indiatimes.com
321indiegogo.com
[33550]322instagram.com
[33555]323instructables.com
[33551]324intel.com
[33555]325interia.pl
[33551]326issuu.com
327istockphoto.com
328iubenda.com
[33550]329jd.com
[33551]330joomla.org
331jquery.com
332jstor.org
333kickstarter.com
334kinja.com
335last.fm
336latimes.com
337lefigaro.fr
338lemonde.fr
339line.me
340linkedin.com
341list-manage.com
[33550]342live.com
[33551]343livejournal.com
344livescience.com
345loc.gov
[33555]346lonelyplanet.com
[33551]347lycos.com
[33561]348m.wikipedia.org,mi.m.wikipedia.org
[33551]349mail.ru
350marketwatch.com
351marriott.com
352mashable.com
353mediafire.com
354medium.com
355mega.nz
[33555]356megaupload.com
[33551]357mercurynews.com
358merriam-webster.com
359metro.co.uk
[33561]360microsoft.com,microsoft.com/mi-nz/
[33550]361microsoftonline.com
[33551]362mirror.co.uk
363mit.edu
364mixcloud.com
365mlb.com
366mozilla.com
367mozilla.org
[33550]368msn.com
[33551]369myspace.com
370mysql.com
371namecheap.com
372narod.ru
373nasa.gov
374nationalgeographic.com
375nature.com
[33550]376naver.com
[33551]377naver.jp
[33555]378nba.com
[33551]379nbcnews.com
380ndtv.com
[33550]381netflix.com
[33551]382netsons.com
383netvibes.com
384networkadvertising.org
385news.com.au
386newscientist.com
387newsweek.com
[33555]388newyorker.com
[33551]389nginx.com
390nginx.org
391nhk.or.jp
392nicovideo.jp
393nifty.com
394nih.gov
395nikkei.com
396noaa.gov
397nokia.com
398npr.org
399nvidia.com
400nydailynews.com
401nypost.com
402nytimes.com
403nyu.edu
404odnoklassniki.ru
[33550]405office.com
[33555]406offset.com
[33550]407ok.ru
408okezone.com
[33551]409opera.com
410oracle.com
411orange.fr
412oreilly.com
413oup.com
414over-blog.com
415ovh.co.uk
416ovh.com
417ovh.net
418ox.ac.uk
419parallels.com
420pastebin.com
[33550]421paypal.com
[33551]422pbs.org
[33555]423pcmag.com
[33551]424people.com
425photobucket.com
426php.net
[33561]427pinterest.com,SINGLEPAGE
[33551]428pixabay.com
429playstation.com
430plesk.com
[33555]431plos.org
[33551]432politico.com
[33555]433prestashop.com
[33551]434prezi.com
435princeton.edu
436privacyshield.gov
437prnewswire.com
438psychologytoday.com
[33550]439qq.com
[33551]440quantcast.com
[33550]441quora.com
[33551]442rakuten.co.jp
443rambler.ru
444rapidshare.com
[33550]445reddit.com
[33551]446repubblica.it
[33555]447researchgate.net
[33551]448reuters.com
449ria.ru
450rottentomatoes.com
451rt.com
452rtve.es
[33555]453sakura.ne.jp
[33551]454samsung.com
455sapo.pt
[33555]456scholastic.com
[33551]457sciencedaily.com
458sciencedirect.com
459sciencemag.org
460scientificamerican.com
461scribd.com
462seattletimes.com
463secureserver.net
464sedo.com
465seesaa.net
466sendspace.com
467sfgate.com
468shopify.com
469shutterstock.com
470siemens.com
471sina.com.cn
472sky.com
473skype.com
474skyrock.com
[33555]475slate.com
[33551]476slideshare.net
[33550]477sm.cn
[33551]478smh.com.au
479so-net.ne.jp
480softonic.com
[33550]481sogou.com
482sohu.com
[33551]483soratemplates.com
[33550]484soso.com
[33551]485soundcloud.com
486spiegel.de
487spotify.com
488springer.com
489sputniknews.com
[33555]490ssl-images-amazon.com
[33550]491stackoverflow.com
[33555]492standard.co.uk
[33551]493stanford.edu
494state.gov
495steamcommunity.com
496steampowered.com
497storage.canalblog.com
[33555]498storage.googleapis.com
[33551]499stores.jp
500storify.com
[33561]501stuff.co.nz,SINGLEPAGE
[33551]502surveymonkey.com
503symantec.com
504t-online.de
[33550]505t.co
[33551]506t.me
507tabelog.com
[33550]508taobao.com
[33551]509target.com
[33555]510teamviewer.com
[33551]511techcrunch.com
512ted.com
513telegram.me
514telegraph.co.uk
515terra.com.br
[33555]516theatlantic.com
517thefreedictionary.com
[33551]518theglobeandmail.com
519theguardian.com
520themeforest.net
[33555]521thenextweb.com
[33551]522thestar.com
523thesun.co.uk
524thetimes.co.uk
525theverge.com
526thoughtco.com
[33550]527tianya.cn
[33551]528time.com
529tinyurl.com
[33550]530tmall.com
[33551]531tmz.com
[33550]532tribunnews.com
[33551]533tripadvisor.com
534trustpilot.com
[33550]535twitch.tv
536twitter.com
[33551]537ucoz.ru
538uiuc.edu
539umich.edu
540un.org
541undeveloped.com
542unesco.org
543uol.com.br
544urbandictionary.com
[33555]545usa.gov
[33551]546usatoday.com
547usgs.gov
548usnews.com
549uspto.gov
550ustream.tv
551utexas.edu
552variety.com
553venturebeat.com
554vice.com
555viglink.com
556vimeo.com
[33550]557vk.com
[33551]558vkontakte.ru
559vox.com
560w3.org
[33550]561w3schools.com
[33551]562wa.me
[33550]563walmart.com
[33551]564washington.edu
565washingtonpost.com
566wattpad.com
[33555]567weather.com
[33551]568web.fc2.com
569webmd.com
570weebly.com
[33550]571weibo.com
[33551]572welt.de
573whatsapp.com
574whitehouse.gov
575who.int
576wikia.com
577wikihow.com
578wikimedia.org
[33561]579wikipedia.org,mi.wikipedia.org
580wiktionary.org,mi.wiktionary.org
[33551]581wiley.com
582windowsphone.com
583wired.com
584wix.com
[33561]585wordpress.org,SUBDOMAIN-COPY
[33551]586worldbank.org
587wp.com
588wsj.com
589xbox.com
[33550]590xinhuanet.com
[33551]591yadi.sk
[33555]592yahoo.co.jp
[33550]593yahoo.com
[33551]594yale.edu
[33550]595yandex.ru
[33551]596yelp.com
597youku.com
598youronlinechoices.com
599youtu.be
[33550]600youtube.com
[33551]601ytimg.com
602zdnet.com
[33555]603zend.com
[33551]604zendesk.com
[33555]605zippyshare.com
Note: See TracBrowser for help on using the repository browser.