# Mapping of top sites in base url forms to value # This file contains sites that are too large to crawl exhaustively. # The domains are from Alexa top sites (where only the first 50 were visible) # Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites # Finally also added https://moz.com/top500 by downloading its CSV file and # adding its URLs to the existing listing here from alexa/wiki. # Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates. # Then in Gedit, used regex search and replace to remove ..ext variants, keeping # just .ext # And finally, re-sorted the reduced list alphabetically and pasted into here. # FORMAT OF THIS FILE'S CONTENTS: # , # where can or is one of # empty, SUBDOMAIN-COPY, FOLLOW-LINKS-WITHIN-TOPSITE, SINGLEPAGE, # # - if value is left empty: if seedurl contains topsite-base-url, the seedurl will go into the # file unprocessed-topsite-matches.txt and the site/page won't be crawled. # The user will be notified to inspect the file unprocessed-topsite-matches.txt. # - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl. # For example, if the seedurl is http://docs.google.com/some-long-suffix-in-base64, then it # matches the topsite-base-url of docs.google.com and its value of SINGLEPAGE will add the # seedurl itself as the regex url-filter, to restrict the crawl to just the specified page. # - SUBDOMAIN-COPY: if seedurl CONTAINS topsite-base-url, then whatever the seedurl's subdomain # or else domain is, will make up the urlfilter, so we don't leak out into a larger domain. # Use SUBDOMAIN-COPY to restrict to a domain prefix/subdomain. For example, if seedurl is # pinky.blogspot.com, it will match the topsite-base-url of blogspot.com, but SUBDOMAIN-COPY # will ensure we restrict crawling to pages on pinky.blogspot.com. # However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go # into the file unprocessed-topsite-matches.txt and the site/page won't be crawled. # - FOLLOW-LINKS-WITHIN-TOPSITE: download seedURL pages and pages linked from each seedURL # page should be followed and downloaded too, as long as they're within the same subdomain # matching the topsite-base-url. # This is different from SUBDOMAIN-COPY, as that can download all of a specific subdomain but # restricts against downloading the entire domain (e.g. all pinky.blogspot.com and not anything # else within blogspot.com). FOLLOW-LINKS-WITHIN-TOPSITE can download all linked pages (at # depth specified for the nutch crawl) as long as they're within the topsite-base-url. # e.g. seedURLs on docs.google.com containing links will have those linked pages and any # they link to etc. downloaded as long as they're on docs.google.com. # - : if a seedurl contains topsite-base-url, then the provided # url-form-without-protocol will make up the urlfilter, again preventing leaking into a # larger part of the domain. For example, if the seedurl is mi.wikipedia.org/SomePage, it will # match the topsite-base-url of wikipedia.org for which the # value is mi.wikipedia.org, which should be all that's accepted for wikipedia.org. The # ends up in the regex urlfilter file, thereby restricting the # crawl to just mi.wikipedia.org. # Remember to leave out any protocol . # # TODO If useful: # column 3: whether nutch should do fetch all or not # column 4: number of crawl iterations # NOT TOP SITES, BUT SITES WE INSPECTED AND WANT TO CONTROL SIMILARLY TO TOP SITES 00.gs,SINGLEPAGE # May be a large site with only seedURLs of real relevance topographic-map.com,SINGLEPAGE ami-media.net,SINGLEPAGE # 2 pages of declarations of human rights in Maori, rest in other languages anitra.net,SINGLEPAGE # special case mi.centr-zashity.ru,SINGLEPAGE # we want the http://loquevendra318.com/fox/maori.html seed URL but also # pages within the following subsection loquevendra318.com,loquevendra318.com/fox/maori/ martinvrijland.nl,martinvrijland.nl/mi/ csunplugged.org,csunplugged.org/mi/ gpedia.com,gpedia.com/mi/ # TOP SITE BUT NOT TOP 500 www.tumblr.com,SINGLEPAGE # TOP SITES # docs.google.com is a special case: not all pages are public and any interlinking is likely to # be intentional. Grab all linked pages, for link depth set with nutch's crawl, as long as the # links are within the given topsite-base-url docs.google.com,FOLLOW-LINKS-WITHIN-TOPSITE # Just crawl a single page for these: drive.google.com,SINGLEPAGE forms.office.com,SINGLEPAGE player.vimeo.com,SINGLEPAGE static-promote.weebly.com,SINGLEPAGE # Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos # The page's containing folder is whitelisted in case the photos are there. korora.econ.yale.edu,SINGLEPAGE 000webhost.com 360.cn 4shared.com a8.net abc.es abc.net.au abcnews.go.com about.com about.me aboutads.info abril.com.br academia.edu accuweather.com addthis.com addtoany.com adobe.com adweek.com airbnb.com akamaihd.net alexa.com alibaba.com aliexpress.com alipay.com aljazeera.com allaboutcookies.org allrecipes.com amazon.ca amazon.co.jp amazon.co.uk amazon.com amazon.de amazon.es amazon.fr amazon.in ameblo.jp ampproject.org android.com aol.com ap.org apache.org apachefriends.org apple.com archive.org archives.gov arstechnica.com arxiv.org asahi.com ask.fm asus.com axs.com babytree.com baidu.com bandcamp.com bbc.co.uk bbc.com behance.net berkeley.edu biblegateway.com biglobe.ne.jp billboard.com bing.com bit.ly bitly.com blackberry.com blogger.com blogspot.com,SUBDOMAIN-COPY bloomberg.com booking.com boston.com box.com britannica.com bt.com bund.de businessinsider.com businesswire.com buydomains.com buzzfeed.com ca.gov cambridge.org canalblog.com cbc.ca cbslocal.com cbsnews.com cdc.gov change.org channel4.com chicagotribune.com chinadaily.com.cn cisco.com clickbank.net cloudflare.com cmu.edu cnbc.com cnet.com cnn.com cocolog-nifty.com columbia.edu connect.over-blog.com cornell.edu corriere.it cpanel.com cpanel.net creativecommons.org csdn.net csmonitor.com dailymail.co.uk dailymotion.com dan.com daum.net debian.org dell.com depositfiles.com detik.com digg.com discovery.com disney.com disney.go.com disqus.com doubleclick.net dreniq.com dribbble.com dropbox.com,SINGLEPAGE dropboxusercontent.com dw.com e-recht24.de ea.com ebay.co.uk ebay.com economist.com eff.org ehow.com elmundo.es elpais.com engadget.com entrepreneur.com eonline.com espn.com espn.go.com etsy.com europa.eu eventbrite.com example.com excite.co.jp express.co.uk facebook.com fandom.com fastcompany.com fb.com fb.me fda.gov fedoraproject.org feedburner.com fifa.com files.wordpress.com flickr.com forbes.com fortune.com foursquare.com foxnews.com ft.com ftc.gov gen.xyz geocities.jp gesetze-im-internet.de ggpht.com github.com gizmodo.com globo.com gmail.com gnu.org godaddy.com gofundme.com goo.gl goo.ne.jp goodreads.com google.ca google.co.id google.co.in google.co.jp google.co.uk google.com google.com.br google.com.hk google.com.tr google.de google.es google.fr google.it google.nl google.pl google.ru googleapis.com googleblog.com googleusercontent.com gooyaabitemplates.com gov.uk gravatar.com greenpeace.org gstatic.com guardian.co.uk harvard.edu hatena.ne.jp histats.com hm.com hollywoodreporter.com home.pl house.gov howstuffworks.com hp.com huffingtonpost.com huffpost.com hugedomains.com ibm.com ibtimes.com icann.org ieee.org ietf.org ig.com.br ign.com ikea.com imageshack.us imdb.com imgur.com inc.com independent.co.uk indiatimes.com indiegogo.com instagram.com instructables.com intel.com interia.pl issuu.com istockphoto.com iubenda.com jd.com joomla.org jquery.com jstor.org kickstarter.com kinja.com last.fm latimes.com lefigaro.fr lemonde.fr line.me linkedin.com list-manage.com live.com livejournal.com livescience.com loc.gov lonelyplanet.com lycos.com m.wikipedia.org,mi.m.wikipedia.org mail.ru marketwatch.com marriott.com mashable.com mediafire.com medium.com mega.nz megaupload.com mercurynews.com merriam-webster.com metro.co.uk microsoft.com,microsoft.com/mi-nz/ microsoftonline.com mirror.co.uk mit.edu mixcloud.com mlb.com mozilla.com mozilla.org msn.com myspace.com mysql.com namecheap.com narod.ru nasa.gov nationalgeographic.com nature.com naver.com naver.jp nba.com nbcnews.com ndtv.com netflix.com netsons.com netvibes.com networkadvertising.org news.com.au newscientist.com newsweek.com newyorker.com nginx.com nginx.org nhk.or.jp nicovideo.jp nifty.com nih.gov nikkei.com noaa.gov nokia.com npr.org nvidia.com nydailynews.com nypost.com nytimes.com nyu.edu odnoklassniki.ru office.com offset.com ok.ru okezone.com opera.com oracle.com orange.fr oreilly.com oup.com over-blog.com ovh.co.uk ovh.com ovh.net ox.ac.uk parallels.com pastebin.com paypal.com pbs.org pcmag.com people.com photobucket.com php.net pinterest.com,SINGLEPAGE pixabay.com playstation.com plesk.com plos.org politico.com prestashop.com prezi.com princeton.edu privacyshield.gov prnewswire.com psychologytoday.com qq.com quantcast.com quora.com rakuten.co.jp rambler.ru rapidshare.com reddit.com repubblica.it researchgate.net reuters.com ria.ru rottentomatoes.com rt.com rtve.es sakura.ne.jp samsung.com sapo.pt scholastic.com sciencedaily.com sciencedirect.com sciencemag.org scientificamerican.com scribd.com seattletimes.com secureserver.net sedo.com seesaa.net sendspace.com sfgate.com shopify.com shutterstock.com siemens.com sina.com.cn sky.com skype.com skyrock.com slate.com slideshare.net sm.cn smh.com.au so-net.ne.jp softonic.com sogou.com sohu.com soratemplates.com soso.com soundcloud.com spiegel.de spotify.com springer.com sputniknews.com ssl-images-amazon.com stackoverflow.com standard.co.uk stanford.edu state.gov steamcommunity.com steampowered.com storage.canalblog.com storage.googleapis.com stores.jp storify.com stuff.co.nz,SINGLEPAGE surveymonkey.com symantec.com t-online.de t.co t.me tabelog.com taobao.com target.com teamviewer.com techcrunch.com ted.com telegram.me telegraph.co.uk terra.com.br theatlantic.com thefreedictionary.com theglobeandmail.com theguardian.com themeforest.net thenextweb.com thestar.com thesun.co.uk thetimes.co.uk theverge.com thoughtco.com tianya.cn time.com tinyurl.com tmall.com tmz.com tribunnews.com tripadvisor.com trustpilot.com twitch.tv twitter.com ucoz.ru uiuc.edu umich.edu un.org undeveloped.com unesco.org uol.com.br urbandictionary.com usa.gov usatoday.com usgs.gov usnews.com uspto.gov ustream.tv utexas.edu variety.com venturebeat.com vice.com viglink.com vimeo.com vk.com vkontakte.ru vox.com w3.org w3schools.com wa.me walmart.com washington.edu washingtonpost.com wattpad.com weather.com web.fc2.com webmd.com weebly.com weibo.com welt.de whatsapp.com whitehouse.gov who.int wikia.com wikihow.com wikimedia.org wikipedia.org,mi.wikipedia.org wiktionary.org,mi.wiktionary.org wiley.com windowsphone.com wired.com wix.com wordpress.org,SUBDOMAIN-COPY worldbank.org wp.com wsj.com xbox.com xinhuanet.com yadi.sk yahoo.co.jp yahoo.com yale.edu yandex.ru yelp.com youku.com youronlinechoices.com youtu.be youtube.com ytimg.com zdnet.com zend.com zendesk.com zippyshare.com