source: gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt@ 33569

Last change on this file since 33569 was 33569, checked in by ak19, 5 years ago
  1. batchcrawl.sh now does what it should have from the start, which is to move the log.out and UNFINISHED files into the output folder instead of leaving them in the input folder, as the input to_crawl folder can and does get replaced all the time, every time I regenerate it after black/white/greylisting more urls. 2. Blacklisted more adult sites, greylisted more product sites and .ru, .pl and .tk domains with whitelisting in the whitelist file. 3. CCWETProcessor now looks out for additional adult sites based on URL and adds them to its blacklist in memory (not the file) and logs the domain for checking and manually adding to the blacklist file.
File size: 10.9 KB
Line 
1# Mapping of top sites in base url forms to value
2
3# This file contains sites that are too large to crawl exhaustively.
4# The domains are from Alexa top sites (where only the first 50 were visible)
5# Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
6# Finally also added https://moz.com/top500 by downloading its CSV file and
7# adding its URLs to the existing listing here from alexa/wiki.
8# Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates.
9# Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext variants, keeping
10# just <site>.ext
11# And finally, re-sorted the reduced list alphabetically and pasted into here.
12
13# FORMAT OF THIS FILE'S CONTENTS:
14# <topsite-base-url>,<value>
15# where <value> can or is one of
16# empty, SUBDOMAIN-COPY, FOLLOW-LINKS-WITHIN-TOPSITE, SINGLEPAGE, <url-form-without-protocol>
17#
18# - if value is left empty: if seedurl contains topsite-base-url, the seedurl will go into the
19# file unprocessed-topsite-matches.txt and the site/page won't be crawled.
20# The user will be notified to inspect the file unprocessed-topsite-matches.txt.
21# - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl.
22# For example, if the seedurl is http://docs.google.com/some-long-suffix-in-base64, then it
23# matches the topsite-base-url of docs.google.com and its value of SINGLEPAGE will add the
24# seedurl itself as the regex url-filter, to restrict the crawl to just the specified page.
25# - SUBDOMAIN-COPY: if seedurl CONTAINS topsite-base-url, then whatever the seedurl's subdomain
26# or else domain is, will make up the urlfilter, so we don't leak out into a larger domain.
27# Use SUBDOMAIN-COPY to restrict to a domain prefix/subdomain. For example, if seedurl is
28# pinky.blogspot.com, it will match the topsite-base-url of blogspot.com, but SUBDOMAIN-COPY
29# will ensure we restrict crawling to pages on pinky.blogspot.com.
30# However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go
31# into the file unprocessed-topsite-matches.txt and the site/page won't be crawled.
32# - FOLLOW-LINKS-WITHIN-TOPSITE: if pages linked from the seedURL page can be followed and
33# downloaded, as long as it's within the same subdomain matching the topsite-base-url.
34# This is different from SUBDOMAIN-COPY, as that can download all of a specific subdomain but
35# restricts against downloading the entire domain (e.g. all pinky.blogspot.com and not anything
36# else within blogspot.com). FOLLOW-LINKS-WITHIN-TOPSITE can download all linked pages (at
37# depth specified for the nutch crawl) as long as they're within the topsite-base-url.
38# e.g. seedURLs on docs.google.com containing links will have those linked pages and any
39# they link to etc. downloaded as long as they're on docs.google.com.
40# - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided
41# url-form-without-protocol will make up the urlfilter, again preventing leaking into a
42# larger part of the domain. For example, if the seedurl is mi.wikipedia.org/SomePage, it will
43# match the topsite-base-url of wikipedia.org for which the <url-form-without-protocol>
44# value is mi.wikipedia.org, which should be all that's accepted for wikipedia.org. The
45# <url-form-without-protocol> ends up in the regex urlfilter file, thereby restricting the
46# crawl to just mi.wikipedia.org.
47# Remember to leave out any protocol <from url-form-without-protocol>.
48#
49# TODO If useful:
50# column 3: whether nutch should do fetch all or not
51# column 4: number of crawl iterations
52
53
54# NOT TOP SITES, BUT SITES WE INSPECTED AND WANT TO CONTROL SIMILARLY TO TOP SITES
5500.gs,SINGLEPAGE
56# May be a large site with only seedURLs of real relevance
57topographic-map.com,SINGLEPAGE
58ami-media.net,SINGLEPAGE
59# 2 pages of declarations of human rights in Maori, rest in other languages
60anitra.net,SINGLEPAGE
61# special case
62mi.centr-zashity.ru,SINGLEPAGE
63
64# TOP SITE BUT NOT TOP 500
65www.tumblr.com,SINGLEPAGE
66
67
68# TOP SITES
69
70# docs.google.com is a special case: not all pages are public and any interlinking is likely to
71# be intentional. Grab all linked pages, for link depth set with nutch's crawl, as long as the
72# links are within the given topsite-base-url
73docs.google.com,FOLLOW-LINKS-WITHIN-TOPSITE
74
75# Just crawl a single page for these:
76drive.google.com,SINGLEPAGE
77forms.office.com,SINGLEPAGE
78player.vimeo.com,SINGLEPAGE
79static-promote.weebly.com,SINGLEPAGE
80
81# Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
82# The page's containing folder is whitelisted in case the photos are there.
83korora.econ.yale.edu,SINGLEPAGE
84
85
86000webhost.com
87360.cn
884shared.com
89a8.net
90abc.es
91abc.net.au
92abcnews.go.com
93about.com
94about.me
95aboutads.info
96abril.com.br
97academia.edu
98accuweather.com
99addthis.com
100addtoany.com
101adobe.com
102adweek.com
103airbnb.com
104akamaihd.net
105alexa.com
106alibaba.com
107aliexpress.com
108alipay.com
109aljazeera.com
110allaboutcookies.org
111allrecipes.com
112amazon.ca
113amazon.co.jp
114amazon.co.uk
115amazon.com
116amazon.de
117amazon.es
118amazon.fr
119amazon.in
120ameblo.jp
121ampproject.org
122android.com
123aol.com
124ap.org
125apache.org
126apachefriends.org
127apple.com
128archive.org
129archives.gov
130arstechnica.com
131arxiv.org
132asahi.com
133ask.fm
134asus.com
135axs.com
136babytree.com
137baidu.com
138bandcamp.com
139bbc.co.uk
140bbc.com
141behance.net
142berkeley.edu
143biblegateway.com
144biglobe.ne.jp
145billboard.com
146bing.com
147bit.ly
148bitly.com
149blackberry.com
150blogger.com
151blogspot.com,SUBDOMAIN-COPY
152bloomberg.com
153booking.com
154boston.com
155box.com
156britannica.com
157bt.com
158bund.de
159businessinsider.com
160businesswire.com
161buydomains.com
162buzzfeed.com
163ca.gov
164cambridge.org
165canalblog.com
166cbc.ca
167cbslocal.com
168cbsnews.com
169cdc.gov
170change.org
171channel4.com
172chicagotribune.com
173chinadaily.com.cn
174cisco.com
175clickbank.net
176cloudflare.com
177cmu.edu
178cnbc.com
179cnet.com
180cnn.com
181cocolog-nifty.com
182columbia.edu
183connect.over-blog.com
184cornell.edu
185corriere.it
186cpanel.com
187cpanel.net
188creativecommons.org
189csdn.net
190csmonitor.com
191dailymail.co.uk
192dailymotion.com
193dan.com
194daum.net
195debian.org
196dell.com
197depositfiles.com
198detik.com
199digg.com
200discovery.com
201disney.com
202disney.go.com
203disqus.com
204doubleclick.net
205dreniq.com
206dribbble.com
207dropbox.com,SINGLEPAGE
208dropboxusercontent.com
209dw.com
210e-recht24.de
211ea.com
212ebay.co.uk
213ebay.com
214economist.com
215eff.org
216ehow.com
217elmundo.es
218elpais.com
219engadget.com
220entrepreneur.com
221eonline.com
222espn.com
223espn.go.com
224etsy.com
225europa.eu
226eventbrite.com
227example.com
228excite.co.jp
229express.co.uk
230facebook.com
231fandom.com
232fastcompany.com
233fb.com
234fb.me
235fda.gov
236fedoraproject.org
237feedburner.com
238fifa.com
239files.wordpress.com
240flickr.com
241forbes.com
242fortune.com
243foursquare.com
244foxnews.com
245ft.com
246ftc.gov
247gen.xyz
248geocities.jp
249gesetze-im-internet.de
250ggpht.com
251github.com
252gizmodo.com
253globo.com
254gmail.com
255gnu.org
256godaddy.com
257gofundme.com
258goo.gl
259goo.ne.jp
260goodreads.com
261google.ca
262google.co.id
263google.co.in
264google.co.jp
265google.co.uk
266google.com
267google.com.br
268google.com.hk
269google.com.tr
270google.de
271google.es
272google.fr
273google.it
274google.nl
275google.pl
276google.ru
277googleapis.com
278googleblog.com
279googleusercontent.com
280gooyaabitemplates.com
281gov.uk
282gravatar.com
283greenpeace.org
284gstatic.com
285guardian.co.uk
286harvard.edu
287hatena.ne.jp
288histats.com
289hm.com
290hollywoodreporter.com
291home.pl
292house.gov
293howstuffworks.com
294hp.com
295huffingtonpost.com
296huffpost.com
297hugedomains.com
298ibm.com
299ibtimes.com
300icann.org
301ieee.org
302ietf.org
303ig.com.br
304ign.com
305ikea.com
306imageshack.us
307imdb.com
308imgur.com
309inc.com
310independent.co.uk
311indiatimes.com
312indiegogo.com
313instagram.com
314instructables.com
315intel.com
316interia.pl
317issuu.com
318istockphoto.com
319iubenda.com
320jd.com
321joomla.org
322jquery.com
323jstor.org
324kickstarter.com
325kinja.com
326last.fm
327latimes.com
328lefigaro.fr
329lemonde.fr
330line.me
331linkedin.com
332list-manage.com
333live.com
334livejournal.com
335livescience.com
336loc.gov
337lonelyplanet.com
338lycos.com
339m.wikipedia.org,mi.m.wikipedia.org
340mail.ru
341marketwatch.com
342marriott.com
343mashable.com
344mediafire.com
345medium.com
346mega.nz
347megaupload.com
348mercurynews.com
349merriam-webster.com
350metro.co.uk
351microsoft.com,microsoft.com/mi-nz/
352microsoftonline.com
353mirror.co.uk
354mit.edu
355mixcloud.com
356mlb.com
357mozilla.com
358mozilla.org
359msn.com
360myspace.com
361mysql.com
362namecheap.com
363narod.ru
364nasa.gov
365nationalgeographic.com
366nature.com
367naver.com
368naver.jp
369nba.com
370nbcnews.com
371ndtv.com
372netflix.com
373netsons.com
374netvibes.com
375networkadvertising.org
376news.com.au
377newscientist.com
378newsweek.com
379newyorker.com
380nginx.com
381nginx.org
382nhk.or.jp
383nicovideo.jp
384nifty.com
385nih.gov
386nikkei.com
387noaa.gov
388nokia.com
389npr.org
390nvidia.com
391nydailynews.com
392nypost.com
393nytimes.com
394nyu.edu
395odnoklassniki.ru
396office.com
397offset.com
398ok.ru
399okezone.com
400opera.com
401oracle.com
402orange.fr
403oreilly.com
404oup.com
405over-blog.com
406ovh.co.uk
407ovh.com
408ovh.net
409ox.ac.uk
410parallels.com
411pastebin.com
412paypal.com
413pbs.org
414pcmag.com
415people.com
416photobucket.com
417php.net
418pinterest.com,SINGLEPAGE
419pixabay.com
420playstation.com
421plesk.com
422plos.org
423politico.com
424prestashop.com
425prezi.com
426princeton.edu
427privacyshield.gov
428prnewswire.com
429psychologytoday.com
430qq.com
431quantcast.com
432quora.com
433rakuten.co.jp
434rambler.ru
435rapidshare.com
436reddit.com
437repubblica.it
438researchgate.net
439reuters.com
440ria.ru
441rottentomatoes.com
442rt.com
443rtve.es
444sakura.ne.jp
445samsung.com
446sapo.pt
447scholastic.com
448sciencedaily.com
449sciencedirect.com
450sciencemag.org
451scientificamerican.com
452scribd.com
453seattletimes.com
454secureserver.net
455sedo.com
456seesaa.net
457sendspace.com
458sfgate.com
459shopify.com
460shutterstock.com
461siemens.com
462sina.com.cn
463sky.com
464skype.com
465skyrock.com
466slate.com
467slideshare.net
468sm.cn
469smh.com.au
470so-net.ne.jp
471softonic.com
472sogou.com
473sohu.com
474soratemplates.com
475soso.com
476soundcloud.com
477spiegel.de
478spotify.com
479springer.com
480sputniknews.com
481ssl-images-amazon.com
482stackoverflow.com
483standard.co.uk
484stanford.edu
485state.gov
486steamcommunity.com
487steampowered.com
488storage.canalblog.com
489storage.googleapis.com
490stores.jp
491storify.com
492stuff.co.nz,SINGLEPAGE
493surveymonkey.com
494symantec.com
495t-online.de
496t.co
497t.me
498tabelog.com
499taobao.com
500target.com
501teamviewer.com
502techcrunch.com
503ted.com
504telegram.me
505telegraph.co.uk
506terra.com.br
507theatlantic.com
508thefreedictionary.com
509theglobeandmail.com
510theguardian.com
511themeforest.net
512thenextweb.com
513thestar.com
514thesun.co.uk
515thetimes.co.uk
516theverge.com
517thoughtco.com
518tianya.cn
519time.com
520tinyurl.com
521tmall.com
522tmz.com
523tribunnews.com
524tripadvisor.com
525trustpilot.com
526twitch.tv
527twitter.com
528ucoz.ru
529uiuc.edu
530umich.edu
531un.org
532undeveloped.com
533unesco.org
534uol.com.br
535urbandictionary.com
536usa.gov
537usatoday.com
538usgs.gov
539usnews.com
540uspto.gov
541ustream.tv
542utexas.edu
543variety.com
544venturebeat.com
545vice.com
546viglink.com
547vimeo.com
548vk.com
549vkontakte.ru
550vox.com
551w3.org
552w3schools.com
553wa.me
554walmart.com
555washington.edu
556washingtonpost.com
557wattpad.com
558weather.com
559web.fc2.com
560webmd.com
561weebly.com
562weibo.com
563welt.de
564whatsapp.com
565whitehouse.gov
566who.int
567wikia.com
568wikihow.com
569wikimedia.org
570wikipedia.org,mi.wikipedia.org
571wiktionary.org,mi.wiktionary.org
572wiley.com
573windowsphone.com
574wired.com
575wix.com
576wordpress.org,SUBDOMAIN-COPY
577worldbank.org
578wp.com
579wsj.com
580xbox.com
581xinhuanet.com
582yadi.sk
583yahoo.co.jp
584yahoo.com
585yale.edu
586yandex.ru
587yelp.com
588youku.com
589youronlinechoices.com
590youtu.be
591youtube.com
592ytimg.com
593zdnet.com
594zend.com
595zendesk.com
596zippyshare.com
Note: See TracBrowser for help on using the repository browser.