source: gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt@ 33562

Last change on this file since 33562 was 33562, checked in by ak19, 5 years ago
  1. The sites-too-big-to-exhaustively-crawl.txt is now a csv file of a semi-custom format, and the Java code now uses the Apache Commons CSV jar file (v1.7 for Java 8) to parse the contents thereof. 2. Tidied up code to reuse reference to ClassLoader.
File size: 10.5 KB
Line 
1# Mapping of top sites in base url forms to value
2
3# This file contains sites that are too large to crawl exhaustively.
4# The domains are from Alexa top sites (where only the first 50 were visible)
5# Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
6# Finally also added https://moz.com/top500 by downloading its CSV file and
7# adding its URLs to the existing listing here from alexa/wiki.
8# Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates.
9# Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext variants, keeping
10# just <site>.ext
11# And finally, re-sorted the reduced list alphabetically and pasted into here.
12
13# FORMAT OF THIS FILE'S CONTENTS:
14# <topsite-base-url>,<value>
15# where <value> can or is one of
16# empty, SUBDOMAIN-COPY, FOLLOW-LINKS-WITHIN-TOPSITE, SINGLEPAGE, <url-form-without-protocol>
17#
18# - if value is left empty: if seedurl contains topsite-base-url, the seedurl will go into the
19# file unprocessed-topsite-matches.txt and the site/page won't be crawled.
20# The user will be notified to inspect the file unprocessed-topsite-matches.txt.
21# - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl.
22# For example, if the seedurl is http://docs.google.com/some-long-suffix-in-base64, then it
23# matches the topsite-base-url of docs.google.com and its value of SINGLEPAGE will add the
24# seedurl itself as the regex url-filter, to restrict the crawl to just the specified page.
25# - SUBDOMAIN-COPY: if seedurl CONTAINS topsite-base-url, then whatever the seedurl's subdomain
26# or else domain is, will make up the urlfilter, so we don't leak out into a larger domain.
27# Use SUBDOMAIN-COPY to restrict to a domain prefix/subdomain. For example, if seedurl is
28# pinky.blogspot.com, it will match the topsite-base-url of blogspot.com, but SUBDOMAIN-COPY
29# will ensure we restrict crawling to pages on pinky.blogspot.com.
30# However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go
31# into the file unprocessed-topsite-matches.txt and the site/page won't be crawled.
32# - FOLLOW-LINKS-WITHIN-TOPSITE: if pages linked from the seedURL page can be followed and
33# downloaded, as long as it's within the same subdomain matching the topsite-base-url.
34# This is different from SUBDOMAIN-COPY, as that can download all of a specific subdomain but
35# restricts against downloading the entire domain (e.g. all pinky.blogspot.com and not anything
36# else within blogspot.com). FOLLOW-LINKS-WITHIN-TOPSITE can download all linked pages (at
37# depth specified for the nutch crawl) as long as they're within the topsite-base-url.
38# e.g. seedURLs on docs.google.com containing links will have those linked pages and any
39# they link to etc. downloaded as long as they're on docs.google.com.
40# - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided
41# url-form-without-protocol will make up the urlfilter, again preventing leaking into a
42# larger part of the domain. For example, if the seedurl is mi.wikipedia.org/SomePage, it will
43# match the topsite-base-url of wikipedia.org for which the <url-form-without-protocol>
44# value is mi.wikipedia.org, which should be all that's accepted for wikipedia.org. The
45# <url-form-without-protocol> ends up in the regex urlfilter file, thereby restricting the
46# crawl to just mi.wikipedia.org.
47# Remember to leave out any protocol <from url-form-without-protocol>.
48#
49# TODO If useful:
50# column 3: whether nutch should do fetch all or not
51# column 4: number of crawl iterations
52
53# docs.google.com is a special case: not all pages are public and any interlinking is likely to
54# be intentional. Grab all linked pages, for link depth set with nutch's crawl, as long as the
55# links are within the given topsite-base-url
56docs.google.com,FOLLOW-LINKS-WITHIN-TOPSITE
57
58# Just crawl a single page for these:
59drive.google.com,SINGLEPAGE
60forms.office.com,SINGLEPAGE
61player.vimeo.com,SINGLEPAGE
62static-promote.weebly.com,SINGLEPAGE
63
64# Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
65# The page's containing folder is whitelisted in case the photos are there.
66korora.econ.yale.edu,SINGLEPAGE
67
68000webhost.com
69360.cn
704shared.com
71a8.net
72abc.es
73abc.net.au
74abcnews.go.com
75about.com
76about.me
77aboutads.info
78abril.com.br
79academia.edu
80accuweather.com
81addthis.com
82addtoany.com
83adobe.com
84adweek.com
85airbnb.com
86akamaihd.net
87alexa.com
88alibaba.com
89aliexpress.com
90alipay.com
91aljazeera.com
92allaboutcookies.org
93allrecipes.com
94amazon.ca
95amazon.co.jp
96amazon.co.uk
97amazon.com
98amazon.de
99amazon.es
100amazon.fr
101amazon.in
102ameblo.jp
103ampproject.org
104android.com
105aol.com
106ap.org
107apache.org
108apachefriends.org
109apple.com
110archive.org
111archives.gov
112arstechnica.com
113arxiv.org
114asahi.com
115ask.fm
116asus.com
117axs.com
118babytree.com
119baidu.com
120bandcamp.com
121bbc.co.uk
122bbc.com
123behance.net
124berkeley.edu
125biblegateway.com
126biglobe.ne.jp
127billboard.com
128bing.com
129bit.ly
130bitly.com
131blackberry.com
132blogger.com
133blogspot.com,SUBDOMAIN-COPY
134bloomberg.com
135booking.com
136boston.com
137box.com
138britannica.com
139bt.com
140bund.de
141businessinsider.com
142businesswire.com
143buydomains.com
144buzzfeed.com
145ca.gov
146cambridge.org
147canalblog.com
148cbc.ca
149cbslocal.com
150cbsnews.com
151cdc.gov
152change.org
153channel4.com
154chicagotribune.com
155chinadaily.com.cn
156cisco.com
157clickbank.net
158cloudflare.com
159cmu.edu
160cnbc.com
161cnet.com
162cnn.com
163cocolog-nifty.com
164columbia.edu
165connect.over-blog.com
166cornell.edu
167corriere.it
168cpanel.com
169cpanel.net
170creativecommons.org
171csdn.net
172csmonitor.com
173dailymail.co.uk
174dailymotion.com
175dan.com
176daum.net
177debian.org
178dell.com
179depositfiles.com
180detik.com
181digg.com
182discovery.com
183disney.com
184disney.go.com
185disqus.com
186doubleclick.net
187dreniq.com
188dribbble.com
189dropbox.com,SINGLEPAGE
190dropboxusercontent.com
191dw.com
192e-recht24.de
193ea.com
194ebay.co.uk
195ebay.com
196economist.com
197eff.org
198ehow.com
199elmundo.es
200elpais.com
201engadget.com
202entrepreneur.com
203eonline.com
204espn.com
205espn.go.com
206etsy.com
207europa.eu
208eventbrite.com
209example.com
210excite.co.jp
211express.co.uk
212facebook.com
213fandom.com
214fastcompany.com
215fb.com
216fb.me
217fda.gov
218fedoraproject.org
219feedburner.com
220fifa.com
221files.wordpress.com
222flickr.com
223forbes.com
224fortune.com
225foursquare.com
226foxnews.com
227ft.com
228ftc.gov
229gen.xyz
230geocities.jp
231gesetze-im-internet.de
232ggpht.com
233github.com
234gizmodo.com
235globo.com
236gmail.com
237gnu.org
238godaddy.com
239gofundme.com
240goo.gl
241goo.ne.jp
242goodreads.com
243google.ca
244google.co.id
245google.co.in
246google.co.jp
247google.co.uk
248google.com
249google.com.br
250google.com.hk
251google.com.tr
252google.de
253google.es
254google.fr
255google.it
256google.nl
257google.pl
258google.ru
259googleapis.com
260googleblog.com
261googleusercontent.com
262gooyaabitemplates.com
263gov.uk
264gravatar.com
265greenpeace.org
266gstatic.com
267guardian.co.uk
268harvard.edu
269hatena.ne.jp
270histats.com
271hm.com
272hollywoodreporter.com
273home.pl
274house.gov
275howstuffworks.com
276hp.com
277huffingtonpost.com
278huffpost.com
279hugedomains.com
280ibm.com
281ibtimes.com
282icann.org
283ieee.org
284ietf.org
285ig.com.br
286ign.com
287ikea.com
288imageshack.us
289imdb.com
290imgur.com
291inc.com
292independent.co.uk
293indiatimes.com
294indiegogo.com
295instagram.com
296instructables.com
297intel.com
298interia.pl
299issuu.com
300istockphoto.com
301iubenda.com
302jd.com
303joomla.org
304jquery.com
305jstor.org
306kickstarter.com
307kinja.com
308last.fm
309latimes.com
310lefigaro.fr
311lemonde.fr
312line.me
313linkedin.com
314list-manage.com
315live.com
316livejournal.com
317livescience.com
318loc.gov
319lonelyplanet.com
320lycos.com
321m.wikipedia.org,mi.m.wikipedia.org
322mail.ru
323marketwatch.com
324marriott.com
325mashable.com
326mediafire.com
327medium.com
328mega.nz
329megaupload.com
330mercurynews.com
331merriam-webster.com
332metro.co.uk
333microsoft.com,microsoft.com/mi-nz/
334microsoftonline.com
335mirror.co.uk
336mit.edu
337mixcloud.com
338mlb.com
339mozilla.com
340mozilla.org
341msn.com
342myspace.com
343mysql.com
344namecheap.com
345narod.ru
346nasa.gov
347nationalgeographic.com
348nature.com
349naver.com
350naver.jp
351nba.com
352nbcnews.com
353ndtv.com
354netflix.com
355netsons.com
356netvibes.com
357networkadvertising.org
358news.com.au
359newscientist.com
360newsweek.com
361newyorker.com
362nginx.com
363nginx.org
364nhk.or.jp
365nicovideo.jp
366nifty.com
367nih.gov
368nikkei.com
369noaa.gov
370nokia.com
371npr.org
372nvidia.com
373nydailynews.com
374nypost.com
375nytimes.com
376nyu.edu
377odnoklassniki.ru
378office.com
379offset.com
380ok.ru
381okezone.com
382opera.com
383oracle.com
384orange.fr
385oreilly.com
386oup.com
387over-blog.com
388ovh.co.uk
389ovh.com
390ovh.net
391ox.ac.uk
392parallels.com
393pastebin.com
394paypal.com
395pbs.org
396pcmag.com
397people.com
398photobucket.com
399php.net
400pinterest.com,SINGLEPAGE
401pixabay.com
402playstation.com
403plesk.com
404plos.org
405politico.com
406prestashop.com
407prezi.com
408princeton.edu
409privacyshield.gov
410prnewswire.com
411psychologytoday.com
412qq.com
413quantcast.com
414quora.com
415rakuten.co.jp
416rambler.ru
417rapidshare.com
418reddit.com
419repubblica.it
420researchgate.net
421reuters.com
422ria.ru
423rottentomatoes.com
424rt.com
425rtve.es
426sakura.ne.jp
427samsung.com
428sapo.pt
429scholastic.com
430sciencedaily.com
431sciencedirect.com
432sciencemag.org
433scientificamerican.com
434scribd.com
435seattletimes.com
436secureserver.net
437sedo.com
438seesaa.net
439sendspace.com
440sfgate.com
441shopify.com
442shutterstock.com
443siemens.com
444sina.com.cn
445sky.com
446skype.com
447skyrock.com
448slate.com
449slideshare.net
450sm.cn
451smh.com.au
452so-net.ne.jp
453softonic.com
454sogou.com
455sohu.com
456soratemplates.com
457soso.com
458soundcloud.com
459spiegel.de
460spotify.com
461springer.com
462sputniknews.com
463ssl-images-amazon.com
464stackoverflow.com
465standard.co.uk
466stanford.edu
467state.gov
468steamcommunity.com
469steampowered.com
470storage.canalblog.com
471storage.googleapis.com
472stores.jp
473storify.com
474stuff.co.nz,SINGLEPAGE
475surveymonkey.com
476symantec.com
477t-online.de
478t.co
479t.me
480tabelog.com
481taobao.com
482target.com
483teamviewer.com
484techcrunch.com
485ted.com
486telegram.me
487telegraph.co.uk
488terra.com.br
489theatlantic.com
490thefreedictionary.com
491theglobeandmail.com
492theguardian.com
493themeforest.net
494thenextweb.com
495thestar.com
496thesun.co.uk
497thetimes.co.uk
498theverge.com
499thoughtco.com
500tianya.cn
501time.com
502tinyurl.com
503tmall.com
504tmz.com
505tribunnews.com
506tripadvisor.com
507trustpilot.com
508twitch.tv
509twitter.com
510ucoz.ru
511uiuc.edu
512umich.edu
513un.org
514undeveloped.com
515unesco.org
516uol.com.br
517urbandictionary.com
518usa.gov
519usatoday.com
520usgs.gov
521usnews.com
522uspto.gov
523ustream.tv
524utexas.edu
525variety.com
526venturebeat.com
527vice.com
528viglink.com
529vimeo.com
530vk.com
531vkontakte.ru
532vox.com
533w3.org
534w3schools.com
535wa.me
536walmart.com
537washington.edu
538washingtonpost.com
539wattpad.com
540weather.com
541web.fc2.com
542webmd.com
543weebly.com
544weibo.com
545welt.de
546whatsapp.com
547whitehouse.gov
548who.int
549wikia.com
550wikihow.com
551wikimedia.org
552wikipedia.org,mi.wikipedia.org
553wiktionary.org,mi.wiktionary.org
554wiley.com
555windowsphone.com
556wired.com
557wix.com
558wordpress.org,SUBDOMAIN-COPY
559worldbank.org
560wp.com
561wsj.com
562xbox.com
563xinhuanet.com
564yadi.sk
565yahoo.co.jp
566yahoo.com
567yale.edu
568yandex.ru
569yelp.com
570youku.com
571youronlinechoices.com
572youtu.be
573youtube.com
574ytimg.com
575zdnet.com
576zend.com
577zendesk.com
578zippyshare.com
Note: See TracBrowser for help on using the repository browser.