source: gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt@ 33622

Last change on this file since 33622 was 33604, checked in by ak19, 5 years ago
  1. Better output into possible-product-sites.txt including the overseas country code prefix to help decide whether the site is worth keeping or not. 2. Updated whitelisting and top-sites filters to grab the /mi/ subsections of sites that don't appear to be autotranslated. This is done in preparation for blocking out product sites hereafter
File size: 11.0 KB
Line 
1# Mapping of top sites in base url forms to value
2
3# This file contains sites that are too large to crawl exhaustively.
4# The domains are from Alexa top sites (where only the first 50 were visible)
5# Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
6# Finally also added https://moz.com/top500 by downloading its CSV file and
7# adding its URLs to the existing listing here from alexa/wiki.
8# Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates.
9# Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext variants, keeping
10# just <site>.ext
11# And finally, re-sorted the reduced list alphabetically and pasted into here.
12
13# FORMAT OF THIS FILE'S CONTENTS:
14# <topsite-base-url>,<value>
15# where <value> can or is one of
16# empty, SUBDOMAIN-COPY, FOLLOW-LINKS-WITHIN-TOPSITE, SINGLEPAGE, <url-form-without-protocol>
17#
18# - if value is left empty: if seedurl contains topsite-base-url, the seedurl will go into the
19# file unprocessed-topsite-matches.txt and the site/page won't be crawled.
20# The user will be notified to inspect the file unprocessed-topsite-matches.txt.
21# - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl.
22# For example, if the seedurl is http://docs.google.com/some-long-suffix-in-base64, then it
23# matches the topsite-base-url of docs.google.com and its value of SINGLEPAGE will add the
24# seedurl itself as the regex url-filter, to restrict the crawl to just the specified page.
25# - SUBDOMAIN-COPY: if seedurl CONTAINS topsite-base-url, then whatever the seedurl's subdomain
26# or else domain is, will make up the urlfilter, so we don't leak out into a larger domain.
27# Use SUBDOMAIN-COPY to restrict to a domain prefix/subdomain. For example, if seedurl is
28# pinky.blogspot.com, it will match the topsite-base-url of blogspot.com, but SUBDOMAIN-COPY
29# will ensure we restrict crawling to pages on pinky.blogspot.com.
30# However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go
31# into the file unprocessed-topsite-matches.txt and the site/page won't be crawled.
32# - FOLLOW-LINKS-WITHIN-TOPSITE: if pages linked from the seedURL page can be followed and
33# downloaded, as long as it's within the same subdomain matching the topsite-base-url.
34# This is different from SUBDOMAIN-COPY, as that can download all of a specific subdomain but
35# restricts against downloading the entire domain (e.g. all pinky.blogspot.com and not anything
36# else within blogspot.com). FOLLOW-LINKS-WITHIN-TOPSITE can download all linked pages (at
37# depth specified for the nutch crawl) as long as they're within the topsite-base-url.
38# e.g. seedURLs on docs.google.com containing links will have those linked pages and any
39# they link to etc. downloaded as long as they're on docs.google.com.
40# - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided
41# url-form-without-protocol will make up the urlfilter, again preventing leaking into a
42# larger part of the domain. For example, if the seedurl is mi.wikipedia.org/SomePage, it will
43# match the topsite-base-url of wikipedia.org for which the <url-form-without-protocol>
44# value is mi.wikipedia.org, which should be all that's accepted for wikipedia.org. The
45# <url-form-without-protocol> ends up in the regex urlfilter file, thereby restricting the
46# crawl to just mi.wikipedia.org.
47# Remember to leave out any protocol <from url-form-without-protocol>.
48#
49# TODO If useful:
50# column 3: whether nutch should do fetch all or not
51# column 4: number of crawl iterations
52
53
54# NOT TOP SITES, BUT SITES WE INSPECTED AND WANT TO CONTROL SIMILARLY TO TOP SITES
5500.gs,SINGLEPAGE
56# May be a large site with only seedURLs of real relevance
57topographic-map.com,SINGLEPAGE
58ami-media.net,SINGLEPAGE
59# 2 pages of declarations of human rights in Maori, rest in other languages
60anitra.net,SINGLEPAGE
61# special case
62mi.centr-zashity.ru,SINGLEPAGE
63
64martinvrijland.nl,martinvrijland.nl/mi/
65csunplugged.org,csunplugged.org/mi/
66gpedia.com,gpedia.com/mi/
67
68# TOP SITE BUT NOT TOP 500
69www.tumblr.com,SINGLEPAGE
70
71
72# TOP SITES
73
74# docs.google.com is a special case: not all pages are public and any interlinking is likely to
75# be intentional. Grab all linked pages, for link depth set with nutch's crawl, as long as the
76# links are within the given topsite-base-url
77docs.google.com,FOLLOW-LINKS-WITHIN-TOPSITE
78
79# Just crawl a single page for these:
80drive.google.com,SINGLEPAGE
81forms.office.com,SINGLEPAGE
82player.vimeo.com,SINGLEPAGE
83static-promote.weebly.com,SINGLEPAGE
84
85# Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
86# The page's containing folder is whitelisted in case the photos are there.
87korora.econ.yale.edu,SINGLEPAGE
88
89
90000webhost.com
91360.cn
924shared.com
93a8.net
94abc.es
95abc.net.au
96abcnews.go.com
97about.com
98about.me
99aboutads.info
100abril.com.br
101academia.edu
102accuweather.com
103addthis.com
104addtoany.com
105adobe.com
106adweek.com
107airbnb.com
108akamaihd.net
109alexa.com
110alibaba.com
111aliexpress.com
112alipay.com
113aljazeera.com
114allaboutcookies.org
115allrecipes.com
116amazon.ca
117amazon.co.jp
118amazon.co.uk
119amazon.com
120amazon.de
121amazon.es
122amazon.fr
123amazon.in
124ameblo.jp
125ampproject.org
126android.com
127aol.com
128ap.org
129apache.org
130apachefriends.org
131apple.com
132archive.org
133archives.gov
134arstechnica.com
135arxiv.org
136asahi.com
137ask.fm
138asus.com
139axs.com
140babytree.com
141baidu.com
142bandcamp.com
143bbc.co.uk
144bbc.com
145behance.net
146berkeley.edu
147biblegateway.com
148biglobe.ne.jp
149billboard.com
150bing.com
151bit.ly
152bitly.com
153blackberry.com
154blogger.com
155blogspot.com,SUBDOMAIN-COPY
156bloomberg.com
157booking.com
158boston.com
159box.com
160britannica.com
161bt.com
162bund.de
163businessinsider.com
164businesswire.com
165buydomains.com
166buzzfeed.com
167ca.gov
168cambridge.org
169canalblog.com
170cbc.ca
171cbslocal.com
172cbsnews.com
173cdc.gov
174change.org
175channel4.com
176chicagotribune.com
177chinadaily.com.cn
178cisco.com
179clickbank.net
180cloudflare.com
181cmu.edu
182cnbc.com
183cnet.com
184cnn.com
185cocolog-nifty.com
186columbia.edu
187connect.over-blog.com
188cornell.edu
189corriere.it
190cpanel.com
191cpanel.net
192creativecommons.org
193csdn.net
194csmonitor.com
195dailymail.co.uk
196dailymotion.com
197dan.com
198daum.net
199debian.org
200dell.com
201depositfiles.com
202detik.com
203digg.com
204discovery.com
205disney.com
206disney.go.com
207disqus.com
208doubleclick.net
209dreniq.com
210dribbble.com
211dropbox.com,SINGLEPAGE
212dropboxusercontent.com
213dw.com
214e-recht24.de
215ea.com
216ebay.co.uk
217ebay.com
218economist.com
219eff.org
220ehow.com
221elmundo.es
222elpais.com
223engadget.com
224entrepreneur.com
225eonline.com
226espn.com
227espn.go.com
228etsy.com
229europa.eu
230eventbrite.com
231example.com
232excite.co.jp
233express.co.uk
234facebook.com
235fandom.com
236fastcompany.com
237fb.com
238fb.me
239fda.gov
240fedoraproject.org
241feedburner.com
242fifa.com
243files.wordpress.com
244flickr.com
245forbes.com
246fortune.com
247foursquare.com
248foxnews.com
249ft.com
250ftc.gov
251gen.xyz
252geocities.jp
253gesetze-im-internet.de
254ggpht.com
255github.com
256gizmodo.com
257globo.com
258gmail.com
259gnu.org
260godaddy.com
261gofundme.com
262goo.gl
263goo.ne.jp
264goodreads.com
265google.ca
266google.co.id
267google.co.in
268google.co.jp
269google.co.uk
270google.com
271google.com.br
272google.com.hk
273google.com.tr
274google.de
275google.es
276google.fr
277google.it
278google.nl
279google.pl
280google.ru
281googleapis.com
282googleblog.com
283googleusercontent.com
284gooyaabitemplates.com
285gov.uk
286gravatar.com
287greenpeace.org
288gstatic.com
289guardian.co.uk
290harvard.edu
291hatena.ne.jp
292histats.com
293hm.com
294hollywoodreporter.com
295home.pl
296house.gov
297howstuffworks.com
298hp.com
299huffingtonpost.com
300huffpost.com
301hugedomains.com
302ibm.com
303ibtimes.com
304icann.org
305ieee.org
306ietf.org
307ig.com.br
308ign.com
309ikea.com
310imageshack.us
311imdb.com
312imgur.com
313inc.com
314independent.co.uk
315indiatimes.com
316indiegogo.com
317instagram.com
318instructables.com
319intel.com
320interia.pl
321issuu.com
322istockphoto.com
323iubenda.com
324jd.com
325joomla.org
326jquery.com
327jstor.org
328kickstarter.com
329kinja.com
330last.fm
331latimes.com
332lefigaro.fr
333lemonde.fr
334line.me
335linkedin.com
336list-manage.com
337live.com
338livejournal.com
339livescience.com
340loc.gov
341lonelyplanet.com
342lycos.com
343m.wikipedia.org,mi.m.wikipedia.org
344mail.ru
345marketwatch.com
346marriott.com
347mashable.com
348mediafire.com
349medium.com
350mega.nz
351megaupload.com
352mercurynews.com
353merriam-webster.com
354metro.co.uk
355microsoft.com,microsoft.com/mi-nz/
356microsoftonline.com
357mirror.co.uk
358mit.edu
359mixcloud.com
360mlb.com
361mozilla.com
362mozilla.org
363msn.com
364myspace.com
365mysql.com
366namecheap.com
367narod.ru
368nasa.gov
369nationalgeographic.com
370nature.com
371naver.com
372naver.jp
373nba.com
374nbcnews.com
375ndtv.com
376netflix.com
377netsons.com
378netvibes.com
379networkadvertising.org
380news.com.au
381newscientist.com
382newsweek.com
383newyorker.com
384nginx.com
385nginx.org
386nhk.or.jp
387nicovideo.jp
388nifty.com
389nih.gov
390nikkei.com
391noaa.gov
392nokia.com
393npr.org
394nvidia.com
395nydailynews.com
396nypost.com
397nytimes.com
398nyu.edu
399odnoklassniki.ru
400office.com
401offset.com
402ok.ru
403okezone.com
404opera.com
405oracle.com
406orange.fr
407oreilly.com
408oup.com
409over-blog.com
410ovh.co.uk
411ovh.com
412ovh.net
413ox.ac.uk
414parallels.com
415pastebin.com
416paypal.com
417pbs.org
418pcmag.com
419people.com
420photobucket.com
421php.net
422pinterest.com,SINGLEPAGE
423pixabay.com
424playstation.com
425plesk.com
426plos.org
427politico.com
428prestashop.com
429prezi.com
430princeton.edu
431privacyshield.gov
432prnewswire.com
433psychologytoday.com
434qq.com
435quantcast.com
436quora.com
437rakuten.co.jp
438rambler.ru
439rapidshare.com
440reddit.com
441repubblica.it
442researchgate.net
443reuters.com
444ria.ru
445rottentomatoes.com
446rt.com
447rtve.es
448sakura.ne.jp
449samsung.com
450sapo.pt
451scholastic.com
452sciencedaily.com
453sciencedirect.com
454sciencemag.org
455scientificamerican.com
456scribd.com
457seattletimes.com
458secureserver.net
459sedo.com
460seesaa.net
461sendspace.com
462sfgate.com
463shopify.com
464shutterstock.com
465siemens.com
466sina.com.cn
467sky.com
468skype.com
469skyrock.com
470slate.com
471slideshare.net
472sm.cn
473smh.com.au
474so-net.ne.jp
475softonic.com
476sogou.com
477sohu.com
478soratemplates.com
479soso.com
480soundcloud.com
481spiegel.de
482spotify.com
483springer.com
484sputniknews.com
485ssl-images-amazon.com
486stackoverflow.com
487standard.co.uk
488stanford.edu
489state.gov
490steamcommunity.com
491steampowered.com
492storage.canalblog.com
493storage.googleapis.com
494stores.jp
495storify.com
496stuff.co.nz,SINGLEPAGE
497surveymonkey.com
498symantec.com
499t-online.de
500t.co
501t.me
502tabelog.com
503taobao.com
504target.com
505teamviewer.com
506techcrunch.com
507ted.com
508telegram.me
509telegraph.co.uk
510terra.com.br
511theatlantic.com
512thefreedictionary.com
513theglobeandmail.com
514theguardian.com
515themeforest.net
516thenextweb.com
517thestar.com
518thesun.co.uk
519thetimes.co.uk
520theverge.com
521thoughtco.com
522tianya.cn
523time.com
524tinyurl.com
525tmall.com
526tmz.com
527tribunnews.com
528tripadvisor.com
529trustpilot.com
530twitch.tv
531twitter.com
532ucoz.ru
533uiuc.edu
534umich.edu
535un.org
536undeveloped.com
537unesco.org
538uol.com.br
539urbandictionary.com
540usa.gov
541usatoday.com
542usgs.gov
543usnews.com
544uspto.gov
545ustream.tv
546utexas.edu
547variety.com
548venturebeat.com
549vice.com
550viglink.com
551vimeo.com
552vk.com
553vkontakte.ru
554vox.com
555w3.org
556w3schools.com
557wa.me
558walmart.com
559washington.edu
560washingtonpost.com
561wattpad.com
562weather.com
563web.fc2.com
564webmd.com
565weebly.com
566weibo.com
567welt.de
568whatsapp.com
569whitehouse.gov
570who.int
571wikia.com
572wikihow.com
573wikimedia.org
574wikipedia.org,mi.wikipedia.org
575wiktionary.org,mi.wiktionary.org
576wiley.com
577windowsphone.com
578wired.com
579wix.com
580wordpress.org,SUBDOMAIN-COPY
581worldbank.org
582wp.com
583wsj.com
584xbox.com
585xinhuanet.com
586yadi.sk
587yahoo.co.jp
588yahoo.com
589yale.edu
590yandex.ru
591yelp.com
592youku.com
593youronlinechoices.com
594youtu.be
595youtube.com
596ytimg.com
597zdnet.com
598zend.com
599zendesk.com
600zippyshare.com
Note: See TracBrowser for help on using the repository browser.