source: gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt@ 33559

Last change on this file since 33559 was 33559, checked in by ak19, 5 years ago
  1. Special string COPY changed to SUBDOMAIN-COPY after Dr Bainbridge explained why it was more accurate to the behaviour. 2. Comments to explain how the sites-too-big-to-exhaustively-crawl.txt should be formatted, what values are expected and how they work. 3. Special blacklisting and whitelisting of urls on yale.edu, coupled with special treatment in topsites file too.
File size: 9.3 KB
Line 
1# Mapping of top sites in base url forms to value
2
3# This file contains sites that are too large to crawl exhaustively.
4# The domains are from Alexa top sites (where only the first 50 were visible)
5# Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
6# Finally also added https://moz.com/top500 by downloading its CSV file and
7# adding its URLs to the existing listing here from alexa/wiki.
8# Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates.
9# Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext variants, keeping
10# just <site>.ext
11# And finally, re-sorted the reduced list alphabetically and pasted into here.
12
13# FORMAT OF THIS FILE'S CONTENTS:
14# <topsite-base-url><tabspace><value>
15# where <value> can be empty or one of SUBDOMAIN-COPY, SINGLEPAGE, <url-form-without-protocol>
16#
17# - if value is empty: if seedurl contains topsite-base-url, the seedurl will go into the file
18# unprocessed-topsite-matches.txt and the site/page won't be crawled.
19# The user will be notified to inspect the file unprocessed-topsite-matches.txt.
20# - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl.
21# For example, if the seedurl is http://docs.google.com/some-long-suffix-in-base64, then it
22# matches the topsite-base-url of docs.google.com and its value of SINGLEPAGE will add the
23# seedurl itself as the regex url-filter, to restrict the crawl to just the specified page.
24# - SUBDOMAIN-COPY: if seedurl CONTAINS topsite-base-url, then whatever the seedurl's subdomain
25# or else domain is, will make up the urlfilter, so we don't leak out into a larger domain.
26# Use SUBDOMAIN-COPY to restrict to a domain prefix/subdomain. For example, if seedurl is
27# pinky.blogspot.com, it will match the topsite-base-url of blogspot.com, but SUBDOMAIN-COPY
28# will ensure we restrict crawling to pages on pinky.blogspot.com.
29# However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go
30# into the file unprocessed-topsite-matches.txt and the site/page won't be crawled.
31# - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided
32# url-form-without-protocol will make up the urlfilter, again preventing leaking into a
33# larger part of the domain. For example, if the seedurl is mi.wikipedia.org/SomePage, it will
34# match the topsite-base-url of wikipedia.org for which the <url-form-without-protocol>
35# value is mi.wikipedia.org, which should be all that's accepted for wikipedia.org. The
36# <url-form-without-protocol> ends up in the regex urlfilter file, thereby restricting the
37# crawl to just mi.wikipedia.org.
38# Remember to leave out any protocol <from url-form-without-protocol>.
39
40
41
42docs.google.com SINGLEPAGE
43drive.google.com SINGLEPAGE
44forms.office.com SINGLEPAGE
45player.vimeo.com SINGLEPAGE
46static-promote.weebly.com SINGLEPAGE
47
48# Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
49# The page's containing folder is whitelisted in case the photos are there.
50korora.econ.yale.edu SINGLEPAGE
51
52000webhost.com
53360.cn
544shared.com
55a8.net
56abc.es
57abc.net.au
58abcnews.go.com
59about.com
60about.me
61aboutads.info
62abril.com.br
63academia.edu
64accuweather.com
65addthis.com
66addtoany.com
67adobe.com
68adweek.com
69airbnb.com
70akamaihd.net
71alexa.com
72alibaba.com
73aliexpress.com
74alipay.com
75aljazeera.com
76allaboutcookies.org
77allrecipes.com
78amazon.ca
79amazon.co.jp
80amazon.co.uk
81amazon.com
82amazon.de
83amazon.es
84amazon.fr
85amazon.in
86ameblo.jp
87ampproject.org
88android.com
89aol.com
90ap.org
91apache.org
92apachefriends.org
93apple.com
94archive.org
95archives.gov
96arstechnica.com
97arxiv.org
98asahi.com
99ask.fm
100asus.com
101axs.com
102babytree.com
103baidu.com
104bandcamp.com
105bbc.co.uk
106bbc.com
107behance.net
108berkeley.edu
109biblegateway.com
110biglobe.ne.jp
111billboard.com
112bing.com
113bit.ly
114bitly.com
115blackberry.com
116blogger.com
117blogspot.com SUBDOMAIN-COPY
118bloomberg.com
119booking.com
120boston.com
121box.com
122britannica.com
123bt.com
124bund.de
125businessinsider.com
126businesswire.com
127buydomains.com
128buzzfeed.com
129ca.gov
130cambridge.org
131canalblog.com
132cbc.ca
133cbslocal.com
134cbsnews.com
135cdc.gov
136change.org
137channel4.com
138chicagotribune.com
139chinadaily.com.cn
140cisco.com
141clickbank.net
142cloudflare.com
143cmu.edu
144cnbc.com
145cnet.com
146cnn.com
147cocolog-nifty.com
148columbia.edu
149connect.over-blog.com
150cornell.edu
151corriere.it
152cpanel.com
153cpanel.net
154creativecommons.org
155csdn.net
156csmonitor.com
157dailymail.co.uk
158dailymotion.com
159dan.com
160daum.net
161debian.org
162dell.com
163depositfiles.com
164detik.com
165digg.com
166discovery.com
167disney.com
168disney.go.com
169disqus.com
170doubleclick.net
171dreniq.com
172dribbble.com
173dropbox.com SINGLEPAGE
174dropboxusercontent.com
175dw.com
176e-recht24.de
177ea.com
178ebay.co.uk
179ebay.com
180economist.com
181eff.org
182ehow.com
183elmundo.es
184elpais.com
185engadget.com
186entrepreneur.com
187eonline.com
188espn.com
189espn.go.com
190etsy.com
191europa.eu
192eventbrite.com
193example.com
194excite.co.jp
195express.co.uk
196facebook.com
197fandom.com
198fastcompany.com
199fb.com
200fb.me
201fda.gov
202fedoraproject.org
203feedburner.com
204fifa.com
205files.wordpress.com
206flickr.com
207forbes.com
208fortune.com
209foursquare.com
210foxnews.com
211ft.com
212ftc.gov
213gen.xyz
214geocities.jp
215gesetze-im-internet.de
216ggpht.com
217github.com
218gizmodo.com
219globo.com
220gmail.com
221gnu.org
222godaddy.com
223gofundme.com
224goo.gl
225goo.ne.jp
226goodreads.com
227google.ca
228google.co.id
229google.co.in
230google.co.jp
231google.co.uk
232google.com
233google.com.br
234google.com.hk
235google.com.tr
236google.de
237google.es
238google.fr
239google.it
240google.nl
241google.pl
242google.ru
243googleapis.com
244googleblog.com
245googleusercontent.com
246gooyaabitemplates.com
247gov.uk
248gravatar.com
249greenpeace.org
250gstatic.com
251guardian.co.uk
252harvard.edu
253hatena.ne.jp
254histats.com
255hm.com
256hollywoodreporter.com
257home.pl
258house.gov
259howstuffworks.com
260hp.com
261huffingtonpost.com
262huffpost.com
263hugedomains.com
264ibm.com
265ibtimes.com
266icann.org
267ieee.org
268ietf.org
269ig.com.br
270ign.com
271ikea.com
272imageshack.us
273imdb.com
274imgur.com
275inc.com
276independent.co.uk
277indiatimes.com
278indiegogo.com
279instagram.com
280instructables.com
281intel.com
282interia.pl
283issuu.com
284istockphoto.com
285iubenda.com
286jd.com
287joomla.org
288jquery.com
289jstor.org
290kickstarter.com
291kinja.com
292last.fm
293latimes.com
294lefigaro.fr
295lemonde.fr
296line.me
297linkedin.com
298list-manage.com
299live.com
300livejournal.com
301livescience.com
302loc.gov
303lonelyplanet.com
304lycos.com
305m.wikipedia.org mi.m.wikipedia.org
306mail.ru
307marketwatch.com
308marriott.com
309mashable.com
310mediafire.com
311medium.com
312mega.nz
313megaupload.com
314mercurynews.com
315merriam-webster.com
316metro.co.uk
317microsoft.com microsoft.com/mi-nz/
318microsoftonline.com
319mirror.co.uk
320mit.edu
321mixcloud.com
322mlb.com
323mozilla.com
324mozilla.org
325msn.com
326myspace.com
327mysql.com
328namecheap.com
329narod.ru
330nasa.gov
331nationalgeographic.com
332nature.com
333naver.com
334naver.jp
335nba.com
336nbcnews.com
337ndtv.com
338netflix.com
339netsons.com
340netvibes.com
341networkadvertising.org
342news.com.au
343newscientist.com
344newsweek.com
345newyorker.com
346nginx.com
347nginx.org
348nhk.or.jp
349nicovideo.jp
350nifty.com
351nih.gov
352nikkei.com
353noaa.gov
354nokia.com
355npr.org
356nvidia.com
357nydailynews.com
358nypost.com
359nytimes.com
360nyu.edu
361odnoklassniki.ru
362office.com
363offset.com
364ok.ru
365okezone.com
366opera.com
367oracle.com
368orange.fr
369oreilly.com
370oup.com
371over-blog.com
372ovh.co.uk
373ovh.com
374ovh.net
375ox.ac.uk
376parallels.com
377pastebin.com
378paypal.com
379pbs.org
380pcmag.com
381people.com
382photobucket.com
383php.net
384pinterest.com SINGLEPAGE
385pixabay.com
386playstation.com
387plesk.com
388plos.org
389politico.com
390prestashop.com
391prezi.com
392princeton.edu
393privacyshield.gov
394prnewswire.com
395psychologytoday.com
396qq.com
397quantcast.com
398quora.com
399rakuten.co.jp
400rambler.ru
401rapidshare.com
402reddit.com
403repubblica.it
404researchgate.net
405reuters.com
406ria.ru
407rottentomatoes.com
408rt.com
409rtve.es
410sakura.ne.jp
411samsung.com
412sapo.pt
413scholastic.com
414sciencedaily.com
415sciencedirect.com
416sciencemag.org
417scientificamerican.com
418scribd.com
419seattletimes.com
420secureserver.net
421sedo.com
422seesaa.net
423sendspace.com
424sfgate.com
425shopify.com
426shutterstock.com
427siemens.com
428sina.com.cn
429sky.com
430skype.com
431skyrock.com
432slate.com
433slideshare.net
434sm.cn
435smh.com.au
436so-net.ne.jp
437softonic.com
438sogou.com
439sohu.com
440soratemplates.com
441soso.com
442soundcloud.com
443spiegel.de
444spotify.com
445springer.com
446sputniknews.com
447ssl-images-amazon.com
448stackoverflow.com
449standard.co.uk
450stanford.edu
451state.gov
452steamcommunity.com
453steampowered.com
454storage.canalblog.com
455storage.googleapis.com
456stores.jp
457storify.com
458stuff.co.nz SINGLEPAGE
459surveymonkey.com
460symantec.com
461t-online.de
462t.co
463t.me
464tabelog.com
465taobao.com
466target.com
467teamviewer.com
468techcrunch.com
469ted.com
470telegram.me
471telegraph.co.uk
472terra.com.br
473theatlantic.com
474thefreedictionary.com
475theglobeandmail.com
476theguardian.com
477themeforest.net
478thenextweb.com
479thestar.com
480thesun.co.uk
481thetimes.co.uk
482theverge.com
483thoughtco.com
484tianya.cn
485time.com
486tinyurl.com
487tmall.com
488tmz.com
489tribunnews.com
490tripadvisor.com
491trustpilot.com
492twitch.tv
493twitter.com
494ucoz.ru
495uiuc.edu
496umich.edu
497un.org
498undeveloped.com
499unesco.org
500uol.com.br
501urbandictionary.com
502usa.gov
503usatoday.com
504usgs.gov
505usnews.com
506uspto.gov
507ustream.tv
508utexas.edu
509variety.com
510venturebeat.com
511vice.com
512viglink.com
513vimeo.com
514vk.com
515vkontakte.ru
516vox.com
517w3.org
518w3schools.com
519wa.me
520walmart.com
521washington.edu
522washingtonpost.com
523wattpad.com
524weather.com
525web.fc2.com
526webmd.com
527weebly.com
528weibo.com
529welt.de
530whatsapp.com
531whitehouse.gov
532who.int
533wikia.com
534wikihow.com
535wikimedia.org
536wikipedia.org mi.wikipedia.org
537wiktionary.org mi.wiktionary.org
538wiley.com
539windowsphone.com
540wired.com
541wix.com
542wordpress.org SUBDOMAIN-COPY
543worldbank.org
544wp.com
545wsj.com
546xbox.com
547xinhuanet.com
548yadi.sk
549yahoo.co.jp
550yahoo.com
551yale.edu
552yandex.ru
553yelp.com
554youku.com
555youronlinechoices.com
556youtu.be
557youtube.com
558ytimg.com
559zdnet.com
560zend.com
561zendesk.com
562zippyshare.com
Note: See TracBrowser for help on using the repository browser.