Changeset 33559 for gs3-extensions
- Timestamp:
- 2019-10-10T23:44:31+13:00 (5 years ago)
- Location:
- gs3-extensions/maori-lang-detection/conf
- Files:
-
- 3 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt
r33555 r33559 1 # top sites - base url forms 2 3 # Contains alexa top sites (where only the first 50 were visible) 1 # Mapping of top sites in base url forms to value 2 3 # This file contains sites that are too large to crawl exhaustively. 4 # The domains are from Alexa top sites (where only the first 50 were visible) 4 5 # Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites 5 6 # Finally also added https://moz.com/top500 by downloading its CSV file and … … 10 11 # And finally, re-sorted the reduced list alphabetically and pasted into here. 11 12 13 # FORMAT OF THIS FILE'S CONTENTS: 14 # <topsite-base-url><tabspace><value> 15 # where <value> can be empty or one of SUBDOMAIN-COPY, SINGLEPAGE, <url-form-without-protocol> 16 # 17 # - if value is empty: if seedurl contains topsite-base-url, the seedurl will go into the file 18 # unprocessed-topsite-matches.txt and the site/page won't be crawled. 19 # The user will be notified to inspect the file unprocessed-topsite-matches.txt. 20 # - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl. 21 # For example, if the seedurl is http://docs.google.com/some-long-suffix-in-base64, then it 22 # matches the topsite-base-url of docs.google.com and its value of SINGLEPAGE will add the 23 # seedurl itself as the regex url-filter, to restrict the crawl to just the specified page. 24 # - SUBDOMAIN-COPY: if seedurl CONTAINS topsite-base-url, then whatever the seedurl's subdomain 25 # or else domain is, will make up the urlfilter, so we don't leak out into a larger domain. 26 # Use SUBDOMAIN-COPY to restrict to a domain prefix/subdomain. For example, if seedurl is 27 # pinky.blogspot.com, it will match the topsite-base-url of blogspot.com, but SUBDOMAIN-COPY 28 # will ensure we restrict crawling to pages on pinky.blogspot.com. 29 # However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go 30 # into the file unprocessed-topsite-matches.txt and the site/page won't be crawled. 31 # - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided 32 # url-form-without-protocol will make up the urlfilter, again preventing leaking into a 33 # larger part of the domain. For example, if the seedurl is mi.wikipedia.org/SomePage, it will 34 # match the topsite-base-url of wikipedia.org for which the <url-form-without-protocol> 35 # value is mi.wikipedia.org, which should be all that's accepted for wikipedia.org. The 36 # <url-form-without-protocol> ends up in the regex urlfilter file, thereby restricting the 37 # crawl to just mi.wikipedia.org. 38 # Remember to leave out any protocol <from url-form-without-protocol>. 39 40 41 42 docs.google.com SINGLEPAGE 43 drive.google.com SINGLEPAGE 44 forms.office.com SINGLEPAGE 45 player.vimeo.com SINGLEPAGE 46 static-promote.weebly.com SINGLEPAGE 47 48 # Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos 49 # The page's containing folder is whitelisted in case the photos are there. 50 korora.econ.yale.edu SINGLEPAGE 12 51 13 52 000webhost.com … … 76 115 blackberry.com 77 116 blogger.com 78 blogspot.com 117 blogspot.com SUBDOMAIN-COPY 79 118 bloomberg.com 80 119 booking.com … … 132 171 dreniq.com 133 172 dribbble.com 134 dropbox.com 173 dropbox.com SINGLEPAGE 135 174 dropboxusercontent.com 136 175 dw.com … … 264 303 lonelyplanet.com 265 304 lycos.com 266 m.wikipedia.org 305 m.wikipedia.org mi.m.wikipedia.org 267 306 mail.ru 268 307 marketwatch.com … … 276 315 merriam-webster.com 277 316 metro.co.uk 278 microsoft.com 317 microsoft.com microsoft.com/mi-nz/ 279 318 microsoftonline.com 280 319 mirror.co.uk … … 343 382 photobucket.com 344 383 php.net 345 pinterest.com 384 pinterest.com SINGLEPAGE 346 385 pixabay.com 347 386 playstation.com … … 417 456 stores.jp 418 457 storify.com 419 stuff.co.nz 458 stuff.co.nz SINGLEPAGE 420 459 surveymonkey.com 421 460 symantec.com … … 495 534 wikihow.com 496 535 wikimedia.org 497 wikipedia.org 498 wiktionary.org 536 wikipedia.org mi.wikipedia.org 537 wiktionary.org mi.wiktionary.org 499 538 wiley.com 500 539 windowsphone.com 501 540 wired.com 502 541 wix.com 503 wordpress.org 542 wordpress.org SUBDOMAIN-COPY 504 543 worldbank.org 505 544 wp.com -
gs3-extensions/maori-lang-detection/conf/url-blacklist-filter.txt
r33556 r33559 6 6 # Without either ^ or $ symbol, urls containing the given url will get blacklisted 7 7 8 9 # manually adjusting for irrelevant topsite hits 10 # Rapa-Nui is related to Easter Island 11 ^http://codex.cs.yale.edu/avi/silberschatz/gallery/trips-photos/South-America/Rapa-Nui/ 12 13 # We will blacklist this yale.edu domain except for the subportion that gets whitelisted 14 # then in the sites-too-big-to-exhaustively-crawl.txt, we have a mapping for an allowed url 15 # pattern in case elements on the page are stored elsewhere 16 ^http://korora.econ.yale.edu/ 8 17 9 18 # wikipedia pages in -
gs3-extensions/maori-lang-detection/conf/url-whitelist-filter.txt
r33531 r33559 7 7 # Without either ^ or $ symbol, urls containing the given url will get greylisted 8 8 9 mi.wikipedia.org 9 # Special exception for this url on yale.edu, since we needed to blacklist 10 # some particular other urls on yale.edu 11 http://korora.econ.yale.edu/phillips/archive/hauraki.htm
Note:
See TracChangeset
for help on using the changeset viewer.