Ignore:
Timestamp:
2019-10-10T23:44:31+13:00 (5 years ago)
Author:
ak19
Message:
  1. Special string COPY changed to SUBDOMAIN-COPY after Dr Bainbridge explained why it was more accurate to the behaviour. 2. Comments to explain how the sites-too-big-to-exhaustively-crawl.txt should be formatted, what values are expected and how they work. 3. Special blacklisting and whitelisting of urls on yale.edu, coupled with special treatment in topsites file too.
File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt

    r33555 r33559  
    1 # top sites - base url forms
    2 
    3 # Contains alexa top sites (where only the first 50 were visible)
     1# Mapping of top sites in base url forms to value
     2
     3# This file contains sites that are too large to crawl exhaustively.
     4# The domains are from Alexa top sites (where only the first 50 were visible)
    45# Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
    56# Finally also added https://moz.com/top500 by downloading its CSV file and
     
    1011# And finally, re-sorted the reduced list alphabetically and pasted into here.
    1112
     13# FORMAT OF THIS FILE'S CONTENTS:
     14#    <topsite-base-url><tabspace><value>
     15# where <value> can be empty or one of SUBDOMAIN-COPY, SINGLEPAGE, <url-form-without-protocol>
     16#
     17#   - if value is empty: if seedurl contains topsite-base-url, the seedurl will go into the file
     18#     unprocessed-topsite-matches.txt and the site/page won't be crawled.
     19#     The user will be notified to inspect the file unprocessed-topsite-matches.txt.
     20#   - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl.
     21#     For example, if the seedurl is http://docs.google.com/some-long-suffix-in-base64, then it
     22#     matches the topsite-base-url of docs.google.com and its value of SINGLEPAGE will add the
     23#     seedurl itself as the regex url-filter, to restrict the crawl to just the specified page.
     24#   - SUBDOMAIN-COPY: if seedurl CONTAINS topsite-base-url, then whatever the seedurl's subdomain
     25#     or else domain is, will make up the urlfilter, so we don't leak out into a larger domain.
     26#     Use SUBDOMAIN-COPY to restrict to a domain prefix/subdomain. For example, if seedurl is
     27#     pinky.blogspot.com, it will match the topsite-base-url of blogspot.com, but SUBDOMAIN-COPY
     28#     will ensure we restrict crawling to pages on pinky.blogspot.com.
     29#     However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go
     30#     into the file unprocessed-topsite-matches.txt and the site/page won't be crawled.
     31#   - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided
     32#     url-form-without-protocol will make up the urlfilter, again preventing leaking into a
     33#     larger part of the domain. For example, if the seedurl is mi.wikipedia.org/SomePage, it will
     34#     match the topsite-base-url of wikipedia.org for which the <url-form-without-protocol>
     35#     value is mi.wikipedia.org, which should be all that's accepted for wikipedia.org. The
     36#     <url-form-without-protocol> ends up in the regex urlfilter file, thereby restricting the
     37#     crawl to just mi.wikipedia.org.
     38#     Remember to leave out any protocol <from url-form-without-protocol>.
     39
     40
     41
     42docs.google.com  SINGLEPAGE
     43drive.google.com    SINGLEPAGE
     44forms.office.com    SINGLEPAGE
     45player.vimeo.com    SINGLEPAGE
     46static-promote.weebly.com   SINGLEPAGE
     47
     48# Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
     49# The page's containing folder is whitelisted in case the photos are there.
     50korora.econ.yale.edu        SINGLEPAGE
    1251
    1352000webhost.com
     
    76115blackberry.com
    77116blogger.com
    78 blogspot.com
     117blogspot.com    SUBDOMAIN-COPY
    79118bloomberg.com
    80119booking.com
     
    132171dreniq.com
    133172dribbble.com
    134 dropbox.com
     173dropbox.com SINGLEPAGE
    135174dropboxusercontent.com
    136175dw.com
     
    264303lonelyplanet.com
    265304lycos.com
    266 m.wikipedia.org
     305m.wikipedia.org mi.m.wikipedia.org
    267306mail.ru
    268307marketwatch.com
     
    276315merriam-webster.com
    277316metro.co.uk
    278 microsoft.com
     317microsoft.com   microsoft.com/mi-nz/
    279318microsoftonline.com
    280319mirror.co.uk
     
    343382photobucket.com
    344383php.net
    345 pinterest.com
     384pinterest.com   SINGLEPAGE
    346385pixabay.com
    347386playstation.com
     
    417456stores.jp
    418457storify.com
    419 stuff.co.nz
     458stuff.co.nz SINGLEPAGE
    420459surveymonkey.com
    421460symantec.com
     
    495534wikihow.com
    496535wikimedia.org
    497 wikipedia.org
    498 wiktionary.org
     536wikipedia.org   mi.wikipedia.org
     537wiktionary.org  mi.wiktionary.org
    499538wiley.com
    500539windowsphone.com
    501540wired.com
    502541wix.com
    503 wordpress.org
     542wordpress.org   SUBDOMAIN-COPY
    504543worldbank.org
    505544wp.com
Note: See TracChangeset for help on using the changeset viewer.