Changeset 33559

Show
Ignore:
Timestamp:
10.10.2019 23:44:31 (7 days ago)
Author:
ak19
Message:

1. Special string COPY changed to SUBDOMAIN-COPY after Dr Bainbridge explained why it was more accurate to the behaviour. 2. Comments to explain how the sites-too-big-to-exhaustively-crawl.txt should be formatted, what values are expected and how they work. 3. Special blacklisting and whitelisting of urls on yale.edu, coupled with special treatment in topsites file too.

Location:
gs3-extensions/maori-lang-detection/conf
Files:
3 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt

    r33555 r33559  
    1 # top sites - base url forms 
    2  
    3 # Contains alexa top sites (where only the first 50 were visible) 
     1# Mapping of top sites in base url forms to value 
     2 
     3# This file contains sites that are too large to crawl exhaustively. 
     4# The domains are from Alexa top sites (where only the first 50 were visible) 
    45# Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites 
    56# Finally also added https://moz.com/top500 by downloading its CSV file and 
     
    1011# And finally, re-sorted the reduced list alphabetically and pasted into here. 
    1112 
     13# FORMAT OF THIS FILE'S CONTENTS: 
     14#    <topsite-base-url><tabspace><value> 
     15# where <value> can be empty or one of SUBDOMAIN-COPY, SINGLEPAGE, <url-form-without-protocol> 
     16# 
     17#   - if value is empty: if seedurl contains topsite-base-url, the seedurl will go into the file 
     18#     unprocessed-topsite-matches.txt and the site/page won't be crawled. 
     19#     The user will be notified to inspect the file unprocessed-topsite-matches.txt. 
     20#   - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl. 
     21#     For example, if the seedurl is http://docs.google.com/some-long-suffix-in-base64, then it 
     22#     matches the topsite-base-url of docs.google.com and its value of SINGLEPAGE will add the 
     23#     seedurl itself as the regex url-filter, to restrict the crawl to just the specified page. 
     24#   - SUBDOMAIN-COPY: if seedurl CONTAINS topsite-base-url, then whatever the seedurl's subdomain 
     25#     or else domain is, will make up the urlfilter, so we don't leak out into a larger domain. 
     26#     Use SUBDOMAIN-COPY to restrict to a domain prefix/subdomain. For example, if seedurl is 
     27#     pinky.blogspot.com, it will match the topsite-base-url of blogspot.com, but SUBDOMAIN-COPY 
     28#     will ensure we restrict crawling to pages on pinky.blogspot.com. 
     29#     However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go 
     30#     into the file unprocessed-topsite-matches.txt and the site/page won't be crawled. 
     31#   - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided 
     32#     url-form-without-protocol will make up the urlfilter, again preventing leaking into a 
     33#     larger part of the domain. For example, if the seedurl is mi.wikipedia.org/SomePage, it will 
     34#     match the topsite-base-url of wikipedia.org for which the <url-form-without-protocol> 
     35#     value is mi.wikipedia.org, which should be all that's accepted for wikipedia.org. The 
     36#     <url-form-without-protocol> ends up in the regex urlfilter file, thereby restricting the 
     37#     crawl to just mi.wikipedia.org. 
     38#     Remember to leave out any protocol <from url-form-without-protocol>. 
     39 
     40 
     41 
     42docs.google.com  SINGLEPAGE 
     43drive.google.com    SINGLEPAGE 
     44forms.office.com    SINGLEPAGE 
     45player.vimeo.com    SINGLEPAGE 
     46static-promote.weebly.com   SINGLEPAGE 
     47 
     48# Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos 
     49# The page's containing folder is whitelisted in case the photos are there. 
     50korora.econ.yale.edu        SINGLEPAGE 
    1251 
    1352000webhost.com 
     
    76115blackberry.com 
    77116blogger.com 
    78 blogspot.com 
     117blogspot.com    SUBDOMAIN-COPY 
    79118bloomberg.com 
    80119booking.com 
     
    132171dreniq.com 
    133172dribbble.com 
    134 dropbox.com 
     173dropbox.com SINGLEPAGE 
    135174dropboxusercontent.com 
    136175dw.com 
     
    264303lonelyplanet.com 
    265304lycos.com 
    266 m.wikipedia.org 
     305m.wikipedia.org mi.m.wikipedia.org 
    267306mail.ru 
    268307marketwatch.com 
     
    276315merriam-webster.com 
    277316metro.co.uk 
    278 microsoft.com 
     317microsoft.com   microsoft.com/mi-nz/ 
    279318microsoftonline.com 
    280319mirror.co.uk 
     
    343382photobucket.com 
    344383php.net 
    345 pinterest.com 
     384pinterest.com   SINGLEPAGE 
    346385pixabay.com 
    347386playstation.com 
     
    417456stores.jp 
    418457storify.com 
    419 stuff.co.nz 
     458stuff.co.nz SINGLEPAGE 
    420459surveymonkey.com 
    421460symantec.com 
     
    495534wikihow.com 
    496535wikimedia.org 
    497 wikipedia.org 
    498 wiktionary.org 
     536wikipedia.org   mi.wikipedia.org 
     537wiktionary.org  mi.wiktionary.org 
    499538wiley.com 
    500539windowsphone.com 
    501540wired.com 
    502541wix.com 
    503 wordpress.org 
     542wordpress.org   SUBDOMAIN-COPY 
    504543worldbank.org 
    505544wp.com 
  • gs3-extensions/maori-lang-detection/conf/url-blacklist-filter.txt

    r33556 r33559  
    66# Without either ^ or $ symbol, urls containing the given url will get blacklisted 
    77 
     8 
     9# manually adjusting for irrelevant topsite hits 
     10# Rapa-Nui is related to Easter Island 
     11^http://codex.cs.yale.edu/avi/silberschatz/gallery/trips-photos/South-America/Rapa-Nui/ 
     12 
     13# We will blacklist this yale.edu domain except for the subportion that gets whitelisted 
     14# then in the sites-too-big-to-exhaustively-crawl.txt, we have a mapping for an allowed url 
     15# pattern in case elements on the page are stored elsewhere 
     16^http://korora.econ.yale.edu/ 
    817 
    918# wikipedia pages in 
  • gs3-extensions/maori-lang-detection/conf/url-whitelist-filter.txt

    r33531 r33559  
    77# Without either ^ or $ symbol, urls containing the given url will get greylisted 
    88 
    9 mi.wikipedia.org 
     9# Special exception for this url on yale.edu, since we needed to blacklist 
     10# some particular other urls on yale.edu 
     11http://korora.econ.yale.edu/phillips/archive/hauraki.htm