Ignore:
Timestamp:
2019-10-11T20:49:05+13:00 (5 years ago)
Author:
ak19
Message:
  1. sites-too-big-to-exhaustively-crawl.txt is now a comma separated list. 2. After the discussion with Dr Bainbridge that SINGLEPAGE is not what we want for docs.google.com, I found that the tentative switch to SUBDOMAIN-COPY for docs.google.com will not work precisely because of the important change we had to make yesterday: if SUBDOMAIN-COPY, then only copy SUBdomains, and not root domains. If root domain with SUBDOMAIN-COPY, then the seedURL gets written out to unprocessed-topsite-matches.txt and its site doesn't get crawled. 3. This revealed a lacuna in sites-too-big-to-exhaustively-crawl.txt possible list of values and I had to invent a new value which I introduce and have tested with this commit: FOLLOW_LINKS_WITHIN_TOPSITE. This value so far applies only to docs.google.com and will keep following any links originating in a seedURL on docs.google.com but only as long as it's within that topsite domain (docs.google.com). 4. Tidied some old fashioned use of Iterator, replaced with newer style of for loops that work with Types. Comitting before update code to use the apache csv API.
File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt

    r33559 r33561  
    1212
    1313# FORMAT OF THIS FILE'S CONTENTS:
    14 #    <topsite-base-url><tabspace><value>
     14#    <topsite-base-url>,<value>
    1515# where <value> can be empty or one of SUBDOMAIN-COPY, SINGLEPAGE, <url-form-without-protocol>
    1616#
     
    2929#     However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go
    3030#     into the file unprocessed-topsite-matches.txt and the site/page won't be crawled.
     31#   - FOLLOW-LINKS-WITHIN-TOPSITE: if pages linked from the seedURL page can be followed and
     32#     downloaded, as long as it's within the same subdomain matching the topsite-base-url.
     33#     This is different from SUBDOMAIN-COPY, as that can download all of a specific subdomain but
     34#     restricts against downloading the entire domain (e.g. all pinky.blogspot.com and not anything
     35#     else within blogspot.com). FOLLOW-LINKS-WITHIN-TOPSITE can download all linked pages (at
     36#     depth specified for the nutch crawl) as long as they're within the topsite-base-url.
     37#     e.g. seedURLs on docs.google.com containing links will have those linked pages and any
     38#     they link to etc. downloaded as long as they're on docs.google.com.
    3139#   - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided
    3240#     url-form-without-protocol will make up the urlfilter, again preventing leaking into a
     
    3846#     Remember to leave out any protocol <from url-form-without-protocol>.
    3947
    40 
    41 
    42 docs.google.com  SINGLEPAGE
    43 drive.google.com    SINGLEPAGE
    44 forms.office.com    SINGLEPAGE
    45 player.vimeo.com    SINGLEPAGE
    46 static-promote.weebly.com   SINGLEPAGE
     48# column 3: whether nutch should do fetch all or not
     49# column 4: number of crawl iterations
     50
     51# docs.google.com is a special case: not all pages are public and any interlinking is likely to
     52# be intentional. But SUBDOMAIN-COPY does not work: as seedURL's domain becomes docs.google.com
     53# which, when combined with SUBDOMAIN-COPY, the Java code treats as a special case so that
     54# any seedURL on docs.google.com ends up pushed out into the "unprocessed....txt" text file.
     55#docs.google.com,SUBDOMAIN-COPY
     56docs.google.com,FOLLOW-LINKS-WITHIN-TOPSITE
     57
     58drive.google.com,SINGLEPAGE
     59forms.office.com,SINGLEPAGE
     60player.vimeo.com,SINGLEPAGE
     61static-promote.weebly.com,SINGLEPAGE
    4762
    4863# Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
    4964# The page's containing folder is whitelisted in case the photos are there.
    50 korora.econ.yale.edu        SINGLEPAGE
     65korora.econ.yale.edu,,SINGLEPAGE
    5166
    5267000webhost.com
     
    115130blackberry.com
    116131blogger.com
    117 blogspot.com    SUBDOMAIN-COPY
     132blogspot.com,SUBDOMAIN-COPY
    118133bloomberg.com
    119134booking.com
     
    171186dreniq.com
    172187dribbble.com
    173 dropbox.com SINGLEPAGE
     188dropbox.com,SINGLEPAGE
    174189dropboxusercontent.com
    175190dw.com
     
    303318lonelyplanet.com
    304319lycos.com
    305 m.wikipedia.org mi.m.wikipedia.org
     320m.wikipedia.org,mi.m.wikipedia.org
    306321mail.ru
    307322marketwatch.com
     
    315330merriam-webster.com
    316331metro.co.uk
    317 microsoft.com   microsoft.com/mi-nz/
     332microsoft.com,microsoft.com/mi-nz/
    318333microsoftonline.com
    319334mirror.co.uk
     
    382397photobucket.com
    383398php.net
    384 pinterest.com   SINGLEPAGE
     399pinterest.com,SINGLEPAGE
    385400pixabay.com
    386401playstation.com
     
    456471stores.jp
    457472storify.com
    458 stuff.co.nz SINGLEPAGE
     473stuff.co.nz,SINGLEPAGE
    459474surveymonkey.com
    460475symantec.com
     
    534549wikihow.com
    535550wikimedia.org
    536 wikipedia.org   mi.wikipedia.org
    537 wiktionary.org  mi.wiktionary.org
     551wikipedia.org,mi.wikipedia.org
     552wiktionary.org,mi.wiktionary.org
    538553wiley.com
    539554windowsphone.com
    540555wired.com
    541556wix.com
    542 wordpress.org   SUBDOMAIN-COPY
     557wordpress.org,SUBDOMAIN-COPY
    543558worldbank.org
    544559wp.com
Note: See TracChangeset for help on using the changeset viewer.