Ignore:
Timestamp:
2019-10-11T21:52:40+13:00 (5 years ago)
Author:
ak19
Message:
  1. The sites-too-big-to-exhaustively-crawl.txt is now a csv file of a semi-custom format, and the Java code now uses the Apache Commons CSV jar file (v1.7 for Java 8) to parse the contents thereof. 2. Tidied up code to reuse reference to ClassLoader.
File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt

    r33561 r33562  
    1313# FORMAT OF THIS FILE'S CONTENTS:
    1414#    <topsite-base-url>,<value>
    15 # where <value> can be empty or one of SUBDOMAIN-COPY, SINGLEPAGE, <url-form-without-protocol>
     15# where <value> can or is one of
     16#    empty, SUBDOMAIN-COPY, FOLLOW-LINKS-WITHIN-TOPSITE, SINGLEPAGE, <url-form-without-protocol>
    1617#
    17 #   - if value is empty: if seedurl contains topsite-base-url, the seedurl will go into the file
    18 #     unprocessed-topsite-matches.txt and the site/page won't be crawled.
     18#   - if value is left empty: if seedurl contains topsite-base-url, the seedurl will go into the
     19#     file unprocessed-topsite-matches.txt and the site/page won't be crawled.
    1920#     The user will be notified to inspect the file unprocessed-topsite-matches.txt.
    2021#   - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl.
     
    4546#     crawl to just mi.wikipedia.org.
    4647#     Remember to leave out any protocol <from url-form-without-protocol>.
    47 
    48 # column 3: whether nutch should do fetch all or not
    49 # column 4: number of crawl iterations
     48#
     49# TODO If useful:
     50#   column 3: whether nutch should do fetch all or not
     51#   column 4: number of crawl iterations
    5052
    5153# docs.google.com is a special case: not all pages are public and any interlinking is likely to
    52 # be intentional. But SUBDOMAIN-COPY does not work: as seedURL's domain becomes docs.google.com
    53 # which, when combined with SUBDOMAIN-COPY, the Java code treats as a special case so that
    54 # any seedURL on docs.google.com ends up pushed out into the "unprocessed....txt" text file.
    55 #docs.google.com,SUBDOMAIN-COPY
     54# be intentional. Grab all linked pages, for link depth set with nutch's crawl, as long as the
     55# links are within the given topsite-base-url
    5656docs.google.com,FOLLOW-LINKS-WITHIN-TOPSITE
    5757
     58# Just crawl a single page for these:
    5859drive.google.com,SINGLEPAGE
    5960forms.office.com,SINGLEPAGE
     
    6364# Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
    6465# The page's containing folder is whitelisted in case the photos are there.
    65 korora.econ.yale.edu,,SINGLEPAGE
     66korora.econ.yale.edu,SINGLEPAGE
    6667
    6768000webhost.com
Note: See TracChangeset for help on using the changeset viewer.