Changeset 33562 for gs3-extensions/maori-lang-detection/conf
- Timestamp:
- 2019-10-11T21:52:40+13:00 (5 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt
r33561 r33562 13 13 # FORMAT OF THIS FILE'S CONTENTS: 14 14 # <topsite-base-url>,<value> 15 # where <value> can be empty or one of SUBDOMAIN-COPY, SINGLEPAGE, <url-form-without-protocol> 15 # where <value> can or is one of 16 # empty, SUBDOMAIN-COPY, FOLLOW-LINKS-WITHIN-TOPSITE, SINGLEPAGE, <url-form-without-protocol> 16 17 # 17 # - if value is empty: if seedurl contains topsite-base-url, the seedurl will go into the file18 # unprocessed-topsite-matches.txt and the site/page won't be crawled.18 # - if value is left empty: if seedurl contains topsite-base-url, the seedurl will go into the 19 # file unprocessed-topsite-matches.txt and the site/page won't be crawled. 19 20 # The user will be notified to inspect the file unprocessed-topsite-matches.txt. 20 21 # - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl. … … 45 46 # crawl to just mi.wikipedia.org. 46 47 # Remember to leave out any protocol <from url-form-without-protocol>. 47 48 # column 3: whether nutch should do fetch all or not 49 # column 4: number of crawl iterations 48 # 49 # TODO If useful: 50 # column 3: whether nutch should do fetch all or not 51 # column 4: number of crawl iterations 50 52 51 53 # docs.google.com is a special case: not all pages are public and any interlinking is likely to 52 # be intentional. But SUBDOMAIN-COPY does not work: as seedURL's domain becomes docs.google.com 53 # which, when combined with SUBDOMAIN-COPY, the Java code treats as a special case so that 54 # any seedURL on docs.google.com ends up pushed out into the "unprocessed....txt" text file. 55 #docs.google.com,SUBDOMAIN-COPY 54 # be intentional. Grab all linked pages, for link depth set with nutch's crawl, as long as the 55 # links are within the given topsite-base-url 56 56 docs.google.com,FOLLOW-LINKS-WITHIN-TOPSITE 57 57 58 # Just crawl a single page for these: 58 59 drive.google.com,SINGLEPAGE 59 60 forms.office.com,SINGLEPAGE … … 63 64 # Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos 64 65 # The page's containing folder is whitelisted in case the photos are there. 65 korora.econ.yale.edu, ,SINGLEPAGE66 korora.econ.yale.edu,SINGLEPAGE 66 67 67 68 000webhost.com
Note:
See TracChangeset
for help on using the changeset viewer.