Ignore:
Timestamp:
2019-10-14T21:04:58+13:00 (5 years ago)
Author:
ak19
Message:

CCWETProcessor: domain url now goes in as a seedURL after the individual seedURLs, after Dr Bainbridge explained why the original ordering didn't make sense. 2. conf: we inspected the first site to be crawled. It was a non-top site, but we still wanted to control the crawling of it in the same way we control topsites. 3. Documented use of the nutch command for testing which urls pass and fail the existing regex-urlfilter checks.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt

    r33558 r33565  
    293293---
    294294
     295----------------------------------------------------------------------
     296    Testing URLFilters: testing a URL to see if it's accepted
     297----------------------------------------------------------------------
     298Use the command
     299    ./bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined
     300(mentioned at https://lucene.472066.n3.nabble.com/Correct-syntax-for-regex-urlfilter-txt-trying-to-exclude-single-path-results-td3600376.html)
     301
     302Use as follows:
     303
     304    cd apache-nutch-2.3.1/runtime/local
     305
     306    ./bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined
     307
     308Then paste the URL you want to test, press Enter.
     309    A + in front of response means accepted
     310    A - in front of response means rejected.
     311Can continue pasting URLs to test against filters until you send Ctrl-D to terminate input.
     312
     313
     314
     315
     316
Note: See TracChangeset for help on using the changeset viewer.