Changeset 33565 for gs3-extensions


Ignore:
Timestamp:
2019-10-14T21:04:58+13:00 (5 years ago)
Author:
ak19
Message:

CCWETProcessor: domain url now goes in as a seedURL after the individual seedURLs, after Dr Bainbridge explained why the original ordering didn't make sense. 2. conf: we inspected the first site to be crawled. It was a non-top site, but we still wanted to control the crawling of it in the same way we control topsites. 3. Documented use of the nutch command for testing which urls pass and fail the existing regex-urlfilter checks.

Location:
gs3-extensions/maori-lang-detection
Files:
3 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt

    r33558 r33565  
    293293---
    294294
     295----------------------------------------------------------------------
     296    Testing URLFilters: testing a URL to see if it's accepted
     297----------------------------------------------------------------------
     298Use the command
     299    ./bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined
     300(mentioned at https://lucene.472066.n3.nabble.com/Correct-syntax-for-regex-urlfilter-txt-trying-to-exclude-single-path-results-td3600376.html)
     301
     302Use as follows:
     303
     304    cd apache-nutch-2.3.1/runtime/local
     305
     306    ./bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined
     307
     308Then paste the URL you want to test, press Enter.
     309    A + in front of response means accepted
     310    A - in front of response means rejected.
     311Can continue pasting URLs to test against filters until you send Ctrl-D to terminate input.
     312
     313
     314
     315
     316
  • gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt

    r33562 r33565  
    5050#   column 3: whether nutch should do fetch all or not
    5151#   column 4: number of crawl iterations
     52
     53
     54# NOT TOP SITES, BUT SITES WE INSPECTED AND WANT TO CONTROL SIMILARLY TO TOP SITES
     5500.gs,SINGLEPAGE
     56
     57
     58# TOP SITES
    5259
    5360# docs.google.com is a special case: not all pages are public and any interlinking is likely to
  • gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java

    r33562 r33565  
    423423            // files (and write regexed domain into each sites/0000#/regex-urlfilter.txt)
    424424            // If we ever run nutch on a single seedURLs listing containing
    425             // all seed pages to crawl sites from, the above two files will work for that. 
     425            // all seed pages to crawl sites from, the above two files will work for that.
     426
     427            // first write out the urls for the domain into the sites/0000x/seedURLs.txt file
     428            // also write into the global seeds file (with a tab prefixed to each?)
     429            Set<String> urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol);
     430            for(String url : urlsForDomainSet) {
     431            seedURLsWriter.write(url + "\n"); // global seedURLs file
     432            siteURLsWriter.write(url + "\n");
     433            }
     434           
    426435           
    427436            if(allowedURLPatternRegex == null) { // entire site can be crawled
     
    455464                // since we will only be downloading the single page
    456465               
    457                 Set<String> urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol);
     466                urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol);
    458467                for(String urlInDomain : urlsForDomainSet) {
    459468                // don't append slash to end this time
     
    482491
    483492            }
    484             }
    485            
    486             // next write out the urls for the domain into the sites/0000x/seedURLs.txt file
    487             // also write into the global seeds file (with a tab prefixed to each?)
    488             Set<String> urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol);
    489             for(String url : urlsForDomainSet) {
    490             seedURLsWriter.write(url + "\n"); // global seedURLs file
    491             siteURLsWriter.write(url + "\n");
    492493            }
    493494           
Note: See TracChangeset for help on using the changeset viewer.