Changeset 33565

Show
Ignore:
Timestamp:
14.10.2019 21:04:58 (4 weeks ago)
Author:
ak19
Message:

CCWETProcessor: domain url now goes in as a seedURL after the individual seedURLs, after Dr Bainbridge explained why the original ordering didn't make sense. 2. conf: we inspected the first site to be crawled. It was a non-top site, but we still wanted to control the crawling of it in the same way we control topsites. 3. Documented use of the nutch command for testing which urls pass and fail the existing regex-urlfilter checks.

Location:
gs3-extensions/maori-lang-detection
Files:
3 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt

    r33558 r33565  
    293293--- 
    294294 
     295---------------------------------------------------------------------- 
     296    Testing URLFilters: testing a URL to see if it's accepted 
     297---------------------------------------------------------------------- 
     298Use the command 
     299    ./bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined  
     300(mentioned at https://lucene.472066.n3.nabble.com/Correct-syntax-for-regex-urlfilter-txt-trying-to-exclude-single-path-results-td3600376.html) 
     301 
     302Use as follows: 
     303 
     304    cd apache-nutch-2.3.1/runtime/local 
     305 
     306    ./bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined  
     307 
     308Then paste the URL you want to test, press Enter. 
     309    A + in front of response means accepted 
     310    A - in front of response means rejected. 
     311Can continue pasting URLs to test against filters until you send Ctrl-D to terminate input. 
     312 
     313 
     314 
     315 
     316 
  • gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt

    r33562 r33565  
    5050#   column 3: whether nutch should do fetch all or not 
    5151#   column 4: number of crawl iterations 
     52 
     53 
     54# NOT TOP SITES, BUT SITES WE INSPECTED AND WANT TO CONTROL SIMILARLY TO TOP SITES 
     5500.gs,SINGLEPAGE 
     56 
     57 
     58# TOP SITES 
    5259 
    5360# docs.google.com is a special case: not all pages are public and any interlinking is likely to 
  • gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java

    r33562 r33565  
    423423            // files (and write regexed domain into each sites/0000#/regex-urlfilter.txt) 
    424424            // If we ever run nutch on a single seedURLs listing containing 
    425             // all seed pages to crawl sites from, the above two files will work for that.   
     425            // all seed pages to crawl sites from, the above two files will work for that. 
     426 
     427            // first write out the urls for the domain into the sites/0000x/seedURLs.txt file 
     428            // also write into the global seeds file (with a tab prefixed to each?) 
     429            Set<String> urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol); 
     430            for(String url : urlsForDomainSet) { 
     431            seedURLsWriter.write(url + "\n"); // global seedURLs file 
     432            siteURLsWriter.write(url + "\n"); 
     433            } 
     434             
    426435             
    427436            if(allowedURLPatternRegex == null) { // entire site can be crawled 
     
    455464                // since we will only be downloading the single page 
    456465                 
    457                 Set<String> urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol); 
     466                urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol); 
    458467                for(String urlInDomain : urlsForDomainSet) { 
    459468                // don't append slash to end this time 
     
    482491 
    483492            } 
    484             } 
    485              
    486             // next write out the urls for the domain into the sites/0000x/seedURLs.txt file 
    487             // also write into the global seeds file (with a tab prefixed to each?) 
    488             Set<String> urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol); 
    489             for(String url : urlsForDomainSet) { 
    490             seedURLsWriter.write(url + "\n"); // global seedURLs file 
    491             siteURLsWriter.write(url + "\n"); 
    492493            } 
    493494