Changeset 33565
- Timestamp:
- 2019-10-14T21:04:58+13:00 (5 years ago)
- Location:
- gs3-extensions/maori-lang-detection
- Files:
-
- 3 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt
r33558 r33565 293 293 --- 294 294 295 ---------------------------------------------------------------------- 296 Testing URLFilters: testing a URL to see if it's accepted 297 ---------------------------------------------------------------------- 298 Use the command 299 ./bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined 300 (mentioned at https://lucene.472066.n3.nabble.com/Correct-syntax-for-regex-urlfilter-txt-trying-to-exclude-single-path-results-td3600376.html) 301 302 Use as follows: 303 304 cd apache-nutch-2.3.1/runtime/local 305 306 ./bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined 307 308 Then paste the URL you want to test, press Enter. 309 A + in front of response means accepted 310 A - in front of response means rejected. 311 Can continue pasting URLs to test against filters until you send Ctrl-D to terminate input. 312 313 314 315 316 -
gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt
r33562 r33565 50 50 # column 3: whether nutch should do fetch all or not 51 51 # column 4: number of crawl iterations 52 53 54 # NOT TOP SITES, BUT SITES WE INSPECTED AND WANT TO CONTROL SIMILARLY TO TOP SITES 55 00.gs,SINGLEPAGE 56 57 58 # TOP SITES 52 59 53 60 # docs.google.com is a special case: not all pages are public and any interlinking is likely to -
gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java
r33562 r33565 423 423 // files (and write regexed domain into each sites/0000#/regex-urlfilter.txt) 424 424 // If we ever run nutch on a single seedURLs listing containing 425 // all seed pages to crawl sites from, the above two files will work for that. 425 // all seed pages to crawl sites from, the above two files will work for that. 426 427 // first write out the urls for the domain into the sites/0000x/seedURLs.txt file 428 // also write into the global seeds file (with a tab prefixed to each?) 429 Set<String> urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol); 430 for(String url : urlsForDomainSet) { 431 seedURLsWriter.write(url + "\n"); // global seedURLs file 432 siteURLsWriter.write(url + "\n"); 433 } 434 426 435 427 436 if(allowedURLPatternRegex == null) { // entire site can be crawled … … 455 464 // since we will only be downloading the single page 456 465 457 Set<String>urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol);466 urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol); 458 467 for(String urlInDomain : urlsForDomainSet) { 459 468 // don't append slash to end this time … … 482 491 483 492 } 484 }485 486 // next write out the urls for the domain into the sites/0000x/seedURLs.txt file487 // also write into the global seeds file (with a tab prefixed to each?)488 Set<String> urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol);489 for(String url : urlsForDomainSet) {490 seedURLsWriter.write(url + "\n"); // global seedURLs file491 siteURLsWriter.write(url + "\n");492 493 } 493 494
Note:
See TracChangeset
for help on using the changeset viewer.