Changeset 33528
- Timestamp:
- 2019-09-26T21:47:13+12:00 (5 years ago)
- File:
-
- 1 moved
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt
r33527 r33528 4 4 https://www.quora.com/What-are-some-Web-crawler-tips-to-avoid-crawler-traps 5 5 6 ----------- 7 NUTCH 8 ----------- 9 https://stackoverflow.com/questions/35449673/nutch-and-solr-indexing-blacklist-domain 10 https://nutch.apache.org/apidocs/apidocs-1.6/org/apache/nutch/urlfilter/domainblacklist/DomainBlacklistURLFilter.html 11 12 https://lucene.472066.n3.nabble.com/blacklist-for-crawling-td618343.html 13 https://lucene.472066.n3.nabble.com/Content-of-size-X-was-truncated-to-Y-td4003517.html 6 14 15 16 Google: nutch mirror web site 17 https://stackoverflow.com/questions/33354460/nutch-clone-website 18 [https://stackoverflow.com/questions/35714897/nutch-not-crawling-entire-website 19 fetch -all seems to be a nutch v2 thing?] 20 21 Google: nutch performance tuning 22 * https://stackoverflow.com/questions/24383212/apache-nutch-performance-tuning-for-whole-web-crawling 23 * https://stackoverflow.com/questions/4871972/how-to-speed-up-crawling-in-nutch 24 25 Nutch v2 installation and set up: 26 https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781783286850/1/ch01lvl1sec09/installing-and-configuring-apache-nutch 27 28 29 SOLR: 30 * Query syntax: http://www.solrtutorial.com/solr-query-syntax.html 31 * Deleting a core: https://factorpad.com/tech/solr/reference/solr-delete.html 32
Note:
See TracChangeset
for help on using the changeset viewer.