Changeset 33558
- Timestamp:
- 2019-10-10T23:41:36+13:00 (4 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt
r33545 r33558 4 4 https://www.quora.com/What-are-some-Web-crawler-tips-to-avoid-crawler-traps 5 5 6 6 https://cwiki.apache.org/confluence/display/nutch/ 7 https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Crawling 8 https://cwiki.apache.org/confluence/display/nutch/ReaddbOptions 9 10 https://moz.com/top500 7 11 ----------- 8 12 NUTCH … … 43 47 44 48 49 Nutch doesn't work with spark (yet): 50 https://stackoverflow.com/questions/29950299/distributed-web-crawling-using-apache-spark-is-it-possible 51 45 52 SOLR: 46 53 * Query syntax: http://www.solrtutorial.com/solr-query-syntax.html 47 54 * Deleting a core: https://factorpad.com/tech/solr/reference/solr-delete.html 48 55 56 57 * If you change a nutch 2 configuration, https://stackoverflow.com/questions/16401667/java-lang-classnotfoundexception-org-apache-gora-hbase-store-hbasestore 58 explains you can rebuild nutch with: 59 cd <apache-nutch> 60 ant clean 61 ant runtime 49 62 ---------------------------------- 50 63 Apache Nutch 2 with newer HBase … … 71 84 * https://waue0920.wordpress.com/2016/08/25/nutch-2-3-1-hbase-0-98-hadoop-2-5-solr-4-10-3/ 72 85 86 The older but recommended hbase 0.98.21 for hadoop 2 can be downloaded from https://archive.apache.org/dist/hbase/0.98.21/ 73 87 74 88 ----- … … 77 91 https://learnhbase.net/2013/03/02/hbase-shell-commands/ 78 92 http://dwgeek.com/read-hbase-table-using-hbase-shell-get-command.html/ 93 dropping tables: https://www.tutorialspoint.com/hbase/hbase_drop_table.htm 79 94 80 95 > list … … 184 199 Dump output on local filesystem: 185 200 rm -rf /tmp/bla 186 ./bin/nutch readdb -dump /tmp/bla 201 ./bin/nutch readdb -dump /tmp/bla [-crawlId ID -text] 187 202 less /tmp/bla/part-r-00000 188 203
Note:
See TracChangeset
for help on using the changeset viewer.