Changeset 33558 for gs3-extensions


Ignore:
Timestamp:
2019-10-10T23:41:36+13:00 (5 years ago)
Author:
ak19
Message:

Committing cumulative changes since last commit.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt

    r33545 r33558  
    44https://www.quora.com/What-are-some-Web-crawler-tips-to-avoid-crawler-traps
    55
    6 
     6https://cwiki.apache.org/confluence/display/nutch/
     7https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Crawling
     8https://cwiki.apache.org/confluence/display/nutch/ReaddbOptions
     9
     10https://moz.com/top500
    711-----------
    812NUTCH
     
    4347
    4448
     49Nutch doesn't work with spark (yet):
     50https://stackoverflow.com/questions/29950299/distributed-web-crawling-using-apache-spark-is-it-possible
     51
    4552SOLR:
    4653* Query syntax: http://www.solrtutorial.com/solr-query-syntax.html
    4754* Deleting a core: https://factorpad.com/tech/solr/reference/solr-delete.html
    4855
     56
     57* If you change a nutch 2 configuration, https://stackoverflow.com/questions/16401667/java-lang-classnotfoundexception-org-apache-gora-hbase-store-hbasestore
     58explains you can rebuild nutch with:
     59     cd <apache-nutch>
     60     ant clean
     61     ant runtime
    4962----------------------------------
    5063Apache Nutch 2 with newer HBase
     
    7184* https://waue0920.wordpress.com/2016/08/25/nutch-2-3-1-hbase-0-98-hadoop-2-5-solr-4-10-3/
    7285
     86The older but recommended hbase 0.98.21 for hadoop 2 can be downloaded from https://archive.apache.org/dist/hbase/0.98.21/
    7387
    7488-----
     
    7791https://learnhbase.net/2013/03/02/hbase-shell-commands/
    7892http://dwgeek.com/read-hbase-table-using-hbase-shell-get-command.html/
     93dropping tables: https://www.tutorialspoint.com/hbase/hbase_drop_table.htm
    7994
    8095> list
     
    184199Dump output on local filesystem:
    185200    rm -rf /tmp/bla
    186     ./bin/nutch readdb -dump /tmp/bla
     201    ./bin/nutch readdb -dump /tmp/bla [-crawlId ID -text]
    187202    less /tmp/bla/part-r-00000
    188203
Note: See TracChangeset for help on using the changeset viewer.