Changeset 33558

Show
Ignore:
Timestamp:
10.10.2019 23:41:36 (4 months ago)
Author:
ak19
Message:

Committing cumulative changes since last commit.

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt

    r33545 r33558  
    44https://www.quora.com/What-are-some-Web-crawler-tips-to-avoid-crawler-traps 
    55 
    6  
     6https://cwiki.apache.org/confluence/display/nutch/ 
     7https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Crawling 
     8https://cwiki.apache.org/confluence/display/nutch/ReaddbOptions 
     9 
     10https://moz.com/top500 
    711----------- 
    812NUTCH 
     
    4347 
    4448 
     49Nutch doesn't work with spark (yet): 
     50https://stackoverflow.com/questions/29950299/distributed-web-crawling-using-apache-spark-is-it-possible 
     51 
    4552SOLR: 
    4653* Query syntax: http://www.solrtutorial.com/solr-query-syntax.html 
    4754* Deleting a core: https://factorpad.com/tech/solr/reference/solr-delete.html 
    4855 
     56 
     57* If you change a nutch 2 configuration, https://stackoverflow.com/questions/16401667/java-lang-classnotfoundexception-org-apache-gora-hbase-store-hbasestore 
     58explains you can rebuild nutch with: 
     59     cd <apache-nutch> 
     60     ant clean 
     61     ant runtime 
    4962---------------------------------- 
    5063Apache Nutch 2 with newer HBase 
     
    7184* https://waue0920.wordpress.com/2016/08/25/nutch-2-3-1-hbase-0-98-hadoop-2-5-solr-4-10-3/ 
    7285 
     86The older but recommended hbase 0.98.21 for hadoop 2 can be downloaded from https://archive.apache.org/dist/hbase/0.98.21/ 
    7387 
    7488----- 
     
    7791https://learnhbase.net/2013/03/02/hbase-shell-commands/ 
    7892http://dwgeek.com/read-hbase-table-using-hbase-shell-get-command.html/ 
     93dropping tables: https://www.tutorialspoint.com/hbase/hbase_drop_table.htm 
    7994 
    8095> list 
     
    184199Dump output on local filesystem: 
    185200    rm -rf /tmp/bla 
    186     ./bin/nutch readdb -dump /tmp/bla 
     201    ./bin/nutch readdb -dump /tmp/bla [-crawlId ID -text] 
    187202    less /tmp/bla/part-r-00000 
    188203