Changeset 33537

Show
Ignore:
Timestamp:
30.09.2019 22:51:36 (2 weeks ago)
Author:
ak19
Message:

More nutch and general site mirroring related links

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt

    r33529 r33537  
    1919fetch -all seems to be a nutch v2 thing?] 
    2020 
     21Google (30 Sep): site mirroring with nutch 
     22https://grokbase.com/t/nutch/user/125sfbg0pt/using-nutch-for-web-site-mirroring 
     23https://lucene.472066.n3.nabble.com/Using-nutch-just-for-the-crawler-fetcher-td611918.html 
     24http://www.cs.ucy.ac.cy/courses/EPL660/lectures/lab6.pdf 
     25    slide p.5 onwards 
     26 
     27crawler softw options: https://repositorio.iscte-iul.pt/bitstream/10071/2871/1/Building%20a%20Scalable%20Index%20and%20Web%20Search%20Engine%20for%20Music%20on.pdf 
     28See also p.20. HTTrack 
     29 
     30 
    2131Google: nutch performance tuning 
    2232* https://stackoverflow.com/questions/24383212/apache-nutch-performance-tuning-for-whole-web-crawling 
    2333* https://stackoverflow.com/questions/4871972/how-to-speed-up-crawling-in-nutch 
    24  
     34* https://cwiki.apache.org/confluence/display/nutch/OptimizingCrawls 
    2535 
    2636NUTCH INSTALLATION: 
     
    3646* Deleting a core: https://factorpad.com/tech/solr/reference/solr-delete.html 
    3747 
     48---------------------------------- 
     49Apache Nutch 2 with newer HBase 
     50 
     51hbase-common-1.4.8.jar 
     52 
     531. hbase jar files need to go into runtime/local/lib 
     54 
     55But not slf4j-log4j12-1.7.10.jar (there's already a slf4j-log4j12-1.7.5.jar) - so remove that from runtime/local/lib after copying it over. 
     56 
     572. https://stackoverflow.com/questions/46340416/how-to-compile-nutch-2-3-1-with-hbase-1-2-6 
     58    https://stackoverflow.com/questions/39834423/apache-nutch-fetcherjob-throws-nosuchelementexception-deep-in-gora/39837926#39837926 
     59 
     60Unfortunately, the page https://paste.apache.org/jjqz referred to above that contains patches for using Gora 0.7 is no longer available. 
     61 
     62http://mail-archives.apache.org/mod_mbox/nutch-user/201602.mbox/%3C56B2EA23.8080801@cisinlabs.com%3E 
     63 
     64https://www.mail-archive.com/user@nutch.apache.org/msg14245.html 
     65 
     66------------------------------------------------------------------------------ 
     67Other way: Nutch on its own vagrant with specified hbase or nutch with mongodb 
     68------------------------------------------------------------------------------ 
     69* https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/ 
     70* https://waue0920.wordpress.com/2016/08/25/nutch-2-3-1-hbase-0-98-hadoop-2-5-solr-4-10-3/ 
     71 
     72 
     73----- 
     74HBASE commands 
     75/usr/local/hbase/bin/hbase shell 
     76https://learnhbase.net/2013/03/02/hbase-shell-commands/ 
     77 
     78 
     79list 
     80 
     81davidbHomePage_webpage is a table 
     82 
     83