Changeset 33537 for gs3-extensions


Ignore:
Timestamp:
2019-09-30T22:51:36+13:00 (5 years ago)
Author:
ak19
Message:

More nutch and general site mirroring related links

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt

    r33529 r33537  
    1919fetch -all seems to be a nutch v2 thing?]
    2020
     21Google (30 Sep): site mirroring with nutch
     22https://grokbase.com/t/nutch/user/125sfbg0pt/using-nutch-for-web-site-mirroring
     23https://lucene.472066.n3.nabble.com/Using-nutch-just-for-the-crawler-fetcher-td611918.html
     24http://www.cs.ucy.ac.cy/courses/EPL660/lectures/lab6.pdf
     25    slide p.5 onwards
     26
     27crawler softw options: https://repositorio.iscte-iul.pt/bitstream/10071/2871/1/Building%20a%20Scalable%20Index%20and%20Web%20Search%20Engine%20for%20Music%20on.pdf
     28See also p.20. HTTrack
     29
     30
    2131Google: nutch performance tuning
    2232* https://stackoverflow.com/questions/24383212/apache-nutch-performance-tuning-for-whole-web-crawling
    2333* https://stackoverflow.com/questions/4871972/how-to-speed-up-crawling-in-nutch
    24 
     34* https://cwiki.apache.org/confluence/display/nutch/OptimizingCrawls
    2535
    2636NUTCH INSTALLATION:
     
    3646* Deleting a core: https://factorpad.com/tech/solr/reference/solr-delete.html
    3747
     48----------------------------------
     49Apache Nutch 2 with newer HBase
     50
     51hbase-common-1.4.8.jar
     52
     531. hbase jar files need to go into runtime/local/lib
     54
     55But not slf4j-log4j12-1.7.10.jar (there's already a slf4j-log4j12-1.7.5.jar) - so remove that from runtime/local/lib after copying it over.
     56
     572. https://stackoverflow.com/questions/46340416/how-to-compile-nutch-2-3-1-with-hbase-1-2-6
     58    https://stackoverflow.com/questions/39834423/apache-nutch-fetcherjob-throws-nosuchelementexception-deep-in-gora/39837926#39837926
     59
     60Unfortunately, the page https://paste.apache.org/jjqz referred to above that contains patches for using Gora 0.7 is no longer available.
     61
     62http://mail-archives.apache.org/mod_mbox/nutch-user/201602.mbox/%[email protected]%3E
     63
     64https://www.mail-archive.com/[email protected]/msg14245.html
     65
     66------------------------------------------------------------------------------
     67Other way: Nutch on its own vagrant with specified hbase or nutch with mongodb
     68------------------------------------------------------------------------------
     69* https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/
     70* https://waue0920.wordpress.com/2016/08/25/nutch-2-3-1-hbase-0-98-hadoop-2-5-solr-4-10-3/
     71
     72
     73-----
     74HBASE commands
     75/usr/local/hbase/bin/hbase shell
     76https://learnhbase.net/2013/03/02/hbase-shell-commands/
     77
     78
     79list
     80
     81davidbHomePage_webpage is a table
     82
     83
Note: See TracChangeset for help on using the changeset viewer.