1 | https://codereview.stackexchange.com/questions/198343/crawl-and-gather-all-the-urls-recursively-in-a-domain
|
---|
2 | http://lucene.472066.n3.nabble.com/Using-nutch-just-for-the-crawler-fetcher-td611918.html
|
---|
3 |
|
---|
4 | https://www.quora.com/What-are-some-Web-crawler-tips-to-avoid-crawler-traps
|
---|
5 |
|
---|
6 | -----------
|
---|
7 | NUTCH
|
---|
8 | -----------
|
---|
9 | https://stackoverflow.com/questions/35449673/nutch-and-solr-indexing-blacklist-domain
|
---|
10 | https://nutch.apache.org/apidocs/apidocs-1.6/org/apache/nutch/urlfilter/domainblacklist/DomainBlacklistURLFilter.html
|
---|
11 |
|
---|
12 | https://lucene.472066.n3.nabble.com/blacklist-for-crawling-td618343.html
|
---|
13 | https://lucene.472066.n3.nabble.com/Content-of-size-X-was-truncated-to-Y-td4003517.html
|
---|
14 |
|
---|
15 |
|
---|
16 | Google: nutch mirror web site
|
---|
17 | https://stackoverflow.com/questions/33354460/nutch-clone-website
|
---|
18 | [https://stackoverflow.com/questions/35714897/nutch-not-crawling-entire-website
|
---|
19 | fetch -all seems to be a nutch v2 thing?]
|
---|
20 |
|
---|
21 | Google (30 Sep): site mirroring with nutch
|
---|
22 | https://grokbase.com/t/nutch/user/125sfbg0pt/using-nutch-for-web-site-mirroring
|
---|
23 | https://lucene.472066.n3.nabble.com/Using-nutch-just-for-the-crawler-fetcher-td611918.html
|
---|
24 | http://www.cs.ucy.ac.cy/courses/EPL660/lectures/lab6.pdf
|
---|
25 | slide p.5 onwards
|
---|
26 |
|
---|
27 | crawler softw options: https://repositorio.iscte-iul.pt/bitstream/10071/2871/1/Building%20a%20Scalable%20Index%20and%20Web%20Search%20Engine%20for%20Music%20on.pdf
|
---|
28 | See also p.20. HTTrack
|
---|
29 |
|
---|
30 |
|
---|
31 | Google: nutch performance tuning
|
---|
32 | * https://stackoverflow.com/questions/24383212/apache-nutch-performance-tuning-for-whole-web-crawling
|
---|
33 | * https://stackoverflow.com/questions/4871972/how-to-speed-up-crawling-in-nutch
|
---|
34 | * https://cwiki.apache.org/confluence/display/nutch/OptimizingCrawls
|
---|
35 |
|
---|
36 | NUTCH INSTALLATION:
|
---|
37 | * Nutch v1: https://cwiki.apache.org/confluence/display/nutch/NutchTutorial#NutchTutorial-SetupSolrforsearch
|
---|
38 |
|
---|
39 | Nutch v2 installation and set up:
|
---|
40 | * https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Tutorial
|
---|
41 | * https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781783286850/1/ch01lvl1sec09/installing-and-configuring-apache-nutch
|
---|
42 |
|
---|
43 |
|
---|
44 | SOLR:
|
---|
45 | * Query syntax: http://www.solrtutorial.com/solr-query-syntax.html
|
---|
46 | * Deleting a core: https://factorpad.com/tech/solr/reference/solr-delete.html
|
---|
47 |
|
---|
48 | ----------------------------------
|
---|
49 | Apache Nutch 2 with newer HBase
|
---|
50 |
|
---|
51 | hbase-common-1.4.8.jar
|
---|
52 |
|
---|
53 | 1. hbase jar files need to go into runtime/local/lib
|
---|
54 |
|
---|
55 | But not slf4j-log4j12-1.7.10.jar (there's already a slf4j-log4j12-1.7.5.jar) - so remove that from runtime/local/lib after copying it over.
|
---|
56 |
|
---|
57 | 2. https://stackoverflow.com/questions/46340416/how-to-compile-nutch-2-3-1-with-hbase-1-2-6
|
---|
58 | https://stackoverflow.com/questions/39834423/apache-nutch-fetcherjob-throws-nosuchelementexception-deep-in-gora/39837926#39837926
|
---|
59 |
|
---|
60 | Unfortunately, the page https://paste.apache.org/jjqz referred to above that contains patches for using Gora 0.7 is no longer available.
|
---|
61 |
|
---|
62 | http://mail-archives.apache.org/mod_mbox/nutch-user/201602.mbox/%[email protected]%3E
|
---|
63 |
|
---|
64 | https://www.mail-archive.com/[email protected]/msg14245.html
|
---|
65 |
|
---|
66 | ------------------------------------------------------------------------------
|
---|
67 | Other way: Nutch on its own vagrant with specified hbase or nutch with mongodb
|
---|
68 | ------------------------------------------------------------------------------
|
---|
69 | * https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/
|
---|
70 | * https://waue0920.wordpress.com/2016/08/25/nutch-2-3-1-hbase-0-98-hadoop-2-5-solr-4-10-3/
|
---|
71 |
|
---|
72 |
|
---|
73 | -----
|
---|
74 | HBASE commands
|
---|
75 | /usr/local/hbase/bin/hbase shell
|
---|
76 | https://learnhbase.net/2013/03/02/hbase-shell-commands/
|
---|
77 |
|
---|
78 |
|
---|
79 | list
|
---|
80 |
|
---|
81 | davidbHomePage_webpage is a table
|
---|
82 |
|
---|
83 |
|
---|