source: gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt@ 33565

Last change on this file since 33565 was 33565, checked in by ak19, 5 years ago

CCWETProcessor: domain url now goes in as a seedURL after the individual seedURLs, after Dr Bainbridge explained why the original ordering didn't make sense. 2. conf: we inspected the first site to be crawled. It was a non-top site, but we still wanted to control the crawling of it in the same way we control topsites. 3. Documented use of the nutch command for testing which urls pass and fail the existing regex-urlfilter checks.

File size: 14.9 KB
Line 
1https://codereview.stackexchange.com/questions/198343/crawl-and-gather-all-the-urls-recursively-in-a-domain
2http://lucene.472066.n3.nabble.com/Using-nutch-just-for-the-crawler-fetcher-td611918.html
3
4https://www.quora.com/What-are-some-Web-crawler-tips-to-avoid-crawler-traps
5
6https://cwiki.apache.org/confluence/display/nutch/
7https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Crawling
8https://cwiki.apache.org/confluence/display/nutch/ReaddbOptions
9
10https://moz.com/top500
11-----------
12NUTCH
13-----------
14https://stackoverflow.com/questions/35449673/nutch-and-solr-indexing-blacklist-domain
15 https://nutch.apache.org/apidocs/apidocs-1.6/org/apache/nutch/urlfilter/domainblacklist/DomainBlacklistURLFilter.html
16
17https://lucene.472066.n3.nabble.com/blacklist-for-crawling-td618343.html
18https://lucene.472066.n3.nabble.com/Content-of-size-X-was-truncated-to-Y-td4003517.html
19
20
21Google: nutch mirror web site
22https://stackoverflow.com/questions/33354460/nutch-clone-website
23[https://stackoverflow.com/questions/35714897/nutch-not-crawling-entire-website
24fetch -all seems to be a nutch v2 thing?]
25
26Google (30 Sep): site mirroring with nutch
27https://grokbase.com/t/nutch/user/125sfbg0pt/using-nutch-for-web-site-mirroring
28https://lucene.472066.n3.nabble.com/Using-nutch-just-for-the-crawler-fetcher-td611918.html
29http://www.cs.ucy.ac.cy/courses/EPL660/lectures/lab6.pdf
30 slide p.5 onwards
31
32crawler softw options: https://repositorio.iscte-iul.pt/bitstream/10071/2871/1/Building%20a%20Scalable%20Index%20and%20Web%20Search%20Engine%20for%20Music%20on.pdf
33See also p.20. HTTrack
34
35
36Google: nutch performance tuning
37* https://stackoverflow.com/questions/24383212/apache-nutch-performance-tuning-for-whole-web-crawling
38* https://stackoverflow.com/questions/4871972/how-to-speed-up-crawling-in-nutch
39* https://cwiki.apache.org/confluence/display/nutch/OptimizingCrawls
40
41NUTCH INSTALLATION:
42* Nutch v1: https://cwiki.apache.org/confluence/display/nutch/NutchTutorial#NutchTutorial-SetupSolrforsearch
43
44Nutch v2 installation and set up:
45* https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Tutorial
46* https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781783286850/1/ch01lvl1sec09/installing-and-configuring-apache-nutch
47
48
49Nutch doesn't work with spark (yet):
50https://stackoverflow.com/questions/29950299/distributed-web-crawling-using-apache-spark-is-it-possible
51
52SOLR:
53* Query syntax: http://www.solrtutorial.com/solr-query-syntax.html
54* Deleting a core: https://factorpad.com/tech/solr/reference/solr-delete.html
55
56
57* If you change a nutch 2 configuration, https://stackoverflow.com/questions/16401667/java-lang-classnotfoundexception-org-apache-gora-hbase-store-hbasestore
58explains you can rebuild nutch with:
59 cd <apache-nutch>
60 ant clean
61 ant runtime
62----------------------------------
63Apache Nutch 2 with newer HBase
64
65hbase-common-1.4.8.jar
66
671. hbase jar files need to go into runtime/local/lib
68
69But not slf4j-log4j12-1.7.10.jar (there's already a slf4j-log4j12-1.7.5.jar) - so remove that from runtime/local/lib after copying it over.
70
712. https://stackoverflow.com/questions/46340416/how-to-compile-nutch-2-3-1-with-hbase-1-2-6
72 https://stackoverflow.com/questions/39834423/apache-nutch-fetcherjob-throws-nosuchelementexception-deep-in-gora/39837926#39837926
73
74Unfortunately, the page https://paste.apache.org/jjqz referred to above that contains patches for using Gora 0.7 is no longer available.
75
76http://mail-archives.apache.org/mod_mbox/nutch-user/201602.mbox/%[email protected]%3E
77
78https://www.mail-archive.com/[email protected]/msg14245.html
79
80------------------------------------------------------------------------------
81Other way: Nutch on its own vagrant with specified hbase or nutch with mongodb
82------------------------------------------------------------------------------
83* https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/
84* https://waue0920.wordpress.com/2016/08/25/nutch-2-3-1-hbase-0-98-hadoop-2-5-solr-4-10-3/
85
86The older but recommended hbase 0.98.21 for hadoop 2 can be downloaded from https://archive.apache.org/dist/hbase/0.98.21/
87
88-----
89HBASE commands
90/usr/local/hbase/bin/hbase shell
91https://learnhbase.net/2013/03/02/hbase-shell-commands/
92http://dwgeek.com/read-hbase-table-using-hbase-shell-get-command.html/
93dropping tables: https://www.tutorialspoint.com/hbase/hbase_drop_table.htm
94
95> list
96
97davidbHomePage_webpage is a table
98
99> get 'davidbHomePage_webpage', '1'
100
101Solution to get a working nutch2:
102get http://trac.greenstone.org/browser/gs3-extensions/maori-lang-detection/hdfs-cc-work/vagrant-for-nutch2.tar.gz
103And follow the instructions in my README file in there.
104
105---------------------------------------------------------------------
106ALTERNATIVES TO NUTCH - looking for site mirroring capabilities
107---------------------------------------------------------------------
108=> https://anarc.at/services/archive/web/
109 Autistici's crawl [https://git.autistici.org/ale/crawl] needs Go:
110 https://medium.com/better-programming/install-go-1-11-on-ubuntu-18-04-16-04-lts-8c098c503c5f
111 https://guide.freecodecamp.org/go/installing-go/ubuntu-apt-get/
112 To uninstall: https://medium.com/@firebitsbr/how-to-uninstall-from-the-apt-manager-uninstall-just-golang-go-from-universe-debian-ubuntu-82d6a3692cbd
113 https://tecadmin.net/install-go-on-ubuntu/ [our vagrant VMs are Ubuntu 16.04 LTS, as discovered by running the cmd "lsb_release -a"]
114https://alternativeto.net/software/apache-nutch/
115https://alternativeto.net/software/wget/
116https://github.com/ArchiveTeam/grab-site/blob/master/README.md#inspecting-warc-files-in-the-terminal
117https://github.com/ArchiveTeam/wpull
118
119-------------------
120
121Running nutch 2.x
122
123-------------------
124
125LINKS
126
127https://lucene.472066.n3.nabble.com/Nutch-2-x-readdb-command-dump-td4033937.html
128https://cwiki.apache.org/confluence/display/nutch/ReaddbOptions
129
130
131https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/ ## most useful for running nutch 2.x crawls
132
133https://www.mobomo.com/2017/06/the-basics-working-with-nutch-2-x/
134 "Fetch
135
136 This is where the magic happens. During the fetch step, Nutch crawls the urls selected in the generate step. The most important argument you need is -threads: this sets the number of fetcher threads per task. Increasing this will make crawling faster, but setting it too high can overwhelm a site and it might shut out your crawler, as well as take up too much memory from your machine. Run it like this:
137 $ nutch fetch -threads 50"
138
139
140https://examples.javacodegeeks.com/enterprise-java/apache-hadoop/apache-hadoop-nutch-tutorial/
141https://www.yegor256.com/2019/04/17/nutch-from-java.html
142
143http://nutch.sourceforge.net/docs/en/tutorial.html
144
145Intranet: Configuration
146To configure things for intranet crawling you must:
147
148 Create a flat file of root urls. For example, to crawl the nutch.org site you might start with a file named urls containing just the Nutch home page. All other Nutch pages should be reachable from this page. The urls file would thus look like:
149
150 http://www.nutch.org/
151
152 Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the nutch.org domain, the line should read:
153
154 +^http://([a-z0-9]*\.)*nutch.org/
155
156 This will include any url in the domain nutch.org.
157
158Intranet: Running the Crawl
159Once things are configured, running the crawl is easy. Just use the crawl command. Its options include:
160
161 -dir dir names the directory to put the crawl in.
162 -depth depth indicates the link depth from the root page that should be crawled.
163 -delay delay determines the number of seconds between accesses to each host.
164 -threads threads determines the number of threads that will fetch in parallel.
165
166For example, a typical call might be:
167
168bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log
169
170Typically one starts testing one's configuration by crawling at low depths, and watching the output to check that desired pages are found. Once one is more confident of the configuration, then an appropriate depth for a full crawl is around 10. <===========
171
172Once crawling has completed, one can skip to the Searching section below.
173
174
175-----------------------------------
176Actually running nutch 2.x - steps
177-----------------------------------
178MANUALLY GOING THROUGH THE CYCLE 3 TIMES:
179
180cd ~/apache-nutch-2.3.1/runtime/local
181
182./bin/nutch inject urls
183
184./bin/nutch generate -topN 50
185./bin/nutch fetch -all
186./bin/nutch parse -all
187./bin/nutch updatedb -all
188
189./bin/nutch generate -topN 50
190./bin/nutch fetch -all
191./bin/nutch parse -all
192./bin/nutch updatedb -all
193
194./bin/nutch generate -topN 50
195./bin/nutch fetch -all
196./bin/nutch parse -all
197./bin/nutch updatedb -all
198
199Dump output on local filesystem:
200 rm -rf /tmp/bla
201 ./bin/nutch readdb -dump /tmp/bla [-crawlId ID -text]
202 less /tmp/bla/part-r-00000
203
204To dump output on local filesystem:
205 Need hdfs host name if sending/dumping nutch crawl output to a location on hdfs
206 Host is defined in /usr/local/hadoop/etc/hadoop/core-site.xml for property fs.defaultFS, (https://stackoverflow.com/questions/27956973/java-io-ioexception-incomplete-hdfs-uri-no-host)
207 host is hdfs://node2/ in this case.
208 So:
209
210 hdfs dfs -rmdir /user/vagrant/dump
211 XXX ./bin/nutch readdb -dump user/vagrant/dump -text ### won't work
212 XXX ./bin/nutch readdb -dump hdfs:///user/vagrant/dump -text ### won't work
213 ./bin/nutch readdb -dump hdfs://node2/user/vagrant/dump -text
214
215
216USING THE SCRIPT TO ATTEMPT TO CRAWL A SITE
217* Choosing to repeat the cycle 10 times because, as per http://nutch.sourceforge.net/docs/en/tutorial.html
218
219"Typically one starts testing one's configuration by crawling at low depths, and watching the output to check that desired pages are found. Once one is more confident of the configuration, then an appropriate depth for a full crawl is around 10."
220
221* Use the ./bin/crawl script, provide the seed urls dir, the crawlId and number of times to repeat = 10
222vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/crawl urls davidbHomePage 10
223
224
225* View the downloaded crawls.
226This time need to provide crawlId to readdb, in order to get a dump of its text contents:
227 hdfs dfs -rm -r hdfs://node2/user/vagrant/dump2
228 ./bin/nutch readdb -dump hdfs://node2/user/vagrant/dump2 -text -crawlId davidbHomePage
229
230* View the contents:
231hdfs dfs -cat hdfs://node2/user/vagrant/dump2/part-r-*
232
233
234* FIND OUT NUMBER OF URLS DOWNLOADED FOR THE SITE:
235vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/nutch readdb -stats -crawlId davidbHomePage
236WebTable statistics start
237Statistics for WebTable:
238retry 0: 44
239status 5 (status_redir_perm): 4
240status 3 (status_gone): 1
241status 2 (status_fetched): 39
242jobs: {[davidbHomePage]db_stats-job_local647846559_0001={jobName=[davidbHomePage]db_stats, jobID=job_local647846559_0001, counters={Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=135, REDUCE_INPUT_RECORDS=8, SPILLED_RECORDS=16, MERGED_MAP_OUTPUTS=1, VIRTUAL_MEMORY_BYTES=0, MAP_INPUT_RECORDS=44, SPLIT_RAW_BYTES=935, FAILED_SHUFFLE=0, MAP_OUTPUT_BYTES=2332, REDUCE_SHUFFLE_BYTES=135, PHYSICAL_MEMORY_BYTES=0, GC_TIME_MILLIS=0, REDUCE_INPUT_GROUPS=8, COMBINE_OUTPUT_RECORDS=8, SHUFFLED_MAPS=1, REDUCE_OUTPUT_RECORDS=8, MAP_OUTPUT_RECORDS=176, COMBINE_INPUT_RECORDS=176, CPU_MILLISECONDS=0, COMMITTED_HEAP_BYTES=595591168}, File Input Format Counters ={BYTES_READ=0}, File System Counters={FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, FILE_READ_OPS=0, FILE_BYTES_WRITTEN=1788140, FILE_BYTES_READ=1223290}, File Output Format Counters ={BYTES_WRITTEN=275}, Shuffle Errors={CONNECTION=0, WRONG_LENGTH=0, BAD_ID=0, WRONG_MAP=0, WRONG_REDUCE=0, IO_ERROR=0}}}}
243TOTAL urls: 44
244max score: 1.0
245avg score: 0.022727273
246min score: 0.0
247WebTable statistics: done
248
249------------------------------------
250STOPPING CONDITION
251Seems inbuilt
252* When I tell it to cycle 15 times, it stops after 6 cycles saying no more URLs to fetch:
253
254vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/crawl urls davidbHomePage2 15
255---
256No SOLRURL specified. Skipping indexing.
257Injecting seed URLs
258
259...
260
261Thu Oct 3 09:22:23 UTC 2019 : Iteration 6 of 15
262Generating batchId
263Generating a new fetchlist
264...
265Generating batchId
266Generating a new fetchlist
267/home/vagrant/apache-nutch-2.3.1/runtime/local/bin/nutch generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 -crawlId davidbHomePage2 -batchId 1570094569-27637
268GeneratorJob: starting at 2019-10-03 09:22:49
269GeneratorJob: Selecting best-scoring urls due for fetch.
270GeneratorJob: starting
271GeneratorJob: filtering: false
272GeneratorJob: normalizing: false
273GeneratorJob: topN: 50000
274GeneratorJob: finished at 2019-10-03 09:22:52, time elapsed: 00:00:02
275GeneratorJob: generated batch id: 1570094569-27637 containing 0 URLs
276Generate returned 1 (no new segments created)
277Escaping loop: no more URLs to fetch now
278vagrant@node2:~/apache-nutch-2.3.1/runtime/local$
279---
280
281* running readdb -stats show 44 URLs fetched, just as first time (when crawlId had been "davidbHomePage"):
282
283vagrant@node2:~/apache-nutch-2.3.1/runtime/local$ ./bin/nutch readdb -stats -crawlId davidbHomePage2
284---
285WebTable statistics start
286Statistics for WebTable:
287retry 0: 44
288status 5 (status_redir_perm): 4
289status 3 (status_gone): 1
290status 2 (status_fetched): 39
291jobs: {[davidbHomePage2]db_stats-job_local985519583_0001={jobName=[davidbHomePage2]db_stats, jobID=job_local985519583_0001, counters={Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=135, REDUCE_INPUT_RECORDS=8, SPILLED_RECORDS=16, MERGED_MAP_OUTPUTS=1, VIRTUAL_MEMORY_BYTES=0, MAP_INPUT_RECORDS=44, SPLIT_RAW_BYTES=935, FAILED_SHUFFLE=0, MAP_OUTPUT_BYTES=2332, REDUCE_SHUFFLE_BYTES=135, PHYSICAL_MEMORY_BYTES=0, GC_TIME_MILLIS=4, REDUCE_INPUT_GROUPS=8, COMBINE_OUTPUT_RECORDS=8, SHUFFLED_MAPS=1, REDUCE_OUTPUT_RECORDS=8, MAP_OUTPUT_RECORDS=176, COMBINE_INPUT_RECORDS=176, CPU_MILLISECONDS=0, COMMITTED_HEAP_BYTES=552599552}, File Input Format Counters ={BYTES_READ=0}, File System Counters={FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, FILE_READ_OPS=0, FILE_BYTES_WRITTEN=1788152, FILE_BYTES_READ=1223290}, File Output Format Counters ={BYTES_WRITTEN=275}, Shuffle Errors={CONNECTION=0, WRONG_LENGTH=0, BAD_ID=0, WRONG_MAP=0, WRONG_REDUCE=0, IO_ERROR=0}}}}
292TOTAL urls: 44
293---
294
295----------------------------------------------------------------------
296 Testing URLFilters: testing a URL to see if it's accepted
297----------------------------------------------------------------------
298Use the command
299 ./bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined
300(mentioned at https://lucene.472066.n3.nabble.com/Correct-syntax-for-regex-urlfilter-txt-trying-to-exclude-single-path-results-td3600376.html)
301
302Use as follows:
303
304 cd apache-nutch-2.3.1/runtime/local
305
306 ./bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined
307
308Then paste the URL you want to test, press Enter.
309 A + in front of response means accepted
310 A - in front of response means rejected.
311Can continue pasting URLs to test against filters until you send Ctrl-D to terminate input.
312
313
314
315
316
Note: See TracBrowser for help on using the repository browser.